Should this work? (page 6)

On Fri, 17 Jan 2014 23:58:04 -0000, Marco Leise <Marco.Leise@gmx.de> wrote: > Am Fri, 17 Jan 2014 21:38:18 +0100 > schrieb Marco Leise <Marco.Leise@gmx.de>: > >> Am Mon, 13 Jan 2014 11:40:19 -0000 >> schrieb "Regan Heath" <regan@netmail.co.nz>: >> >> > On Fri, 10 Jan 2014 19:47:07 -0000, H. S. Teoh >> <hsteoh@quickfur.ath.cx> >> > wrote: >> > >> > > On Sat, Jan 11, 2014 at 02:14:41AM +1000, Manu wrote: >> > > [...] >> > >> One more, again here to reduce spam... >> > >> >> > >> 2 overloads exist: >> > >> void func(const(char)* str); >> > >> void func(const(char)[] str); >> > >> >> > >> Called with literal string: >> > >> func("literal"); >> > >> >> > >> called with argument types (string) matches both. >> > >> >> > >> I appreciate the convenience of the automatic string literal -> >> > >> const(char)* cast. But in this case, surely it should just choose >> the >> > >> array version of the function, like it does it calling with a >> 'string' >> > >> variable? The convenience should be just that, a convenience, not >> a >> > >> hard rule...? >> > > >> > > File a bug against dmd for this? I agree that it should match the >> array >> > > overload, not the pointer overload. I'm not sure if it's fixable, >> > > though, due to the way overloads are resolved currently. But maybe >> Kenji >> > > has a way. ;) >> > >> > I think this should remain an error, for the same reason as any other >> > overload resolution error; you might have one, and add the second and >> > silently behaviour changes - this is bad. >> > >> > Instead.. isn't the first overload strictly incorrect, unless that >> first >> > overload expects a null terminated **UTF-8** string.. if it's a C >> function >> > it should be const(ubyte)* str right? What overload does D select if >> you >> > use that instead? >> > >> > R >> >> A few days ago I noticed - in shock :D - that I had this >> situation with a \0 terminated D string literal and totally >> expected the char* overload to be chosen! >> >> I want to object to your claim that C functions should take >> const(ubyte)*. When working in Windows that is correct due to >> the ANSI/Microsoft codepage madness, but on Mac OS X and Linux >> UTF-8 can be assumed. > > Then again, technically you can still set any encoding you > like, so my argument is moot and you are correct that > const(ubyte)* should be used in any case. I was thinking in a very Windows centric way when I wrote my comment but it doesn't surprise me that other platforms can be configured to other locales. What do they default to? The last Linux install I did was for my Raspberry Pi and UTF-8 was recommended, and I selected it, and yet still I had to break out some weird console magic to fully realise that choice (I think there was a disjoint component which had not been configured correctly.. some part of the installation dropped the ball). > Heck, even most of Phobos probably passes UTF-8 directly into > OS APIs without converting to the current locale. :p Yeah, and it's only "working" because UTF-8 etc is a superset of ASCII, or they've defaulted or been set to UTF-8 locale, or on windows the Win32 W function API accepts UTF-16 and we pass that instead. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/

On Monday, 20 January 2014 at 10:22:03 UTC, Regan Heath wrote: > I was thinking in a very Windows centric way when I wrote my comment but it doesn't surprise me that other platforms can be configured to other locales. What do they default to? UTF-8 for most "user-friendly" distros I know. Region/locale is usually entered by user during installation but encoding is always UTF-8. That said, changing it system-wide is just matter of tweaking config and regenerating locale so that can't be relied upon in standard library. > The last Linux install I did was for my Raspberry Pi and UTF-8 was recommended, and I selected it, and yet still I had to break out some weird console magic to fully realise that choice (I think there was a disjoint component which had not been configured correctly.. some part of the installation dropped the ball). Most likely just installer issue for specific distro you have used.

January 21, 2014

Re: Should this work?

Posted by Marco Leise
in reply to Dicebot

Permalink

Marco Leise

Posted in reply to Dicebot

Permalink

Am Mon, 20 Jan 2014 11:14:46 +0000
schrieb "Dicebot" <public@dicebot.lv>:

> On Monday, 20 January 2014 at 10:22:03 UTC, Regan Heath wrote:
> > I was thinking in a very Windows centric way when I wrote my comment but it doesn't surprise me that other platforms can be configured to other locales.  What do they default to?
> 
> UTF-8 for most "user-friendly" distros I know. Region/locale is usually entered by user during installation but encoding is always UTF-8. That said, changing it system-wide is just matter of tweaking config and regenerating locale so that can't be relied upon in standard library.
> 
> > The last Linux install I did was for my Raspberry Pi and UTF-8 was recommended, and I selected it, and yet still I had to break out some weird console magic to fully realise that choice (I think there was a disjoint component which had not been configured correctly.. some part of the installation dropped the ball).
> 
> Most likely just installer issue for specific distro you have used.

I asked on #linux how encodings are handled. I figured it must be a complicated process from the file system to the kernel to the C lib to you D program. If I understood correctly, the kernel exposes the names from the file systems as they are, unless they are in UTF-16 like Joliet, in which case they need to be converted (in this case to an ISO charset, not UTF-8).

Wondering how I could make sure file names are always represented as UTF-8 I was told that iocharset=utf8 as the mount option where applicable should do the trick. I haven't looked into that yet as I don't currently have any issues, but what I took away from that is that this is the only place a conversion will happen and my C locale does not influence it.

Yet the possibility alone that a C string could be in any encoding is unsettling. Especially when strings from a proxy library and an implementation library can be concatenated like this: "Standard-Audiogerät using DirectSound" they may end up in different encodings (not in this case, but imagine Cyrillic or Greek). So to work reliably you need to have all interacting components agree on the charset.

When you ask OpenAL devs why they didn't enforce UTF-8 for strings, they say that it is just a spec and implementations are free to use the language default for characters. For most people that means working with "C strings", whatever they represent, since most programming languages will just use a C implementation directly or via bindings. Some Haskell developer also complained about it. Modern programming languages in general use some Unicode encoding, making the C string issue more obvious.

My current "best practice" is this:

* If a C string represents an identifier in C (e.g. variable
  or function name), assume it is ASCII and thus a valid UTF-8
  char* in D.

* Otherwise keep it as a ubyte*. Chances are we need to pass
  it back into the C API as is or don't have to print it on
  screen.

* When a C string has to be displayed, use on Windows:

  wstring ansiToString(const ubyte* ansi)
  {
      import std.exception;
      auto utf16Len = MultiByteToWideChar(CP_ACP, 0, ansi, -1,
                                          null, 0);
      if (utf16Len == 0) { /* handle error */ }
      wchar[] utf16 = new wchar[](utf16Len);
      utf16Len = MultiByteToWideChar(CP_ACP, 0, ansi, -1,
                                 utf16.ptr, utf16.length);
      if (utf16Len == 0) { /* handle error */ }
      return assumeUnique(utf16)[0 .. $-1];
  }

  I don't know the best practice for other systems yet.

* Don't store a C string in a converted UTF-8 version only,
  if you intend to pass it back into the C API. If the
  original C string had some invalid characters in it, a data
  loss would occur.

Forums