| |
| Posted by Arcane Jill | PermalinkReply |
|
Arcane Jill
| To clarify a point made in another post... Suppose you have a file called "café". This filename will be stored within the Window filesystem as a sequence of 16-bit words, each representing a Unicode character, which in this case will be the sequence { 0x0063, 0x0061, 0x0066, 0x00C9 }. (This is not of course true with old DOS 8.3 filenames, but we'll ignore them).
The D char[] string, "café", on the other hand, will be stored as the five-byte UTF-8 sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 }.
However, when you call the char version of CreateFile(), the filename string will be interpretted as if it were encoded in the default windows codepage (normally WINDOWS-1252 in English-speaking countries. Under this interpretation, the byte sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 } will be seen as the string "cafÃ?", and so Windows will attempt to open a file of that name. So, either it will fail, or it will open the wrong file.
The fix, of course, to pass to CreateFile() the value
(std.utf.toUTF16(filename)), instead of (filename). This should not have to be
done by users - it needs to be done at the Phobos level.
The situation is more complicated on Linux, unfortunately. On Linux filenames are stored as a sequence of bytes, not 16-bit-words. On one level that sequence of bytes is kind of "raw" - fopen() can be passed any sequence of bytes not containing "/" or "\0", and it will consider a filename to match only if it is byte-for-byte identical. However, this does not really mitigate the problem, because bytes only turn into characters - even 8-bit-wide ones - when you interpret them according to an encoding. Thus, if you have C source code which says fopen("café", "r"), your C++ compiler will still need to know what sequence of bytes should represent these characters. By and large, it will assume the system default encoding, called the "locale" in Linux-speak (although it has very little to do with the ISO langauage-country-variant understanding of "locale"). Some Linux users will have set their default "locale" to UTF-8. Others won't. Getting this right will be tricky.
Unfortunately, you can't ignore this problem. Unless you want to tell people that D's File (FileStream?) class will only work for filenames containing ASCII characters, that it - and that is hardly a realistic option if you want D to compete seriously with C++ and Java.
It will be easier to fix this for Windows, for the reasons given above. I think, at least, that should happen as part of the ongoing std.stream improving. Someone who knows more about Linux encoding will have to help out on the Linux fix.
Arcane Jill
|