Thread overview
The encoding of Windows/Linux filenames
Jun 26, 2004
Arcane Jill
Jun 26, 2004
Walter
Jun 26, 2004
Walter
June 26, 2004
To clarify a point made in another post... Suppose you have a file called "café". This filename will be stored within the Window filesystem as a sequence of 16-bit words, each representing a Unicode character, which in this case will be the sequence { 0x0063, 0x0061, 0x0066, 0x00C9 }. (This is not of course true with old DOS 8.3 filenames, but we'll ignore them).

The D char[] string, "café", on the other hand, will be stored as the five-byte UTF-8 sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 }.

However, when you call the char version of CreateFile(), the filename string will be interpretted as if it were encoded in the default windows codepage (normally WINDOWS-1252 in English-speaking countries. Under this interpretation, the byte sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 } will be seen as the string "cafÃ?", and so Windows will attempt to open a file of that name. So, either it will fail, or it will open the wrong file.

The fix, of course, to pass to CreateFile() the value
(std.utf.toUTF16(filename)), instead of (filename). This should not have to be
done by users - it needs to be done at the Phobos level.

The situation is more complicated on Linux, unfortunately. On Linux filenames are stored as a sequence of bytes, not 16-bit-words. On one level that sequence of bytes is kind of "raw" - fopen() can be passed any sequence of bytes not containing "/" or "\0", and it will consider a filename to match only if it is byte-for-byte identical. However, this does not really mitigate the problem, because bytes only turn into characters - even 8-bit-wide ones - when you interpret them according to an encoding. Thus, if you have C source code which says fopen("café", "r"), your C++ compiler will still need to know what sequence of bytes should represent these characters. By and large, it will assume the system default encoding, called the "locale" in Linux-speak (although it has very little to do with the ISO langauage-country-variant understanding of "locale"). Some Linux users will have set their default "locale" to UTF-8. Others won't. Getting this right will be tricky.

Unfortunately, you can't ignore this problem. Unless you want to tell people that D's File (FileStream?) class will only work for filenames containing ASCII characters, that it - and that is hardly a realistic option if you want D to compete seriously with C++ and Java.

It will be easier to fix this for Windows, for the reasons given above. I think, at least, that should happen as part of the ongoing std.stream improving. Someone who knows more about Linux encoding will have to help out on the Linux fix.

Arcane Jill


June 26, 2004
This issue is already fixed for std.file operations under Win32, this fix just needs to be propagated to std.stream. For linux, the file name operations assume the linux APIs take UTF-8. I don't know how to do code pages in linux, so this will have to wait until I figure it out <g>.


June 26, 2004
"Walter" <newshound@digitalmars.com> escribió en el mensaje
news:cbkbv8$abd$2@digitaldaemon.com
| This issue is already fixed for std.file operations under Win32, this fix
| just needs to be propagated to std.stream.

... which I already did, and posted in the bugs ng.

| For linux, the file name
| operations assume the linux APIs take UTF-8. I don't know how to do code
| pages in linux, so this will have to wait until I figure it out <g>.

-----------------------
Carlos Santander Bernal


June 26, 2004
"Carlos Santander B." <carlos8294@msn.com> wrote in message news:cbkdvf$d4a$1@digitaldaemon.com...
> "Walter" <newshound@digitalmars.com> escribió en el mensaje
> news:cbkbv8$abd$2@digitaldaemon.com
> | This issue is already fixed for std.file operations under Win32, this
fix
> | just needs to be propagated to std.stream.
>
> ... which I already did, and posted in the bugs ng.

Yes, you did.