Thread overview
std.file
Sep 30, 2004
novice2
Sep 30, 2004
novice
Sep 30, 2004
M
Sep 30, 2004
Arcane Jill
Sep 30, 2004
novice
Sep 30, 2004
Arcane Jill
Sep 30, 2004
J C Calvarese
September 30, 2004
Hello.
What can i do, if my program must work with file/folder names, contain
non-english symbols? This names may be readed from Windows registry, may be
entered by user interaction.
Environment: localised Windows XP
Local code page: 1251 (8 bit, cyrillic, non-english letters have codes > 0x80)
Test program:
/*********/
private import std.file;
private import std.utf;

char[] test(char[] dirName)
{
try
{
exists(dirName);
return "passed";
}
catch(UtfError)
{
return "failed";
}
}

void main()
{
char[] dir1 = "not exists dir A";
char[] dir2 = "not exists dir \xC0"; //this is cyrillic letter "A"
printf("dir1=%.*s\n", dir1);
printf("dir2=%.*s\n", dir2);

printf("test1 %.*s\n", test(dir1));
printf("test2 %.*s\n", test(dir2));
printf("test3 %.*s\n", test(toUTF8(dir1)));
printf("test4 %.*s\n", test(toUTF8(dir2)));

}
/*********/

For my environment this program print:
dir1=not exists folder A
dir2=not exists folder À
test1 passed
test2 failed
test2 passed
test2 failed


September 30, 2004
>For my environment this program print:
>dir1=not exists folder A
>dir2=not exists folder À
>test1 passed
>test2 failed
>test2 passed
>test2 failed

i skip problem explaination: std.file.exists() throw exception "Bad UTF
sequence" if dirname contain non-english letter.
can i bypass this problem?


September 30, 2004
Maybe you must give the function UTF-8 string and you don't. You can give it non-english letters, but they must be in UTF code.

M
In article <cjggll$25fd$1@digitaldaemon.com>, novice says...
>
>>For my environment this program print:
>>dir1=not exists folder A
>>dir2=not exists folder À
>>test1 passed
>>test2 failed
>>test2 passed
>>test2 failed
>
>i skip problem explaination: std.file.exists() throw exception "Bad UTF
>sequence" if dirname contain non-english letter.
>can i bypass this problem?
>
>


September 30, 2004
In article <cjgfna$2507$1@digitaldaemon.com>, novice2 says...
>
>Hello.

Hiya


>What can i do, if my program must work with file/folder names, contain
>non-english symbols? This names may be readed from Windows registry, may be
>entered by user interaction.
>Environment: localised Windows XP
>Local code page: 1251 (8 bit, cyrillic, non-english letters have codes > 0x80)

Understand that your local code page is something about which D doesn't not know or care. I'll try to explain more further on. Bear with me.


>char[] dir2 = "not exists dir \xC0"; //this is cyrillic letter "A"

No it isn't. It's an invalid UTF-8 sequence. What you should do instead is this:

#    char[] dir2 = "not exists dir \u0410"; //Cyrillic capital letter A

(or simply insert the Cryllic capital letter A straight into your source code as
a single character).


In D, source code is portable. The sequence "\u0410" emits the Unicode character U+0410 (CYRILLIC CAPITAL LETTER A), and - importantly - it will do so for /all users/, not just folk like who use Windows code page 1251.

In D, "\x##" emits UTF fragments, not characters. Therefore you should never use "\x##" in a string unless you are prepared to encode UTF-8 by hand. "\u####" or "\U########" emit characters, so that's what you want. The codepoint (character code) should always be the /Unicode/ codepoint, not the Windows-1251 codepoint.

Arcane Jill


--------------------------------------------------------------------------------
PS. Walter - I change my mind about things occasionally, and I'm now starting to agree with Regan in suggesting that "\x" should be deprecated, precisely because it causes this kind of confusion. It's reasonable to assume that people who want to do UTF-encoding by hand are likely to be knowledgeable enough to figure out some other way of doing this.



September 30, 2004
Hi, Arcane Jill

>or care. I'll try to explain more further on. Bear with me.

thank you

>U+0410 (CYRILLIC CAPITAL LETTER A), and - importantly - it will do so for /all users/, not just folk like who use Windows code page 1251.

Unfortunately (?) code page 1251 is standart for russian Windows localization. Many editors (standart notepad for example) use it. It is standart to exchange text between to russian windows. Quite the contrary: unicode used by windows internaly. I must search for special text editor for produce unicode text :(


>(or simply insert the Cryllic capital letter A straight into your source
> code as a single character).

I tried just insert cyrillic letter into source before my question appear. Compiler error "bad utf sequence" :(

>In D, source code is portable.

Yes, unicode is portable. But where i can see it? My friends in unix use iso8859-5 or koi-8r code page (8 bit codepage like 1251), ALL russian users in windows MUST use 1251 code page...



September 30, 2004
In article <cjgst9$2blj$1@digitaldaemon.com>, novice says...

>Unfortunately (?) code page 1251 is standart for russian Windows localization.

Yes, I understand that - just as WINDOWS-1252 is standard for Western European. However, that has got nothing to do with UTF-8, which is independent of localization - and that's the whole point, of course. Windows /does/ understand Unicode. Windows 95 understood Unicode, and every version of Unicode thereafter uses Unicode internally.


>Many editors (standart notepad for example) use it.

Standard Notepad /also/ uses UTF-8. Click on "Save As..."; Go to the "Encoding" pull-down menu and select "UTF-8". That's all you have to do. Really - it's that simple.


>It is standart to exchange
>text between to russian windows. Quite the contrary: unicode used by windows
>internaly. I must search for special text editor for produce unicode text :(

I think you may be surprised to learn that /almost all/ text editors these days can save in UTF-8. There's usually an "Encoding" option on the "Save As..." menu item. Not only that, many text editors can auto-detect UTF encodings, so a UTF-8 text file created using one text editor can be loaded up in another with no problems.

What text editor are you using?

Even in the unlikely event that your text editor can't cope with UTF, there are plenty that can. (And you're going to want other features too, like syntax highlighting, so maybe a text editor upgrade wouldn't be a bad thing).



>>(or simply insert the Cryllic capital letter A straight into your source
>> code as a single character).
>
>I tried just insert cyrillic letter into source before my question appear. Compiler error "bad utf sequence" :(

Yes, that's because you didn't save your source code as UTF-8. Saving your source code as UTF-8 before passing it to DMD will fix this.



>Yes, unicode is portable. But where i can see it? My friends in unix use iso8859-5 or koi-8r code page (8 bit codepage like 1251), ALL russian users in windows MUST use 1251 code page...

That's just not true. Windows uses Unicode internally (its filenames are stored
in UTF-16, for example). And Windows can understand a great variety of
encodings. For example - have you ever used Google? (You know, www.google.com)?
If so, you've been using UTF-8. (For proof, Google something, then view the page
source. You'll see it starts:
<html><head><meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">

Saying that all Russian users of Windows MUST use encoding Windows-1251 is simply not true. Windows has been using Unicode for nearly a decade. If you open a Unicode file, it will "just work". You probably won't even notice you've done it.

Arcane Jill


September 30, 2004
Arcane Jill wrote:
> In article <cjgst9$2blj$1@digitaldaemon.com>, novice says...
> 
> 
>>Unfortunately (?) code page 1251 is standart for russian Windows localization.
> 
> 
> Yes, I understand that - just as WINDOWS-1252 is standard for Western European.
> However, that has got nothing to do with UTF-8, which is independent of
> localization - and that's the whole point, of course. Windows /does/ understand
> Unicode. Windows 95 understood Unicode, and every version of Unicode thereafter
> uses Unicode internally.
> 
> 
> 
>>Many editors (standart notepad for example) use it. 
> 
> 
> Standard Notepad /also/ uses UTF-8. Click on "Save As..."; Go to the "Encoding"
> pull-down menu and select "UTF-8". That's all you have to do. Really - it's that
> simple.

Actually, I think that whether Notepad.exe supports Unicode depends on the version of Windows. Notepad supports Unicode on Windows 2000/XP.

I think that with Win95/Win98/WinME, Notepad doesn't have an option to save in Unicode. (I'd hate to guess whether WinNT's Notepad supports Unicode, but I doubt that it does.)

In any case, I'm sure there are several free Unicode-enabled editors out there. If the OP uses one of those, I suspect he'll have much more success with D.

> 
> 
> 
>>It is standart to exchange
>>text between to russian windows. Quite the contrary: unicode used by windows
>>internaly. I must search for special text editor for produce unicode text :(
> 
> 
> I think you may be surprised to learn that /almost all/ text editors these days
> can save in UTF-8. There's usually an "Encoding" option on the "Save As..." menu
> item. Not only that, many text editors can auto-detect UTF encodings, so a UTF-8
> text file created using one text editor can be loaded up in another with no
> problems.
> 
> What text editor are you using?
> 
> Even in the unlikely event that your text editor can't cope with UTF, there are
> plenty that can. (And you're going to want other features too, like syntax
> highlighting, so maybe a text editor upgrade wouldn't be a bad thing).
> 
> 
> 
> 
>>>(or simply insert the Cryllic capital letter A straight into your source
>>>code as a single character).
>>
>>I tried just insert cyrillic letter into source before my question appear.
>>Compiler error "bad utf sequence" :(
> 
> 
> Yes, that's because you didn't save your source code as UTF-8. Saving your
> source code as UTF-8 before passing it to DMD will fix this.
> 
> 
> 
> 
>>Yes, unicode is portable. But where i can see it? My friends in unix use
>>iso8859-5 or koi-8r code page (8 bit codepage like 1251), ALL russian users in
>>windows MUST use 1251 code page...
> 
> 
> That's just not true. Windows uses Unicode internally (its filenames are stored
> in UTF-16, for example). And Windows can understand a great variety of
> encodings. For example - have you ever used Google? (You know, www.google.com)?
> If so, you've been using UTF-8. (For proof, Google something, then view the page
> source. You'll see it starts:
> <html><head><meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">
> 
> Saying that all Russian users of Windows MUST use encoding Windows-1251 is
> simply not true. Windows has been using Unicode for nearly a decade. If you open
> a Unicode file, it will "just work". You probably won't even notice you've done
> it.
> 
> Arcane Jill
> 
> 


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/