November 20, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mathias Bierschenk | Hello,
i have the same problem on Linux Debian (sarge) and SUSE 9.1.
"invalid UTF-8 sequence"
Editor is vim .
Manfred
Mathias Bierschenk wrote:
> How can I print German characters? I've tried the following simple program:
>
> import std.c.stdio;
>
> int main()
> {
> puts("äöüßÄÖÜ"); // German characters
>
> return 0;
> }
>
> As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German edition) I tried Mozilla to save the source code file with different character encodings but none worked as expected. Here's what I tried using the current DMD version:
>
> MS-DOS encoding as performed by Microsoft's EDIT editor:
> (5) "invalid UTF-sequence"
>
> Western (ISO-8859-1):
> (5) "invalid UTF-sequence"
>
> Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
> (1) "semicolon expected, not '.'"
> (1) no identifier for declarator
>
> Unicode (UTF-16 and UTF-8):
> both compile fine but output garbage under MS-DOS
> (Windows 98 SE, German edition)
|
November 20, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manfred Hansen |
Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
> Hello,
>
> i have the same problem on Linux Debian (sarge) and SUSE 9.1.
> "invalid UTF-8 sequence"
> Editor is vim .
>
Vim 6.2 works for me.
Are you sure your locale is set to use UTF-8?
# > locale
# LANG=de_DE.UTF-8
# LC_CTYPE=de_DE.UTF-8
# LC_NUMERIC=de_DE.UTF-8
# LC_TIME=de_DE.UTF-8
# LC_COLLATE=de_DE.UTF-8
# LC_MONETARY=de_DE.UTF-8
# LC_MESSAGES=de_DE.UTF-8
# LC_PAPER=de_DE.UTF-8
# LC_NAME=de_DE.UTF-8
# LC_ADDRESS=de_DE.UTF-8
# LC_TELEPHONE=de_DE.UTF-8
# LC_MEASUREMENT=de_DE.UTF-8
# LC_IDENTIFICATION=de_DE.UTF-8
# LC_ALL=
Please send me a sample, if this problem persists.
Thomas
|
November 20, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | Thomas Kuehne wrote:
>
> Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
>> Hello,
>>
>> i have the same problem on Linux Debian (sarge) and SUSE 9.1.
>> "invalid UTF-8 sequence"
>> Editor is vim .
>>
>
> Vim 6.2 works for me.
> Are you sure your locale is set to use UTF-8?
>
> # > locale
> # LANG=de_DE.UTF-8
> # LC_CTYPE=de_DE.UTF-8
> # LC_NUMERIC=de_DE.UTF-8
> # LC_TIME=de_DE.UTF-8
> # LC_COLLATE=de_DE.UTF-8
> # LC_MONETARY=de_DE.UTF-8
> # LC_MESSAGES=de_DE.UTF-8
> # LC_PAPER=de_DE.UTF-8
> # LC_NAME=de_DE.UTF-8
> # LC_ADDRESS=de_DE.UTF-8
> # LC_TELEPHONE=de_DE.UTF-8
> # LC_MEASUREMENT=de_DE.UTF-8
> # LC_IDENTIFICATION=de_DE.UTF-8
> # LC_ALL=
>
> Please send me a sample, if this problem persists.
>
> Thomas
My locale
hansen@hansen-lx:~/d$ locale
LANG=de_DE@euro
LC_CTYPE="de_DE@euro"
LC_NUMERIC="de_DE@euro"
LC_TIME="de_DE@euro"
LC_COLLATE="de_DE@euro"
LC_MONETARY="de_DE@euro"
LC_MESSAGES="de_DE@euro"
LC_PAPER="de_DE@euro"
LC_NAME="de_DE@euro"
LC_ADDRESS="de_DE@euro"
LC_TELEPHONE="de_DE@euro"
LC_MEASUREMENT="de_DE@euro"
LC_IDENTIFICATION="de_DE@euro"
LC_ALL=
thank you for the advice, i try to switch to UTF-8 .
mfg Manfred
|
November 20, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | Am Fri, 19 Nov 2004 14:13:32 -0800 schrieb Walter <newshound@digitalmars.com>: > Using Microsoft Notepad, click on "Save As" and under encoding, select > "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it > should work. No, that doesn't work. Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. Instead one can only make use of some DOS escape sequences. The only thing that works so far (thanks to Stewart Gordon): puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ or, more portable(?), written by myself: import std.c.stdio; int main() { version(Win32) puts("\x84\x94\x81\xE1\x8E\x99\x9A"); else puts("äöüßÄÖÜ"); return 0; } Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far. Maybe someone should write a tutorial about input/output basics in D? ;-) |
November 22, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mathias Bierschenk | In article <opshrgt5ci9gaiaw@dialin-212-144-051-198.arcor-ip.net>, Mathias Bierschenk says... [...] >Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. This problem is for Windows NT/2000/XP also. Consoles use OEM character set. D doesn't support this. > Instead one can only make use of some DOS escape >sequences. The only thing that works so far (thanks to Stewart Gordon): > >puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ This are binary encodings of OEM characters. >or, more portable(?), written by myself: > >import std.c.stdio; > >int main() >{ > version(Win32) > puts("\x84\x94\x81\xE1\x8E\x99\x9A"); > else > puts("äöüßÄÖÜ"); > > return 0; >} This is not portable at all. It work only if the OEM codepage used is compatible with CP437 for those codeponits. The solution is to use CharToOemW, a function that translates a string from UTF-16 to OEM character set (when possible, of course). See an example: <code> import std.stdio; import std.c.stdio; import std.c.windows.windows; extern (Windows) { export BOOL CharToOemW( LPCWSTR lpszSrc, // string to translate LPSTR lpszDst // translated string ); } int main() { puts("-- untranslated --"); puts("äöüßÄÖÜ"); writef("äöüßÄÖÜ\n"); puts("-- translated --"); wchar[] mess = "äöüßÄÖÜ"; char[] OEMmess = new char[mess.length]; CharToOemW(mess, OEMmess); puts(OEMmess); writef(OEMmess); return 0; } </code> This outputs: -- untranslated -- ├ñ├Â├╝├ƒ├ä├û├£ ├ñ├Â├╝├ƒ├ä├û├£ -- translated -- äöüßÄÖÜ Error: invalid UTF-8 sequence Here you can not that puts() works, but writef() not. That's because writefs expects OEMmess to be UTF-8. The results are that writef doesn't work, in any case, under Windows. Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed. The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI. >Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far. I've not tested it, too. >Maybe someone should write a tutorial about input/output basics in D? ;-) Yes, please do it. Ciao |
November 22, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | In article <cnlrlp$14b6$1@digitaldaemon.com>, Walter says... > > [...] > >Using Microsoft Notepad, click on "Save As" and under encoding, select >"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it >should work. The code doesn't work anyway, see my other post for details. The biggest problem is that writef() doesn't work on Windows, neither 9x/Me nor NT/2000/XP. Ciao |
November 22, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Roberto Mariottini | Roberto Mariottini schrieb am Mon, 22 Nov 2004 09:52:27 +0000 (UTC): > Here you can not that puts() works, but writef() not. That's because writefs > expects OEMmess to be UTF-8. > The results are that writef doesn't work, in any case, under Windows. > > Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed. > > The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI. Maybe you could take a look at dmd/src/phobos/std/c/stdio.d? You should be able to change it in a way that - if "FILE*" equals stdout, stderr or stdlog and the hosting environment is Windows - CharToOemA is called before C's "fputs", "fputc", "puts" or "putw" is called. The consequence would be that all writef/*put* calls should produce reasonable output. To do the same with with "printf" you'd have to modify dmd/src/phobos/internal/object.d and dmd/src/phobos/object.d . I'm currently not running Windows but it would be interesting if "fputws" works correctly for non-ASCI chars. Thomas |
November 22, 2004 Re: Character encoding problem | ||||
---|---|---|---|---|
| ||||
Posted in reply to Roberto Mariottini | Roberto Mariottini wrote: >>Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. > > This problem is for Windows NT/2000/XP also. > Consoles use OEM character set. > D doesn't support this. Mac OS X has a similar issue (uses MacRoman/ISO-8859-1 by default), but fortunately you can choose UTF-8 from the Terminal settings... > This is not portable at all. It work only if the OEM codepage used is compatible > with CP437 for those codeponits. > > The solution is to use CharToOemW, a function that translates a string from > UTF-16 to OEM character set (when possible, of course). Or supply similar functions in D, which could be an alternative ? >>Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far. I have written some basic lookups (i.e. "wchar mapping[256];") using the tables that are all available on the Unicode site: ISO Latin-1 (simple!) http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT DOS Latin Console http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT Windows "Latin-1" http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Mac OS Roman http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT (there are few dozen others, but I think these are the most common ?) But it needs a more thought-through API to be really useful... And some optimization to do the reverse lookup, I suppose ? I'm thinking one array of char[256], and one char[] of exceptions. (where 0x00-0xFF would use the lookup, and 0x0100-0xFFFF the hash) --anders |
Copyright © 1999-2021 by the D Language Foundation