Character encoding problem (page 3)

Hello, i have the same problem on Linux Debian (sarge) and SUSE 9.1. "invalid UTF-8 sequence" Editor is vim . Manfred Mathias Bierschenk wrote: > How can I print German characters? I've tried the following simple program: > > import std.c.stdio; > > int main() > { > puts("äöüßÄÖÜ"); // German characters > > return 0; > } > > As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German edition) I tried Mozilla to save the source code file with different character encodings but none worked as expected. Here's what I tried using the current DMD version: > > MS-DOS encoding as performed by Microsoft's EDIT editor: > (5) "invalid UTF-sequence" > > Western (ISO-8859-1): > (5) "invalid UTF-sequence" > > Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian): > (1) "semicolon expected, not '.'" > (1) no identifier for declarator > > Unicode (UTF-16 and UTF-8): > both compile fine but output garbage under MS-DOS > (Windows 98 SE, German edition)

Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100: > Hello, > > i have the same problem on Linux Debian (sarge) and SUSE 9.1. > "invalid UTF-8 sequence" > Editor is vim . > Vim 6.2 works for me. Are you sure your locale is set to use UTF-8? # > locale # LANG=de_DE.UTF-8 # LC_CTYPE=de_DE.UTF-8 # LC_NUMERIC=de_DE.UTF-8 # LC_TIME=de_DE.UTF-8 # LC_COLLATE=de_DE.UTF-8 # LC_MONETARY=de_DE.UTF-8 # LC_MESSAGES=de_DE.UTF-8 # LC_PAPER=de_DE.UTF-8 # LC_NAME=de_DE.UTF-8 # LC_ADDRESS=de_DE.UTF-8 # LC_TELEPHONE=de_DE.UTF-8 # LC_MEASUREMENT=de_DE.UTF-8 # LC_IDENTIFICATION=de_DE.UTF-8 # LC_ALL= Please send me a sample, if this problem persists. Thomas

Thomas Kuehne wrote: > > Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100: >> Hello, >> >> i have the same problem on Linux Debian (sarge) and SUSE 9.1. >> "invalid UTF-8 sequence" >> Editor is vim . >> > > Vim 6.2 works for me. > Are you sure your locale is set to use UTF-8? > > # > locale > # LANG=de_DE.UTF-8 > # LC_CTYPE=de_DE.UTF-8 > # LC_NUMERIC=de_DE.UTF-8 > # LC_TIME=de_DE.UTF-8 > # LC_COLLATE=de_DE.UTF-8 > # LC_MONETARY=de_DE.UTF-8 > # LC_MESSAGES=de_DE.UTF-8 > # LC_PAPER=de_DE.UTF-8 > # LC_NAME=de_DE.UTF-8 > # LC_ADDRESS=de_DE.UTF-8 > # LC_TELEPHONE=de_DE.UTF-8 > # LC_MEASUREMENT=de_DE.UTF-8 > # LC_IDENTIFICATION=de_DE.UTF-8 > # LC_ALL= > > Please send me a sample, if this problem persists. > > Thomas My locale hansen@hansen-lx:~/d$ locale LANG=de_DE@euro LC_CTYPE="de_DE@euro" LC_NUMERIC="de_DE@euro" LC_TIME="de_DE@euro" LC_COLLATE="de_DE@euro" LC_MONETARY="de_DE@euro" LC_MESSAGES="de_DE@euro" LC_PAPER="de_DE@euro" LC_NAME="de_DE@euro" LC_ADDRESS="de_DE@euro" LC_TELEPHONE="de_DE@euro" LC_MEASUREMENT="de_DE@euro" LC_IDENTIFICATION="de_DE@euro" LC_ALL= thank you for the advice, i try to switch to UTF-8 . mfg Manfred

Am Fri, 19 Nov 2004 14:13:32 -0800 schrieb Walter <newshound@digitalmars.com>: > Using Microsoft Notepad, click on "Save As" and under encoding, select > "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it > should work. No, that doesn't work. Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. Instead one can only make use of some DOS escape sequences. The only thing that works so far (thanks to Stewart Gordon): puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ or, more portable(?), written by myself: import std.c.stdio; int main() { version(Win32) puts("\x84\x94\x81\xE1\x8E\x99\x9A"); else puts("äöüßÄÖÜ"); return 0; } Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far. Maybe someone should write a tutorial about input/output basics in D? ;-)

November 22, 2004

Re: Character encoding problem

Posted by Roberto Mariottini
in reply to Mathias Bierschenk

Permalink

Roberto Mariottini

Posted in reply to Mathias Bierschenk

Permalink

In article <opshrgt5ci9gaiaw@dialin-212-144-051-198.arcor-ip.net>, Mathias
Bierschenk says...
[...]
>Some others here have tracked down the main problem: The Win9x console doesn't support Unicode.

This problem is for Windows NT/2000/XP also.
Consoles use OEM character set.
D doesn't support this.

> Instead one can only make use of some DOS escape
>sequences. The only thing that works so far (thanks to Stewart Gordon):
>
>puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ

This are binary encodings of OEM characters.

>or, more portable(?), written by myself:
>
>import std.c.stdio;
>
>int main()
>{
>   version(Win32)
>     puts("\x84\x94\x81\xE1\x8E\x99\x9A");
>   else
>     puts("äöüßÄÖÜ");
>
>   return 0;
>}

This is not portable at all. It work only if the OEM codepage used is compatible with CP437 for those codeponits.

The solution is to use CharToOemW, a function that translates a string from UTF-16 to OEM character set (when possible, of course).

See an example:

<code>
import std.stdio;
import std.c.stdio;
import std.c.windows.windows;

extern (Windows)
{
export BOOL CharToOemW(
LPCWSTR lpszSrc,  // string to translate
LPSTR lpszDst     // translated string
);
}

int main()
{
puts("-- untranslated --");
puts("äöüßÄÖÜ");
writef("äöüßÄÖÜ\n");

puts("-- translated --");
wchar[] mess = "äöüßÄÖÜ";
char[] OEMmess = new char[mess.length];
CharToOemW(mess, OEMmess);
puts(OEMmess);
writef(OEMmess);

return 0;
}
</code>

This outputs:

-- untranslated --
&#9500;ñ&#9500;Â&#9500;&#9565;&#9500;ƒ&#9500;ä&#9500;û&#9500;£
&#9500;ñ&#9500;Â&#9500;&#9565;&#9500;ƒ&#9500;ä&#9500;û&#9500;£
-- translated --
äöüßÄÖÜ
Error: invalid UTF-8 sequence

Here you can not that puts() works, but writef() not. That's because writefs
expects OEMmess to be UTF-8.
The results are that writef doesn't work, in any case, under Windows.

Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed.

The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI.

>Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far.

I've not tested it, too.

>Maybe someone should write a tutorial about input/output basics in D? ;-)

Yes, please do it.

Ciao

In article <cnlrlp$14b6$1@digitaldaemon.com>, Walter says... > > [...] > >Using Microsoft Notepad, click on "Save As" and under encoding, select >"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it >should work. The code doesn't work anyway, see my other post for details. The biggest problem is that writef() doesn't work on Windows, neither 9x/Me nor NT/2000/XP. Ciao

Roberto Mariottini schrieb am Mon, 22 Nov 2004 09:52:27 +0000 (UTC): > Here you can not that puts() works, but writef() not. That's because writefs > expects OEMmess to be UTF-8. > The results are that writef doesn't work, in any case, under Windows. > > Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed. > > The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI. Maybe you could take a look at dmd/src/phobos/std/c/stdio.d? You should be able to change it in a way that - if "FILE*" equals stdout, stderr or stdlog and the hosting environment is Windows - CharToOemA is called before C's "fputs", "fputc", "puts" or "putw" is called. The consequence would be that all writef/*put* calls should produce reasonable output. To do the same with with "printf" you'd have to modify dmd/src/phobos/internal/object.d and dmd/src/phobos/object.d . I'm currently not running Windows but it would be interesting if "fputws" works correctly for non-ASCI chars. Thomas

Roberto Mariottini wrote: >>Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. > > This problem is for Windows NT/2000/XP also. > Consoles use OEM character set. > D doesn't support this. Mac OS X has a similar issue (uses MacRoman/ISO-8859-1 by default), but fortunately you can choose UTF-8 from the Terminal settings... > This is not portable at all. It work only if the OEM codepage used is compatible > with CP437 for those codeponits. > > The solution is to use CharToOemW, a function that translates a string from > UTF-16 to OEM character set (when possible, of course). Or supply similar functions in D, which could be an alternative ? >>Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far. I have written some basic lookups (i.e. "wchar mapping[256];") using the tables that are all available on the Unicode site: ISO Latin-1 (simple!) http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT DOS Latin Console http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT Windows "Latin-1" http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Mac OS Roman http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT (there are few dozen others, but I think these are the most common ?) But it needs a more thought-through API to be really useful... And some optimization to do the reverse lookup, I suppose ? I'm thinking one array of char[256], and one char[] of exceptions. (where 0x00-0xFF would use the lookup, and 0x0100-0xFFFF the hash) --anders

Forums