November 20, 2004
Hello,

i have the same problem on Linux Debian (sarge) and SUSE 9.1.
"invalid UTF-8 sequence"
Editor is vim .

Manfred

Mathias Bierschenk wrote:

> How can I print German characters? I've tried the following simple program:
> 
> import std.c.stdio;
> 
> int main()
> {
>    puts("äöüßÄÖÜ"); // German characters
> 
>    return 0;
> }
> 
> As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German edition) I tried Mozilla to save the source code file with different character encodings but none worked as expected. Here's what I tried using the current DMD version:
> 
> MS-DOS encoding as performed by Microsoft's EDIT editor:
> (5) "invalid UTF-sequence"
> 
> Western (ISO-8859-1):
> (5) "invalid UTF-sequence"
> 
> Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
> (1) "semicolon expected, not '.'"
> (1) no identifier for declarator
> 
> Unicode (UTF-16 and UTF-8):
> both compile fine but output garbage under MS-DOS
> (Windows 98 SE, German edition)

November 20, 2004
Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
> Hello,
>
> i have the same problem on Linux Debian (sarge) and SUSE 9.1.
> "invalid UTF-8 sequence"
> Editor is vim .
>

Vim 6.2 works for me.
Are you sure your locale is set to use UTF-8?

# > locale
# LANG=de_DE.UTF-8
# LC_CTYPE=de_DE.UTF-8
# LC_NUMERIC=de_DE.UTF-8
# LC_TIME=de_DE.UTF-8
# LC_COLLATE=de_DE.UTF-8
# LC_MONETARY=de_DE.UTF-8
# LC_MESSAGES=de_DE.UTF-8
# LC_PAPER=de_DE.UTF-8
# LC_NAME=de_DE.UTF-8
# LC_ADDRESS=de_DE.UTF-8
# LC_TELEPHONE=de_DE.UTF-8
# LC_MEASUREMENT=de_DE.UTF-8
# LC_IDENTIFICATION=de_DE.UTF-8
# LC_ALL=

Please send me a sample, if this problem persists.

Thomas
November 20, 2004
Thomas Kuehne wrote:

> 
> Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
>> Hello,
>>
>> i have the same problem on Linux Debian (sarge) and SUSE 9.1.
>> "invalid UTF-8 sequence"
>> Editor is vim .
>>
> 
> Vim 6.2 works for me.
> Are you sure your locale is set to use UTF-8?
> 
> # > locale
> # LANG=de_DE.UTF-8
> # LC_CTYPE=de_DE.UTF-8
> # LC_NUMERIC=de_DE.UTF-8
> # LC_TIME=de_DE.UTF-8
> # LC_COLLATE=de_DE.UTF-8
> # LC_MONETARY=de_DE.UTF-8
> # LC_MESSAGES=de_DE.UTF-8
> # LC_PAPER=de_DE.UTF-8
> # LC_NAME=de_DE.UTF-8
> # LC_ADDRESS=de_DE.UTF-8
> # LC_TELEPHONE=de_DE.UTF-8
> # LC_MEASUREMENT=de_DE.UTF-8
> # LC_IDENTIFICATION=de_DE.UTF-8
> # LC_ALL=
> 
> Please send me a sample, if this problem persists.
> 
> Thomas

My locale
hansen@hansen-lx:~/d$ locale
LANG=de_DE@euro
LC_CTYPE="de_DE@euro"
LC_NUMERIC="de_DE@euro"
LC_TIME="de_DE@euro"
LC_COLLATE="de_DE@euro"
LC_MONETARY="de_DE@euro"
LC_MESSAGES="de_DE@euro"
LC_PAPER="de_DE@euro"
LC_NAME="de_DE@euro"
LC_ADDRESS="de_DE@euro"
LC_TELEPHONE="de_DE@euro"
LC_MEASUREMENT="de_DE@euro"
LC_IDENTIFICATION="de_DE@euro"
LC_ALL=

thank you for the advice, i try to switch to UTF-8 .

mfg Manfred


November 20, 2004
Am Fri, 19 Nov 2004 14:13:32 -0800 schrieb Walter <newshound@digitalmars.com>:

> Using Microsoft Notepad, click on "Save As" and under encoding, select
> "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it
> should work.

No, that doesn't work.
Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. Instead one can only make use of some DOS escape sequences. The only thing that works so far (thanks to Stewart Gordon):

puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ

or, more portable(?), written by myself:

import std.c.stdio;

int main()
{
  version(Win32)
    puts("\x84\x94\x81\xE1\x8E\x99\x9A");
  else
    puts("äöüßÄÖÜ");

  return 0;
}

Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far.
Maybe someone should write a tutorial about input/output basics in D? ;-)
November 22, 2004
In article <opshrgt5ci9gaiaw@dialin-212-144-051-198.arcor-ip.net>, Mathias
Bierschenk says...
[...]
>Some others here have tracked down the main problem: The Win9x console doesn't support Unicode.

This problem is for Windows NT/2000/XP also.
Consoles use OEM character set.
D doesn't support this.

> Instead one can only make use of some DOS escape
>sequences. The only thing that works so far (thanks to Stewart Gordon):
>
>puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ

This are binary encodings of OEM characters.

>or, more portable(?), written by myself:
>
>import std.c.stdio;
>
>int main()
>{
>   version(Win32)
>     puts("\x84\x94\x81\xE1\x8E\x99\x9A");
>   else
>     puts("äöüßÄÖÜ");
>
>   return 0;
>}

This is not portable at all. It work only if the OEM codepage used is compatible with CP437 for those codeponits.

The solution is to use CharToOemW, a function that translates a string from UTF-16 to OEM character set (when possible, of course).

See an example:

<code>
import std.stdio;
import std.c.stdio;
import std.c.windows.windows;

extern (Windows)
{
export BOOL CharToOemW(
LPCWSTR lpszSrc,  // string to translate
LPSTR lpszDst     // translated string
);
}

int main()
{
puts("-- untranslated --");
puts("äöüßÄÖÜ");
writef("äöüßÄÖÜ\n");

puts("-- translated --");
wchar[] mess = "äöüßÄÖÜ";
char[] OEMmess = new char[mess.length];
CharToOemW(mess, OEMmess);
puts(OEMmess);
writef(OEMmess);

return 0;
}
</code>

This outputs:

-- untranslated --
&#9500;ñ&#9500;Â&#9500;&#9565;&#9500;ƒ&#9500;ä&#9500;û&#9500;£
&#9500;ñ&#9500;Â&#9500;&#9565;&#9500;ƒ&#9500;ä&#9500;û&#9500;£
-- translated --
äöüßÄÖÜ
Error: invalid UTF-8 sequence

Here you can not that puts() works, but writef() not. That's because writefs
expects OEMmess to be UTF-8.
The results are that writef doesn't work, in any case, under Windows.

Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed.

The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI.

>Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far.

I've not tested it, too.

>Maybe someone should write a tutorial about input/output basics in D? ;-)

Yes, please do it.

Ciao


November 22, 2004
In article <cnlrlp$14b6$1@digitaldaemon.com>, Walter says...
>
>
[...]
>
>Using Microsoft Notepad, click on "Save As" and under encoding, select
>"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it
>should work.

The code doesn't work anyway, see my other post for details.
The biggest problem is that writef() doesn't work on Windows, neither 9x/Me nor
NT/2000/XP.

Ciao


November 22, 2004
Roberto Mariottini schrieb am Mon, 22 Nov 2004 09:52:27 +0000 (UTC):
> Here you can not that puts() works, but writef() not. That's because writefs
> expects OEMmess to be UTF-8.
> The results are that writef doesn't work, in any case, under Windows.
>
> Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed.
>
> The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI.

Maybe you could take a look at dmd/src/phobos/std/c/stdio.d?

You should be able to change it in a way that - if "FILE*" equals stdout, stderr or stdlog and the hosting environment is Windows - CharToOemA is called before C's "fputs", "fputc", "puts" or "putw" is called.

The consequence would be that all writef/*put* calls should produce reasonable output. To do the same with with "printf" you'd have to modify dmd/src/phobos/internal/object.d and dmd/src/phobos/object.d .

I'm currently not running Windows but it would be interesting if "fputws" works correctly for non-ASCI chars.

Thomas
November 22, 2004
Roberto Mariottini wrote:

>>Some others here have tracked down the main problem: The Win9x console  doesn't support Unicode.
> 
> This problem is for Windows NT/2000/XP also.
> Consoles use OEM character set.
> D doesn't support this.

Mac OS X has a similar issue (uses MacRoman/ISO-8859-1 by default),
but fortunately you can choose UTF-8 from the Terminal settings...

> This is not portable at all. It work only if the OEM codepage used is compatible
> with CP437 for those codeponits.
> 
> The solution is to use CharToOemW, a function that translates a string from
> UTF-16 to OEM character set (when possible, of course).

Or supply similar functions in D, which could be an alternative ?

>>Carlos Santander B. suggested another solution, based on Y. Tomino's Win32  headers, that seems to convert characters at run-time. I can't get it to  print anything at the moment, so I can't yet tell if it is better than  what I have got so far.

I have written some basic lookups (i.e. "wchar mapping[256];")
using the tables that are all available on the Unicode site:


ISO Latin-1 (simple!)
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

DOS Latin Console
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT

Windows "Latin-1"
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Mac OS Roman
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT

(there are few dozen others, but I think these are the most common ?)


But it needs a more thought-through API to be really useful...
And some optimization to do the reverse lookup, I suppose ?

I'm thinking one array of char[256], and one char[] of exceptions.
(where 0x00-0xFF would use the lookup, and 0x0100-0xFFFF the hash)

--anders
1 2 3
Next ›   Last »