Character encoding problem (page 2)

Ben Hinkle wrote: <snip> > Are you sure your command window is set to use UTF-8? On Windows I think you > change it by going to the "Regional Settings" control panel. In Windows 98, a command prompt is still a plain old MS-DOS window. As such, it can't possibly use UTF-8, as this would break the essential one-to-one mapping between bytes and on-screen character positions. I don't know how different this really is in Windows 2000/XP.... Stewart.

Am Fri, 19 Nov 2004 17:03:36 +0100 schrieb Ilya Minkov <minkov@cs.tum.edu>: >> Are you sure your command window is set to use UTF-8? On Windows I think you >> change it by going to the "Regional Settings" control panel. > > That doesn't matter - or rather i think there is nothing to configure. The problem is, he misuses Mozilla for something wrong. He should rather use a programmer's editor which supports UTF-8, for example SciTE. In this example, also go to File -> Encoding -> UTF-8. I've just downloaded SciTE and have done what you suggested. I admit that using Mozilla for encoding issues is not very elegant. SciTE doesn't change anything, though. I still get garbage. By the way, I there a D plugin for SciTE? > The output will be another problem - either multi-character garbage (C functions) or automatically converted to local codepage (D native Unicode functions)

Am Fri, 19 Nov 2004 16:02:17 +0000 schrieb Stewart Gordon <smjg_1998@yahoo.com>: > You can include MS-DOS characters in a string, but only as escape codes. In your case (assuming your code page is 437, 850, 852, 853 or 857): > > puts("\x84\x94\x81\xE1\x8E\x99\x9A"); > > Since the whole point of this is for outputting to MS-DOS, you could argue that this is appropriate use of non-Unicode characters in a string. Yep, that works. Maybe this is a more portable (encoded as UTF-8): import std.c.stdio; int main() { version(Win32) puts("\x84\x94\x81\xE1\x8E\x99\x9A"); else puts("äöüßÄÖÜ"); return 0; } What do you think?!

Here a patch that enables GDC-0.8 and DMD-0.106 to handle UTF-8/16/32 with and without bom. Thomas --- gdc-0.8/d/dmd/module.c 2004-10-02 19:19:31.000000000 +0200 +++ gdc-0.8d/d/dmd/module.c 2004-11-19 19:19:09.522419400 +0100 @@ -241,6 +241,7 @@ * EF BB BF UTF-8 */ + int haveNoBom=0; if (buf[0] == 0xFF && buf[1] == 0xFE) { if (buflen >= 4 && buf[2] == 0 && buf[3] == 0) @@ -257,6 +258,7 @@ fatal(); } + pu-=haveNoBom; dbuf.reserve(buflen / 4); while (++pu < pumax) { unsigned u; @@ -292,6 +294,7 @@ fatal(); } + pu-=haveNoBom; dbuf.reserve(buflen / 2); while (++pu < pumax) { unsigned u; @@ -354,6 +357,8 @@ * figure out the encoding. */ + haveNoBom=1; + if (buflen >= 4) { if (buf[1] == 0 && buf[2] == 0 && buf[3] == 0) { // UTF-32LE Thomas Kuehne schrieb am Fri, 19 Nov 2004 14:19:33 +0000 (UTC): >>> Let's try to track down the real problem. >>> >>> change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss). >>> >>> If the output is still garbage try printf instead of puts. >> >>I've tested the above string. The result for both puts and printf is that either it doesn't compile or it outputs garbage: >> >>MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8 >>compile fine but output garbage under MS-DOS >>(Windows 98 SE, German edition) > > Clearly seems to be a shell problem. > >>Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian): >>(1) "semicolon expected, not '.'" >>(1) no identifier for declarator > > This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails. > > http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16le http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32le

Thomas Kuehne schrieb am Fri, 19 Nov 2004 19:26:25 +0100: >>>Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian): >>>(1) "semicolon expected, not '.'" >>>(1) no identifier for declarator >> >> This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails. The real problem was that it removed the bytes of the not existing BOM. Thomas

"Mathias Bierschenk" <Mathias.Bierschenk@web.de> a écrit dans le message de news: opshp0d1h29gaiaw@dialin-145-254-035-176.arcor-ip.net... > By the way, I there a D plugin for SciTE? You'll find it there http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE

Am Fri, 19 Nov 2004 22:08:56 +0100 schrieb Valéry Croizier <valery@freesurf.fr>: >> By the way, I there a D plugin for SciTE? > > You'll find it there > http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE Thanks!

"Mathias Bierschenk" <Mathias.Bierschenk@web.de> wrote in message news:opshpm3zlo9gaiaw@dialin-212-144-051-051.arcor-ip.net... > How can I print German characters? I've tried the following simple program: > > import std.c.stdio; > > int main() > { > puts("äöüßÄÖÜ"); // German characters > > return 0; > } > > As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German edition) I tried Mozilla to save the source code file with different character encodings but none worked as expected. Here's what I tried using the current DMD version: > > MS-DOS encoding as performed by Microsoft's EDIT editor: Using Microsoft Notepad, click on "Save As" and under encoding, select "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it should work.

November 19, 2004

Re: Character encoding problem

Posted by Carlos Santander B.
in reply to Mathias Bierschenk

Permalink

Carlos Santander B.

Posted in reply to Mathias Bierschenk

Permalink

"Mathias Bierschenk" <Mathias.Bierschenk@web.de> escribió en el mensaje
news:opshpm3zlo9gaiaw@dialin-212-144-051-051.arcor-ip.net...
| How can I print German characters? I've tried the following simple program:
|
| import std.c.stdio;
|
| int main()
| {
|   puts("äöüßÄÖÜ"); // German characters
|
|   return 0;
| }
|
| As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
| edition) I tried Mozilla to save the source code file with different
| character encodings but none worked as expected. Here's what I tried using
| the current DMD version:
|
| MS-DOS encoding as performed by Microsoft's EDIT editor:
| (5) "invalid UTF-sequence"
|
| Western (ISO-8859-1):
| (5) "invalid UTF-sequence"
|
| Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
| (1) "semicolon expected, not '.'"
| (1) no identifier for declarator
|
| Unicode (UTF-16 and UTF-8):
| both compile fine but output garbage under MS-DOS
| (Windows 98 SE, German edition)

I was investigating the same thing recently. What I really wanted was a Windows
console that did Unicode, but I couldn't find it.
But I came across to some C++ program which allows you to output UTF-16 strings
(wchar * in C++ on Windows). Translated to D, the program was like this:

import std.file;
import std.string;
import std.utf;

import win32.winbase;
import win32.wincon;
import win32.winnls;

void main ()
{
    wchar [] tmp_w = toUTF16(cast(char[])"carlos andrés");
    wchar *   szwOut = tmp_w;
    DWORD      dwBytesWritten;
    DWORD      fdwMode;
    HANDLE     outHandle = GetStdHandle(STD_OUTPUT_HANDLE);

    if( (GetFileType(outHandle) & FILE_TYPE_CHAR) && GetConsoleMode( outHandle,
&fdwMode) )
        WriteConsoleW( outHandle, szwOut, wcslen(szwOut), &dwBytesWritten,
null);
    else
    {
        int nOutputCP = GetConsoleOutputCP();
        //int charCount = WideCharToMultiByte(nOutputCP, 0, szwOut, -1, null, 0,
null, null);
        //char* szaStr = new char[charCount];
        //WideCharToMultiByte( nOutputCP, 0, szwOut, -1, szaStr, charCount,
null, null);
        char [] tmp = toUTF8(tmp_w);
        char * szaStr = toMBSz(tmp);
        int charCount = tmp.length;
        WriteFile(outHandle, szaStr, charCount-1, &dwBytesWritten, null);
    }

}

It uses Y Tomino's Win32 headers. The encoding how it's saved doesn't seem to
matter.
I really don't remember where I found the original, so you can use this code as
you want since it's not mine.
For linux, I don't think there's any problem since it goes UTF-8 by default (at
least with RedHat based distros, in my experience).
BTW, if someone knows about a Unicode console for Windows, please let me know.

-----------------------
Carlos Santander Bernal

Mathias Bierschenk schrieb: > I've just downloaded SciTE and have done what you suggested. I admit that using Mozilla for encoding issues is not very elegant. SciTE doesn't change anything, though. I still get garbage. Ah, i missed out that you are through to getting garbage. :) Well, i'll see what can be wrong. In general, non-NT Windows has not been largely considered in the Phobos implementation, because these Windows versions are not very Unicode compatible. -eye

Forums