November 19, 2004
Ben Hinkle wrote:
<snip>
> Are you sure your command window is set to use UTF-8? On Windows I think you
> change it by going to the "Regional Settings" control panel.

In Windows 98, a command prompt is still a plain old MS-DOS window.  As such, it can't possibly use UTF-8, as this would break the essential one-to-one mapping between bytes and on-screen character positions.

I don't know how different this really is in Windows 2000/XP....

Stewart.
November 19, 2004
Am Fri, 19 Nov 2004 17:03:36 +0100 schrieb Ilya Minkov <minkov@cs.tum.edu>:

>> Are you sure your command window is set to use UTF-8? On Windows I think you
>> change it by going to the "Regional Settings" control panel.
>
> That doesn't matter - or rather i think there is nothing to configure. The problem is, he misuses Mozilla for something wrong. He should rather use a programmer's editor which supports UTF-8, for example SciTE. In this example, also go to File -> Encoding -> UTF-8.

I've just downloaded SciTE and have done what you suggested. I admit that using Mozilla for encoding issues is not very elegant. SciTE doesn't change anything, though. I still get garbage.
By the way, I there a D plugin for SciTE?

> The output will be another problem - either multi-character garbage (C functions) or automatically converted to local codepage (D native Unicode functions)
November 19, 2004
Am Fri, 19 Nov 2004 16:02:17 +0000 schrieb Stewart Gordon <smjg_1998@yahoo.com>:

> You can include MS-DOS characters in a string, but only as escape codes.   In your case (assuming your code page is 437, 850, 852, 853 or 857):
>
>      puts("\x84\x94\x81\xE1\x8E\x99\x9A");
>
> Since the whole point of this is for outputting to MS-DOS, you could argue that this is appropriate use of non-Unicode characters in a string.

Yep, that works. Maybe this is a more portable (encoded as UTF-8):

import std.c.stdio;

int main()
{
  version(Win32)
    puts("\x84\x94\x81\xE1\x8E\x99\x9A");
  else
    puts("äöüßÄÖÜ");

  return 0;
}

What do you think?!
November 19, 2004
Here a patch that enables GDC-0.8 and DMD-0.106 to handle UTF-8/16/32 with and without bom.

Thomas

--- gdc-0.8/d/dmd/module.c	2004-10-02 19:19:31.000000000 +0200
+++ gdc-0.8d/d/dmd/module.c	2004-11-19 19:19:09.522419400 +0100
@@ -241,6 +241,7 @@
 	 * EF BB BF	UTF-8
 	 */

+	int haveNoBom=0;
 	if (buf[0] == 0xFF && buf[1] == 0xFE)
 	{
 	    if (buflen >= 4 && buf[2] == 0 && buf[3] == 0)
@@ -257,6 +258,7 @@
 		    fatal();
 		}

+		pu-=haveNoBom;
 		dbuf.reserve(buflen / 4);
 		while (++pu < pumax)
 		{   unsigned u;
@@ -292,6 +294,7 @@
 		    fatal();
 		}

+		pu-=haveNoBom;
 		dbuf.reserve(buflen / 2);
 		while (++pu < pumax)
 		{   unsigned u;
@@ -354,6 +357,8 @@
 	     * figure out the encoding.
 	     */

+            haveNoBom=1;
+
 	    if (buflen >= 4)
 	    {   if (buf[1] == 0 && buf[2] == 0 && buf[3] == 0)
 		{   // UTF-32LE


Thomas Kuehne schrieb am Fri, 19 Nov 2004 14:19:33 +0000 (UTC):
>>> Let's try to track down  the real problem.
>>>
>>> change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).
>>>
>>> If the output is still garbage try printf instead of puts.
>>
>>I've tested the above string. The result for both puts and printf is that either it doesn't compile or it outputs garbage:
>>
>>MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8
>>compile fine but output garbage under MS-DOS
>>(Windows 98 SE, German edition)
>
> Clearly seems to be a shell problem.
>
>>Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
>>(1) "semicolon expected, not '.'"
>>(1) no identifier for declarator
>
> This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails.
>
> http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16le http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32le
November 19, 2004
Thomas Kuehne schrieb am Fri, 19 Nov 2004 19:26:25 +0100:
>>>Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
>>>(1) "semicolon expected, not '.'"
>>>(1) no identifier for declarator
>>
>> This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails.

The real problem was that it removed the bytes of the not existing BOM.

Thomas
November 19, 2004
"Mathias Bierschenk" <Mathias.Bierschenk@web.de> a écrit dans le message de news: opshp0d1h29gaiaw@dialin-145-254-035-176.arcor-ip.net...

> By the way, I there a D plugin for SciTE?

You'll find it there http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE


November 19, 2004
Am Fri, 19 Nov 2004 22:08:56 +0100 schrieb Valéry Croizier <valery@freesurf.fr>:

>> By the way, I there a D plugin for SciTE?
>
> You'll find it there
> http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE

Thanks!
November 19, 2004
"Mathias Bierschenk" <Mathias.Bierschenk@web.de> wrote in message news:opshpm3zlo9gaiaw@dialin-212-144-051-051.arcor-ip.net...
> How can I print German characters? I've tried the following simple
program:
>
> import std.c.stdio;
>
> int main()
> {
>    puts("äöüßÄÖÜ"); // German characters
>
>    return 0;
> }
>
> As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German edition) I tried Mozilla to save the source code file with different character encodings but none worked as expected. Here's what I tried using the current DMD version:
>
> MS-DOS encoding as performed by Microsoft's EDIT editor:

Using Microsoft Notepad, click on "Save As" and under encoding, select
"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it
should work.


November 19, 2004
"Mathias Bierschenk" <Mathias.Bierschenk@web.de> escribió en el mensaje
news:opshpm3zlo9gaiaw@dialin-212-144-051-051.arcor-ip.net...
| How can I print German characters? I've tried the following simple program:
|
| import std.c.stdio;
|
| int main()
| {
|   puts("äöüßÄÖÜ"); // German characters
|
|   return 0;
| }
|
| As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
| edition) I tried Mozilla to save the source code file with different
| character encodings but none worked as expected. Here's what I tried using
| the current DMD version:
|
| MS-DOS encoding as performed by Microsoft's EDIT editor:
| (5) "invalid UTF-sequence"
|
| Western (ISO-8859-1):
| (5) "invalid UTF-sequence"
|
| Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
| (1) "semicolon expected, not '.'"
| (1) no identifier for declarator
|
| Unicode (UTF-16 and UTF-8):
| both compile fine but output garbage under MS-DOS
| (Windows 98 SE, German edition)

I was investigating the same thing recently. What I really wanted was a Windows
console that did Unicode, but I couldn't find it.
But I came across to some C++ program which allows you to output UTF-16 strings
(wchar * in C++ on Windows). Translated to D, the program was like this:

import std.file;
import std.string;
import std.utf;

import win32.winbase;
import win32.wincon;
import win32.winnls;

void main ()
{
    wchar [] tmp_w = toUTF16(cast(char[])"carlos andrés");
    wchar *   szwOut = tmp_w;
    DWORD      dwBytesWritten;
    DWORD      fdwMode;
    HANDLE     outHandle = GetStdHandle(STD_OUTPUT_HANDLE);

    if( (GetFileType(outHandle) & FILE_TYPE_CHAR) && GetConsoleMode( outHandle,
&fdwMode) )
        WriteConsoleW( outHandle, szwOut, wcslen(szwOut), &dwBytesWritten,
null);
    else
    {
        int nOutputCP = GetConsoleOutputCP();
        //int charCount = WideCharToMultiByte(nOutputCP, 0, szwOut, -1, null, 0,
null, null);
        //char* szaStr = new char[charCount];
        //WideCharToMultiByte( nOutputCP, 0, szwOut, -1, szaStr, charCount,
null, null);
        char [] tmp = toUTF8(tmp_w);
        char * szaStr = toMBSz(tmp);
        int charCount = tmp.length;
        WriteFile(outHandle, szaStr, charCount-1, &dwBytesWritten, null);
    }

}

It uses Y Tomino's Win32 headers. The encoding how it's saved doesn't seem to
matter.
I really don't remember where I found the original, so you can use this code as
you want since it's not mine.
For linux, I don't think there's any problem since it goes UTF-8 by default (at
least with RedHat based distros, in my experience).
BTW, if someone knows about a Unicode console for Windows, please let me know.

-----------------------
Carlos Santander Bernal



November 19, 2004
Mathias Bierschenk schrieb:

> I've just downloaded SciTE and have done what you suggested. I admit that  using Mozilla for encoding issues is not very elegant. SciTE doesn't  change anything, though. I still get garbage.

Ah, i missed out that you are through to getting garbage. :) Well, i'll see what can be wrong. In general, non-NT Windows has not been largely considered in the Phobos implementation, because these Windows versions are not very Unicode compatible.

-eye