July 14, 2004
In article <cd2ksg$fng$1@digitaldaemon.com>, Roberto Mariottini says...

>This leds to some questions:

>How can it detect the right coding?

UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell apart, either with or without a BOM (a BOM is a special prefix).

It cannot, however, distinguish the above from any OTHER encoding.


>Does endianess matter?

With the UTF family, no. As I said, they are easy to tell apart.


>And what about my current default codepage (windows-1252)?

D is designed with a global philosophy, so it will ignore your default codepage, and signal an error if you rely upon it. This is a good thing, because in D (unlike C/C++), the same source file will compile identically on all machines. Consider the following fragment of C++:

#    std::basic_string<wchar_t> s = toUTF16("\x80"); // C++

(assuming the existence of a C++ toUTF16() function). Even in Western Europe and
America, if you run that on Linux (where the default encoding is ISO-8859-1)
you'll end up with s containing U+0080, but if you run it on Windows (where the
default encoding is WINDOWS-1252) you'll end up with s containing U+20AC.
Outside of Western Europe and America, the situation would be decidedly worse.

D, on the other hand, will produce a consistent binary for the same source, no matter where you live or what your encoding is. In other words, the short answer to your question:

>And what about my current default codepage (windows-1252)?

is, if you're using D, forget it.


>If I pass an HTML as source, does it honor the encoding specified in the header?

No. It can't, because DMD doesn't come armed with hundreds of different decoders.


Arcane Jill


July 14, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cd2lsj$hsu$1@digitaldaemon.com...
> Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CAN
save
> files in UTF-8 format, which is all I need.

Actually I'd like to ask:
how many people at present time use 'unicode editors' for their project's
sources on the  'regular base'  but not occasionally. It seems to me it
happens very rarely (if ever) and it's not the 'strict rule' in
companies/projects. So I think myself why i's so if unicode is so wonderful?


July 14, 2004
In article <cd2vp0$164f$1@digitaldaemon.com>, Blandger says...
>
>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cd2lsj$hsu$1@digitaldaemon.com...
>> Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CAN
>save
>> files in UTF-8 format, which is all I need.
>
>Actually I'd like to ask:
>how many people at present time use 'unicode editors' for their project's
>sources on the  'regular base'  but not occasionally. It seems to me it
>happens very rarely (if ever) and it's not the 'strict rule' in
>companies/projects. So I think myself why i's so if unicode is so wonderful?


And I say again, almost ALL text editors these days can save in UTF. In fact, I'm not even sure I can name one that doesn't.

On that basis, then, the probable answer is almost everyone (although they may not consciously be aware of it).

Arcane Jill


July 14, 2004
In article <cd2p4m$p0c$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cd2ksg$fng$1@digitaldaemon.com>, Roberto Mariottini says...
>
>>This leds to some questions:
>
>>How can it detect the right coding?
>
>UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell apart, either with or without a BOM (a BOM is a special prefix).
>
>It cannot, however, distinguish the above from any OTHER encoding.
>
>
>>Does endianess matter?
>
>With the UTF family, no. As I said, they are easy to tell apart.
>
>
>>And what about my current default codepage (windows-1252)?
>
[...]
>is, if you're using D, forget it.

Thanks for the answer.
I should have RTFM before asking, though.
In http://www.digitalmars.com/d/lex.html is stated that D supports only ASCII
and UTF-*, if there isn't a BOM at the beginning then UTF-8 is assumed(so ASCII
is safe too).

>>If I pass an HTML as source, does it honor the encoding specified in the header?
>
>No. It can't, because DMD doesn't come armed with hundreds of different decoders.

Well, do you know any translator from 1252 to UTF-8?

Ciao


July 14, 2004
In article <cd372c$1i77$1@digitaldaemon.com>, Roberto Mariottini says...

>Well, do you know any translator from 1252 to UTF-8?

How about I just make one up right now:

#    char[] windows1252ToUTF8(ubyte[] s)
#    {
#        wchar[] t = new wchar[s.length];
#        for (uint i=0; i<s.length; ++i)
#        {
#            t[i] = windows1252ToUnicode(s[i]);
#        }
#        return toUTF8(t);
#    }
#
#    dchar windows1252ToUnicode(ubyte c)
#    {
#        if (c < 0x80 || c > 0x9F) return cast(dchar) c;
#        switch (c)
#        {
#        0x80: return '\u20AC'; //EURO SIGN
#        0x82: return '\u201A'; //SINGLE LOW-9 QUOTATION MARK
#        0x83: return '\u0192'; //LATIN SMALL LETTER F WITH HOOK
#        0x84: return '\u201E'; //DOUBLE LOW-9 QUOTATION MARK
#        0x85: return '\u2026'; //HORIZONTAL ELLIPSIS
#        0x86: return '\u2020'; //DAGGER
#        0x87: return '\u2021'; //DOUBLE DAGGER
#        0x88: return '\u02C6'; //MODIFIER LETTER CIRCUMFLEX ACCENT
#        0x89: return '\u2030'; //PER MILLE SIGN
#        0x8A: return '\u0160'; //LATIN CAPITAL LETTER S WITH CARON
#        0x8B: return '\u2039'; //SINGLE LEFT-POINTING ANGLE QUOTATION MARK
#        0x8C: return '\u0152'; //LATIN CAPITAL LIGATURE OE
#        0x8E: return '\u017D'; //LATIN CAPITAL LETTER Z WITH CARON
#        0x91: return '\u2018'; //LEFT SINGLE QUOTATION MARK
#        0x92: return '\u2019'; //RIGHT SINGLE QUOTATION MARK
#        0x93: return '\u201C'; //LEFT DOUBLE QUOTATION MARK
#        0x94: return '\u201D'; //RIGHT DOUBLE QUOTATION MARK
#        0x95: return '\u2022'; //BULLET
#        0x96: return '\u2013'; //EN DASH
#        0x97: return '\u2014'; //EM DASH
#        0x98: return '\u02DC'; //SMALL TILDE
#        0x99: return '\u2122'; //TRADE MARK SIGN
#        0x9A: return '\u0161'; //LATIN SMALL LETTER S WITH CARON
#        0x9B: return '\u203A'; //SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
#        0x9C: return '\u0153'; //LATIN SMALL LIGATURE OE
#        0x9E: return '\u017E'; //LATIN SMALL LETTER Z WITH CARON
#        0x9F: return '\u0178'; //LATIN CAPITAL LETTER Y WITH DIAERESIS
#        default: throw new Exception("Invalid character in WINDOWS-1252");
#        }
#    }

Arcane Jill


July 14, 2004
I've played a little with this, but I don't seem to find a suitable solution. Attached is the Jill code modified to get a filter program.

My test program is this:

import std.c.stdio;
import std.utf;

int main(char[][] args)
{
int perché;

printf("Perché\n");

return 0;
}

Obviously, if I compile it in its original encoding (Windows 1252) I get an
error:

test.d(6): invalid UTF-8 sequence
test.d(6): invalid UTF-8 sequence
test.d(6): unsupported char 0xe9

So I translate it in UTF-8, using:
w2u.exe test.d > test2.d

This new encoded file compiles without errors, but printf output is scrambled by
the conversion: two characters are printed instead of the special one. In fact
the special character is translated in a two-byte UTF-8 sequence by the filter,
and printf doesn't recognize UTF-8 encoded strings.
So I changed it to use wprintf:

wprintf(toUTF16("Perché\n"));

And I get the following error:

test2.d(8): function toUTF16 overloads wchar[](char[]s) and wchar[](dchar[]s) bo
th match argument list for toUTF16

There is an ambiguity (why?), so I applyed an explicit cast on the line 8:

wprintf(toUTF16(cast(char[])"Perché\n"));

Now the program comiples fine, but the character printed is not correct:
wprintf correctly converts UTF-16 characters to CP 1252 (ANSI), but the Command
Prompt window uses the old DOS CP 850 (!!!).

What can I do now?

Ciao


1 2 3
Next ›   Last »