December 01, 2004
Anders F Björklund wrote:
<snip>
> Most of the consoles mentioned only support old 16-bit Unicode anyway ?
<snip>

MS-DOS, and hence DOS windows in Win9x, only support 8-bit IBM codepages.

Stewart.

-- 
My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.
December 02, 2004
In article <cokk5u$2dt9$1@digitaldaemon.com>, =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
>
>Roberto Mariottini wrote:
>
>> The first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work.
>
>It works just fine, but you *have* to set your console to UTF-8.
>D does *not* support consoles or shells which are not Unicode... :(

Windows XP does *not* support UTF-8 consoles. Neither Windows NT/2000. So the bug still applies.

I don't think D will go any further if it doesn't support non-English versions of Windows.

Ciao


December 02, 2004
In article <cokqtc$2okj$1@digitaldaemon.com>, Ben Hinkle says...
>
>
[...]
>Now that you mention it it might be nice to make another
>Stream subclass and add support for the "native" encodings. It sounds fun -
>I'll give it a shot. It should be pretty easy actually since you just
>override writeString and writeStringW to call some OS function to convert
>the string or char from utf to native encoding.

The Windows function to accomplish this task is the already cited CharToOemW().

>Supporting arbitrary encodings would probably be left for non-phobos libraries since they would presumably require something like ICU or libiconv. So basically what I have in mind is that to write to stdout with native encoding you'd have to write
>
>import std.stream;
>...
>stdoutn = NativeTextStream(stdout);
>stdoutn.writef(<some utf encoded string>);

The default stdout should be a NativeTextStrem on Windows.

Ciao


December 02, 2004
In article <cokjmk$2d8u$1@digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>
[...]
>
>Moral of the story being that 8-bit strings should be declared ubyte[]. Even if it makes you cast it to a pointer, before usage with C routines:
>
>> ubyte[] OEMmess = new ubyte[mess.length];
>> CharToOemW(mess, cast(LPSTR) OEMmess);
>> puts(cast(char *) OEMmess);
>
>The "char" type in C, is known as "byte" in D. Confusingly enough. Like Ben says, the D char type only accepts valid UTF-8 code units...
>
>PS. No, it doesn't help that the C routines are declared as (char *)
>     when they really take (ubyte *) arguments. It's just as a shortcut
>     to avoid having to translate the C function declarations to D...

Sorry, I don't understand.
Are you proposing to change any C function prototype that uses "char*" to
"ubyte*"?
I agree that this would make clear that D char[] are different from C char*.
But it's a lot of work.

>     And of course, it also works just fine for ASCII-only strings.
>     (a char[] can be directly converted to char *, iff it is ASCII)
>     With non-US-ASCII characters, it doesn't work - as you've seen.

With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows.

Ciao


December 02, 2004
Roberto Mariottini wrote:

>>PS. No, it doesn't help that the C routines are declared as (char *)
>>    when they really take (ubyte *) arguments. It's just as a shortcut
>>    to avoid having to translate the C function declarations to D...
> 
> Sorry, I don't understand.

I was being somewhat vague, sorry.

> Are you proposing to change any C function prototype that uses "char*" to
> "ubyte*"?
> I agree that this would make clear that D char[] are different from C char*.
> But it's a lot of work.

That is why it was skipped, but you still need to be aware of the difference or it will cause subtle bugs like the one you encountered...
(actually is a huge pain, as soon as you leave the old ascii strings)

Anyway, if you stick non-UTF-8 strings in char[] variables you are
setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ?
They both convert to C's (char *) in the usual way (with a NUL added)

char[] and wchar[] should be enough for any strings internal to D,
you should only need to mess with 8-bit encodings for input/output...
(and then it should preferrably all be handled by a library routine)

>>    And of course, it also works just fine for ASCII-only strings.
>>    (a char[] can be directly converted to char *, iff it is ASCII)
>>    With non-US-ASCII characters, it doesn't work - as you've seen.
> 
> With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it
> works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or
> 437, or ...) on Windows.

I meant that you can output ASCII as UTF-8 and it will still work...
(mostly, except if you are stuck in EDBDIC or some other weird place)

>   writefln("hello world!"); // English, works about everywhere US-ASCII

But to output to the console on Windows (or other non-Unicode platform),
it needs to be translated to the local "code page" or "charset/encoding"

Like if you want to support characters beyond the 96 or so that are
in the ASCII subset, for instance if you live in Italy or Sweden.

>   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8

And there is currently no functions in D to do that, as far as I know ?


Same thing applies to console input such as the "char[] args" params...
If you just echo those args on a non-Unicode console, you get errors!
(since then they are not really UTF-8 strings, but casted ubyte[]'s)

--anders
December 02, 2004
On Thu, 02 Dec 2004 11:11:27 +0100, Anders F Björklund <afb@algonet.se> wrote:
> Roberto Mariottini wrote:
>
>>> PS. No, it doesn't help that the C routines are declared as (char *)
>>>    when they really take (ubyte *) arguments. It's just as a shortcut
>>>    to avoid having to translate the C function declarations to D...
>>  Sorry, I don't understand.
>
> I was being somewhat vague, sorry.
>
>> Are you proposing to change any C function prototype that uses "char*" to
>> "ubyte*"?
>> I agree that this would make clear that D char[] are different from C char*.
>> But it's a lot of work.

I think it's a good idea. I reckon it will initially cause people to be confused, i.e. they see:

int strcmp(byte *, byte *)

and think "huh? strcmp takes a char * not a byte *" but then if they look up byte * and/or char * in the D docs they should hopefully realise the difference, that C's char * is really a byte * and D's char[] is UTF encoded.

Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not a "ubyte*" as C's char's are signed.

> That is why it was skipped, but you still need to be aware of the difference or it will cause subtle bugs like the one you encountered...
> (actually is a huge pain, as soon as you leave the old ascii strings)
>
> Anyway, if you stick non-UTF-8 strings in char[] variables you are
> setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ?
> They both convert to C's (char *) in the usual way (with a NUL added)
>
> char[] and wchar[] should be enough for any strings internal to D,
> you should only need to mess with 8-bit encodings for input/output...
> (and then it should preferrably all be handled by a library routine)

Exactly, all transcoding should be done at the input/output stage (if at all) internally you should use char[] wchar[] or dchar[]. Unless of course you have a good reason not to.

>>>    And of course, it also works just fine for ASCII-only strings.
>>>    (a char[] can be directly converted to char *, iff it is ASCII)
>>>    With non-US-ASCII characters, it doesn't work - as you've seen.
>>  With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it
>> works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or
>> 437, or ...) on Windows.
>
> I meant that you can output ASCII as UTF-8 and it will still work...
> (mostly, except if you are stuck in EDBDIC or some other weird place)
>
>>   writefln("hello world!"); // English, works about everywhere US-ASCII
>
> But to output to the console on Windows (or other non-Unicode platform),
> it needs to be translated to the local "code page" or "charset/encoding"
>
> Like if you want to support characters beyond the 96 or so that are
> in the ASCII subset, for instance if you live in Italy or Sweden.
>
>>   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8
>
> And there is currently no functions in D to do that, as far as I know ?

No, but you can wrap and use the C (windows) function CharToOemW.

Someone suggested that the default stdout stream should do this automatically, I think that's a great idea. IIRC Ben was considering giving this a go.

> Same thing applies to console input such as the "char[] args" params...
> If you just echo those args on a non-Unicode console, you get errors!
> (since then they are not really UTF-8 strings, but casted ubyte[]'s)

Which strikes me as ridiculous.

Regan
December 03, 2004
Regan Heath wrote:

> and think "huh? strcmp takes a char * not a byte *" but then if they look  up byte * and/or char * in the D docs they should hopefully realise the  difference, that C's char * is really a byte * and D's char[] is UTF  encoded.
> 
> Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not a  "ubyte*" as C's char's are signed.

D is only concerned about "byte size", so it will remain as char*...

>>>   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8
>>
>> And there is currently no functions in D to do that, as far as I know ?
> 
> No, but you can wrap and use the C (windows) function CharToOemW.

I'm not using Windows, but a modern system with an UTF-8 console ;-)

>> Same thing applies to console input such as the "char[] args" params...
>> If you just echo those args on a non-Unicode console, you get errors!
>> (since then they are not really UTF-8 strings, but casted ubyte[]'s)
> 
> Which strikes me as ridiculous.

Either way, both stdout and stdin need to be "extended" for non-UTF-8

--anders
December 03, 2004
"Roberto Mariottini" <Roberto_member@pathlink.com> wrote in message news:comhrd$26ug$1@digitaldaemon.com...
> In article <cokqtc$2okj$1@digitaldaemon.com>, Ben Hinkle says...
> >
> >
> [...]
> >Now that you mention it it might be nice to make another
> >Stream subclass and add support for the "native" encodings. It sounds
fun -
> >I'll give it a shot. It should be pretty easy actually since you just override writeString and writeStringW to call some OS function to convert the string or char from utf to native encoding.
>
> The Windows function to accomplish this task is the already cited
CharToOemW().
>
> >Supporting arbitrary encodings would probably be left for non-phobos libraries since they would presumably require something like ICU or libiconv. So basically what I have in mind is that to write to stdout
with
> >native encoding you'd have to write
> >
> >import std.stream;
> >...
> >stdoutn = NativeTextStream(stdout);
> >stdoutn.writef(<some utf encoded string>);
>
> The default stdout should be a NativeTextStrem on Windows.
>
> Ciao
>
>

Actually on second thought I'm getting hesitant to put this kind of thing into std.stream since it is so platform specific - the Mac's iconv API is mapped to libiconv using C preprocessor macros so the D code will have to hard-code in those symbol names (AFAIK). Also it looks like CharToOemW might not be on all Win95/98/Me systems. Each platform will have to get special code for how to handle this NativeTextStream stuff. It could get pretty messy for fairly small bang-for-buck. I'm leaning towards putting in an outside library that can handle arbitrary encodings - like libiconv or mango's ICU wrapper or something.


December 04, 2004
Ben Hinkle wrote:
>  Each platform will have to get special
> code for how to handle this NativeTextStream stuff. It could get pretty
> messy for fairly small bang-for-buck. I'm leaning towards putting in an
> outside library that can handle arbitrary encodings - like libiconv or
> mango's ICU wrapper or something.
> 
> 
I'd like to encourage you to do so. If you take that approach I'll write an adapter for Mango.io, so there's more options for everyone. I'd also like to see a Stream adapter for the ICU converters; perhaps there will be cases where the 200+ ICU transcoders cover areas that iconv does not?
1 2
Next ›   Last »