utf-32 text

Sep 07, 2004

Carlos Santander B.

Sep 07, 2004

Arcane Jill

Sep 07, 2004

Carlos Santander B.

Sep 09, 2004

James McComb

Sep 09, 2004

Arcane Jill

Somebody enlighten (sp?) me, please. AFAIU, this code: ////////////////////////////////// import std.file; import std.utf; void main () { char [] u32 = `import std.stdio; void main() { writefln("adiós"); }`; void [] txt = cast (void[]) "\xFF\xFE\x00\x00"; write("test32.d", txt ~ cast(void[]) toUTF32(u32)); } ////////////////////////////////// Should produce a valid D program: import std.stdio; void main() { writefln("adiós"); } In fact, DMD accepts it. Now my questions: 1. How do I edit the created file (test32.d)? I tried a number of different editors and not even one of them could display the text correctly. Notepad shows something like " i m p o r t ..." and that's the general case (NULL before every letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of them too. 2. UTF32 is always 4 bytes per character, right? Then why did the resulting program output this? "adi??s" (2 bytes for "ó", 1 for the rest). Further testing showed it was the output as if it was UTF8. Did I miss something in the process? (FWIW, the original file was saved as UTF8 and UTF16-BE). 3. I tried to use the other BOM (00 00 FE FF) for testing and the results were exactly the same. Do BE or LE matter at all?. However I could do this: "\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character \U0000fffe"). Why is that? Is that the correct way to use \u? 4. If I save a file as, say, UTF8 and then assign a string literal to a dchar [], does DMD convert it automatically or does it produce an invalid string? Take for what it is: just ignorance. ----------------------- Carlos Santander Bernal

September 07, 2004

Re: utf-32 text

Posted by Arcane Jill
in reply to Carlos Santander B.

Permalink

Arcane Jill

Posted in reply to Carlos Santander B.

Permalink

In article <chja8o$20d3$1@digitaldaemon.com>, Carlos Santander B. says...
>
>Somebody enlighten (sp?) me, please.
>AFAIU, this code:
>
>//////////////////////////////////
>import std.file;
>import std.utf;
>
>void main ()
>{
>    char [] u32 = `import std.stdio; void main() { writefln("adiós"); }`;
>    void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
>    write("test32.d", txt ~ cast(void[]) toUTF32(u32));
>}
>
>//////////////////////////////////
>
>Should produce a valid D program:

It does.


>In fact, DMD accepts it.

As it should.


>Now my questions:
>1. How do I edit the created file (test32.d)? I tried a number of different
>editors and not even one of them could display the text correctly.

That's because most Windows text editors don't grok UTF-32. You can blame this
on Microsoft. Microsoft incorrectly lists the following encodings:
* ANSI                     SHOULD BE: WINDOWS-1252 (NOT an ANSI standard)
* Unicode                  SHOULD BE: UTF-16LE
* Unicode (big endian)     SHOULD BE: UTF-16BE

and most Windows text editors follow suit.



>Notepad shows
>something like " i m p o r t ..." and that's the general case (NULL before every
>letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
>them too.

You will have to ask individual text editor vendors that.

One editor which gets it /right/ is SC Unipad (www.unipad.org). Unfortunately this is hideously expensive.




>2. UTF32 is always 4 bytes per character, right? Then why did the resulting program output this? "adi??s" (2 bytes for "ó", 1 for the rest). Further testing showed it was the output as if it was UTF8. Did I miss something in the process? (FWIW, the original file was saved as UTF8 and UTF16-BE).

You didn't miss anything. Blame it on the text editor.


>3. I tried to use the other BOM (00 00 FE FF) for testing and the results were exactly the same. Do BE or LE matter at all?.

To Unicode, yes. To an application which doesn't understand it, no.


>However I could do this:
>"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
>\U0000fffe"). Why is that? Is that the correct way to use \u?

\u is used to denote a Unicode codepoint, and nothing else. It should /not/ be used to inject bytes into a byte array. The actual bytes inserted will depend on the encoding of the character literal -- normally UTF-8 in D, although there are arguments that D should be more flexible in this regard.

The phrase "invalid UTF character" is meaningless, since there is no such thing as a "UTF character". However, U+FFFE is a noncharacter codepoint, and it is indeed invalid to find such a codepoint in a conformant Unicode string (which of course is precisely why U+FEFF was chosen as the byte-order-mark).



>4. If I save a file as, say, UTF8 and then assign a string literal to a dchar [], does DMD convert it automatically or does it produce an invalid string?

Current DMD behavior is:
*) COMPILE-TIME constants are converted.
*) Values known only at RUN-TIME are not.

Again, plenty of us believe that this is not the best way for DMD to behave, and that implicit conversion should happen always, just as it does from short to int, because such conversions generate zero loss of information.


Arcane Jill

Arcane Jill wrote: > Again, plenty of us believe that this is not the best way for DMD to behave, and > that implicit conversion should happen always, just as it does from short to > int, because such conversions generate zero loss of information. That sounds like a beatiful knockdown argument: short-->int does not lose information, so it happens implicitly. dchar-->char does not lose information, so it should happen implicitly. +1 for implicit conversions between char, wchar and dchar. James McComb

In article <chognu$1hnj$1@digitaldaemon.com>, James McComb says... >That sounds like a beatiful knockdown argument: > >short-->int does not lose information, so it happens implicitly. Yes. That's what happens now, and it's perfectly sensible. >dchar-->char does not lose information, so it should happen implicitly. Huh? I think you may be a little confused there. dchar-->char is not lossless. (And for that matter, char-->dchar, IMO, should either require an explicit cast or throw a UTF exception if char value >0x80). >+1 for implicit conversions between char, wchar and dchar. Lossless conversion is possible between char[], wchar[] and dchar[] - but /not/ between char, wchar and dchar. Please be aware of the difference. Arcane Jill

Forums