View mode: basic / threaded / horizontal-split · Log in · Help
September 07, 2004
utf-32 text
Somebody enlighten (sp?) me, please.
AFAIU, this code:

//////////////////////////////////
import std.file;
import std.utf;

void main ()
{
   char [] u32 = `import std.stdio; void main() { writefln("adiós"); }`;
   void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
   write("test32.d", txt ~ cast(void[]) toUTF32(u32));
}

//////////////////////////////////

Should produce a valid D program:
import std.stdio; void main() { writefln("adiós"); }

In fact, DMD accepts it. Now my questions:
1. How do I edit the created file (test32.d)? I tried a number of different
editors and not even one of them could display the text correctly. Notepad shows
something like " i m p o r t ..." and that's the general case (NULL before every
letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
them too.
2. UTF32 is always 4 bytes per character, right? Then why did the resulting
program output this? "adi??s" (2 bytes for "ó", 1 for the rest). Further testing
showed it was the output as if it was UTF8. Did I miss something in the process?
(FWIW, the original file was saved as UTF8 and UTF16-BE).
3. I tried to use the other BOM (00 00 FE FF) for testing and the results were
exactly the same. Do BE or LE matter at all?. However I could do this:
"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
\U0000fffe"). Why is that? Is that the correct way to use \u?
4. If I save a file as, say, UTF8 and then assign a string literal to a dchar
[], does DMD convert it automatically or does it produce an invalid string?

Take for what it is: just ignorance.

-----------------------
Carlos Santander Bernal
September 07, 2004
Re: utf-32 text
In article <chja8o$20d3$1@digitaldaemon.com>, Carlos Santander B. says...
>
>Somebody enlighten (sp?) me, please.
>AFAIU, this code:
>
>//////////////////////////////////
>import std.file;
>import std.utf;
>
>void main ()
>{
>    char [] u32 = `import std.stdio; void main() { writefln("adiós"); }`;
>    void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
>    write("test32.d", txt ~ cast(void[]) toUTF32(u32));
>}
>
>//////////////////////////////////
>
>Should produce a valid D program:

It does.


>In fact, DMD accepts it.

As it should.


>Now my questions:
>1. How do I edit the created file (test32.d)? I tried a number of different
>editors and not even one of them could display the text correctly.

That's because most Windows text editors don't grok UTF-32. You can blame this
on Microsoft. Microsoft incorrectly lists the following encodings:
* ANSI                     SHOULD BE: WINDOWS-1252 (NOT an ANSI standard)
* Unicode                  SHOULD BE: UTF-16LE
* Unicode (big endian)     SHOULD BE: UTF-16BE

and most Windows text editors follow suit.



>Notepad shows
>something like " i m p o r t ..." and that's the general case (NULL before every
>letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
>them too.

You will have to ask individual text editor vendors that.

One editor which gets it /right/ is SC Unipad (www.unipad.org). Unfortunately
this is hideously expensive.




>2. UTF32 is always 4 bytes per character, right? Then why did the resulting
>program output this? "adi??s" (2 bytes for "ó", 1 for the rest). Further testing
>showed it was the output as if it was UTF8. Did I miss something in the process?
>(FWIW, the original file was saved as UTF8 and UTF16-BE).

You didn't miss anything. Blame it on the text editor.


>3. I tried to use the other BOM (00 00 FE FF) for testing and the results were
>exactly the same. Do BE or LE matter at all?.

To Unicode, yes. To an application which doesn't understand it, no.


>However I could do this:
>"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
>\U0000fffe"). Why is that? Is that the correct way to use \u?

\u is used to denote a Unicode codepoint, and nothing else. It should /not/ be
used to inject bytes into a byte array. The actual bytes inserted will depend on
the encoding of the character literal -- normally UTF-8 in D, although there are
arguments that D should be more flexible in this regard.

The phrase "invalid UTF character" is meaningless, since there is no such thing
as a "UTF character". However, U+FFFE is a noncharacter codepoint, and it is
indeed invalid to find such a codepoint in a conformant Unicode string (which of
course is precisely why U+FEFF was chosen as the byte-order-mark).



>4. If I save a file as, say, UTF8 and then assign a string literal to a dchar
>[], does DMD convert it automatically or does it produce an invalid string?

Current DMD behavior is:
*) COMPILE-TIME constants are converted.
*) Values known only at RUN-TIME are not.

Again, plenty of us believe that this is not the best way for DMD to behave, and
that implicit conversion should happen always, just as it does from short to
int, because such conversions generate zero loss of information.


Arcane Jill
September 07, 2004
Re: utf-32 text
"Arcane Jill" <Arcane_member@pathlink.com> escribió en el mensaje
news:chjpcj$284s$1@digitaldaemon.com
|
| ...
|
| Arcane Jill

Thanks

-----------------------
Carlos Santander Bernal
September 09, 2004
Re: utf-32 text
Arcane Jill wrote:

> Again, plenty of us believe that this is not the best way for DMD to behave, and
> that implicit conversion should happen always, just as it does from short to
> int, because such conversions generate zero loss of information.

That sounds like a beatiful knockdown argument:

short-->int does not lose information, so it happens implicitly.

dchar-->char does not lose information, so it should happen implicitly.

+1 for implicit conversions between char, wchar and dchar.

James McComb
September 09, 2004
Re: utf-32 text
In article <chognu$1hnj$1@digitaldaemon.com>, James McComb says...

>That sounds like a beatiful knockdown argument:
>
>short-->int does not lose information, so it happens implicitly.

Yes. That's what happens now, and it's perfectly sensible.

>dchar-->char does not lose information, so it should happen implicitly.

Huh? I think you may be a little confused there. dchar-->char is not lossless. 

(And for that matter, char-->dchar, IMO, should either require an explicit cast
or throw a UTF exception if char value >0x80).


>+1 for implicit conversions between char, wchar and dchar.

Lossless conversion is possible between char[], wchar[] and dchar[] - but /not/
between char, wchar and dchar. Please be aware of the difference.

Arcane Jill
Top | Discussion index | About this forum | D home