November 13, 2005
> (*) This character will be one represented by 1 dchar, 2 wchar, or 3+ char codepoints. (Arcane Jill, were she still frequenting this place, would be  able to tell us one in an instant, sadly you have only me, a poor replacement for true expertise)

Well, such a character might, for example, be "\U00100000".  For a nice table comparing the ranges covered by each type of encoding, I suggest:

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

-[Unknown]
November 13, 2005
Regan Heath wrote:
> On Sat, 12 Nov 2005 17:06:58 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
>> Regan Heath wrote:
>>
>>> The habit/trap here is to provide 3 functions but have 2 of them call the 3rd, it's common sense as it reduces maintenance nightmares where there is a bug in one implementation and not another. The 3rd function does all the work in one encoding, however, what this means is that unless the applications internal encoding is of that 3rd type all calls to that library result in transcoding. I can't really see a good solution to this "transcoding nightmare", then again maybe I am seeing a bigger problem than there really is, is transcoding going to be a significant problem (in efficiency terms) for a common application? >>
>>
>> THIS IS NOT AGAINST YOU, OR ANYBODY ELSE
>>
>> -- I am cross posting this to digitalmars.D
>> -- I suggest follow-ups be written only there
>>
>> To create a common understanding,
> 
> 
> Sounds good to me.
> 

>> I propose a thought experiment. Those who are active or hard headed, may actually carry out the experiment. (Oh, and by Bob, if my reasoning below turns incorrect, please, inform us all about it!) :-(
> 
> 
> I have done so. My code changes are not criticism but rather, I hope,
>  as examples of how it current works. The full modified code I used
> is posted at the end of this message.
> 
>> Given the below code:
>>
>> void main()
>> {
>>      for(int i=0; i<50; i++)
>>      {
>>          // myprint("Foobar"); // see below about this line
> 
> 
> I suspect you meant:
> 
>>          // _myprint("Foobar"); // see below about this line
> 
> 
> here? (note the _), am I correct?
> 
>>          myprint("Foobar"c);
>>          myprint("Foobar"w);
>>          myprint("Foobar"d);
>>      }
>> }
>>
>> int myrandom(int b, int e)
>> {
>> // return uniform random integer int >= b || int < e
>> }
>>
>> void myprint( char[] s) {_myprint(char[] s}
>> void myprint(wchar[] s) {_myprint(char[] s}
>> void myprint(dchar[] s) {_myprint(char[] s}
> 
> 
> Were these supposed to read?
> 
> void myprint( char[] s) {_myprint(s); }
> void myprint(wchar[] s) {_myprint(cast(char[])s); }
> void myprint(dchar[] s) {_myprint(cast(char[])s); }
> 
> If so, see below for why this does not work, instead you require:
> 
> void myprint( char[] s) {_myprint(s); }
> void myprint(wchar[] s) {_myprint(toUTF8(s)); }
> void myprint(dchar[] s) {_myprint(toUTF8(s)); }
> 
>> void _myprint(char[]s)
>> {
>>      char[] cs;
>>      wchar[] ws;
>>      dchar[] ds;
>>
>>      switch ( myrandom(0,3) )
>>      {
>>          case 0: cs = s; printfln(cs); break;
>>          case 1: ws = s; printfln(ws); break;
>>          case 2: ds = s; printfln(ds); break;
>>          default: static assert(0);
>>      }
>> }
> 
> 
> These lines:
> 
>>          case 0: cs = s; printfln(cs); break;
>>          case 1: ws = s; printfln(ws); break;
>>          case 2: ds = s; printfln(ds); break;
> 
> 
> cause errors, D does not implicitly transcode or implicitly paint (thank  Bob!) one char type as another.
> 
> These lines:
> 
>>          case 0: cs = s; printfln(cs); break;
>>          case 1: ws = cast(wchar[])s; printfln(ws); break;
>>          case 2: ds = cast(dchar[])s; printfln(ds); break;
> 
> 
> compile but cause an "array cast missalignment" exception due to the cast  'painting' the data as opposed to transcoding it, these lines:
> 
>>          case 0: cs = s; printfln(cs); break;
>>          case 1: ws = toUTF16(s); printfln(ws); break;
>>          case 2: ds = toUTF32(s); printfln(ds); break;
> 
> 
> will compile and run 'correctly' (as I see it).
> 
>> Question 1: do the printed lines look alike?
> 
> 
> It depends on the contents of the string being printed.
> It also depends on the format the output device is expecting.
> 
> Assuming:
> 1. the text is ASCII
> 2. the ouput device expects ASCII
> 
> then all 3 will be identical (the test code shows this with windows console default)
> 
> However, if the text contains characters outside of ASCII and the output device expects say UTF-8, then UTF-16 and UTF-32 strings will come out as garbage in some cases.
> 
> Take the windows console for example, it expects Latin-1 on my PC,
> printing non-ascii UTF-8, 16, or 32 codepoints (codepoints are parts
> of or whole characters) can cause garbage to be displayed for certain
>  output. Setting it to unicode or utf-8 will cause utf-8 to be
> displayed correctly but can cause garbage for certain utf-16 or
> utf-32 output.
> 
>> Question 2: if one has a text editor that can produce a source code
>>  file in both UTF-8, UTF-16, UTF-32, and USASCII formats, and then
>> "saves as" this source code to all of those -- would the program
>> outputs look alike?
> 
> 
> Saving the source file as a different encoding will have no effect on
>  the output assuming you save as one of the valid encodings (as per D
>  spec):
>   ASCII
>   UTF-8
>   UTF-16BE
>   UTF-16LE
>   UTF-32BE
>   UTF-32LE
> 
> D takes the literal as stored in whatever encoding and transcodes implicitly at compile time to the required type which is stored as that type in the binary (I believe) unless there is a collision (the original complaint that started this thread).
> 
>> ----
>> 
>> Supposing the reader has answered "yes" to both of the above, then a further question:
> 
> 
> I haven't, but I believe I can continue anyway...
> 
>> Question 3: in that case, if we uncomment line 5 above, AND do the excercise in question 2, would the program outputs look alike?
> 
> 
> Uncommenting that line causes a compile error, adding the "_" works as I believe you intended.
> 
> The result is again identical lines, however (see answer to question #1 as it applies here as well) outputting UTF-16 or 32 to a UTF-8 console can cause garbage for certain output.
> 
>> Question 4: would you say that it actually makes _no_ difference how the compiler chooses to consider an undecorated string literal?
>> 
> 
> 
> No.
> 
> For the reasons I have posted earlier, namely:
> 
> #1 If you have 2 functions of the same name, doing different things,
> and one takes char[] and the other wchar[] then allowing the compiler
>  to choose an encoding can result in different behaviour depending on
>  the function chosen.
> 
> #2 depending on the encoding the program uses internally, i.e. say it
>  uses char[] everywhere, then if the compiler chooses wchar[],
> calling a wchar[] function returning a wchar[] (for example) then the
> result will need to be transcoded back to char[] for use with the
> rest of the program code.
> 
> Ignoring problem #1, in cases where the function takes a wchar literal and does not return a char it makes no difference because the compiler does the transcoding at compile time and stores the literal as a wchar in the binary. (however duplicate literals may be encoded in other encodings, due to the function they are passed to taking a different type, in these cases you'll get a binary with the same literal encoding in different ways)
> 
> In other words, ignoring #1, the encoding is important for efficiency
>  reasons, the question is: How in-efficient is it? Does it make any
> real difference?
> 
>> Question 5: would you say that we could let the compiler vendors decide how they choose to interpret the undecorated string literal?
>> 
> 
> 
> No. I believe it should be the programmers choice. For reasons above.
> 
> 
>> ----
>> 
>> Supposing that the reader still has answered "yes" to all questions, we go further:
>> 
>> Question 6: would it now seem obvious that undecorated string literals should default to _something_, and when the programmer writes an undecorated string literal, he is thereby expressly indicating that he could not care, AND that he specifically does not want to argue with the compiler about it?
> 
> 
> No. For reasons above.
> 
> To actually see the problem I describe in #1:
> 
> - Set the windows console to UTF-8 (I forget how, sorry) - Find a
> character that exists in all 3 UTF encodings differently (as
> different code points)(*) - Type/paste the character into the source
> - Save the source as any of the valid UTF encodings (ASCII will not
> work as that char will not exist in ASCII) - Run the test again.
> 
> What you should see (assuming it is done correctly and that I am correct in my reasoning) is that one, the UTF-8 one will look correct, the other 2 will look wrong.
> 
> (*) This character will be one represented by 1 dchar, 2 wchar, or 3+
>  char codepoints. (Arcane Jill, were she still frequenting this
> place, would be able to tell us one in an instant, sadly you have
> only me, a poor replacement for true expertise)
> 
> Regan
> ----test code----
> import std.random;
> import std.stdio;
> import std.utf;
> 
> void main()
> {
>     for(int i=0; i<50; i++)
>     {
>         //_myprint("Foobar"); // see below about this line
>         myprint("Foobar"c);
>         myprint("Foobar"w);
>         myprint("Foobar"d);
>     }
> }
> 
> 
> int myrandom(int b, int e)
> {
>     // return uniform random integer int >= b || int < e
>     return rand()%(e-b) + b;
> }
> 
> void myprint( char[] s) {_myprint(s); }
> void myprint(wchar[] s) {_myprint(toUTF8(s)); }
> void myprint(dchar[] s) {_myprint(toUTF8(s)); }
> 
> 
> void _myprint(char[]s)
> {
>     char[] cs;
>     wchar[] ws;
>     dchar[] ds;
> 
>     switch ( myrandom(0,3) )
>     {
>         case 0: cs = s; writefln(cs); break;
>         case 1: ws = toUTF16(s); writefln(ws); break;
>         case 2: ds = toUTF32(s); writefln(ds); break;
>         default: assert(0);
>     }
> }


Ok.

Did my own homework. Took the random stuff out, because it seemed to distract rather than power-drive my point. :-) Then I inserted an Arabic and a Russian character directly into the source, and made the same characters with \u codes.

Then I saved the file as UTF-7, UTF-8, UTF-16, UCS-2, UCS-4. I also tried to save it as Western-ISO-8859-15, which the editor aborted halfway.

(I'm on Fedora 4, where the default character set is UTF-8.)

Then I tried to compile each of the files. UTF-7 produced a C like slew of errors, and upon looking at the file with "less" I found out it was full of extra crap. (The file itself was ok, but UTF-7 is not for us.)

The others compiled fine, although each of the binaries were subtly different.

The output from each was exactly alike, and right!!
Walter is the Man!

-----

Below is the source as text -- but I got paranoid, so the "foreign" characters I replaced with "Z". The actual (UTF-8) version is compressed and attached to this post.

I also took the UTF-8 version to Windows and looked at it with WordPad, but the special characters looked destroyed. And I don't have DMD on that Windows machine (yet).

Here's the code:

import std.stdio;
import std.utf;

void main()
{
	// myprint("Foobar"); // see below
	myprint("Foobar"c);
	myprint("Foobar"w);
	myprint("Foobar"d);

     writefln("   U+FEF8 'arabic ligature lam with alef with hamza above
final form'");
     writefln("   should look like a mirrored hand written K with crap
below at left");
     writefln("Direct inserted with text editor");
	myprint("FEF8:Z"c); // direct insert
	myprint("FEF8:Z"d); // direct insert
	myprint("FEF8:Z"w); // direct insert
     writefln("Inserted as backslash-and-hex");
	myprint("With uXXXX code: \uFEF8"c);
	myprint("With uXXXX code: \uFEF8"d);
	myprint("With uXXXX code: \uFEF8"w);

     writefln("   U+0429 'cyrillic capital letter shcha");
     writefln("   should look like three vertical bars joined at bottom
with");
     writefln("   a horizontal bar having a small hanging crap at right
end");
     writefln("Direct inserted with text editor");
	myprint("0429:Z"c);
	myprint("0429:Z"w);
	myprint("0429:Z"d);
     writefln("Inserted as backslash-and-hex");
	myprint("With uXXXX code: \u0429"c);
	myprint("With uXXXX code: \u0429"d);
	myprint("With uXXXX code: \u0429"w);
}

void myprint( char[] s) {_myprint( toUTF8(s) );}
void myprint(wchar[] s) {_myprint( toUTF8(s) );}
void myprint(dchar[] s) {_myprint( toUTF8(s) );}

void _myprint(char[] s)
{
	char[] cs;
	wchar[] ws;
	dchar[] ds;

	cs = toUTF8(s) ; writefln(cs);
	ws = toUTF16(s); writefln(ws);
	ds = toUTF32(s); writefln(ds);
}



November 13, 2005

Georg Wrede wrote:
> Ok.
> 
> Did my own homework. 

> 
> The output from each was exactly alike, and right!!

So, what did we learn?

1. DMD handles this perfectly.

2. It does not matter which width (char, dchar or wchar) the literal string is originally, or by decoration either.

3. The \u notation is robust, but obviously awkward.

4. In theory, source files with in UTF should be copyable between systems, without deterioration -- in practice it doesn't work. If nothig else comes in the way, the target system does not necessarily have suitable fonts to show the files correctly.

5. We are years from being able to reliably write anything but USASCII in source code (whether within quotes or otherwise) -- however all kinds of OS vendors purport otherwise.

6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.

7. Which this default is, does not matter for program functionality.

8. The only thing for which it matters is performance.

9. Even this performance difference is minimal. (Here we are talking only about string literals. Of course the performance difference is big for Real Work string processing, but here we are only talking about Undecorated string literals.)

10. When the programmer doesn't explicitly specify, the compiler should be free to choose what width an undecorated string literal is.

---

Some explaining:

For any non-trivial program, the time spent on decoding string _literals_ is small compared to doing real work, or mangling and manipulating strings.

String literals are commonly used for giving prompts to the user, or for details in file or printed output. These activities are inherently slow anyhow.

The character width (char, wchar, dchar) with which the majority of string handling is done, is dependent only on the operating system. If for example, one operating system customarily uses 16 bit characters, then there is no reason whatever for not using this width throughout the program.

As to Unicode, a string is either Unicode or not. Period. What this "width thing" means, is just a storage attribute. Therefore, the contents of the string do not change however you transfer it from one width to the other. (Of course the width varies, as well as the bit pattern, but the contents do not change.)

---

According to D documents, unless a file is in ASCII, it has to be in _some_ UTF format.

In ASCII you cannot insert unicode literal characters in strings. (You can use the \u notation -- but now we are not talking about that, or ASCII files either.)

FROM THIS FOLLOWS that a literal string (whether decorated or not!) is already taken to be in UTF! More specifically, it is taken to be in the particular character width that the source file is. -- Even before any possible decoration is even encountered!

---

Sheesh, I feel like I'm banging my head against everybody else's head. Ok, let's try another tack:

I just ran all the test programs that were compiled from different sorts of UFT source. Then I saved the output of each, and checked the exact file type of them.

Turns out they _all_ were in UTF-8 format.

Now, how perverted a person should I be if I implicitly assumed that an undecorated string literal on this machine is in UTF-8 ?

Think about it -- one of the lines in the program looks like this:

	ds = toUTF32(s); writefln(ds);

and the output still turns out to be in UTF-8.

---

Now I'm getting angry here.

I typed text from the keyboard directly to a file:

$ cat > mytext.xxx

and then checked with the unix "file" command -- even that file was UTF-8.

Then I used the same "file" command to check all of the here discussed source files, and got surprising results:

unitest2.d:       UTF-8
unitest2UCS2.d:   data
unitest2UCS4.d:   data
unitest2UTF16.d:  MPEG ADTS, audio file
unitest2UTF7.d:   ASCII program

So, anything else but UTF-8 is asking for trouble on this machine.

And interestingly, DMD did compile each of these, except the UTF-7 file.

_Now_ am I unreasonable with the default?

---

If anybody still opposes string literal width defaults, then I'm going to ask that undecorated string literals are removed from the language.

Bobdamn!
November 13, 2005
Georg Wrede wrote:
> 
> 6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.

This seems vaguely consistent with integer promotion, as this is unambiguous:

void fn(short x) {}
void fn(int x) {}
void fn(long x) {}

The int version will be selected by default.  And since int has a specific size in D (unlike C), this is really not substantially different from having a default encoding for unqualified string literals.


Sean
November 13, 2005
On Sun, 13 Nov 2005 14:21:35 -0800, Sean Kelly <sean@f4.ca> wrote:
> Georg Wrede wrote:
>>  6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.
>
> This seems vaguely consistent with integer promotion, as this is unambiguous:
>
> void fn(short x) {}
> void fn(int x) {}
> void fn(long x) {}
>
> The int version will be selected by default.  And since int has a specific size in D (unlike C), this is really not substantially different from having a default encoding for unqualified string literals.

It's not quite the same as short/int/long as you can loose data converting from long to int, and int to short, but you can't loose data converting from dchar to wchar or wchar to char.

Regan
November 13, 2005
On Mon, 14 Nov 2005 00:07:30 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> 6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.

What if you have:

void bob(char[] a)  { printf("1\n"); }
void bob(wchar[] a) { printf("2\n"); }
void bob(dchar[] a) { printf("3\n"); }

void main()
{
	bob("test");
}

In other words 2 or 3 functions of the same name which do _different_ things, the compiler cannot correctly choose the function to call, right?

This can only occur if the functions are in the same module, if in different modules you get collision errors requiring 'alias' to resolve. So, really it should never happen.. but even then once it happens (assuming the compiler picks one silently) it could be a very hard bug to find.

> 8. The only thing for which it matters is performance.
>
> 9. Even this performance difference is minimal. (Here we are talking only about string literals. Of course the performance difference is big for Real Work string processing, but here we are only talking about Undecorated string literals.)

What if the literal is used in the data processing i.e. inserted into or searched for within a large block of text in another encoding? What if the literal is thus transcoded millions of times in the normal operation of the program. I don't think you can discount performance so easily.

> 10. When the programmer doesn't explicitly specify, the compiler should be free to choose what width an undecorated string literal is.

Unless it is effected by cases like #6 and/or #9.

> As to Unicode, a string is either Unicode or not. Period. What this "width thing" means, is just a storage attribute. Therefore, the contents of the string do not change however you transfer it from one width to the other. (Of course the width varies, as well as the bit pattern, but the contents do not change.)

This is a key concept people _must_ understand before talking about unicode issues. The 3 types can all represent the same data, it's just represented in different ways. It's not like short/int/long where a long can represent values the other two cannot.

> Sheesh, I feel like I'm banging my head against everybody else's head. Ok, let's try another tack:
>
> I just ran all the test programs that were compiled from different sorts of UFT source. Then I saved the output of each, and checked the exact file type of them.
>
> Turns out they _all_ were in UTF-8 format.

That is as it should be, assuming the program was intending to output UTF-8, the source file encoding should never have any effect on the program output.

> Now, how perverted a person should I be if I implicitly assumed that an undecorated string literal on this machine is in UTF-8 ?
>
> Think about it -- one of the lines in the program looks like this:
>
> 	ds = toUTF32(s); writefln(ds);
>
> and the output still turns out to be in UTF-8.

This is what was confusing me. I would have expected the line above to print in UTF-32. The only explaination I can think of is that the output stream is converting to UTF-8. In fact, I find it quite likely.

If you want to write a UTF-32 file you can do so by using "write" as opposed to "writeString" on a dchar array, I believe.

Regan
November 13, 2005
On Sun, 13 Nov 2005 22:12:36 +0200, Georg Wrede wrote:

[snip]

> Then I saved the file as UTF-7, UTF-8, UTF-16, UCS-2, UCS-4.

I've just fixed the Build utility to read UTF-8, UTF-16le/be, UTF-32le/be encodings, but is there any reason I should support UTF-7? It seems a bit superfluous, and under-supported elsewhere too.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
14/11/2005 9:45:22 AM
November 13, 2005
Derek Parnell wrote:
> On Sun, 13 Nov 2005 22:12:36 +0200, Georg Wrede wrote:
> 
> [snip]
> 
>>Then I saved the file as UTF-7, UTF-8, UTF-16, UCS-2, UCS-4. 
> 
> I've just fixed the Build utility to read UTF-8, UTF-16le/be, UTF-32le/be
> encodings, but is there any reason I should support UTF-7? It seems a bit
> superfluous, and under-supported elsewhere too.

No.

It's a relic.

Those who may need it, have other worries than choosing between languages (so D is not an option for those guys).
November 13, 2005
Regan Heath wrote:
> On Sun, 13 Nov 2005 14:21:35 -0800, Sean Kelly <sean@f4.ca> wrote:
> 
>> Georg Wrede wrote:
>>
>>>  6. There is _no_ reason for not having a default encoding for  undecorated string literals in source code.
>>
>> This seems vaguely consistent with integer promotion, as this is  unambiguous:
>>
>> void fn(short x) {}
>> void fn(int x) {}
>> void fn(long x) {}
>>
>> The int version will be selected by default.  And since int has a  specific size in D (unlike C), this is really not substantially  different from having a default encoding for unqualified string literals.
> 
> It's not quite the same as short/int/long as you can loose data converting  from long to int, and int to short, but you can't loose data converting  from dchar to wchar or wchar to char.

Ehh, I'd have thought that the implication goes the other way?

That is, if cardinals are handled like this (with the risk of losing content done the wrong way), then I don't see how that can be taken as an example of not doing the same with UTF, especially when UTF to UTF conversions _don't_ lose content!
November 13, 2005
Sean Kelly wrote:
> Georg Wrede wrote:
>> 
>> 6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.
> 
> This seems vaguely consistent with integer promotion, as this is unambiguous:
> 
> void fn(short x) {} void fn(int x) {} void fn(long x) {}
> 
> The int version will be selected by default.  And since int has a specific size in D (unlike C), this is really not substantially different from having a default encoding for unqualified string
> literals.

Right!