November 23, 2005
"Oskar Linde" <oskar.lindeREM@OVEgmail.com> wrote
> Kris wrote:
>
>> There's the additional question as to whether the literal content should be used to imply the type (like literal chars and numerics do), but that's a different topic and probably not as important.
>
> How would the content imply the type? All Unicode strings are
> representable equally by char[], wchar[] and dchar[].
> Do you mean that the most optimal (memory wise) encoding is used?
> Or do you mean that file encoding could imply type?
> (This means that transcoding the source code could change program
> behaviour.)

I meant in terms of implying the "default" type. We're suggesting that default be char[], but if the literal contained unicode chars then the default might be something else. This is the method used to assign type to a char literal, but Walter has noted it might be confusing to do something similar with string literals.

As indicated, I don't think this aspect is of much importance compared to the issues associated with an uncommited literal type ~ that uncommited aspect needs to be fixed, IMO.


November 23, 2005
On Wed, 23 Nov 2005 13:18:49 -0800, Kris <fu@bar.com> wrote:
> "Oskar Linde" <oskar.lindeREM@OVEgmail.com> wrote
>> Kris wrote:
>>
>>> There's the additional question as to whether the literal content should
>>> be used to imply the type (like literal chars and numerics do), but
>>> that's a different topic and probably not as important.
>>
>> How would the content imply the type? All Unicode strings are
>> representable equally by char[], wchar[] and dchar[].
>> Do you mean that the most optimal (memory wise) encoding is used?
>> Or do you mean that file encoding could imply type?
>> (This means that transcoding the source code could change program
>> behaviour.)
>
> I meant in terms of implying the "default" type. We're suggesting that
> default be char[], but if the literal contained unicode chars then the
> default might be something else. This is the method used to assign type to a
> char literal, but Walter has noted it might be confusing to do something
> similar with string literals.

Did you see my reply to you in this other thread .. wait a minute .. where's it gone .. no wonder you didn't reply, it seems my post was never made. Let me try again:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/2216

What do you two think of what I have there?

> As indicated, I don't think this aspect is of much importance compared to
> the issues associated with an uncommited literal type ~ that uncommited
> aspect needs to be fixed, IMO.

I agree. While there exists a slight risk in doing so, it's no different to the risk involved with the current handling of integer literals. This change would essentailly make string literal handling consistent with integer literal handling.

Regan
November 23, 2005
Regan Heath wrote:
> On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
>> A UTF-8 code unit can not be stored in char. Therefore UTF can't be  mentioned at all in this context.
> 
> I suspect we're having terminology issues again.
> 
> Regan

Thanks! Of course!

So "A UTF-8 code point".
November 23, 2005
Bruno Medeiros wrote:
> "a) Use UTF-8. This preserves ASCII, but not Latin-1, because the characters >127 are different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters. "
> http://www.unicode.org/faq/utf_bom.html
> 
> I've actually only found this today when trying to writefln('ç');

That's odd. I tried this on Linux (dmd .139):

writefln('ༀ');	// -> no error!
char a = 'ༀ';	// -> error!

It seems that DMD allows all Unicode-character values as character literals (IMO this is correct behavior). One problem is that while you cannot assign the om-symbol 'ༀ' to a char (this is also correct - 3840>127), you can do this:

char a = 'ä';	// -> no error!
writefln(a);	// outputs: Error: 4invalid UTF-8 sequence (oops!)
November 23, 2005
"Regan Heath" <regan@netwin.co.nz> wrote
[snip]
>> As indicated, I don't think this aspect is of much importance compared to the issues associated with an uncommited literal type ~ that uncommited aspect needs to be fixed, IMO.
>
> I agree. While there exists a slight risk in doing so, it's no different to the risk involved with the current handling of integer literals. This change would essentailly make string literal handling consistent with integer literal handling.

Consistent with auto string-literals too.


November 23, 2005
Regan Heath wrote:
> On Wed, 23 Nov 2005 13:18:49 -0800, Kris <fu@bar.com> wrote:
> 
>> "Oskar Linde" <oskar.lindeREM@OVEgmail.com> wrote
>>
>>> Kris wrote:
>>>
>>>> There's the additional question as to whether the literal content  should
>>>> be used to imply the type (like literal chars and numerics do), but
>>>> that's a different topic and probably not as important.
>>>
>>>
>>> How would the content imply the type? All Unicode strings are
>>> representable equally by char[], wchar[] and dchar[].
>>> Do you mean that the most optimal (memory wise) encoding is used?
>>> Or do you mean that file encoding could imply type?
>>> (This means that transcoding the source code could change program
>>> behaviour.)
>>
>>
>> I meant in terms of implying the "default" type. We're suggesting that
>> default be char[], but if the literal contained unicode chars then the
>> default might be something else. This is the method used to assign type  to a
>> char literal, but Walter has noted it might be confusing to do something
>> similar with string literals.

I think I see what you mean. By unicode char, you actually mean unicode character with code point > 127 (i.e. not representable in ASCII).

> Did you see my reply to you in this other thread .. wait a minute ..  where's it gone .. no wonder you didn't reply, it seems my post was never  made. Let me try again:
> http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/2216
> 
> What do you two think of what I have there?
> 
>> As indicated, I don't think this aspect is of much importance compared to
>> the issues associated with an uncommited literal type ~ that uncommited
>> aspect needs to be fixed, IMO.
> 
> 
> I agree. While there exists a slight risk in doing so, it's no different  to the risk involved with the current handling of integer literals. This  change would essentailly make string literal handling consistent with  integer literal handling.

Reasoning: The only risk always defaulting to char[] is in efficiency? Like the following scenario:

A user writes string literals, in say chineese, without a suffix, making those strings char[] by default. Those literals are fed to functions having multiple implementations, optimised for char[], wchar[] and dchar[]. Since the optimal encoding for chineese is UTF-16 (guessing here), the users code will be suboptimal.

You are hypothesizing that a heuristic could be used to pick a encoding depending on content. The heuristic you are considering seems to be:
If the string contains characters not representable by a single UTF-16 code unit, make the string dchar[].
Else, if the string contains characters not representable by a single UTF-8 code unit, make the string wchar[].
Else, the string becomes char[].

Hmm... By having such a rule, you would also guarantee that:
a) all indexing of string literals give valid unicode characters
b) all slicing of string literals give valid unicode strings
c) counting characters in a Unicode aware editor gives you the correct index

Without such a rule, the following might give surprising results for
"naïve users"[0..5]

a), b) and c) above seem to me to be quite strong arguments. I have almost convinced myself that there might be a good reason to have string literal types depend on content... :)

/Oskar
November 23, 2005
Jari-Matti Mäkelä wrote:
> Bruno Medeiros wrote:
> 
>> "a) Use UTF-8. This preserves ASCII, but not Latin-1, because the characters >127 are different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters. "
>> http://www.unicode.org/faq/utf_bom.html
>>
>> I've actually only found this today when trying to writefln('ç');
> 
> 
> That's odd. I tried this on Linux (dmd .139):
> 
> writefln('ༀ');    // -> no error!
> char a = 'ༀ';    // -> error!
> 
> It seems that DMD allows all Unicode-character values as character literals (IMO this is correct behavior). One problem is that while you cannot assign the om-symbol 'ༀ' to a char (this is also correct - 3840>127), you can do this:
> 
> char a = 'ä';    // -> no error!
> writefln(a);    // outputs: Error: 4invalid UTF-8 sequence (oops!)

I think Bruno may have been using a non-unicode format like Latin-1 for his sources. 'ç' would then appear as garbage to the compiler.

/Oskar
November 23, 2005
Georg Wrede wrote:
> Regan Heath wrote:
> 
>> On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
>>
>>> A UTF-8 code unit can not be stored in char. Therefore UTF can't be  mentioned at all in this context.
>>
>>
>> I suspect we're having terminology issues again.
>>
>> Regan
> 
> 
> Thanks! Of course!
> 
> So "A UTF-8 code point".

No... Apparently code point is a point in the unicode space. I.e. a character. The guys that came up with that should be shot though. ;)

http://www.unicode.org/glossary/

There are no UTF-8 code points by terminology. Only

UTF-8 code unit == UTF-8 code value
Unicode code point == Unicode character == Unicode symbol

:)

/Oskar
November 23, 2005
Oskar Linde wrote:
> Jari-Matti Mäkelä wrote:
> 
>> Bruno Medeiros wrote:
>>
>>> "a) Use UTF-8. This preserves ASCII, but not Latin-1, because the characters >127 are different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters. "
>>> http://www.unicode.org/faq/utf_bom.html
>>>
>>> I've actually only found this today when trying to writefln('ç');
>>
>>
>>
>> That's odd. I tried this on Linux (dmd .139):
>>
>> writefln('ༀ');    // -> no error!
>> char a = 'ༀ';    // -> error!
>>
>> It seems that DMD allows all Unicode-character values as character literals (IMO this is correct behavior). One problem is that while you cannot assign the om-symbol 'ༀ' to a char (this is also correct - 3840>127), you can do this:
>>
>> char a = 'ä';    // -> no error!
>> writefln(a);    // outputs: Error: 4invalid UTF-8 sequence (oops!)
> 
> 
> I think Bruno may have been using a non-unicode format like Latin-1 for his sources. 'ç' would then appear as garbage to the compiler.

Yes, that might be true if he's using Windows. On Linux the compiler input must be valid Unicode. Maybe it's hard to create Unicode-compliant programs on Windows since even Sun Java has some problems with Unicode class names on Windows XP.

--

My point here was that since char[] is a fully valid UTF-8 string and the index/slice-operators are intelligent enough to work on the Unicode-symbol level [code 1], we should be able to store Unicode symbols to a char-variable as well. You see, an UTF-8 symbol requires 8-32 bits of storage space, so it would be perfectly possible to implement an UTF-8 symbol (char) using standard 32-bit integers or 1-4 x 8-bit bytes.

The current implementation seems to be always using 1 x 8-bit byte for the char-type and n x 8-bit bytes for char[] - strings. The strings work well but it's impossible to store a single UTF-8 character now.

[code 1]:

  char[] a = "∇∆∈∋";
  writefln(a[2]);  //outputs: ∈ (a stupid implementation would output 0x87)
November 23, 2005
Oskar Linde wrote:
> Georg Wrede wrote:
>> Oskar Linde wrote:
>>> Georg Wrede wrote:
>>>
>>>> We've wasted truckloads of ink lately on the UTF issue.
>>>>
>>>> The prayer:
>>>>
>>>> Please remove the *char*, *wchar* and *dchar* basic data types from the documentation!
>>>
>>> What atleast should be changed is the suggestion that the C char is replacable by the D char, and a slight change in definition:
>>>
>>> char: UTF-8 code unit
>>> wchar: UTF-16 code unit
>>> dchar: UTF-32 code unit and/or unicode character
>>
>> A UTF-8 code unit can not be stored in char. [...]
> 
> Can it not? I thought I had been doing this all the time... Why?
> (I know most Unicode code points aka characters can not be stored in a char)

Like Regan pointed out, I meant code point.

>> [...] Therefore UTF can't be
>> mentioned at all in this context.
> 
>> By the same token, and because of symmetry, the wchar and dchar things should vanish. Disappear. Get nuked.
> 
> A dchar can represent any Unicode character. Is that not enough reason to keep it?

No.

What we need is to eradicate this entire issue from stealing bandwidth between the ears of every soul who reads the D documents.

>> The language manual (as opposed to the Phobos docs), should not once use the word UTF anywhere. Except possibly stating that a conformant compiler has to accept source code files that are stored in UTF.
> 
> Why? It is often important to know the encoding of a string. (When dealing with system calls, c-library calls etc.) Doesn't D do the right thing to define Unicode as the character set and UTF-* as the supported encodings?

D docs, no.

Phobos docs, yes. But only in std.utf, nowhere else.

>> If a programmer decides to do UTF mangling, then he should store the intermediate results in ubytes, ushorts and uints. This way he can avoid getting tangled and tripped up before or later.
> 
> Those are more or less equivalent to char, wchar, dchar.

Those are exactly equivalent.

The issue is not technical, compiler related, language related, anything. But, having char, wchar, and dchar gives the impression that they have something inherently to do with UTF, and therefore there's something special with them in D, as the docs now stand.

That in turn, implies (to the reader, not necessarily to the writer), that there is some kind of difference in how D treats them, compared to ubyte, ushort and uint.

Well, there is a difference. But that difference exists _only_ in char[] and wchar[]. From the docs however, the reader gets the impression that there is something in char, wchar and dchar themselves that differentiates them from whatever the reader previously may have assumed.

> The difference is more a matter of convention. (and string/character
> literal handling) Doesn't the confusion rather lie in the name? (char
> may imply character rather than code unit. I agree that this is
> unfortunate.)

Yes, yes!

And that precisely is the issue here.

>> And this way he keeps remembering that it's on his responsibility to get things right.
>>
>>>> Please remove ""c, ""w and ""d from the documentation!
>>>
>>> So you suggest that users should use an explicit cast instead?
>>
>> No.
> 
> I think this may have been the reason for the implementation of the suffixes:
> 
> print(char[] x) { printf("1\n"); }
> print(wchar[] x) { printf("2\n"); }
> void main() { print("hello"); }
> 
> A remedy is very close to the current implementation. The following change in behaviour: string literals are always char[], but may be implicitly cast to wchar[] and dchar[]. print(char[] x) would
> therefore be a closer match. (All casts of literals are of course
> done at compile time.)

I'd have no problem with that.

As long as we all agree on, that this is Because We Just Happened to Choose So -- and not because there'd be any Real Reason For Precisely This Choice.

(That does sound stupid, but it's actually a very important distinction here.)

> Currently in D: string literals are __uncommitted char[] and may be implicitly cast to char[], wchar[] and dchar[]. No cast takes preference.
> 
> D may be too agnostic about preferred encoding...

Yes. The specters were lurking behind the back, so the programmer got a little paranoid in the night.

>> How, or in what form the string literals get internally stored, is the compiler vendor's own private matter. And definitely not part of a language specification.
> 
> The form string literals are stored in affects the performance in many cases. Why remove this control from the programmer?

Because he really should not have to bother.

(You might take an average program, count the literal strings, and then time the execution. Figure out the stupidest choice between UTF[8..32] and the OS's native encoding, and then calculate the time for decoding. I'd say that this amounts to such peanuts that it really is not worth thinking about.)

If a programmer writes a 10kloc program, half of which is string literals, then maybe yes, it could make a measurable difference. OTOH, such a programmer's code probably has bigger performance issues (and others) anyhow.

> Rereading your post, tryig to understand what you ask for leads me to assume that by your suggestion, a poor, but conforming, compiler may include a hidden call to a conversion function and a hidden memory allocation in this function call:
> 
> foo("hello");

Yes. Don't buy anythin from that company.

>> The string literal decorations only create a huge amount of distraction, sending every programmer on a wild goose chase, possibly several times, before they either finds someone who explains things, or they switch away from D.
> 
> I can not understand this confusion. It is currently very well specified how a character literal gets stored (and you get to pick between 3 different encodings). 

This confusion is unnecessary, and based on a misconception. And I dare say, most of this confusion is directly a result from _having_ the string literal decorations in the first place.

> The problem lies in the "" form, where you have not specified the encoding. The compiler will try to infer the encoding depending on context. This is a weakness i think should be fixed by:
> 
> - "" is char[], but may be implicitly cast to dchar[] and wchar[]

That would fix it technically. I'd vote yes.

The bigger thing to fix is the rest. (char not being code point, gotchas when having arrays of char which you then slice, etc. -- still have to be sorted out.)

>> Behind the scenes, a smart compiler manufacturer probably stores all string literals as UTF-8. Nice. Or as some other representation that's convenient for him, be it UTF-whatever, or even a native encoding. In any case, the compiler knows what this representation is.
>>
>> When the string gets used (gets assigned to a variable, or put in a structure, or printed to screen), then the compiler should implicitly cast (as in toUTFxxx and not the current "cast") the string to what's expected.
> 
> I start to understand you now... (first time reading)
> 
>> We can have "string types", like [c/w/d]char[], but not standalone UTF chars. When the string literal gets assigned to any of these, then it should get converted.
>>
>> Actually, a smart compiler would already store the string literal in the width the string will get used, it's got the info right there in the same source file. [...]

>> [...] And in case the programmer is real dumb and assigns the same literal to more than one UTF width, it could be stored in all of those separately. -- But the brain cells of the compiler writer should be put to more productive issues than this.
> 
> How do you assign _one_ literal to more than one variable?

Gee, thanks! My bad.

Oh-oh, it's not my bad, after all. If another literal with the same content gets assigned in other places, it's customary for the compiler to only store one copy of it in the binary. So, of course then, _that_ compiler might then store it in different encodings if it is assigned to different width variables in the source.

>> The D programmer should not necessarily even know that representation. The only situation he would need to know it is if his program goes directly to the executable memory image snooping for the literal. I'm not sure that should be legal.
> 
> What is the wrong with a well defined representation?

First of all, if the machine is big endian or little endian. What if it wants to store them in an unexpected place? Maybe it's a CIA certified compiler that encrypts all literals in binaries and inserts on-the-fly runtime decoding routines in all string fetches? (I REALLY wouldn't be surprised to hear they have one.)

Thousands of reasons why you'd be very interested in _not_ knowing the representation. "What you don't know doesn't hurt you, and at least it doesn't borrow brain cells from you."

   for(byte i=0; i<10; i++) {printfln("foo ", i);}

Do you honestly know whether i is a byte or an int in the compiled program? Should you even care? How many have even thought about it?

---

Having said that, if somebody asks me, I'd vote for UTF-8 on Linux, and whatever is appropriate on Windows. Then I could look for the strings in the binary file with the unix "strings" command.