Jump to page: 1 2 3
Thread overview
A grave Prayer
Nov 23, 2005
Georg Wrede
Nov 23, 2005
Oskar Linde
Nov 23, 2005
Georg Wrede
Nov 23, 2005
Regan Heath
Nov 23, 2005
Oskar Linde
Nov 23, 2005
Bruno Medeiros
Nov 23, 2005
Regan Heath
Nov 23, 2005
Oskar Linde
Nov 24, 2005
Carlos Santander
Nov 24, 2005
Oskar Linde
Nov 24, 2005
Oskar Linde
Nov 24, 2005
Georg Wrede
Nov 24, 2005
Georg Wrede
Nov 23, 2005
Georg Wrede
Nov 23, 2005
Oskar Linde
Nov 23, 2005
Georg Wrede
Nov 23, 2005
Oskar Linde
Nov 23, 2005
Kris
Nov 23, 2005
Oskar Linde
Nov 23, 2005
Kris
Nov 23, 2005
Regan Heath
Nov 23, 2005
Kris
Nov 23, 2005
Oskar Linde
Nov 24, 2005
Georg Wrede
Nov 23, 2005
Georg Wrede
November 23, 2005
We've wasted truckloads of ink lately on the UTF issue.

Seems we're tangled in this like the new machine room boy who was later found dead. He got so wound up with cat-5 cables he got asphyxiated.


The prayer:

Please remove the *char*, *wchar* and *dchar* basic data types from the documentation!

Please remove ""c, ""w and ""d from the documentation!


The illusion they make is more than 70% responsible for the recent wasteage of ink here. Further, they will cloud the minds of any new D user -- without anybody even noticing.

For example, there is no such thing as an 8 bit utf char. It simply does not exist. The mere hint of such in the documentation has to be the result of a slipup of the mind.

---

There are other things to fix regarding strings, utf, and such, but it will take some time before we develop a collective understanding about them.

But the documentation should be changed RIGHT NOW, so as to not cause further misunderstanding among ourselves, and to not introduce this to all new readers.

Anybody who reads these archives a year from now, can't help feeling we are a bunch of pathological cases harboring an incredible psychosis -- about something that is basically trivial.

---

ps, this is not a rhetorical joke, I'm really asking for the removals.

November 23, 2005
Georg Wrede wrote:
> We've wasted truckloads of ink lately on the UTF issue.
> 
> Seems we're tangled in this like the new machine room boy who was later found dead. He got so wound up with cat-5 cables he got asphyxiated.
> 
> 
> The prayer:
> 
> Please remove the *char*, *wchar* and *dchar* basic data types from the documentation!

What atleast should be changed is the suggestion that the C char is replacable by the D char, and a slight change in definition:

char: UTF-8 code unit
wchar: UTF-16 code unit
dchar: UTF-32 code unit and/or unicode character

> 
> Please remove ""c, ""w and ""d from the documentation!

So you suggest that users should use an explicit cast instead?

/Oskar
November 23, 2005
Oskar Linde wrote:
> Georg Wrede wrote:
> 
>> We've wasted truckloads of ink lately on the UTF issue.
>>
>> The prayer:
>>
>> Please remove the *char*, *wchar* and *dchar* basic data types from the documentation!
> 
> What atleast should be changed is the suggestion that the C char is replacable by the D char, and a slight change in definition:
> 
> char: UTF-8 code unit
> wchar: UTF-16 code unit
> dchar: UTF-32 code unit and/or unicode character

A UTF-8 code unit can not be stored in char. Therefore UTF can't be mentioned at all in this context.

By the same token, and because of symmetry, the wchar and dchar things should vanish. Disappear. Get nuked.

The language manual (as opposed to the Phobos docs), should not once use the word UTF anywhere. Except possibly stating that a conformant compiler has to accept source code files that are stored in UTF.

---

If a programmer decides to do UTF mangling, then he should store the intermediate results in ubytes, ushorts and uints. This way he can avoid getting tangled and tripped up before or later.

And this way he keeps remembering that it's on his responsibility to get things right.

>> Please remove ""c, ""w and ""d from the documentation!
> 
> So you suggest that users should use an explicit cast instead?

No.

How, or in what form the string literals get internally stored, is the compiler vendor's own private matter. And definitely not part of a language specification.

The string literal decorations only create a huge amount of distraction, sending every programmer on a wild goose chase, possibly several times, before they either finds someone who explains things, or they switch away from D.

Behind the scenes, a smart compiler manufacturer probably stores all string literals as UTF-8. Nice. Or as some other representation that's convenient for him, be it UTF-whatever, or even a native encoding. In any case, the compiler knows what this representation is.

When the string gets used (gets assigned to a variable, or put in a structure, or printed to screen), then the compiler should implicitly cast (as in toUTFxxx and not the current "cast") the string to what's expected.

We can have "string types", like [c/w/d]char[], but not standalone UTF chars. When the string literal gets assigned to any of these, then it should get converted.

Actually, a smart compiler would already store the string literal in the width the string will get used, it's got the info right there in the same source file. And in case the programmer is real dumb and assigns the same literal to more than one UTF width, it could be stored in all of those separately. -- But the brain cells of the compiler writer should be put to more productive issues than this.

Probably the easiest would be to just decide that UTF-8 is it. And use that while something else is not demanded. Or UTF-whatever else, depending on the platform. But the D programmer should never have to cast a string literal.

The D programmer should not necessarily even know that representation. The only situation he would need to know it is if his program goes directly to the executable memory image snooping for the literal. I'm not sure that should be legal.
November 23, 2005
On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Oskar Linde wrote:
>> Georg Wrede wrote:
>>
>>> We've wasted truckloads of ink lately on the UTF issue.
>>>
>>> The prayer:
>>>
>>> Please remove the *char*, *wchar* and *dchar* basic data types from the documentation!
>>  What atleast should be changed is the suggestion that the C char is replacable by the D char, and a slight change in definition:
>>  char: UTF-8 code unit
>> wchar: UTF-16 code unit
>> dchar: UTF-32 code unit and/or unicode character
>
> A UTF-8 code unit can not be stored in char. Therefore UTF can't be mentioned at all in this context.

I suspect we're having terminology issues again.

http://en.wikipedia.org/wiki/Unicode
"each code point may be represented by a variable number of code values."

"code point" == "character"
"code value" == part of a (or in some cases a complete) character

I think Oscar's is correct (because I believe "code unit" == "code value") and char does contain a UTF-8 code value/unit.

However, Georg is also correct (because I suspect he meant) that a char can not contain all Unicode(*) code points. It can contain some, those with values less than 255 but not others.

(*) Note I did not say UTF-8 here, I believe it's incorrect to do so. code points are universal across all the UTF encodings, they are simply represented by different code values/units depending on the encoding used.

Regan
November 23, 2005
Georg Wrede wrote:
> Oskar Linde wrote:
> 
>> Georg Wrede wrote:
>>
>>> We've wasted truckloads of ink lately on the UTF issue.
>>>
>>> The prayer:
>>>
>>> Please remove the *char*, *wchar* and *dchar* basic data types from the documentation!
>>
>>
>> What atleast should be changed is the suggestion that the C char is replacable by the D char, and a slight change in definition:
>>
>> char: UTF-8 code unit
>> wchar: UTF-16 code unit
>> dchar: UTF-32 code unit and/or unicode character
> 
> 
> A UTF-8 code unit can not be stored in char. [...]

Can it not? I thought I had been doing this all the time... Why?
(I know most Unicode code points aka characters can not be stored in a char)

> [...] Therefore UTF can't be
> mentioned at all in this context.
>
> By the same token, and because of symmetry, the wchar and dchar things should vanish. Disappear. Get nuked.

A dchar can represent any Unicode character. Is that not enough reason to keep it?

> The language manual (as opposed to the Phobos docs), should not once use the word UTF anywhere. Except possibly stating that a conformant compiler has to accept source code files that are stored in UTF.

Why? It is often important to know the encoding of a string. (When dealing with system calls, c-library calls etc.) Doesn't D do the right thing to define Unicode as the character set and UTF-* as the supported encodings?

> 
> ---
> 
> If a programmer decides to do UTF mangling, then he should store the intermediate results in ubytes, ushorts and uints. This way he can avoid getting tangled and tripped up before or later.

Those are more or less equivalent to char, wchar, dchar. The difference is more a matter of convention. (and string/character literal handling)
Doesn't the confusion rather lie in the name? (char may imply character rather than code unit. I agree that this is unfortunate.)

> And this way he keeps remembering that it's on his responsibility to get things right.
> 
>>> Please remove ""c, ""w and ""d from the documentation!
>>
>>
>> So you suggest that users should use an explicit cast instead?
>
> No.
>

I think this may have been the reason for the implementation of the suffixes:

print(char[] x) { printf("1\n"); }
print(wchar[] x) { printf("2\n"); }
void main() { print("hello"); }

A remedy is very close to the current implementation. The following change in behaviour: string literals are always char[], but may be implicitly cast to wchar[] and dchar[]. print(char[] x) would therefore be a closer match.
(All casts of literals are of course done at compile time.)

Currently in D: string literals are __uncommitted char[] and may be implicitly cast to char[], wchar[] and dchar[]. No cast takes preference.

D may be too agnostic about preferred encoding...

> How, or in what form the string literals get internally stored, is the compiler vendor's own private matter. And definitely not part of a language specification.

The form string literals are stored in affects the performance in many cases. Why remove this control from the programmer?
Rereading your post, tryig to understand what you ask for leads me to assume that by your suggestion, a poor, but conforming, compiler may include a hidden call to a conversion function and a hidden memory allocation in this function call:

foo("hello");

> The string literal decorations only create a huge amount of distraction, sending every programmer on a wild goose chase, possibly several times, before they either finds someone who explains things, or they switch away from D.

I can not understand this confusion. It is currently very well specified how a character literal gets stored (and you get to pick between 3 different encodings). The problem lies in the "" form, where you have not specified the encoding. The compiler will try to infer the encoding depending on context. This is a weakness i think should be fixed by:

- "" is char[], but may be implicitly cast to dchar[] and wchar[]

> Behind the scenes, a smart compiler manufacturer probably stores all string literals as UTF-8. Nice. Or as some other representation that's convenient for him, be it UTF-whatever, or even a native encoding. In any case, the compiler knows what this representation is.
> 
> When the string gets used (gets assigned to a variable, or put in a structure, or printed to screen), then the compiler should implicitly cast (as in toUTFxxx and not the current "cast") the string to what's expected.

I start to understand you now... (first time reading)

> We can have "string types", like [c/w/d]char[], but not standalone UTF chars. When the string literal gets assigned to any of these, then it should get converted.
> 
> Actually, a smart compiler would already store the string literal in the width the string will get used, it's got the info right there in the same source file. [...]

But this is exactly what D does! Only in a well defined way. Not by hiding it as an implementation detail.

> [...] And in case the programmer is real dumb and assigns the same literal to more than one UTF width, it could be stored in all of those separately. -- But the brain cells of the compiler writer should be put to more productive issues than this.

How do you assign _one_ literal to more than one variable?

> The D programmer should not necessarily even know that representation. The only situation he would need to know it is if his program goes directly to the executable memory image snooping for the literal. I'm not sure that should be legal.

What is the wrong with a well defined representation?

Regards,

/Oskar
November 23, 2005
Regan Heath wrote:
> On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
>> Oskar Linde wrote:
>>
>>> Georg Wrede wrote:
>>>
>>>> We've wasted truckloads of ink lately on the UTF issue.
>>>>
>>>> The prayer:
>>>>
>>>> Please remove the *char*, *wchar* and *dchar* basic data types from  the documentation!
>>>
>>>  What atleast should be changed is the suggestion that the C char is  replacable by the D char, and a slight change in definition:
>>>  char: UTF-8 code unit
>>> wchar: UTF-16 code unit
>>> dchar: UTF-32 code unit and/or unicode character
>>
>>
>> A UTF-8 code unit can not be stored in char. Therefore UTF can't be  mentioned at all in this context.
> 
> 
> I suspect we're having terminology issues again.
> 
> http://en.wikipedia.org/wiki/Unicode
> "each code point may be represented by a variable number of code values."
> 
> "code point" == "character"
> "code value" == part of a (or in some cases a complete) character
> 
> I think Oscar's is correct (because I believe "code unit" == "code value")  and char does contain a UTF-8 code value/unit.
>
> However, Georg is also correct (because I suspect he meant) that a char  can not contain all Unicode(*) code points. It can contain some, those  with values less than 255 but not others.

I think you are correct in your analysis Regan.

I was actually referring to your definitions in:
news://news.digitalmars.com:119/ops0bhkaaa23k2f5@nrage.netwin.co.nz
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30029
and trying to be very careful to use the agreed terminology:

character: (Unicode character/symbol)
code unit: (part of the actual encoding)

I feel that I may have been a bit dense in my second reply to Georg. Apologies.

Regards,

Oskar
November 23, 2005
"Oskar Linde" <oskar.lindeREM@OVEgmail.com> wrote
[snip]
> A remedy is very close to the current implementation. The following change
> in behaviour: string literals are always char[], but may be implicitly
> cast to wchar[] and dchar[]. print(char[] x) would therefore be a closer
> match.
> (All casts of literals are of course done at compile time.)
>
> Currently in D: string literals are __uncommitted char[] and may be implicitly cast to char[], wchar[] and dchar[]. No cast takes preference.

Yes, very close. Rather than be uncommited they should default to char[]. A suffix can be used to direct as appropriate. This makes the behaviour more consistent in two (known) ways.

There's the additional question as to whether the literal content should be used to imply the type (like literal chars and numerics do), but that's a different topic and probably not as important.


November 23, 2005
Regan Heath wrote:
> On Wed, 23 Nov 2005 21:16:03 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
>> Oskar Linde wrote:
>>
>>> Georg Wrede wrote:
>>>
>>>> We've wasted truckloads of ink lately on the UTF issue.
>>>>
>>>> The prayer:
>>>>
>>>> Please remove the *char*, *wchar* and *dchar* basic data types from  the documentation!
>>>
>>>  What atleast should be changed is the suggestion that the C char is  replacable by the D char, and a slight change in definition:
>>>  char: UTF-8 code unit
>>> wchar: UTF-16 code unit
>>> dchar: UTF-32 code unit and/or unicode character
>>
>>
>> A UTF-8 code unit can not be stored in char. Therefore UTF can't be  mentioned at all in this context.
> 
> 
> I suspect we're having terminology issues again.
> 
> http://en.wikipedia.org/wiki/Unicode
> "each code point may be represented by a variable number of code values."
> 
> "code point" == "character"
> "code value" == part of a (or in some cases a complete) character
> 
> I think Oscar's is correct (because I believe "code unit" == "code value")  and char does contain a UTF-8 code value/unit.
> 

> However, Georg is also correct (because I suspect he meant) that a char  can not contain all Unicode(*) code points. It can contain some, those  with values less than 255 but not others.
Wrong actually, it can only contain codepoints below 128. Above that it takes two bytes of storage.

"A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII."
http://en.wikipedia.org/wiki/UTF-8

"a) Use UTF-8. This preserves ASCII, but not Latin-1, because the characters >127 are different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters. "
http://www.unicode.org/faq/utf_bom.html

I've actually only found this today when trying to writefln('ç');


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 23, 2005
Kris wrote:

> There's the additional question as to whether the literal content should be used to imply the type (like literal chars and numerics do), but that's a different topic and probably not as important. 

How would the content imply the type? All Unicode strings are representable equally by char[], wchar[] and dchar[].
Do you mean that the most optimal (memory wise) encoding is used?
Or do you mean that file encoding could imply type?
(This means that transcoding the source code could change program behaviour.)

Char literals and numerals imply type because of limited representability. This is not the case with strings.

/Oskar
November 23, 2005
On Wed, 23 Nov 2005 20:50:16 +0000, Bruno Medeiros <daiphoenixNO@SPAMlycos.com> wrote:
> Regan Heath wrote:
>> However, Georg is also correct (because I suspect he meant) that a char  can not contain all Unicode(*) code points. It can contain some, those  with values less than 255 but not others.
>
> Wrong actually, it can only contain codepoints below 128. Above that it takes two bytes of storage.

Thanks for the correction. I wasn't really sure.

Regan
« First   ‹ Prev
1 2 3