November 14, 2005
On Mon, 14 Nov 2005 23:13:48 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>> There's no end to the misunderstanding this proliferates and breeds.
>
> The online manual (http://www.digitalmars.com/d/type.html) states that "char is an unsigned 8 bit UTF-8".
>
> That means that something like the "C kind of 'char'" simply does not exist in D.

I believe "byte" is equivalent to C "char" and ubyte to C "unsigned char".

Have a look at:
http://www.digitalmars.com/d/ctod.html

In the section entitled "Primitive Types" the table "C to D types" shows:

        char               =>        char
        signed char        =>        byte
        unsigned char      =>        ubyte

IMO the first line is incorrect and confusing.

> A few lines down, under Integer Promotions, there is a table that says char and wchar are converted to int, and dchar to uint.
>
> ---
>
> So, _WHAT_ should the following program output?
>
> import std.stdio;
>
> void main()
> {
> 	struct U {
> 		union {char c; int m;}
> 	}
> 	U * u = new U;
>
> 	for (int i=0;i < 0xFF; i++)
> 	{
> 		u.c = i;
> 		int ii = u.c;
>
> 		writefln("i:%4d  ii:%4d  m:%5x m:%d", i,ii,u.m,u.m);
> 	}
> }

Exactly what it does output (I tried it), 255 lines counting from 0 to 254 inclusive.
Why?

Regan
November 14, 2005
> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!

Given:

    char[10] foo;

What is the storage capacity of foo?

 - is it 10 UTF-8 characters
 - is it 2.5 UTF-8 characters
 - "it depends"
 - something else

Another question:

How much storage capacity should I allocate here:

   char[?] bar = "Äiti syö lettuja.";
      // Finnish for "Mother eats pancakes." :-)
   char[?] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
      // I hope the al Qaida or the CIA don't kock on my door,
      // I hereby officially state I have no idea what I wrote.

(Heh, btw, upon writing that string, my cursor started looking weird. Now I'll just have to see whether my news reader or windows itself crashes first!! Which should never happen because I got the characters "legally", i.e. from the windows Character Map in the System Tools menu.)

--

Character sets, my butt!
November 14, 2005
On Tue, 15 Nov 2005 00:23:47 +0200, Georg Wrede wrote:

>> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
> 
> Given:
> 
>      char[10] foo;
> 
> What is the storage capacity of foo?
> 
>   - is it 10 UTF-8 characters
>   - is it 2.5 UTF-8 characters
>   - "it depends"
>   - something else

10 bytes.  A char is 8-bits wide. But if your question is really, how many characters can it contain, the answer is between 10 and 2.5 depending on which characters you are talking about.


> Another question:
> 
> How much storage capacity should I allocate here:
> 
>     char[?] bar = "Äiti syö lettuja.";
>        // Finnish for "Mother eats pancakes." :-)
>     char[?] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
>        // I hope the al Qaida or the CIA don't kock on my door,
>        // I hereby officially state I have no idea what I wrote.

Don't know. Thats why I use char[].

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
15/11/2005 9:43:13 AM
November 14, 2005
On Tue, 15 Nov 2005 00:23:47 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
>
> Given:
>
>      char[10] foo;
>
> What is the storage capacity of foo?
>
>   - is it 10 UTF-8 characters
>   - is it 2.5 UTF-8 characters
>   - "it depends"
>   - something else

"something else", 10 UTF-8 'codepoints' where one or more codepoints make up a single grapheme/character. (I believe this is the correct terminology)

> Another question:
>
> How much storage capacity should I allocate here:
>
>     char[?] bar = "Äiti syö lettuja.";
>        // Finnish for "Mother eats pancakes." :-)
>     char[?] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
>        // I hope the al Qaida or the CIA don't kock on my door,
>        // I hereby officially state I have no idea what I wrote.

Easy answer: You don't allocate any, you use char[] and it works it out for itself.
Hard answer: As it turns out, by trial and error I got:

import std.stdio;

void main()
{
	char[19] bar = "Äiti syö lettuja.";
	// Finnish for "Mother eats pancakes." :-)
	char[33] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
	// I hope the al Qaida or the CIA don't kock on my door,
	// I hereby officially state I have no idea what I wrote.
}

There are 2 graphemes in the Finnish string that require an extra codepoint. The arabic takes 3 UTF-8 codepoints for every grapheme.

Another interesting piece of information (you're gonna love this) there are, in some(many?) cases, actually several ways to represent the same grapheme in the same encoding, i.e. in UTF-8 there is more than one way to represent a single graphemes, meaning that the storage space (the byte length) of the char[] will change depending on which one is chosen.

The reason this can happen is that graphemes can be made up from their requisite parts, example, the Ä is likely an A followed by the codepoint which means "add two dots above it". Some of the more complicated characters can therefore be made up of different combinations of parts in a different order, I believe (someone correct me if I am talking rubbish).

I believe there is a rule that shortest possible one is deemed 'correct' for purposes like ours shown above, but if you obtain the strings from a file say, you may encounter the same 'string' represented in UTF-8 but find that the char[] are different lengths.

Regan
November 14, 2005

(Don't take this personally, I'm just pissed off, because I wrote some Arabic in another post here, and even when I started this reply, my cursor again seems f**ked up.)


Sean Kelly wrote:
> Georg Wrede wrote:
> 
>> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
> 
> I agree that it's a tad weird, especially since a char literal can
> only hold a subset of valid UTF-8 values (ie. the single-byte ones).

Yes, and actually _only_ the USASCII ones, which for us Europeans is a major leap backwards!! (I wanted to say Mankind, like Americans do about themselves...)

>> There's no end to the misunderstanding this proliferates and
>> breeds.
>> 
>> For example, who would have thought that it is illegal to chop a char[] at an arbitrary point? I bet hardly anybody. Not 2 weeks ago
>>  anyway.
> 
> So long as you're only dealing with ASCII values (common for us US folks) you *can* chop a char[] anywhere you want. Though this is obviously risky for some applictions.

You can't possibly even begin to imagine what kind of language that sentence provokes on this side of the Atlantic!   ;-)   Oh, btw, Finnish is the language (in Europe at least) that's got the largest vocabulary for cursing. (That's "4-letter words", for Americans. And I'm telling you, any kid here knows words that make swearing in English sound like daycare chat. The Swedes envy our worst words!)

>> Parallax thinking. That's what we are doing with our (so called) strings. (Parallax thinking is a term in Bubblefield, which essentially means "thinking of a red car that is blue".)
>>
>>    If we had the types
>>    utf8, utf16, utf32,
>>    while forbidding char,
>>    we could get somewhere.
> 
> I think this would be misleading, as it seems to imply that a utf8 literal may occupy 1-4 bytes.  This would make sense for strings
> however.

The precise intent was to make it obvious that a single utf8 character _may_ actually take anywhere from 1..4 bytes.

>> When somebody wants to do array twiddling while disregarding the
>> risk of encountering a multi-byte utf8, then one should have to be
>> explicit about it.
>> 
>> Equally, when somebody wants to twiddle with arrays of "virtually
>> just byte size entities, but in the rare case potentially
>> multibyte" entities, then _this_ should have to be explicit. This
>> would also keep the programmer himself remembering what he is
>> twiddling.
> 
> This is why I implemented readf using dchars.  It's just a lot
> simpler to manage this sort of thing when you don't have to worry
> about multibyte characters.

I took it for granted that you had "done the Bill Gates" (which implies using wchar and scrubbing the problem under the rug "as you're only dealing with BMP values (common for all civilised folks) you *can* chop a wchar[] anywhere you want, though this is obviously risky for some applictions)", that I already wrote several lines of "less than appropriate" commentary.

Lucky I do proofreading! So you actually use dchar!

That means you're one of the Good Guys!  :-)

> The name "Dchar" also implies to me that it's the char type
> for the D language, even if it really means "double-word char."

I noticed that too. Cool, ain't it!

---

Heh, the quip about Bill Gates turned out to be a pun! He's actually done it on Windows! Good grief.
November 14, 2005
Georg Wrede wrote:
> 
> You can't possibly even begin to imagine what kind of language that sentence provokes on this side of the Atlantic!   ;-)   Oh, btw, Finnish is the language (in Europe at least) that's got the largest vocabulary for cursing. (That's "4-letter words", for Americans. And I'm telling you, any kid here knows words that make swearing in English sound like daycare chat. The Swedes envy our worst words!)

Hah.  US curses are probably the least creative around.  Though I suppose it offers a lot of insight into our culture that the worst words we can come up with refer to feces and sex.  Welcome to grade school! :-)


Sean
November 15, 2005
Regan Heath wrote:
> On Mon, 14 Nov 2005 16:12:32 +0200, Georg Wrede wrote:
> 
>> We have these things called char, wchar and dchar. They are defined
>> as UTF-8, UTF-16 and UTF-32. Fine.
>> 
>> C has this thing called char, too. Many of us work in both
>> languages.
> 
> Indeed, I work in C on a daily basis and D in my spare time.
> 
>> So? Well, what we do a lot is manipulating arrays of "char"s. While
>> at it, in D we are doing most of this manipulating without further
>> thinking, which results in code that essentially manipulates
>> arrays of "ubyte". Since practically nobody here has a mother
>> tongue that _needs_ characters of more than 8 bits, we "haven't had
>> any trouble".
> 
> That is not what I meant. Tho it's also true. If I had to parse a language involving multi byte UTF-8 characters I would likely have
> had some trouble doing relativley simple things like getting the
> string length.
> 
> In part what I actually meant is explained by Kris when he referred
> to the stream libraries, I quote:
> 
> "The Stream methods are named like this specifically to avoid the
> problem:
> 
> write  (char[])
> writeW (wchar[])
> writeD (dchar[])"
> 
> I suspect he is correct in that _if_ the stream functions were all called "write" I would have encountered the problem with this:
> 
> write("test");
> 
> not knowing what function to call and requiring a suffix.
> 
> That said, I think I still prefer the current behaviour to a change which could allow a silent bug to creep in, however unlikely it may
> be (and I agree it's unlikely).

Agreed. What I'm trying to uncover is a way of making that risk go away. In other words, we (of course should first get to the _very_ bottom of this issue (since this is the last time we can do that) and then we) should figure out a Proper Solution for the _entire_ UTF thing.

With any luck, we could still retain "most of the current char[] manipulating code" so that those who use it _remain_conscious_ of the fact that such is only good for USASCII. Which in itself is good for many things, like custom made compilers and whatever. (In-house languages, definition files, shell-script like things -- all of these are really more productive (to develop, to apply, and to understand) when you don't complicate the issue with "the possibility of writing Chinese commands in them".

>> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
>> 
>> There's no end to the misunderstanding this proliferates and
>> breeds.
> 
> This is true, technically char and char* from C should be byte and byte* in D as C's char is a 8 bit signed type with "no specific encoding". Whereas char is an 8 bit signed type with a specified encoding of UTF-8.

Gee, I'd forgotten that it's signed! Your point is thus even more accurate.

> Perhaps if we had used byte in our C porting D code it would have highlighted the differences to new D programmers. 

Man, there's nothing like hindsight! I agree fully!

And not only new D programmers, also the old farts, and even Gurus, who tend to forget some of this every now and then!

> It would certainly
> make things like this:
> 
> char[] a;
> strlwr(a);
> 
> illegal right off the bat. Which is the opposite to what D does, D  actually implicitly converts the char[] to char* where required.
> 
> In order to call strlwr safely we have to use toStringz anyway,
> right? So why not write it:
> 
> byte* toStringz(char[] a);
> byte* toStringz(wchar[] a);
> byte* toStringz(dchar[] a);

Hmm. I wonder whether one _can_ toStringz a UTF thing at all?

At the very least one should know how the receiving end will interpret it, right? (Native character set, UTF-16, 32-bit DOS various character sets, Mac anybody?)

> and turn all the C 'char' parameter types into 'byte'?

Yes, that's what we'll eventually have to do -- like it or not.

> I suspect this change would cause a bunch of compile errors for
> people, but a large percentage of those may be bugs waiting to happen

As I see it, the current state of "UTF support" in D is in such a shape, that whatever we do to fix it, is just bound to break reams of code.

My only hope is that we get it over and done with, before it's too late. The other thing I worry about is, if we "release" D before all of this is fixed, then a lot of "D opponents" will have enough fodder to hurt us real bad before the dust settles.

> and the rest will be:
> 
> strlwr("literal string");
> 
> Tho not likely calling strlwr ;)
> 
> And we can resolve this with either:
>  - another suffix, 'z' anyone?
>  - a call to toStringz, I know not as efficient.
> 
>> For example, who would have thought that it is illegal to chop a char[] at an arbitrary point? I bet hardly anybody. Not 2 weeks ago
>> anyway.
> 
> I did, but only because a year or so ago Arcane Jill was stomping
> round this place righting all the missinformation and confusion about
> unicode issues and the D types etc.

(Man I hope she's alive. The last few posts from her didn't make me feel easy at all.)

>> Parallax thinking. That's what we are doing with our (so called) strings. (Parallax thinking is a term in Bubblefield, which essentially means "thinking of a red car that is blue".)
>>
>>     If we had the types
>>     utf8, utf16, utf32,
>>     while forbidding char,
>>     we could get somewhere.
>>
>> When somebody wants to do array twiddling while disregarding the
>> risk of encountering a multi-byte utf8, then one should have to be explicit about it.
>> 
>> Equally, when somebody wants to twiddle with arrays of "virtually just byte size entities, but in the rare case potentially
>> multibyte" entities, then _this_ should have to be explicit. This
>> would also keep the programmer himself remembering what he is
>> twiddling.
> 
> D has a tradition, borrowed from C of giving you access to the bare metal. char[] etc are examples of this.

I never meant to make opaque things out of this. Just, that (especially to remind the programmer himself!!) you should be forced to be explicit on which side of the fence you are today.

> The concerns you've voiced above are the reasons a lot of people have
> proposed that a string type is required, in fact I'm surprised you haven't done the same.

Ouch. I was counting on nobody noticing! :-)

 - Walter knows, I don't have to state the obvious.

 - The longer we can discuss this without bringing up the ST word, the longer we can keep at it.

 - We really need to chart the territory thoroughly before writing to Santa.

 - Currently I have no idea whether we need a Type and/or a Class.

> A string type could theoretically encapsulate string data in any of
> the 3 types and provide access on a character by character basis
> instead of byte by byte. It could provide a strlen style method which
> would actually be correct for multibyte char[] data, and so on and so
> forth.

 - And, I'm still not sure whether this sould be in the language at all, or just a library thing.
November 15, 2005
Regan Heath wrote:
> On Mon, 14 Nov 2005 23:13:48 +0200, Georg Wrede wrote:
> 
>>> There's no end to the misunderstanding this proliferates and
>>> breeds.
>> 
>> The online manual (http://www.digitalmars.com/d/type.html) states that "char is an unsigned 8 bit UTF-8".
>> 
>> That means that something like the "C kind of 'char'" simply does
>> not exist in D.
> 
> I believe "byte" is equivalent to C "char" and ubyte to C "unsigned
> char".

I know. So do most of us. That was the whole point of this excercise.

> Have a look at: http://www.digitalmars.com/d/ctod.html
> 
> In the section entitled "Primitive Types" the table "C to D types"
> shows:
> 
>         char               =>        char
>         signed char        =>        byte
>         unsigned char      =>        ubyte
> 
> IMO the first line is incorrect and confusing.

It's a lot worse than that. (See below.)

>> A few lines down, under Integer Promotions, there is a table that says char and wchar are converted to int, and dchar to uint.
>>
>> ---
>>
>> So, _WHAT_ should the following program output?
>>
>> import std.stdio;
>>
>> void main()
>> {
>>     struct U {
>>         union {char c; int m;}
>>     }
>>     U * u = new U;
>>
>>     for (int i=0;i < 0xFF; i++)
>>     {
>>         u.c = i;
>>         int ii = u.c;
>>
>>         writefln("i:%4d  ii:%4d  m:%5x m:%d", i,ii,u.m,u.m);
>>     }
>> }
> 
> Exactly what it does output (I tried it), 255 lines counting from 0
> to 254 inclusive. Why?

'Coz I'm too old to code, that's why. Of course it should print the "255" line too. :-)

Seriously, if D advertises "char" as a UTF-only entity instead of "C char", then _it_should_be_illegal_ to Promote it to Integer!

That's because a "char" octet can be part of a multibyte UTF character. In that case any Promotion would rip it off of context.

That's like Promoting the second octet of a "real". It would make just as much sense.

---

Another example of where even we ourselves have no clue where we are. And all just because of the fact that '"char" represents "UFT-8"'.
November 15, 2005
Nick wrote:
> In article <4374598B.30604@nospam.org>, Georg Wrede says...
> 
>> The compiler knows (or at least _should_ know) the character width
>> of the source code file. Now, if there's an undecorated string
>> literal in it, then _simply_assume_ that is the _intended_ type of
>> the string!
>> 
>> (( At this time opponents will say "what if the source code file
>> gets converted into another character width?" -- My answer: "Tough,
>> ain't it!", since there's a law against gratuituous mucking with
>> source code.  ))
> 
> Well that's a nice attitude. Makes copy-and-paste impossible, and
> makes writing code off html, plain text, and books impossible too,
> since the code's behaviour now dependens on your language
> environment. I'm sure that won't cause any bugs at all ;-)

:-) there are actually 2 separate issues involved.

First of all, the copy-and-paste issue:

To be able to paste into the string, the text editor (or whatever) has to know the character width of the file to begin with, since pasting is done differently with the various UTF widths. Further, one cannot paste anything "in the wrong UTF width" as such, so the editor has to convert it into the width of the entire file first. (This _should_ be handled by the operating system (not the text editor), but I wouldn't bet on it, at least before 2010 or something. Not with at least _some_ "operating systems".)

Second, the width the undecorated literal is to be stored as:

What makes this issue interesting is, is it feasible to assume something or declare the literal as "of unspecified" width.

There's lately been some research into the issue (in the D newsgroup). The jury is still out.
November 15, 2005
On Tue, 15 Nov 2005 02:58:23 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Regan Heath wrote:
>> On Mon, 14 Nov 2005 23:13:48 +0200, Georg Wrede wrote:
>>> So, _WHAT_ should the following program output?
>>>
>>> import std.stdio;
>>>
>>> void main()
>>> {
>>>     struct U {
>>>         union {char c; int m;}
>>>     }
>>>     U * u = new U;
>>>
>>>     for (int i=0;i < 0xFF; i++)
>>>     {
>>>         u.c = i;
>>>         int ii = u.c;
>>>
>>>         writefln("i:%4d  ii:%4d  m:%5x m:%d", i,ii,u.m,u.m);
>>>     }
>>> }
>>  Exactly what it does output (I tried it), 255 lines counting from 0
>> to 254 inclusive. Why?
>
> 'Coz I'm too old to code, that's why. Of course it should print the "255" line too. :-)

After modification I get:
i: 255  ii: 255  m:   ff m:255

> Seriously, if D advertises "char" as a UTF-only entity instead of "C char", then _it_should_be_illegal_ to Promote it to Integer!
>
> That's because a "char" octet can be part of a multibyte UTF character.

Correct, a single char (octet) represents a single UTF-8 codepoint, part of a complete grapheme (character).

> In that case any Promotion would rip it off of context.

It's true, it takes it out of context. So you're proposing that an explicit cast be required to obtain the value of a codepoint(char) as an int or vice versa? Or are you proposing we prevent it alltogether?

> That's like Promoting the second octet of a "real". It would make just as much sense.

True, provided that a single UTF-8 codepoint cannot be useful out of context. I have this vague recollection that a grapheme may be made up of a base codepoint followed by modifying codepoints, I mentioned this before with your Finnish example. eg. The letter "Ä" may be made up of 'A' followed by the "add two dots" codepoint.

I believe this is in addition to it's standard form, which this example gives us:

import std.stdio;

void main()
{
	char[] s = "Ä";
	writefln(s.length);
	writefln("%d,%d",s[0],s[1]);	
}

I may be wrong here. Someone correct me if you know for certain!

> ---
>
> Another example of where even we ourselves have no clue where we are. And all just because of the fact that '"char" represents "UFT-8"'.

I dont have a problem with 'char' representing a single UTF-8 codepoint, provided it's explained correctly in the docs. I think it's fair to expect people to learn about unicode issues and learn that 'char' in D is not 'char' from C.

I also believe that we need to provide the tools to manage unicode issues in the standard library. Certainly we haven't got those tools yet. eg. to begin with a function that could tell you the grapheme length of a char[]. There was talk about porting a populat unicode library to D, this would likely solve many of the problems.

Regan