November 15, 2005
On Tue, 15 Nov 2005 02:30:27 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>>  That said, I think I still prefer the current behaviour to a change which could allow a silent bug to creep in, however unlikely it may
>> be (and I agree it's unlikely).
>
> Agreed. What I'm trying to uncover is a way of making that risk go away. In other words, we (of course should first get to the _very_ bottom of this issue (since this is the last time we can do that) and then we) should figure out a Proper Solution for the _entire_ UTF thing.
>
> With any luck, we could still retain "most of the current char[] manipulating code" so that those who use it _remain_conscious_ of the fact that such is only good for USASCII. Which in itself is good for many things, like custom made compilers and whatever. (In-house languages, definition files, shell-script like things -- all of these are really more productive (to develop, to apply, and to understand) when you don't complicate the issue with "the possibility of writing Chinese commands in them".

Perhaps a library flag/version to toogle from ASCII to full UNICODE support. Allowing ASCII only programs to have fast efficient string code that ignores the possibility of multi-codepoint graphemes.

> Hmm. I wonder whether one _can_ toStringz a UTF thing at all?
>
> At the very least one should know how the receiving end will interpret it, right? (Native character set, UTF-16, 32-bit DOS various character sets, Mac anybody?)

Good point. toStringz should really be toStringC or toStringOS or similar and actually be an alias for one of many character encodings:
 toUTF8
 toUTF16
 toUTF32
 toISOxxx
 toISOxxx
 toISOxxx

>> and turn all the C 'char' parameter types into 'byte'?
>
> Yes, that's what we'll eventually have to do -- like it or not.

I agree.

>> I suspect this change would cause a bunch of compile errors for
>> people, but a large percentage of those may be bugs waiting to happen
>
> As I see it, the current state of "UTF support" in D is in such a shape, that whatever we do to fix it, is just bound to break reams of code.

Perhaps not as much if the ASCII/UNICODE library flag/version was implemented and ASCII was the default. Then we could leave the current string handling in the ASCII version and "do it right" in the UNICODE version.

> My only hope is that we get it over and done with, before it's too late. The other thing I worry about is, if we "release" D before all of this is fixed, then a lot of "D opponents" will have enough fodder to hurt us real bad before the dust settles.

Yeah, I think everyone here wants to see D succeed, contrast that with the rest of the world who couldn't care less and will likely take even a small hint of weakness as an excuse to dismiss D (to their loss IMO).

>>  I did, but only because a year or so ago Arcane Jill was stomping
>> round this place righting all the missinformation and confusion about
>> unicode issues and the D types etc.
>
> (Man I hope she's alive. The last few posts from her didn't make me feel easy at all.)

Me too, come back Arcane Jill!

>> The concerns you've voiced above are the reasons a lot of people have
>> proposed that a string type is required, in fact I'm surprised you haven't done the same.
>
> Ouch. I was counting on nobody noticing! :-)
>
>   - Walter knows, I don't have to state the obvious.
>
>   - The longer we can discuss this without bringing up the ST word, the longer we can keep at it.
>
>   - We really need to chart the territory thoroughly before writing to Santa.
>
>   - Currently I have no idea whether we need a Type and/or a Class.

I think the fact that D enforces no programming style (like OO) means it has to be a type or a set of function to operate on the existing types.

>> A string type could theoretically encapsulate string data in any of
>> the 3 types and provide access on a character by character basis
>> instead of byte by byte. It could provide a strlen style method which
>> would actually be correct for multibyte char[] data, and so on and so
>> forth.
>
>   - And, I'm still not sure whether this sould be in the language at all, or just a library thing.

Yeah, it's a tough call. Does D "support" unicode in it's current state? It provides the types, it does't stop you writing the code to correctly handle it, but that code doesn't exist and/or is fragmented, some would argue thats not "support" at all.

Regan
November 15, 2005
Regan Heath wrote:
> On Tue, 15 Nov 2005 00:23:47 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
>>> I think we've painted ourselves in the corner by calling the UTF-8  entity "char"!!
>>
>> Given:
>>
>>      char[10] foo;
>>
>> What is the storage capacity of foo?
>>
>>   - is it 10 UTF-8 characters
>>   - is it 2.5 UTF-8 characters
>>   - "it depends"
>>   - something else
> 
> "something else", 10 UTF-8 'codepoints' where one or more codepoints make up a single grapheme/character. (I believe this is the correct terminology)

I wonder how many different answers one would get with a poll.

>> Another question:
>>
>> How much storage capacity should I allocate here:
>>
>>     char[?] bar = "Äiti syö lettuja.";
>>        // Finnish for "Mother eats pancakes." :-)
>>     char[?] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
>>        // I hope the al Qaida or the CIA don't kock on my door,
>>        // I hereby officially state I have no idea what I wrote.
> 
> Hard answer: As it turns out, by trial and error I got:
> 
>     char[33] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";

Here's one you'd love: Guess what the following prints:

    char[] baz = "ﮔﯛﺌﯔﺡﺠﮗﮝﮱﺼﺶ";
    len = baz.length;
    writefln(len);

Another one: did you know that it is impossible on some machines to
select that line from the beginning up until the middle of the Arabic
string?

Likewise, on some others, copying and pasting that string results in it
getting backward.

Lawrence would have loved this: How do you know its Arabic in the first place? Walk over it with the cursor keys, and you'll see a mirage. (Works on my Win-2k and FC-4 laptops.)

> Another interesting piece of information (you're gonna love this)
> there are, in some(many?) cases, actually several ways to represent
> the same grapheme in the same encoding, i.e. in UTF-8 there is more
> than one way to represent a single graphemes, meaning that the
> storage space (the byte length) of the char[] will change depending
> on which one is chosen.

Ugh! This has turned into a regular graduate class. No wonder people feel its a huge issue.

> The reason this can happen is that graphemes can be made up from
> their requisite parts, example, the Ä is likely an A followed by the codepoint which means "add two dots above it". Some of the more complicated characters can therefore be made up of different combinations of parts in a different order, I believe (someone
> correct me if I am talking rubbish).

I was so hoping you remembered wrong, but yes, even that!

> I believe there is a rule that shortest possible one is deemed 'correct' for purposes like ours shown above, but if you obtain the strings from a file say, you may encounter the same 'string' represented in UTF-8 but find that the char[] are different lengths.

Aarrghh, and how am I supposed to figure out which of a number of ways would be the shortest possible? Even worse, if there is a "ready made" character for it, how would I know? And what about string lookup?

Yes, I just love it!

---

$ man utf-8

on my machine says Unicode 3.1 compatible programs are _required_ to not accept the non-shortest forms -- for security reasons!

Another "love this"-thing I found there:

"UTF-8 encoded UCS characters may be up to six bytes long"!

In practice one uses UTF-8 only for Unicode, which does not have characters above 0x10ffff, so from _there_ one does not have to expect 6-byte characters.

However, nobody says that an application may not invent its own special characters, or something.

Good news:

That man page seems to assume that UTF-8 will become the standard on all POSIX systems, no just Linux.

(The arabic line above, actually does print "33"! Funny how things work when you least expect.)
November 15, 2005
"Kris" <fu@bar.com> wrote in message news:dl0ngf$2f7s$1@digitaldaemon.com...
> Still, it troubles me that one has to decorate string literals in this manner, when the type could be inferred by the content, or by a cast() operator where explicit conversion is required. Makes it a hassle to
create
> and use APIs that deal with all three array types.

One doesn't have to decorate them in that manner, one could use a cast instead.


November 15, 2005
"Kris" <fu@bar.com> wrote in message news:dl2p7i$11gf$1@digitaldaemon.com...
> Don't you think the type can be inferred from the content?

Not 100%.

> For the sake of
> discussion, how about this:
>
> 1) if the literal has a double-wide char contained, it is defaulted to a dchar[] type.
>
> 2) if not 1, and if the literal has a wide-char contained, if defaults to being of wchar[] type.
>
> 3) if neither of the above, it defaults to char[] type.
>
> Given the above, I can't think of a /common/ situation where casting would be thus be required.

I did consider that for a while, but eventually came to the conclusion that its behavior would be surprising to someone who did not very carefully read the spec. Also, the distinction between the various character types is not obvious when looking at the rendered text, further making it surprising.

I think it's better to now and then have to type in an extra character to nail down an ambiguity than to have a complicated set of rules to try and guess what the programmer's intent was.


November 15, 2005
"Kris" <fu@bar.com> wrote in message news:dl3048$17nj$1@digitaldaemon.com...
> Ahh. GW made a suggestion in the other thread (d.learn) that would help here. The notion is that the default type of string literals can be
implied
> through the file encoding (D can handle multiple file encodings), or might be set explicitly via a pragma of some kind. I think the former is an interesting idea.

But then the meaning of the program would change if the source was transliterated into a different UTF encoding. I don't think this is a good idea, as it would be surprising, and would work against someone wanting to edit code in the UTF format their editor happens to be good at.


November 15, 2005
"Kris" <fu@bar.com> wrote in message news:dl34fp$1c2j$1@digitaldaemon.com...
> The derived notion is still applicable though: just suppose there were a compiler option to specify what the default literal type should be ~ would the discussed changes not result in an effective resolution?

Sorry to throw another wet blanket on your efforts here :-( but that won't work either; I am loathe to have the meaning of D programs altered with command line switches. I have much experience with the "signedness" of char in C/C++ being controllable with a command line switch, and nothing good ever comes of it.


November 15, 2005
"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsz4oqoap23k2f5@nrage.netwin.co.nz...
> I can't really see a good solution to this "transcoding nightmare", then again maybe I am seeing a bigger problem than there really is, is transcoding going to be a significant problem (in efficiency terms) for a common application?

I agree with all your points, and for the last, I think it is being perceived as a bigger problem than it really is.


November 15, 2005
Walter Bright wrote:
> "Kris" <fu@bar.com> wrote in message news:dl0ngf$2f7s$1@digitaldaemon.com...
> 
>>Still, it troubles me that one has to decorate string literals in this
>>manner, when the type could be inferred by the content, or by a cast()
>>operator where explicit conversion is required. Makes it a hassle to
> 
> create
> 
>>and use APIs that deal with all three array types.
> 
> 
> One doesn't have to decorate them in that manner, one could use a cast
> instead.
> 
> 

ROFL! Good one :-D
November 15, 2005
Walter Bright wrote:
> "Kris" <fu@bar.com> wrote in message news:dl3048$17nj$1@digitaldaemon.com...
> 
>>Ahh. GW made a suggestion in the other thread (d.learn) that would help
>>here. The notion is that the default type of string literals can be
> 
> implied
> 
>>through the file encoding (D can handle multiple file encodings), or might
>>be set explicitly via a pragma of some kind. I think the former is an
>>interesting idea.
> 
> 
> But then the meaning of the program would change if the source was
> transliterated into a different UTF encoding. I don't think this is a good
> idea, as it would be surprising, and would work against someone wanting to
> edit code in the UTF format their editor happens to be good at.

Yes, that would be a problem. However, a pragma would work. I can well imagine you wouldn't care for that kind of 'solution' though. Still, there is surely some reasonable mechanism?
November 15, 2005
Walter Bright wrote:
> "Kris" <fu@bar.com> wrote in message news:dl34fp$1c2j$1@digitaldaemon.com...
> 
>>The derived notion is still applicable though: just suppose there were a
>>compiler option to specify what the default literal type should be ~ would
>>the discussed changes not result in an effective resolution?
> 
> 
> Sorry to throw another wet blanket on your efforts here :-( but that won't
> work either; I am loathe to have the meaning of D programs altered with
> command line switches. I have much experience with the "signedness" of char
> in C/C++ being controllable with a command line switch, and nothing good
> ever comes of it.

I suspected so, which is why that stated "suppose there were". Do you have some suggestions? I'm sure you could probably think of a handful.

As noted elsewhere, you might imagine for a moment that all of Phobos output (printf, writef, Stream) were sensitive to this issue with the char & char[] related types ~ I suspect you'd then think quite differently about the importance of it. Luckily, Phobos is not sensitive to this by design (via an avoidance of overloading, plus some small kludges where it is used). That doesn't equate to "all designs will have limited exposure". I can testify wholeheartedly to the contrary <g>

Sure, there are more important things to resolve, but this one is a festering sore for those who run into it. It will surely continue to be. It's surprising how some small but repetitive issue can shape one's opinions.

I ask, nay implore, you to give this some further consideration, Walter. And, thanks for chiming in ~ that doesn't happen very often :-)