September 29, 2006
Anders F Björklund wrote:
> Chad J > wrote:
> 
>> char[] data; 
> 
> 
>>   dchar opIndex( int index )
>>   {
>>     foreach( int i, dchar c; data )
>>     {
>>       if ( i == index )
>>         return c;
>>
>>       i++;
>>     }
>>   }
> 
> 
> This code probably does not work as you think it does...
> 
> If you loop through a char[] using dchars (with a foreach),
> then the int will get the codeunit index - *not* codepoint.
> (the ++ in your code above looks more like a typo though,
> since it needs to *either* foreach i, or do it "manually")
> 
> import std.stdio;
> void main()
> {
>    char[] str = "Björklund";
>    foreach(int i, dchar c; str)
>    {
>      writefln("%4d \\U%08X '%s'", i, c, ""d ~ c);
>    }
> }
> 
> Will print the following sequence:
> 
>    0 \U00000042 'B'
>    1 \U0000006A 'j'
>    2 \U000000F6 'ö'
>    4 \U00000072 'r'
>    5 \U0000006B 'k'
>    6 \U0000006C 'l'
>    7 \U00000075 'u'
>    8 \U0000006E 'n'
>    9 \U00000064 'd'
> 
> Notice how the non-ASCII character takes *two* code units ?
> (if you expect indexing to use characters, that'd be wrong)
> 
> More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
> 
> --anders

ah.  And yep the i++ was a typo (oops).

So maybe something like:

  dchar opIndex( int index )
  {
    int i;
    foreach( dchar c; data )
    {
      if ( i == index )
        return c;

      i++;
    }
  }

The i is no longer the foreach's index, so the i++ isn't a typo anymore.

Thanks for the info.  I'll check out that faq a little later, gotta go.
September 29, 2006
Chad J > wrote:
> Georg Wrede wrote:
> 
>> The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.
>>
> 
> But this is what I'm talking about... you can't slice them or index them.  I might actually index a character out of an array from time to time.  If I don't know about UTF, and I do just keep on coding, and I do something like this:
> 
> char[] str = "some string in nonenglish text";
> for ( int i = 0; i < str.length; i++ )
> {
>   str[i] = doSomething( str[i] );
> }
> 
> and this will fail right?
> 
> If it does fail, then everything is not alright.  You do have to worry about UTF.  Someone has to tell you to use a foreach there.

Yes. That's why I talked about you falling down once you realise Daddy's not holding the bike.

Part of UTF-8's magic lies in that it is amazingly easy to get working smoothly with truly minor tweaks to "formerly ASCII-only" libraries -- so that even the most exotic languages have no problem.

Your concerns about the for loop are valid, and expected. Now, IMHO, the standard library should take care of "all" the situations where you would ever need to split, join, examine, or otherwise use strings, "non-ASCII" or not. (And I really have no complaint (Walter!) about this.) Therefore, in no normal circumstances should you have to twiddle them yourself -- unless.

And this "unless" is exactly why I'm unhappy with the situation, too.

Problem is, _technology_wise_ the existing setup may actually be the best, both considering ease of writing the library, ease of using it, robustness of both the library and users' code, and the headaches saved from programmers who, either haven't heard of the issue (whether they're American or Chinese!), or who simply trust their lives with the machinery.

So, where's the actual problem???

At this point I'm inclined to say: the documentation, and the stage props! The latter meaning: exposing the fact that our "strings" are just arrays is psychologically wrong, and even more so is the fact that we're shamelessly storing entities of variable length in arrays which have no notion of such -- even worse, while we brag with slices!

If this had been a university course assignment, we'd have been thrown out of class, for both half baked work, and for arrogance towards our client, victimizing the coder.

The former meaning: we should not be like "we're bad enough to overtly use plain arrays for variable-length data, now if you have a problem with it, the go home and learn stuff, or then just trust us".

Both "documentation" and "stage props" ultimately meaning that the largest problem here is psychology, pedagogy, and education.

---

A lot would already be won by:

merely aliasing char[] to string, and discouraging other than guru-level folks from screwing with their internals. This alone would save a lot of Fear, Uncertainty and D-phobia.

The documentation should take pains in explaining up front that if you _really_ want to do Character-by-Character ops _and_ you live outside of America, then the Right way to do it (ehh, actually the Canonical Way), is to first convert the string to dchar[]. Period.

Then, if somebody else knows enough of UTF-8 and knows he can handle bit twiddling more efficiently than using the Canonical Way, with plain char[] and "foreignish", then let him. But let that be undocumented and Un-Discussed in the docs. Precisely like a lot of other things are. (And should be.) And will be. He's on his own, and he ought to know it.

---

In other words, the normal programmer should believe he's working with black-box Strings, and he will be happy with it. That way he'll survive whether he's in Urduland or Boise, Idaho -- without neither ever needing to have heard about UTF nor other crap.

Not until in Appendix Z of the manual should we ever admit that the Emperor's Clothes are just plain arrays, and we apologize for the breach of manners of storing variable length data in simple naked arrays. And here would be the right place to explain how come this hasn't blown up in our faces already. And, exactly how you'll avoid it too. (This _needs_ to contain an adequate explanation about the actual format of UTF-8.)

---

TO RECAP

The _single_ biggest strings-related disservice to our pilgrims is to

    lead them to believe, that D stores
    strings in something like utf8[]

internally.

Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_ implemented, it would probably have to be an alias of char[][]. Right? Right? What we have instead is ubyte[], which is _not_ the same as utf8[].) (Oh, and if it ever becomes obvious that not _everybody_ understood this, then that in itself simply proves my point here.)

(*1)

And the fault lies in the documentation, not the implementation!

This results, in braincell-hours wasted, precisely as much as everybody has to waste them, before they realise that the acronym RAII is a filthy lie. Akin only to the former "German _Democratic_ Republic". Only a politician should be capable of this kind of deception.

Ok, nobody is doing it on purpose. Things being too clear to oneself often result in difficulties to find ways to express them to new people. (Happens every day at the Math department! :-( ) And since all in-the-know are unable to see it, and all not-in-the-know are too, then both groups might think it's the thing itself that is "the problem", and not merely the chosen _presentation_ of it.

#################

Sorry for sonding Righteous, arrogant and whatever. But this really is a 5 minute thing for one person to fix for good, while it wastes entire days or months _per_person_, from _every_ non-defoiled victim who approaches the issue. Originally I was one of them: hence the aggression.

-------------------------------------------


(*1) Even I am not simultaneously both literally and theoretically right here. Those who saw it right away, probably won't mind, since it's the point that is the issue here.

Now, having to write this disclaimer, IMHO simply again underlines the very point attempted here.
September 29, 2006
Anders F Björklund wrote:
> If you're willing to handle the "surrogates", then UTF-16 is a rather
> good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
> A downside is that it is not "ascii-compatible" (has embedded NUL chars)
> and that it is endian-dependant unlike the more universal UTF-8 format.

Problem is, using 16-bit you sort-of get away with _almost_ all of it. But as a pay-back, the day your 16 bits don't suffice, you're in deep crap. And that day _will_ come.
September 29, 2006
Georg Wrede wrote:

> The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.
> 
> So things just keep on working.
> 

I agree, but I disagree that there is a problem, or that utf-8 is a bad choice, or that perhaps char[] or string should be called utf8 instead.

As a note here, I actually had a page of text localised into Chinese last week - it came back as a utf8 text file.

The only thing with utf8 is that a glyphs aren't represented by a single char.  But utf16 is no better!  And even utf32 codepoints can be combined into a single rendered glyph.  So truncating a string at an arbitrary index is not going to slice on a glyph boundary.

However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes.  That garbage is a unique series of bytes that represent a codepoint.  This is a property not found in any other encoding.

As such, everything works, strstr, strchr, strcat, printf, scanf - for ASCII, normal unicode, and the "Astral planes".  It all just works.  The only thing that breaks is if you tried to index or truncate the data by hand.

But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes.  Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".

September 30, 2006
Johan Granberg wrote:
> Georg Wrede wrote:
> 
>> Wrong.
>>
>> And that's precisely what I meant about the Daddy holding bike allegory a few messages back.
>>
>> The current system seems to work "by magic". So, if you do go to China, itll "just work".
>>
>> At this point you _should_ not believe me. :-) But it still works.
>>
>> ---
> 
> 
> But is this not a needless source of confusion, that could be eliminated by defining char as "big enough to hold a unicode code point" or something else that eliminates the possibility to incorrectly divide utf tokens.
> 
> I will have to try using char[] with non ascii characters thou I have been using dchar fore that up till now.

You might begin with pasting this and compiling it:

import std.stdio;

void main()
{
	int öylätti;
	int ШеФФ;

	öylätti = 37;
	ШеФФ = 19;

	writefln("Köyhyys 1 on %d ja nöyrä 2 on %d, että näin.", öylätti, ШеФФ);
}

It will compile, and run just fine. (The source file having been read into DMD as a single big string, and then having gone through comment removal, tokenizing, parsing, lexing, compiling, optimizing, and finally the variable names having found their way into the executable. Even though the front end has been written in D itself, with simply char[] all over the place.)

(Then you might see that the Windows "command prompt window" renders the output wrong, but it's only from the fact that Windows itself doesn't handle UTF-8 right in the Command Window.)

The next thing you might do is to write a grep program (that takes as input a file and as output writes the lines found). Write the program as if you had never heard this discussion. Then feed it the Kalevala in Finnish, or Mao's Red Book in Chinese. Should still work.

As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper.
September 30, 2006
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

> Derek Parnell wrote:
>> And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.
> 
> I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.

>> And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?
> 
> I don't think it'll cause problems, it just seems pointless.

It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
September 30, 2006
On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:


> As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper.

The Build program does lots of 'tampering'. I had to rewrite many standard routines and create some new ones to deal with unicode characters because the standard ones just don't work. And Build still fails to do somethings correctly (e.g. case insensitive compares) but that's on the TODO list.

I have to think about UTF because it doesn't work unless I do that.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
September 30, 2006
Geoff Carlton wrote:
> Georg Wrede wrote:
> 
>> The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.
>>
>> So things just keep on working.
>>
> 
> I agree, but I disagree that there is a problem, or that utf-8 is a bad choice, or that perhaps char[] or string should be called utf8 instead.
> 
> As a note here, I actually had a page of text localised into Chinese last week - it came back as a utf8 text file.
> 
> The only thing with utf8 is that a glyphs aren't represented by a single char.  But utf16 is no better!  And even utf32 codepoints can be combined into a single rendered glyph.  So truncating a string at an arbitrary index is not going to slice on a glyph boundary.
> 
> However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes.  That garbage is a unique series of bytes that represent a codepoint.  This is a property not found in any other encoding.
> 
> As such, everything works, strstr, strchr, strcat, printf, scanf - for ASCII, normal unicode, and the "Astral planes".  It all just works.  The only thing that breaks is if you tried to index or truncate the data by hand.
> 
> But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes.  Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".

Yes.
September 30, 2006
Walter Bright wrote:
> Derek Parnell wrote:
>> And is it there yet? I mean, given that a string is just a lump of text, is
>> there any text processing operation that cannot be simply done to a char[]
>> item? I can't think of any but maybe somebody else can.
> 
> I believe it's there. I don't think std::string or java.lang.String have anything over it.
> 
>> And if a char[] is just as capable as a std::string, then why not have an
>> official alias in Phobos? Will 'alias char[] string' cause anyone any
>> problems?
> 
> I don't think it'll cause problems, it just seems pointless.

Hi,
The main reasons I think are these:

It simplifies the initial examples, particularly main(string[]), and maps such as string[string].  More complex examples are a map of words to text lines, string[][string], rather than char[][][char[]].

It clarifies the actual use of the entity.  It is a text string, not just a jumbled array of characters.  Arrays of char can be used for other things, such as the set of player letters in a scrabble game.  A string has the additional usage that we know it as is text string.  The alias reflects that intent.

Given a user wants to use a string, there is no need to expose the implementation detail of how strings are done in D.  Perhaps in perl, strings are a linked list of shorts, but it doesn't mean that you'd have  list<short> all over the place.

Use of char[] and char[][] looks like low level C.  It has also been noted that it encourages char based indexing, which is not a good thing for utf8.

Anyway, hope one of those points grabbed you!
Geoff
September 30, 2006
Derek Parnell wrote:
> I'm pretty sure that the phobos routines for search and replace only work
> for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
> always fail to deliver the correct result. It finds the first occurance of
> the byte value for the letter 'a' which may well be inside a Japanese
> character. It looks for byte-subsets rather than character sub-sets.

I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61) may be found within a Japanese multibyte glyph? Or even a very long Japanese text.

That is not correct.

The designers of UTF-8 knew that this would be dangerous, and created UTF-8 so that such _will_not_happen_. Ever.

Therefore, something like std.string.find() doesn't even have to know about it.

Basically, std.string.find() and comparable functions, only have to receive two octet sequences, and see where one of them first occurs in the other. No need to be aware of UTF or ASCII. For all we know, the strings may even be in EBCDIC. Still works.

If the strings themselves are valid (in whichever encoding you have chosen to use), then the result will also be valid.

((For the sake of completeness, here I've restricted the discussion to the version of such functions that accept ubyte[] compatible input (obviously including char[]). Those taking 16 or 32 bits, and especially if we deliberately feed input of wrong width to any of these, then of course the results will be more complicated.))