until strange behavior (page 2)

June 02, 2013

Re: until strange behavior

Posted by Jonathan M Davis
in reply to Jack Applegame

Permalink

Jonathan M Davis

Posted in reply to Jack Applegame

Permalink

On Monday, June 03, 2013 01:04:28 Jack Applegame wrote:
> On Sunday, 2 June 2013 at 20:50:31 UTC, Jonathan M Davis wrote:
> > http://stackoverflow.com/questions/12288465
> 
> Lets have string of chars, and it contains UTF-8 string.
> Does front(str[]) automatically convert first unicode character
> to UTF-32 and returns it?
> I made a test case and answer is: "Yes, it does!"
> May be this make sense. But such implicit conversion confuses
> everyone whom I asked.
> Therefore, string is not ordinary array (in Phobos context), but
> special array with special processing rules.
> 
> I'm moving from C++ and often ask myself: "why D has so much hidden confusing things?"

The language treats strings as arrays of code units. The standard library treats them as ranges of code points. Yes, this can be confusing, but we need both. In order to operate on strings efficiently, they need to be made up of code units, but correctness requires code points. This means that the complexity is to a great extent an intrinsic part of dealing with strings properly. In C++, people usually just screw it up and treat char as if it were a character when in fact it's not. It's a piece of one.

Whether we went about handling the complexity of code units vs code points in the best manner is debatable, but it can't be made simple if you want both efficiency and correctness. A better approach might have been to have a string type which operated on code points and held the code units internally so that everything operated on code points by default, but the library stuff was added later, and Walter Bright tends to think that everyone should understand Unicode well, so the decisions he makes with regards to that aren't always the best (since most people don't understand Unicode well and don't want to care).

What we have actually works quite well, but it does require that you come to at least a basic understanding of the difference between code units and code points.

- Jonathan M Davis

Jonathan, thanks for the detailed response. I think in D we should not use strings for storing "non text" data. For such things we must use byte[] or ubyte[]. And ranges will work as expected. Is it correct?

On Monday, June 03, 2013 01:29:35 Jack Applegame wrote: > Jonathan, thanks for the detailed response. > > I think in D we should not use strings for storing "non text" data. For such things we must use byte[] or ubyte[]. And ranges will work as expected. Is it correct? Exactly. If you want bytes, use ubyte[] or byte[] (probably ubyte[]). C++ lacks such a proper type (though C99 has uint8_t). char is specifically a UTF-8 code unit and should be treated as such. Also, if you have text that you _know_ is ASCII, then it's more efficient to cast the string to immutable(ubyte)[] and operate on it that way (so that it doesn't do any decoding). That's not currently handled by the string-specific functions (though the general array and range-based ones will handle it just fine), but I expect that that will change. - Jonathan M Davis

On 06/02/2013 11:23 AM, ixid wrote: >> There were long and heated discussions when this behavior was first >> proposed. > > Do you have a link to any of the discussions? This is one of the few > things that irritates me in D in that I feel like I am fighting to > control the type unnecessarily. I think the following link is the first time this idea was discussed: http://forum.dlang.org/thread/hkd9nl$h08$1@digitalmars.com Ali

Forums