February 08, 2012
Jonathan M Davis wrote:

> thanks to how unicode works

This does not mean, that the data structure representing a sequence of "letters" has to follow exactly the "working" you cited above. That data structure must only enable it efficiently. If a requirement for sequences of letters is, that a sequence `s' of letters indexed by some natural number `n' gives the letter `s[n]' and that is not efficiently possible, than unicode and its "workings" are as maldesigned as the alphabet Gutenberg has to take to produce books:

take randomly an ancient book `b' and randomly a letter `c'. Then try to verify that `b[ 314.159] == c'. Of course you are allowed to read only one letter.

-manfred

February 08, 2012
On Wednesday, February 08, 2012 09:35:28 H. S. Teoh wrote:
> On Wed, Feb 08, 2012 at 08:32:32AM -0800, Jonathan M Davis wrote: [...]
> 
> > Except that char[] is _not_ an array of characters. It's an array of code units. There is a _big_ difference. Not even dchar[] is an array of characters. It's both an array of code units and an array of code points, but not even that quite gets you characters (though at this point, Phobos pretty much treats a code point as if it were a character). If you want a character, you need a grapheme (which could be multiple code points). _That_ is where the problem comes in.
> > 
> > You can definitely do array operations on strings. In fact, it can be very desirable to do so if you want to process strings efficiently. But if you treat them like you would ubyte[], you're in for a heap of trouble thanks to how unicode works.
> 
> [...]
> 
> Except that the point of my code was to fix byte-order so that they can be correctly interpreted. I suppose I really should be using ubyte[] for that instead, and perhaps use a union to translate it to char[] when I call decode().

You shouldn't normally have to worry about byte order on char[] at all. So, I don't know what you'd be doing that would result in them being in the wrong order. But char is a UTF-8 code unit by definition, so if you're doing something that involves char[] not being a valid array of UTF-8 code units, you're almost certainly going to want to be using ubyte[] instead. There's a lot of stuff in Phobos which will through if you have invalid code points.

- Jonathan M Davis
February 08, 2012
On Wednesday, February 08, 2012 17:52:17 Manfred Nowak wrote:
> Jonathan M Davis wrote:
> > thanks to how unicode works
> 
> This does not mean, that the data structure representing a sequence of "letters" has to follow exactly the "working" you cited above. That data structure must only enable it efficiently. If a requirement for sequences of letters is, that a sequence `s' of letters indexed by some natural number `n' gives the letter `s[n]' and that is not efficiently possible, than unicode and its "workings" are as maldesigned as the alphabet Gutenberg has to take to produce books:
> 
> take randomly an ancient book `b' and randomly a letter `c'. Then try to verify that `b[ 314.159] == c'. Of course you are allowed to read only one letter.

It is impossible to have a random access range of characters with unicode unless you have a range of graphemes - which would require a grapheme to be a struct of some kind which represented a character - either that or an array of arrays. So, you could have

char[][]

where each char[] is a grapheme. But as long as you're dealing with an array of code units or code points like we do now, it's impossible to have efficient random access of characters. Phobos currently takes the tact of treating a code point as a character, which _mostly_ works, but it's not correct.

And while unicode could definitely have been designed better IMHO (e.g. forcing code point order with modifying code points and _not_ having multiple ways to generate the same character), the core problem is that you're forced to have variable length encodings. It wouldn't be feasible to have an integral value which represented _every_ single character, because of the combinatorial explosion caused by code points which modify other code points (e.g. subscript, superscript, cedille, etc.). So, there are problems which are just integral to the issue of designing unicode and which cannot be avoided no matter how good a job you do at designing unicode. And, of course, there are issues with the design on top of that.

- Jonathan M Davis
1 2
Next ›   Last »