February 10, 2005
"Anders F Björklund" <afb@algonet.se> wrote in message news:cugfi7$2sll$2@digitaldaemon.com...
> Matthew wrote:
>
>> Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc.
>
> Ehrm, nooooo ? "The line must be drawn here". :-)
>
> I just wanted some easier basics, for beginners ?
> For the higher levels, you still need to learn
> about bit and char[] and other behind-the-scenes.

I know. And I like your sentiment. It's just that I think that the string-is-a-slice concept is so important and fundamental to D that it's more likely to be disservice in the medium/long term.



February 10, 2005
On Thu, 10 Feb 2005 22:47:18 +0100, Anders F Björklund wrote:

> Derek wrote:
> 
>>>UTF-8 has two major advantages: 1) it's optimized for ASCII and
>>>does not require a BOM mark, making it compatible for files too
>>>2) it is Endian agnostic, no more X86 vs PPC gruffs like the others
>>>
>>>If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)
>> 
>> One cannot easily address individual code points using utf8. For example...
>> 
>>   char[] SomeText;
>> 
>> You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.
> 
> This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string.

I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.

> See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:
> 
>> Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome. Indexing is slow for both of them, but in practice indexing by different code units is done very rarely, except when communicating with specifications that use UTF-32 code units, such as XSL.
>> 
>> This point about indexing is true unless an API for strings allows access only by code point offsets. This is a very inefficient design: strings should always allow indexing with code unit offsets.

Yes, and a simple index into a char[] doesn't do this for you.
> 
> But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units...
> 
> Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ?

'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population.

>> So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8.
> 
> Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
> as D can transparently handle the transition between char[] and dchar...

Except for "character manipulation" as 'foreach(inout dchar c; SomeText)'
is not permitted.

> There are also readily available functions in the std.utf module: "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.

Exactly my point. One needs to use these if *manipulating* characters in a utf8 or utf16 string.

> If you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term.

'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-)

-- 
Derek
Melbourne, Australia
11/02/2005 9:38:27 AM
February 10, 2005
Derek Parnell wrote:

>>But char[] works fine for ASCII and wchar[] works fine for Unicode,
>>*as long* as you watch out for any surrogates in the code units...
>>
>>Which means you can have a fast standard route, and extra code
>>to handle the exceptional characters if and when they occur ?
> 
> 'exceptional' to whom? To latin-based alphabet users maybe, but not the
> great majority of the world's population.

No, but it's mine ;-) (the ignorant westerner that I am)

Seriously, in my own language - Swedish, about 10% of the text
is non ASCII, which means that Walters optimized US ASCII parts
runs for 90% of the time. I assume this is the same for the
rest of the previously ISO-8859-X using Western world languages...

Had I been using another alphabet, like Japanese or Chinese,
then UTF-16 had been a nice bet. Surrogate characters are not
occuring very often, in fact they were just now introduced
in Java 1.5 since the original 16 bits of Unicode "overflowed".
So I think there's a 90-10 rule here too, with non-Surrogates.

So I do think talking about "exceptions" is warranted ?

>>Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
>>as D can transparently handle the transition between char[] and dchar...
> 
> Except for "character manipulation" as 'foreach(inout dchar c; SomeText)'
> is not permitted.

We are talking Copy-on-Write here, yes ? As in reading from readonly
and writing to readwrite ? Otherwise you could use dchar[] instead,
and do a simple indexing. (or a foreach(inout dchar c; SomeText on it)

And convert from UTF-8/UTF-16 on the way in, do all the processing
on the UTF-32 internal array, and convert back to UTF-8/UTF-16 on
the way out. (most routines now do include a dchar[] interface too,
you can even use dchar[] in switch/case statements - if you like)

>>If you lot of loops like that, you can use a dchar[] (dstr alias) as a
>>intermediate storage. But char[] and wchar[] are better for long term.
> 
> 'long term' meaning ??? Disk storage? RAM storage? Or until we finally get
> rid of all those silly 'alphabets' out there ;-)

Storage. Even with all "silly alphabets" utilized, there are still 11
dead bits in each UTF-32 character. UTF-16 is bound to more efficient. Unless you are doing extinct languages research or something? :-)

It's not just me... See http://www.unicode.org/faq/utf_bom.html#UTF32

--anders
February 11, 2005
Anders F Björklund wrote:

> Had I been using another alphabet, like Japanese or Chinese,
> then UTF-16 had been a nice bet. Surrogate characters are not
> occuring very often, in fact they were just now introduced
> in Java 1.5 since the original 16 bits of Unicode "overflowed".
> So I think there's a 90-10 rule here too, with non-Surrogates.

Does anyone here know if Japanese and Chinese use a lot of ASCII punctuation? If they do, then maybe UTF-8 is reasonable.

James McComb
February 12, 2005
"Derek Parnell" <derek@psych.ward> wrote in message news:uk7573l4ag4s.fkp4buj0rl0e.dlg@40tude.net...
> > This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string.
>
> I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.

Take a look at the functions std.utf.stride, std.utf.toUCSindex, and std.utf.toUTFindex. They provide the basic building blocks to manipulate UTF-8 strings as if they were an array of UCS characters.


1 2
Next ›   Last »