Evolution (Hello World) (page 2)

"Anders F Björklund" <afb@algonet.se> wrote in message news:cugfi7$2sll$2@digitaldaemon.com... > Matthew wrote: > >> Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc. > > Ehrm, nooooo ? "The line must be drawn here". :-) > > I just wanted some easier basics, for beginners ? > For the higher levels, you still need to learn > about bit and char[] and other behind-the-scenes. I know. And I like your sentiment. It's just that I think that the string-is-a-slice concept is so important and fundamental to D that it's more likely to be disservice in the medium/long term.

February 10, 2005

Re: Evolution (Hello World)

Posted by Derek Parnell
in reply to Anders F Björklund

Permalink

Derek Parnell

Posted in reply to Anders F Björklund

Permalink

On Thu, 10 Feb 2005 22:47:18 +0100, Anders F Björklund wrote:

> Derek wrote:
> 
>>>UTF-8 has two major advantages: 1) it's optimized for ASCII and
>>>does not require a BOM mark, making it compatible for files too
>>>2) it is Endian agnostic, no more X86 vs PPC gruffs like the others
>>>
>>>If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)
>> 
>> One cannot easily address individual code points using utf8. For example...
>> 
>>   char[] SomeText;
>> 
>> You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.
> 
> This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string.

I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.

> See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:
> 
>> Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome. Indexing is slow for both of them, but in practice indexing by different code units is done very rarely, except when communicating with specifications that use UTF-32 code units, such as XSL.
>> 
>> This point about indexing is true unless an API for strings allows access only by code point offsets. This is a very inefficient design: strings should always allow indexing with code unit offsets.

Yes, and a simple index into a char[] doesn't do this for you.
> 
> But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units...
> 
> Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ?

'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population.

>> So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8.
> 
> Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
> as D can transparently handle the transition between char[] and dchar...

Except for "character manipulation" as 'foreach(inout dchar c; SomeText)'
is not permitted.

> There are also readily available functions in the std.utf module: "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.

Exactly my point. One needs to use these if *manipulating* characters in a utf8 or utf16 string.

> If you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term.

'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-)

-- 
Derek
Melbourne, Australia
11/02/2005 9:38:27 AM

Derek Parnell wrote: >>But char[] works fine for ASCII and wchar[] works fine for Unicode, >>*as long* as you watch out for any surrogates in the code units... >> >>Which means you can have a fast standard route, and extra code >>to handle the exceptional characters if and when they occur ? > > 'exceptional' to whom? To latin-based alphabet users maybe, but not the > great majority of the world's population. No, but it's mine ;-) (the ignorant westerner that I am) Seriously, in my own language - Swedish, about 10% of the text is non ASCII, which means that Walters optimized US ASCII parts runs for 90% of the time. I assume this is the same for the rest of the previously ISO-8859-X using Western world languages... Had I been using another alphabet, like Japanese or Chinese, then UTF-16 had been a nice bet. Surrogate characters are not occuring very often, in fact they were just now introduced in Java 1.5 since the original 16 bits of Unicode "overflowed". So I think there's a 90-10 rule here too, with non-Surrogates. So I do think talking about "exceptions" is warranted ? >>Yes, and this is easily done with a foreach(dchar c; SomeText) loop, >>as D can transparently handle the transition between char[] and dchar... > > Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' > is not permitted. We are talking Copy-on-Write here, yes ? As in reading from readonly and writing to readwrite ? Otherwise you could use dchar[] instead, and do a simple indexing. (or a foreach(inout dchar c; SomeText on it) And convert from UTF-8/UTF-16 on the way in, do all the processing on the UTF-32 internal array, and convert back to UTF-8/UTF-16 on the way out. (most routines now do include a dchar[] interface too, you can even use dchar[] in switch/case statements - if you like) >>If you lot of loops like that, you can use a dchar[] (dstr alias) as a >>intermediate storage. But char[] and wchar[] are better for long term. > > 'long term' meaning ??? Disk storage? RAM storage? Or until we finally get > rid of all those silly 'alphabets' out there ;-) Storage. Even with all "silly alphabets" utilized, there are still 11 dead bits in each UTF-32 character. UTF-16 is bound to more efficient. Unless you are doing extinct languages research or something? :-) It's not just me... See http://www.unicode.org/faq/utf_bom.html#UTF32 --anders

Anders F Björklund wrote: > Had I been using another alphabet, like Japanese or Chinese, > then UTF-16 had been a nice bet. Surrogate characters are not > occuring very often, in fact they were just now introduced > in Java 1.5 since the original 16 bits of Unicode "overflowed". > So I think there's a 90-10 rule here too, with non-Surrogates. Does anyone here know if Japanese and Chinese use a lot of ASCII punctuation? If they do, then maybe UTF-8 is reasonable. James McComb

"Derek Parnell" <derek@psych.ward> wrote in message news:uk7573l4ag4s.fkp4buj0rl0e.dlg@40tude.net... > > This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string. > > I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string. Take a look at the functions std.utf.stride, std.utf.toUCSindex, and std.utf.toUTFindex. They provide the basic building blocks to manipulate UTF-8 strings as if they were an array of UCS characters.

Forums