February 10, 2005 Re: Evolution (Hello World) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | "Anders F Björklund" <afb@algonet.se> wrote in message news:cugfi7$2sll$2@digitaldaemon.com... > Matthew wrote: > >> Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc. > > Ehrm, nooooo ? "The line must be drawn here". :-) > > I just wanted some easier basics, for beginners ? > For the higher levels, you still need to learn > about bit and char[] and other behind-the-scenes. I know. And I like your sentiment. It's just that I think that the string-is-a-slice concept is so important and fundamental to D that it's more likely to be disservice in the medium/long term. |
February 10, 2005 Re: Evolution (Hello World) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | On Thu, 10 Feb 2005 22:47:18 +0100, Anders F Björklund wrote: > Derek wrote: > >>>UTF-8 has two major advantages: 1) it's optimized for ASCII and >>>does not require a BOM mark, making it compatible for files too >>>2) it is Endian agnostic, no more X86 vs PPC gruffs like the others >>> >>>If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower) >> >> One cannot easily address individual code points using utf8. For example... >> >> char[] SomeText; >> >> You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32. > > This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string. I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string. > See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode: > >> Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome. Indexing is slow for both of them, but in practice indexing by different code units is done very rarely, except when communicating with specifications that use UTF-32 code units, such as XSL. >> >> This point about indexing is true unless an API for strings allows access only by code point offsets. This is a very inefficient design: strings should always allow indexing with code unit offsets. Yes, and a simple index into a char[] doesn't do this for you. > > But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units... > > Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ? 'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population. >> So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8. > > Yes, and this is easily done with a foreach(dchar c; SomeText) loop, > as D can transparently handle the transition between char[] and dchar... Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' is not permitted. > There are also readily available functions in the std.utf module: "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers. Exactly my point. One needs to use these if *manipulating* characters in a utf8 or utf16 string. > If you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term. 'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-) -- Derek Melbourne, Australia 11/02/2005 9:38:27 AM |
February 10, 2005 Re: Evolution (Hello World) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: >>But char[] works fine for ASCII and wchar[] works fine for Unicode, >>*as long* as you watch out for any surrogates in the code units... >> >>Which means you can have a fast standard route, and extra code >>to handle the exceptional characters if and when they occur ? > > 'exceptional' to whom? To latin-based alphabet users maybe, but not the > great majority of the world's population. No, but it's mine ;-) (the ignorant westerner that I am) Seriously, in my own language - Swedish, about 10% of the text is non ASCII, which means that Walters optimized US ASCII parts runs for 90% of the time. I assume this is the same for the rest of the previously ISO-8859-X using Western world languages... Had I been using another alphabet, like Japanese or Chinese, then UTF-16 had been a nice bet. Surrogate characters are not occuring very often, in fact they were just now introduced in Java 1.5 since the original 16 bits of Unicode "overflowed". So I think there's a 90-10 rule here too, with non-Surrogates. So I do think talking about "exceptions" is warranted ? >>Yes, and this is easily done with a foreach(dchar c; SomeText) loop, >>as D can transparently handle the transition between char[] and dchar... > > Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' > is not permitted. We are talking Copy-on-Write here, yes ? As in reading from readonly and writing to readwrite ? Otherwise you could use dchar[] instead, and do a simple indexing. (or a foreach(inout dchar c; SomeText on it) And convert from UTF-8/UTF-16 on the way in, do all the processing on the UTF-32 internal array, and convert back to UTF-8/UTF-16 on the way out. (most routines now do include a dchar[] interface too, you can even use dchar[] in switch/case statements - if you like) >>If you lot of loops like that, you can use a dchar[] (dstr alias) as a >>intermediate storage. But char[] and wchar[] are better for long term. > > 'long term' meaning ??? Disk storage? RAM storage? Or until we finally get > rid of all those silly 'alphabets' out there ;-) Storage. Even with all "silly alphabets" utilized, there are still 11 dead bits in each UTF-32 character. UTF-16 is bound to more efficient. Unless you are doing extinct languages research or something? :-) It's not just me... See http://www.unicode.org/faq/utf_bom.html#UTF32 --anders |
February 11, 2005 Re: Evolution (Hello World) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | Anders F Björklund wrote:
> Had I been using another alphabet, like Japanese or Chinese,
> then UTF-16 had been a nice bet. Surrogate characters are not
> occuring very often, in fact they were just now introduced
> in Java 1.5 since the original 16 bits of Unicode "overflowed".
> So I think there's a 90-10 rule here too, with non-Surrogates.
Does anyone here know if Japanese and Chinese use a lot of ASCII punctuation? If they do, then maybe UTF-8 is reasonable.
James McComb
|
February 12, 2005 Re: Evolution (Hello World) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | "Derek Parnell" <derek@psych.ward> wrote in message news:uk7573l4ag4s.fkp4buj0rl0e.dlg@40tude.net... > > This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string. > > I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string. Take a look at the functions std.utf.stride, std.utf.toUCSindex, and std.utf.toUTFindex. They provide the basic building blocks to manipulate UTF-8 strings as if they were an array of UCS characters. |
Copyright © 1999-2021 by the D Language Foundation