November 17, 2004 Re: [OT] Re: switch (dchar[]) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | Sean Kelly schrieb am Wed, 17 Nov 2004 20:33:07 +0000 (UTC): >>>>> Isn't dchar[] a pretty useless type ? >>>>> (dchar isn't, but an UTF-32 string...) >>>>> >>>>> But I suspect that wchar[] is better for >>>>> storing a bunch of (dchar) code points ? >> >>>>When you are dealing with extended CJK, ancient or private scripts >>>>dchar is useful. For simple operations you might use wchar, but >>>>as soon as you start extensive text processing you add an huge amount >>>>of overhead(lookup if this is a surrogate). >> >>> Semi-related question. Is it possible for there to be multiple UTF-8 (or UTF-16) sequences which represent the same UTF-32 character? I would assume not, but don't want to make any assumptions. >> >>The used encodings could technically present one codepoint with different UTF-16/UTF-8 sequences. But the standards require you to use the shortest possible sequence. >> >>Please don't confuse characters and codepoints. >>e.g "small Latin letter a with accent grave" can be represented in >>with 2 different codepoint sequences and thus with different UTF8/16 >>sequences. > > The reason I asked was for string matching. I wanted to be sure there was no advantage to doing comparisons in UTF-32 vs. UTF-8, for example. So you're saying that while it's theoretically possible to have two different UTF-8/16 sequences present the same codepoint, the requirements of the standard make this effectively impossible. Is that correct? Yes - with one important exception. When ever you interact with Java you have to be aware that it's "UTF-8" is 'adapted'. It doesn't encode the code point sequence (aka UTF-32) but encodes UTF-16 chars. So if you have a UTF-16 surrogate(for encoding >0x00FFFF) Java will generate _2_ UTF-8 sequences - one for the lower surrogate part and one for higher surrogate part instead of _1_ for the code point. Concerning the speed. It's a matter of encoded string/byte length. For mostly Latin scripts use UTF-8. For everything else - e.g. Greek, Japanese ... - use UTF-16. http://unicode.org Thomas |
November 17, 2004 Re: [OT] Re: switch (dchar[]) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | In article <mgrs62-3ig.ln1@kuehne.cn>, Thomas Kuehne says...
>
>Concerning the speed. It's a matter of encoded string/byte length.
>For mostly Latin scripts use UTF-8.
>For everything else - e.g. Greek, Japanese ... - use UTF-16.
Thanks a lot. This was all in reference to my readf routines, which are currently converting everything to UTF-32 before matching and such. I'll change them to use UTF-16 instead.
Sean
|
November 30, 2004 Re: switch (dchar[]) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | On Wed, 17 Nov 2004 10:09:41 +0100, Thomas Kuehne <thomas-dloop@kuehne.thisisspam.cn> wrote: > > Using dchar[] as case-keys within a switch results in: > Internal error: s2ir.c 670 > > http://svn.kuehne.cn/dstress/nocompile/switch_14.d > > > Using multiple identical dchar[]s as case-keys within a switch results > in: > expression.c:1367: virtual int StringExp::compare(Object*): Assertion `0' failed > > http://svn.kuehne.cn/dstress/nocompile/switch_13.d > > I don't know why, but the current documentation states that only > "integral types or char[] or wchar[]" are allowed for switch statements. > It is certainly useful if wchar[] and floating types are allowed too. > > Thomas Fixed in 1.06 -- "Unhappy Microsoft customers have a funny way of becoming Linux, Salesforce.com and Oracle customers." - www.microsoft-watch.com: "The Year in Review: Microsoft Opens Up" |
November 30, 2004 Re: switch (dchar[]) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Simon Buchan | On Tue, 30 Nov 2004 19:06:37 +1300, Simon Buchan <currently@no.where> wrote: <snip> > Fixed in 1.06 > err.... 1.07 <smack forehead> -- "Unhappy Microsoft customers have a funny way of becoming Linux, Salesforce.com and Oracle customers." - www.microsoft-watch.com: "The Year in Review: Microsoft Opens Up" |
Copyright © 1999-2021 by the D Language Foundation