[OT] Re: switch (dchar[]) (page 2)

November 17, 2004

Re: [OT] Re: switch (dchar[])

Posted by Thomas Kuehne
in reply to Sean Kelly

Permalink

Thomas Kuehne

Posted in reply to Sean Kelly

Permalink

Sean Kelly schrieb am Wed, 17 Nov 2004 20:33:07 +0000 (UTC):
>>>>> Isn't dchar[] a pretty useless type ?
>>>>> (dchar isn't, but an UTF-32 string...)
>>>>>
>>>>> But I suspect that wchar[] is better for
>>>>> storing a bunch of (dchar) code points ?
>>
>>>>When you are dealing with extended CJK, ancient or private scripts
>>>>dchar is useful. For simple operations you might use wchar, but
>>>>as soon as you start extensive text processing you add an huge amount
>>>>of overhead(lookup if this is a surrogate).
>>
>>> Semi-related question.  Is it possible for there to be multiple UTF-8 (or UTF-16) sequences which represent the same UTF-32 character?  I would assume not, but don't want to make any assumptions.
>>
>>The used encodings could technically present one codepoint with different UTF-16/UTF-8 sequences. But the standards require you to use the shortest possible sequence.
>>
>>Please don't confuse characters and codepoints.
>>e.g "small Latin letter a with accent grave" can be represented in
>>with 2 different codepoint sequences and thus with different UTF8/16
>>sequences.
>
> The reason I asked was for string matching.  I wanted to be sure there was no advantage to doing comparisons in UTF-32 vs. UTF-8, for example.  So you're saying that while it's theoretically possible to have two different UTF-8/16 sequences present the same codepoint, the requirements of the standard make this effectively impossible.  Is that correct?

Yes - with one important exception.
When ever you interact with Java you have to be aware that it's "UTF-8" is
'adapted'. It doesn't encode the code point sequence (aka UTF-32) but
encodes UTF-16 chars. So if you have a UTF-16 surrogate(for encoding
 >0x00FFFF) Java will generate _2_ UTF-8 sequences - one for the lower
surrogate part and one for higher surrogate part instead of _1_ for the
code point.

Concerning the speed. It's a matter of encoded string/byte length.
For mostly Latin scripts use UTF-8.
For everything else - e.g. Greek, Japanese ... - use UTF-16.

http://unicode.org

Thomas

In article <mgrs62-3ig.ln1@kuehne.cn>, Thomas Kuehne says... > >Concerning the speed. It's a matter of encoded string/byte length. >For mostly Latin scripts use UTF-8. >For everything else - e.g. Greek, Japanese ... - use UTF-16. Thanks a lot. This was all in reference to my readf routines, which are currently converting everything to UTF-32 before matching and such. I'll change them to use UTF-16 instead. Sean

On Wed, 17 Nov 2004 10:09:41 +0100, Thomas Kuehne <thomas-dloop@kuehne.thisisspam.cn> wrote: > > Using dchar[] as case-keys within a switch results in: > Internal error: s2ir.c 670 > > http://svn.kuehne.cn/dstress/nocompile/switch_14.d > > > Using multiple identical dchar[]s as case-keys within a switch results > in: > expression.c:1367: virtual int StringExp::compare(Object*): Assertion `0' failed > > http://svn.kuehne.cn/dstress/nocompile/switch_13.d > > I don't know why, but the current documentation states that only > "integral types or char[] or wchar[]" are allowed for switch statements. > It is certainly useful if wchar[] and floating types are allowed too. > > Thomas Fixed in 1.06 -- "Unhappy Microsoft customers have a funny way of becoming Linux, Salesforce.com and Oracle customers." - www.microsoft-watch.com: "The Year in Review: Microsoft Opens Up"

On Tue, 30 Nov 2004 19:06:37 +1300, Simon Buchan <currently@no.where> wrote: <snip> > Fixed in 1.06 > err.... 1.07 <smack forehead> -- "Unhappy Microsoft customers have a funny way of becoming Linux, Salesforce.com and Oracle customers." - www.microsoft-watch.com: "The Year in Review: Microsoft Opens Up"

Forums