November 17, 2004
Sean Kelly schrieb am Wed, 17 Nov 2004 20:33:07 +0000 (UTC):
>>>>> Isn't dchar[] a pretty useless type ?
>>>>> (dchar isn't, but an UTF-32 string...)
>>>>>
>>>>> But I suspect that wchar[] is better for
>>>>> storing a bunch of (dchar) code points ?
>>
>>>>When you are dealing with extended CJK, ancient or private scripts
>>>>dchar is useful. For simple operations you might use wchar, but
>>>>as soon as you start extensive text processing you add an huge amount
>>>>of overhead(lookup if this is a surrogate).
>>
>>> Semi-related question.  Is it possible for there to be multiple UTF-8 (or UTF-16) sequences which represent the same UTF-32 character?  I would assume not, but don't want to make any assumptions.
>>
>>The used encodings could technically present one codepoint with different UTF-16/UTF-8 sequences. But the standards require you to use the shortest possible sequence.
>>
>>Please don't confuse characters and codepoints.
>>e.g "small Latin letter a with accent grave" can be represented in
>>with 2 different codepoint sequences and thus with different UTF8/16
>>sequences.
>
> The reason I asked was for string matching.  I wanted to be sure there was no advantage to doing comparisons in UTF-32 vs. UTF-8, for example.  So you're saying that while it's theoretically possible to have two different UTF-8/16 sequences present the same codepoint, the requirements of the standard make this effectively impossible.  Is that correct?

Yes - with one important exception.
When ever you interact with Java you have to be aware that it's "UTF-8" is
'adapted'. It doesn't encode the code point sequence (aka UTF-32) but
encodes UTF-16 chars. So if you have a UTF-16 surrogate(for encoding
 >0x00FFFF) Java will generate _2_ UTF-8 sequences - one for the lower
surrogate part and one for higher surrogate part instead of _1_ for the
code point.

Concerning the speed. It's a matter of encoded string/byte length.
For mostly Latin scripts use UTF-8.
For everything else - e.g. Greek, Japanese ... - use UTF-16.

http://unicode.org

Thomas

November 17, 2004
In article <mgrs62-3ig.ln1@kuehne.cn>, Thomas Kuehne says...
>
>Concerning the speed. It's a matter of encoded string/byte length.
>For mostly Latin scripts use UTF-8.
>For everything else - e.g. Greek, Japanese ... - use UTF-16.

Thanks a lot.  This was all in reference to my readf routines, which are currently converting everything to UTF-32 before matching and such.  I'll change them to use UTF-16 instead.


Sean


November 30, 2004
On Wed, 17 Nov 2004 10:09:41 +0100, Thomas Kuehne <thomas-dloop@kuehne.thisisspam.cn> wrote:

>
> Using dchar[] as case-keys within a switch results in:
> Internal error: s2ir.c 670
>
> http://svn.kuehne.cn/dstress/nocompile/switch_14.d
>
>
> Using multiple identical dchar[]s as case-keys within a switch results
> in:
> expression.c:1367: virtual int StringExp::compare(Object*): Assertion `0' failed
>
> http://svn.kuehne.cn/dstress/nocompile/switch_13.d
>
> I don't know why, but the current documentation states that only
> "integral types or char[] or wchar[]" are allowed for switch statements.
> It is certainly useful if wchar[] and floating types are allowed too.
>
> Thomas

Fixed in 1.06

-- 
"Unhappy Microsoft customers have a funny way of becoming Linux,
Salesforce.com and Oracle customers." - www.microsoft-watch.com:
"The Year in Review: Microsoft Opens Up"
November 30, 2004
On Tue, 30 Nov 2004 19:06:37 +1300, Simon Buchan <currently@no.where> wrote:

<snip>
> Fixed in 1.06
>

err.... 1.07 <smack forehead>

-- 
"Unhappy Microsoft customers have a funny way of becoming Linux,
Salesforce.com and Oracle customers." - www.microsoft-watch.com:
"The Year in Review: Microsoft Opens Up"
1 2
Next ›   Last »