August 25, 2004
The decoding thing has issues as you point out Jill.

That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15 times faster than the Phobos one, while executing the equivalent algorithm. The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder is 10 to 30 times faster (all  variances are due to alternate mixes of char, wchar, dchar; all timings performed on a P3).

These are rather significant differences. I think it's safe to say that the Phobos routines were "not written with efficiency in mind". Either that, or Mango has some secret means of warping time ...



"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgih09$2ak7$1@digitaldaemon.com...
> In article <cgidgk$28mg$1@digitaldaemon.com>, Sean Kelly says...
>
> >>If these implicit conversions are put in place, then I respectfully
suggest
> >>the std.utf functions be replaced with something that avoids fragmenting
the
> >>heap in the manner they currently do (for non Latin-1); and it's not
hard to
> >>make them an order-of-magnitude faster, too.
> >
> >Then by all means do so :)
> >
> >Sean
>
> Some speed-up ideas...
>
> I posted a potentially speedier version of UTF-8 decode here a while back.
The
> basic algorithm I used was this: get the first byte; if it's ASCII, return
it;
> else use it as an index into a lookup table to get the sequence length.
There's
> slightly more to it than that, obviously, but that was the basis. Walter
wanted
> to know if there were any standard tests to check whether a UTF-8 function
works
> correctly. I didn't know of any.
>
> The big difficulty with UTF-8 is that of being fully Unicode conformant.
This is
> poorly understood, so people are often tempted to make shortcuts. The
std.utf
> functions take no shortcuts and so are conformant.
>
> The jist is this, however. You can have two different kinds of UTF-8
decode
> routine - checked or unchecked. A checked function will ensure that the
input
> contains no invalid sequences (non-shortest sequences are always invalid),
and
> will throw an exception (or otherwise report the error) if that's not the
case.
> Checked decoders can be made fully conformant, but the checking can slow
you
> down.
>
> Unchecked decoders, on the other hand, simply /assume/ that the input is
valid,
> and produce garbage if it isn't. Unchecked decoders can be made to go a
lot
> faster, but they are not Unicode conformant ... unless of course you
*KNOW* with
> 100% certainty that the input *IS* valid. (Without this knowledge, your application won't be Unicode conformant, and can actually be a security
risk).
> So, it would be possible to write a fast, unchecked UTF-8 decoder, if you
made
> use of D's Design by Contract. If you validate the string in the
function's "in"
> block, then you can assume valid input in the function body, and thereby
go
> faster (at least in a release build). But watch out for coding errors. The caller *MUST* fulfil that contract, or you have a bug. And you'd still
need to
> have a checked UTF-8 decoder for those cases when you're not sure where
the
> input came from.
>
> Being able to distinguish between sequences which have already been
validated,
> and those which have not, can buy you a lot of efficiency. Unfortunately,
I
> don't see how D can take advantage of that. If a D string were a class or
a
> struct, then it could have a class invariant - but D strings are just
simple
> arrays, and constructing invalid UTF-8 arrays is all too easy.
>
> Arcane Jill
>
>


August 25, 2004
In article <cgij3k$2bon$1@digitaldaemon.com>, antiAlias says...
>
>The decoding thing has issues as you point out Jill.
>
>That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15 times faster than the Phobos one, while executing the equivalent algorithm. The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder is 10 to 30 times faster (all  variances are due to alternate mixes of char, wchar, dchar; all timings performed on a P3).

Holy crap, what kind of data are you throwing at it?  I don't mean to criticize, but there must be some clever coding on your part or some serious loopholes in the algorithm to get that kind of an improvement. :)

>These are rather significant differences. I think it's safe to say that the Phobos routines were "not written with efficiency in mind". Either that, or Mango has some secret means of warping time ...

In the case of the latter, may we rename the "Mango" to "Tardis"?

-Pragma
[[ EricAnderton at (daleks can't use stairs) yahoo.com ]]
August 25, 2004
<g> That's funny.

No loopholes; no clever coding. Just take a look at (for example) what utf.encode does to the heap. Those order-of-magnitude timings are best-case for Phobos ~ in a busy server environment they'd be even slower, likely cause notable heap fragmentation, and persistently lock the heap against other (much more appropriate) usage by other threads. Imagine multiple threads doing implicit dchar[] conversions via utf.encode?

Because there's no supported means of resolving such things, one becomes inclined to simply 'reimplement' instead of pooling ones skills and resources to fix Phobos.

This is exactly the kind of thing the DSLG should take care of.



"pragma" <pragma_member@pathlink.com> wrote in message news:cgin45$2djh$1@digitaldaemon.com...
> In article <cgij3k$2bon$1@digitaldaemon.com>, antiAlias says...
> >
> >The decoding thing has issues as you point out Jill.
> >
> >That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15 times faster than the Phobos one, while executing the equivalent
algorithm.
> >The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder
is
> >10 to 30 times faster (all  variances are due to alternate mixes of char, wchar, dchar; all timings performed on a P3).
>
> Holy crap, what kind of data are you throwing at it?  I don't mean to
criticize,
> but there must be some clever coding on your part or some serious
loopholes in
> the algorithm to get that kind of an improvement. :)
>
> >These are rather significant differences. I think it's safe to say that
the
> >Phobos routines were "not written with efficiency in mind". Either that,
or
> >Mango has some secret means of warping time ...
>
> In the case of the latter, may we rename the "Mango" to "Tardis"?
>
> -Pragma
> [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]


August 25, 2004
Might I humbly suggest you add these routines to deimos?
Have you seen my post to the DSLG thread, I think my idea has merit..

On Wed, 25 Aug 2004 13:07:57 -0700, antiAlias <fu@bar.com> wrote:
> <g> That's funny.
>
> No loopholes; no clever coding. Just take a look at (for example) what
> utf.encode does to the heap. Those order-of-magnitude timings are best-case
> for Phobos ~ in a busy server environment they'd be even slower, likely
> cause notable heap fragmentation, and persistently lock the heap against
> other (much more appropriate) usage by other threads. Imagine multiple
> threads doing implicit dchar[] conversions via utf.encode?
>
> Because there's no supported means of resolving such things, one becomes
> inclined to simply 'reimplement' instead of pooling ones skills and
> resources to fix Phobos.
>
> This is exactly the kind of thing the DSLG should take care of.
>
>
>
> "pragma" <pragma_member@pathlink.com> wrote in message
> news:cgin45$2djh$1@digitaldaemon.com...
>> In article <cgij3k$2bon$1@digitaldaemon.com>, antiAlias says...
>> >
>> >The decoding thing has issues as you point out Jill.
>> >
>> >That's not the whole story, however. The Mango utf-8 Encoder is 5 to 15
>> >times faster than the Phobos one, while executing the equivalent
> algorithm.
>> >The CheckedDecoder is 5 to 15 times faster, whilst the UncheckedDecoder
> is
>> >10 to 30 times faster (all  variances are due to alternate mixes of 
>> char,
>> >wchar, dchar; all timings performed on a P3).
>>
>> Holy crap, what kind of data are you throwing at it?  I don't mean to
> criticize,
>> but there must be some clever coding on your part or some serious
> loopholes in
>> the algorithm to get that kind of an improvement. :)
>>
>> >These are rather significant differences. I think it's safe to say that
> the
>> >Phobos routines were "not written with efficiency in mind". Either 
>> that,
> or
>> >Mango has some secret means of warping time ...
>>
>> In the case of the latter, may we rename the "Mango" to "Tardis"?
>>
>> -Pragma
>> [[ EricAnderton at (daleks can't use stairs) yahoo.com ]]
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 26, 2004
On Wed, 25 Aug 2004 06:31:16 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:
> In article <opsc9xu9h35a2sq9@digitalmars.com>, Regan Heath says...
>
>> Ahh.. excellent, that is what I was hoping to hear.
>
> It's not /all/ good news however. Consider these two cases:
>
> (1)
> #    class A { wchar[] toString(); }
> #    A a = new A();
> #    wchar[] s = a.toString();
>
> All hunky dory. No conversions happen. /But/
>
> (2)
> #    class A { wchar[] toString(); }
> #    Object a = new A();
> #    wchar[] s = a.toString();
>
> Now /two/ conversions happen (assuming Object.toString() still returns char[]) -
> toUTF8(wchar[]) followed by toUTF16(char[]). Still, that's polymorphism for you.

True, and we can come up with much nastier string concatenation examples too.. I wonder if some cleverness can be thought up to lessen this effect somehow?

> It is better than the status quo, but not quite as good (IMO) as having wchar[]
> be the standard string type.

By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to?

I think the other types have a valid place in the D language, after all each type will be more or less efficient based on the specific circumstances it gets used in.

The most generally efficient type (if that's even possible to decide) should be the type we're encouraged to use, if that's wchar so be it.

Implicit transcoding will fit nicely with a standard type as when you are using another type the library functions (if all written for wchar for example) will still be available without explicit toUTFxx calls.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 26, 2004
In article <cgirdm$2fod$1@digitaldaemon.com>, antiAlias says...
>
><g> That's funny.

Thanks!  That's why I've been doing these goofy sigs lately: if I can make people grin while they type, perhaps tempers won't flare as much in this NG. :)

>
>No loopholes; no clever coding. Just take a look at (for example) what utf.encode does to the heap. Those order-of-magnitude timings are best-case for Phobos ~ in a busy server environment they'd be even slower, likely cause notable heap fragmentation, and persistently lock the heap against other (much more appropriate) usage by other threads. Imagine multiple threads doing implicit dchar[] conversions via utf.encode?

Yikes.  That's amazing.  I only wonder how this might stand up against ICU?

>Because there's no supported means of resolving such things, one becomes inclined to simply 'reimplement' instead of pooling ones skills and resources to fix Phobos.

A personal favorite of mine:

1st law of engineering: Hit it with a hammer
2nd law of engineering: If law 1 fails, *use a bigger hammer*

. or the more classic idiom: use the right tool for the right job.

One could liken reimplementation to a "programmer's sin" of sorts only if they doom others to continue that replication.  Sadly, everybody's lib is currently "in progress" with no end in sight yet (which should change once D's remaining quirks are stamped out).  At least Brad has been moderating product additions on dsource to cut down some of the more coarse-grained duplication.

However we're still very much in the Beta years of D's life. ;)

>
>This is exactly the kind of thing the DSLG should take care of.

At first I wasn't too keen on the idea myself, but perhaps it could use some more discussion. (I'll post on the other thread).

- Pragma
[[ EricAnderton at (its gumby dammit) yahoo.com ]]
August 26, 2004
In article <opsdbboxb05a2sq9@digitalmars.com>, Regan Heath says...

>> It is better than the status quo, but not quite as good (IMO) as having
>> wchar[]
>> be the standard string type.
>
>By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to?

I guess I mean specifically that:

(1) Object.toString() should return wchar[], not char[]

(2) String literals such as "hello world" should be interpretted as wchar[], not
char[].

(3) Object.d should contain the line:
#    alias wchar[] string;

(4) The text on page http://www.digitalmars.com/d/arrays.html should be changed.
Currently it says:

>Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters. String literals become just an easy way to write character arrays.
>
>    char[] str;
>    char[] str1 = "abc";

This should be changed to:

>Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters. String literals become just an easy way to write character arrays.
>
>    wchar[] str;
>    wchar[] str1 = "abc";

(5) There are probably several other things to change. I don't claim this is an
exhaustive list.



In other words, we could actually have our cake /and/ eat it. The intent is to minimize, as far as possible, the number of calls to toUTFxx(). Ideally, they should occur only at input and output. The way to minimize this is to keep everything in the same type, so conversion is not needed. If the D documentation, the behaviour of the compiler, and the organization of Phobos, were to consistently use the same string type, and others were encouraged to use the same type, conversions would be kept to a minimum.

Currently D does that - but it's "string of choice" is the char[], not the wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII characters. (It could be made faster, but it can /never/ be made as fast as UTF-16). So we make wchar[], not char[], the "standard", and hey presto, things get faster (and what's more will interface with ICU without conversion, which is really important for internationalization).




>The most generally efficient type (if that's even possible to decide) should be the type we're encouraged to use, if that's wchar so be it.

Well, it is usually believed that
UTF-8  is the most space-efficient but the least speed-efficient.
UTF-32 is the most speed-efficient but the least space-efficient.
UTF-16 is the happy medium.

However...
*) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so simple
*) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and U+FFFF,
each of which require 3 bytes in UTF-8, but only two bytes in UTF-16
*) The characters expressable in UTF-16 in a single wchar include every symbol
from every living language, so if you /pretend/ that wchar[] is an array of
characters rather than UTF-16 fragments, the effect is relatively harmless
(unlike UTF-8).


>Implicit transcoding will fit nicely with a standard type as when you are using another type the library functions (if all written for wchar for example) will still be available without explicit toUTFxx calls.

True. I can't argue with that.

But back to the case for ditching char - think of this from another perspective. In Java, would you be prepared to argue the case for the /introduction/ of an 8-bit wide character type to the language? And that this type could only ever be used for UTF-8?

There's a reason why that suggestion sounds absurd. It is.

Arcane Jill



August 26, 2004
Hear! Hear!


"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgk32j$1fj$1@digitaldaemon.com...
> In article <opsdbboxb05a2sq9@digitalmars.com>, Regan Heath says...
>
> >> It is better than the status quo, but not quite as good (IMO) as having
> >> wchar[]
> >> be the standard string type.
> >
> >By 'standard' do you mean that the others do not exist? or that it is the type you are encouraged to use unless you have reason not to?
>
> I guess I mean specifically that:
>
> (1) Object.toString() should return wchar[], not char[]
>
> (2) String literals such as "hello world" should be interpretted as
wchar[], not
> char[].
>
> (3) Object.d should contain the line:
> #    alias wchar[] string;
>
> (4) The text on page http://www.digitalmars.com/d/arrays.html should be
changed.
> Currently it says:
>
> >Dynamic arrays in D suggest the obvious solution - a string is just a
dynamic
> >array of characters. String literals become just an easy way to write
character
> >arrays.
> >
> >    char[] str;
> >    char[] str1 = "abc";
>
> This should be changed to:
>
> >Dynamic arrays in D suggest the obvious solution - a string is just a
dynamic
> >array of characters. String literals become just an easy way to write
character
> >arrays.
> >
> >    wchar[] str;
> >    wchar[] str1 = "abc";
>
> (5) There are probably several other things to change. I don't claim this
is an
> exhaustive list.
>
>
>
> In other words, we could actually have our cake /and/ eat it. The intent
is to
> minimize, as far as possible, the number of calls to toUTFxx(). Ideally,
they
> should occur only at input and output. The way to minimize this is to keep everything in the same type, so conversion is not needed. If the D documentation, the behaviour of the compiler, and the organization of
Phobos,
> were to consistently use the same string type, and others were encouraged
to use
> the same type, conversions would be kept to a minimum.
>
> Currently D does that - but it's "string of choice" is the char[], not the wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII
characters.
> (It could be made faster, but it can /never/ be made as fast as UTF-16).
So we
> make wchar[], not char[], the "standard", and hey presto, things get
faster (and
> what's more will interface with ICU without conversion, which is really important for internationalization).
>
>
>
>
> >The most generally efficient type (if that's even possible to decide) should be the type we're encouraged to use, if that's wchar so be it.
>
> Well, it is usually believed that
> UTF-8  is the most space-efficient but the least speed-efficient.
> UTF-32 is the most speed-efficient but the least space-efficient.
> UTF-16 is the happy medium.
>
> However...
> *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so
simple
> *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and
U+FFFF,
> each of which require 3 bytes in UTF-8, but only two bytes in UTF-16 *) The characters expressable in UTF-16 in a single wchar include every
symbol
> from every living language, so if you /pretend/ that wchar[] is an array
of
> characters rather than UTF-16 fragments, the effect is relatively harmless
> (unlike UTF-8).
>
>
> >Implicit transcoding will fit nicely with a standard type as when you are using another type the library functions (if all written for wchar for example) will still be available without explicit toUTFxx calls.
>
> True. I can't argue with that.
>
> But back to the case for ditching char - think of this from another
perspective.
> In Java, would you be prepared to argue the case for the /introduction/ of
an
> 8-bit wide character type to the language? And that this type could only
ever be
> used for UTF-8?
>
> There's a reason why that suggestion sounds absurd. It is.
>
> Arcane Jill
>
>
>


August 26, 2004
On Thu, 26 Aug 2004 07:21:55 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:
> In article <opsdbboxb05a2sq9@digitalmars.com>, Regan Heath says...
>
>>> It is better than the status quo, but not quite as good (IMO) as having
>>> wchar[]
>>> be the standard string type.
>>
>> By 'standard' do you mean that the others do not exist? or that it is the
>> type you are encouraged to use unless you have reason not to?
>
> I guess I mean specifically that:
>
> (1) Object.toString() should return wchar[], not char[]

Sure... so long as custom classes can overload and return char[] for situations where the app might be using char[] throughout (for whatever reason).

OT: here is an example where we _don't_ want the return value used for method name resolution.

> (2) String literals such as "hello world" should be interpretted as wchar[], not char[].

Currently doesn't it decide the type based on the context? i.e.

void foo(dchar[] a);
foo("hello world");

would make "hello world" a dchar string literal?

I guess what you're saying is the default should be wchar[] where the type is indeterminate i.e.

writef("hello world");

but, why not use char[] for the above, it's more efficient in this case. The compiler could do a quick decision based on whether the string contains any code points >= U+0800 if not use char[] otherwise user wchar[], would that be a good soln?

After all, I don't mind if my compile is a little slower if it means my app is faster.

> (3) Object.d should contain the line:
> #    alias wchar[] string;

I'm not sure I like this.. will this hide details a programmer should be aware of?

> (4) The text on page http://www.digitalmars.com/d/arrays.html should be changed.
> Currently it says:
>
>> Dynamic arrays in D suggest the obvious solution - a string is just a dynamic
>> array of characters. String literals become just an easy way to write character
>> arrays.
>>
>>    char[] str;
>>    char[] str1 = "abc";
>
> This should be changed to:
>
>> Dynamic arrays in D suggest the obvious solution - a string is just a dynamic
>> array of characters. String literals become just an easy way to write character
>> arrays.
>>
>>    wchar[] str;
>>    wchar[] str1 = "abc";
>
> (5) There are probably several other things to change. I don't claim this is an
> exhaustive list.

Sure, char[] is probably used and suggested in every example in the manuals except where it's giving an example if the differences in utf-x encodings.

> In other words, we could actually have our cake /and/ eat it. The intent is to
> minimize, as far as possible, the number of calls to toUTFxx().

Agreed.

> Ideally, they should occur only at input and output. The way to minimize this is to keep everything in the same type, so conversion is not needed. If the D
> documentation, the behaviour of the compiler, and the organization of Phobos, were to consistently use the same string type, and others were encouraged to use the same type, conversions would be kept to a minimum.
>
> Currently D does that - but it's "string of choice" is the char[], not the wchar[]. Conversion to/from UTF-8 is incredibly slow for non-ASCII characters.
> (It could be made faster, but it can /never/ be made as fast as UTF-16). So we make wchar[], not char[], the "standard", and hey presto, things get faster

Qualification: For non ASCII only apps.

 (and what's more will interface with ICU without conversion, which is really
> important for internationalization).

The fact that ICU has no char type suggests it's a bad choice for D, that is, if we want to assume they knew what they we're doing. Are there any complaints from developers about ICU anywhere, perhaps some digging for dirt would help make an objective decision here?

>> The most generally efficient type (if that's even possible to decide)
>> should be the type we're encouraged to use, if that's wchar so be it.
>
> Well, it is usually believed that
> UTF-8  is the most space-efficient but the least speed-efficient.
> UTF-32 is the most speed-efficient but the least space-efficient.
> UTF-16 is the happy medium.
>
> However...
> *) UTF-16 almost as fast as UTF-32, because the UTF-16 encoding is so simple
> *) UTF-16 is more compact than UTF-8 for codepoints between U+0800 and U+FFFF,
> each of which require 3 bytes in UTF-8, but only two bytes in UTF-16
> *) The characters expressable in UTF-16 in a single wchar include every symbol
> from every living language

.. ? I thought the problem Java had was with Unicode contains characters that are > U+FFFF.. if they're not part of a 'living language' what are they?

> , so if you /pretend/ that wchar[] is an array of
> characters rather than UTF-16 fragments, the effect is relatively harmless
> (unlike UTF-8).

Sure, however if you're only dealing with ASCII doing that with char[] is also fine. Those of us who haven't done any internationalisation are used to dealing only with ASCII.


I'd like some more stats and figures, simply:
 - how many unicode characters are in the range < U+0800?
 - how many unicode from U+0800 >= x <= U+FFFF?
 - haw many unicode > U+FFFF?

(the answers to the above are probably quite simple, but I want them from someone who 'knows' rather than me who'd be guessing)

Then, how commonly is each range used? I imagine this differs depending on exactly what you're doing.. basically when would you use characters in each range and how common is that task?

It used to be that ASCII < U+0800 was the most common, it still may be, but I can see that it's not the future, the future is unicode.

>> Implicit transcoding will fit nicely with a standard type as when you are
>> using another type the library functions (if all written for wchar for
>> example) will still be available without explicit toUTFxx calls.
>
> True. I can't argue with that.
>
> But back to the case for ditching char - think of this from another perspective.
> In Java, would you be prepared to argue the case for the /introduction/ of an
> 8-bit wide character type to the language? And that this type could only ever be
> used for UTF-8?
>
> There's a reason why that suggestion sounds absurd. It is.

Isn't the reason that all the existing Java stuff, of which there is a *lot* uses wchar, so char wouldn't intgrate. That is different to D, where all 3 exist and char is actually the one with the most integration.

That said, would the introduction of char to Java give you anything? perhaps.. it would allow you to write an app that only deals with ASCII (chars < U+0800) more space efficiently, correct?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 27, 2004
In article <opsdc4sgsi5a2sq9@digitalmars.com>, Regan Heath says...

>I guess what you're saying is the default should be wchar[] where the type is indeterminate

Yes.

>but, why not use char[] for the above, it's more efficient in this case. The compiler could do a quick decision based on whether the string contains any code points >= U+0800 if not use char[] otherwise user wchar[], would that be a good soln?

You suggestion would work. But I'm still thinking along the lines that no conversion is better than some conversion, and no conversion is only achievable by having only one type of string. And even we don't enforce that, we should at least encourage it.


>After all, I don't mind if my compile is a little slower if it means my app is faster.

UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all text is ASCII is going to be just as fast in UTF-16 as it is in ASCII. Remember that ASCII is a subset of UTF-16, just as it is a subset of UTF-8.

Converting between UTF-8 and UTF-16 won't slow you down much if all your characters are ASCII, of course. Such a conversion is trivial - not much slower than a memcpy. /But/ - you're still making a copy, still allocating stuff off the heap and copying data from one place to another, and that's still overhead which you would have avoided had you used UTF-16 right through.



>> (3) Object.d should contain the line:
>> #    alias wchar[] string;
>
>I'm not sure I like this.. will this hide details a programmer should be aware of?

It's just a strong hint. If aliases are bad then we shouldn't use them anywhere.


>> (It could be made faster, but it can /never/ be made as fast as UTF-16). So we make wchar[], not char[], the "standard", and hey presto, things get faster
>
>Qualification: For non ASCII only apps.

No, for /all/ apps. I don't see any reason why ASCII stored in wchar[]s would be any slower than ASCII stored in char[]s. Can you think of a reason why that would be so?

ASCII is a subset of UTF-8
ASCII is a subset of UTF-16

where's the difference? The difference is space, not speed.



>The fact that ICU has no char type suggests it's a bad choice for D, that is, if we want to assume they knew what they we're doing.

See http://oss.software.ibm.com/icu/userguide/strings.html for UCI's discussion on this.


>Are there any complaints from developers about ICU anywhere, perhaps some digging for dirt would help make an objective decision here?

I don't know. I imagine so. People generally tend to complain about /everything/.


>I'd like some more stats and figures, simply:
>  - how many unicode characters are in the range < U+0800?

These figures are all from Unicode 4.0. Unicode now stands at 4.0.1, so these figures are out of date - but they'll give you the general idea.

1646

>  - how many unicode from U+0800 >= x <= U+FFFF?

54014, of which 6400 are Private Use Characters

>  - haw many unicode > U+FFFF?

176012, of which 131068 are Private Use Characters


>Then, how commonly is each range used? I imagine this differs depending on exactly what you're doing.. basically when would you use characters in each range and how common is that task?

The biggest non-BMP chunks are:
U+20000 to U+2A6D6 = CJK Compatibility Ideographs
U+F0000 to U+FFFFD = Private Use
U+100000 to U+10FFFFD = Private Use

Compatibility Ideographs are not used in Unicode except for round-trip compatibility with legacy CJK character sets. Every single one of them is nothing but a compatibility alias for another character. The Private Use characters are not defined by Unicode (being reserved for private interchange between consenting parties). The remainder of the non-BMP (> U+FFFF) characters are:

*) More CJK compatibility characters
*) Old italic variants of ASCII characters
*) Gothic letters (no idea what they are)
*) The Deseret script (a dead language, so far as I know)
*) Musical symbols
*) Miscellaneous mathematical symbols
*) Mathematical variants of ASCII and Greek letters
*) "Tagged" variants of ASCII characters

The math characters are used only in math. The tagged characters are used only in /one/ protocol (and in fact, some of those characters are not used at all, even in that protocol). None of these characters are likely to be found in general text, only in specialist applications.

Here is the complete list of "blocks":

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek
0400..04FF; Cyrillic
0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0780..07BF; Thaana
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0C80..0CFF; Kannada
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1200..137F; Ethiopic
13A0..13FF; Cherokee
1400..167F; Unified Canadian Aboriginal Syllabics
1680..169F; Ogham
16A0..16FF; Runic
1780..17FF; Khmer
1800..18AF; Mongolian
1E00..1EFF; Latin Extended Additional
1F00..1FFF; Greek Extended
2000..206F; General Punctuation
2070..209F; Superscripts and Subscripts
20A0..20CF; Currency Symbols
20D0..20FF; Combining Marks for Symbols
2100..214F; Letterlike Symbols
2150..218F; Number Forms
2190..21FF; Arrows
2200..22FF; Mathematical Operators
2300..23FF; Miscellaneous Technical
2400..243F; Control Pictures
2440..245F; Optical Character Recognition
2460..24FF; Enclosed Alphanumerics
2500..257F; Box Drawing
2580..259F; Block Elements
25A0..25FF; Geometric Shapes
2600..26FF; Miscellaneous Symbols
2700..27BF; Dingbats
2800..28FF; Braille Patterns
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DB5; CJK Unified Ideographs Extension A
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7A3; Hangul Syllables
D800..DB7F; High Surrogates
DB80..DBFF; High Private Use Surrogates
DC00..DFFF; Low Surrogates
E000..F8FF; Private Use
F900..FAFF; CJK Compatibility Ideographs
FB00..FB4F; Alphabetic Presentation Forms
FB50..FDFF; Arabic Presentation Forms-A
FE20..FE2F; Combining Half Marks
FE30..FE4F; CJK Compatibility Forms
FE50..FE6F; Small Form Variants
FE70..FEFE; Arabic Presentation Forms-B
FEFF..FEFF; Specials
FF00..FFEF; Halfwidth and Fullwidth Forms
FFF0..FFFD; Specials
10300..1032F; Old Italic
10330..1034F; Gothic
10400..1044F; Deseret
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D400..1D7FF; Mathematical Alphanumeric Symbols
20000..2A6D6; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
F0000..FFFFD; Private Use
100000..10FFFD; Private Use



>It used to be that ASCII < U+0800 was the most common, it still may be, but I can see that it's not the future, the future is unicode.

Most common, perhaps, if you limit your text to Latin, Greek, Hebrew and Russian. But not otherwise.




>That said, would the introduction of char to Java give you anything? perhaps.. it would allow you to write an app that only deals with ASCII (chars < U+0800) more space efficiently, correct?

Only chars < U+0080 (not U+0800) would be more space efficient in UTF-8. Between U+0080 and U+07FFF they both need two bytes. From U+0800 upwards, UTF-8 needs three bytes where UTF-16 needs two.

How am I doing on the convincing front? I'd still go for:
*) wchar[] for everything in Phobos and everything DMD-generated;
*) ditch the char
*) lossless implicit conversion between all remaining D string types

Arcane Jill