August 27, 2004
Ben Hinkle wrote:

> 
> I was going to start digging into non-ascii next. I remember reading
> somewhere that encoding asian languages in utf8 typically results in longer
> strings than utf16. That will definitely hamper utf8.

Either that or hamper Japanese coders :)

If it comes out to a performance draw when dealing with non-ascii text,
then might I suggest using programming ease (for library writers as
well) to be the tie breaker?
August 27, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgngtc$1nep$1@digitaldaemon.com...
> But it doesn't. Your tests were unfair. A UTF-16 array will not, in
general,
> require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not
otherwise,
> and for the majority of characters the performance will be worse.

The majority of characters are multibyte in UTF-8, that is true. But the distribution of characters is what matters for speed in real apps, and for those, the vast majority will be ASCII. Furthermore, many operations on strings can treat UTF-8 as if it were single byte, such as copying, sorting, and searching.

Of course, there are still many situations where UTF-8 is not ideal, which is why D tries to be agnostic about whether the application programmer wants to use char[], wchar[], or dchar[] or any combination of the three.


August 27, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgncho$1la0$1@digitaldaemon.com...
> In article <cgn92o$1jj4$1@digitaldaemon.com>, Ben Hinkle says...
>
> > char: .36 seconds
> > wchar: .78
> > dchar: 1.3
>
> Yeah, I forgot about allocation time. Of course, D initializes all arrays,
no
> matter whence they are allocated. char[]s will be filled with all FFs, and wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as
many
> bytes to initialize. Damn!

There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very real issue for server apps, since it means that you reach the point of having to double the hardware in half the time.


August 28, 2004
In article <cgobse$237t$1@digitaldaemon.com>, Walter says...

>The majority of characters [within strings] are multibyte in UTF-8,
>that is true. But the [frequency]
>distribution of characters is what matters for speed in real apps, and for
>those, the vast majority will be ASCII.

Are you sure? That's one hellava claim to make.

I've noticed that D seems to generate a lot of interest from Japan (judging from the existense of Japanese web sites). Of course, Japanese strings average 1.5 times as long in UTF-8 as those same strings would have been in UTF-16.

The whole "Most characters are ASCII" dogma is really only true if you happen to live in a certain part of the world, and to bias a computer language because of that assumption hurts performance for everyone else. /Please/ reconsider.



>Furthermore, many operations on
>strings can treat UTF-8 as if it were single byte, such as copying, sorting,
>and searching.

Copying, yes. But of course you miss the point that this would be just as true in UTF-16 as it is in UTF-8.

Sorting? - Lexicographical sorting, maybe, but the only reason you can get away with that is because, in ASCII-only parts of the world, codepoint order happens to correspond to the order we find letters in the alphabet, and even then only if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick an acute accent over one of the vowels and lexicographical sort order goes out the window. Lexicographical sorting may be good for things purely mechanical things like eliminating duplicates in an AA, but if someone wants to look up all entries between "Alice" and "Bob" in a database, I think they would be very surprised to find that "Äaron" was not in the list. (And again, you miss the point that if lexicographical sorting /is/ what you want, it works just as well in UTF-16 as it does in UTF-8). /Real/ sorting, however, requires full understanding of Unicode, and for that, ASCII is just not good enough.

Searching? If you treat "UTF-8 as if it were single byte", there are an /awful lot/ of characters you can't search for, including the British Pound (currency) sign, the Euro currency sign and anything with an accent over it. And searching for the single byte 0x81 (for example) is not exactly useful.



>Of course, there are still many situations where UTF-8 is not ideal, which is why D tries to be agnostic about whether the application programmer wants to use char[], wchar[], or dchar[] or any combination of the three.

Yes, agnostic is good. No problem there. I'm only talking about the /default/. I thought you were all /for/ internationalization and Unicode and all that? I'm surprised to find myself arguing with you on this one. (Okay, I didn't really expect you to ditch the char, but to prefer wchar[] over char[] is /reasonable/). I have given many, many, many reasons why I think that wchar[] strings should be the default in D, and if I can't convince you, I think that would be a big shame.

Arcane Jill


August 28, 2004
In article <cgobse$237t$2@digitaldaemon.com>, Walter says...

>There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very real issue for server apps, since it means that you reach the point of having to double the hardware in half the time.

There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings are shorter than char[] strings. In Japanese, wchar[] strings are shorter than char[] strings. In Mongolian, wchar[] strings are shorter than char[] strings. In Tibetan, wchar[] strings are shorter than char[] strings. I assume I don't need to go on...?

<sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm>

Walter, servers are one the places where internationalization matters most. XML and HTML documents, for example, could be (a) stored and (b) requested in any encodings whatsoever. A server would have to push them through a transcoding function. For this, wchar[]s are more sensible.

I don't understand the basis of your determination. It seems ill-founded. Jill


August 28, 2004
In article <cgobse$237t$2@digitaldaemon.com>, Walter says...

>There are also twice as many (sic) bytes to scan for the gc,

Why are strings added to the GC root list anyway? It occurs to me that arrays of bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated on the heap can never contain pointers, and so should not be added to the GC's list of things to scan when created with new (or modification of .length).

I imagine that this one simple step would increase D's performance rather dramatically.

Arcane Jill


August 28, 2004
In article <cgp7c3$2e36$1@digitaldaemon.com>, Arcane Jill says...

>if someone wants to look up all
>entries between "Alice" and "Bob" in a database, I think they would be very
>surprised to find that "Äaron" was not in the list.

Whoops - dumb brain fart! Please pretend I didn't say that.

The reasoning is still sound - it's just my conception of alphabetical order that's up the spout.

Jill


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgp7c3$2e36$1@digitaldaemon.com...
>>> But it doesn't. Your tests were unfair. A UTF-16 array will not, in
general,
>>> require twice as many bytes as a UTF-8 array, which is what you seemed
to
>>> assume. That will only be true if the string is pure ASCII, but not
otherwise,
>>> and for the majority of characters the performance will be worse.
> In article <cgobse$237t$1@digitaldaemon.com>, Walter says...
> >The majority of characters [within strings] are multibyte in UTF-8,
> >that is true. But the [frequency]
> >distribution of characters is what matters for speed in real apps, and
for
> >those, the vast majority will be ASCII.
> Are you sure? That's one hellava claim to make.

Yes. Nearly all the spam I get is in ascii <g>.

When optimizing for speed, the first rule is optimize for what the bulk of the data will likely consist of for your application. For example, if you're writing a user interface for Chinese people, you'd be sensible to consider using dchar[] throughout.

It probably makes sense to use wchar[] for the unicode library you're developing because programmers who have a need for such a library will most likely NOT be writing applications for ascii.

> I've noticed that D seems to generate a lot of interest from Japan
(judging from
> the existense of Japanese web sites). Of course, Japanese strings average
1.5
> times as long in UTF-8 as those same strings would have been in UTF-16.

If I was building a Japanese word processor, I certainly wouldn't use UTF-8 internally in it for that reason.


> The whole "Most characters are ASCII" dogma is really only true if you
happen to
> live in a certain part of the world, and to bias a computer language
because of
> that assumption hurts performance for everyone else. /Please/ reconsider.

If everything is optimized for Japanese, it will hurt performance for ASCII users. The point is, there is no UTF encoding that is optimal for everyone. That's why D supports all three.

> >Furthermore, many operations on
> >strings can treat UTF-8 as if it were single byte, such as copying,
sorting,
> >and searching.
> Copying, yes. But of course you miss the point that this would be just as
true
> in UTF-16 as it is in UTF-8.

Of course. My point was that quite a few common string operations do not require decoding. For example, the D compiler processes source as UTF-8. It almost never has to do any decoding. The performance penalty for supporting multibyte encodings in D source is essentially zero.

> Sorting? - Lexicographical sorting, maybe, but the only reason you can get
away
> with that is because, in ASCII-only parts of the world, codepoint order
happens
> to correspond to the order we find letters in the alphabet, and even then
only
> if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick
an
> acute accent over one of the vowels and lexicographical sort order goes
out the
> window. Lexicographical sorting may be good for things purely mechanical
things
> like eliminating duplicates in an AA, but if someone wants to look up all entries between "Alice" and "Bob" in a database, I think they would be
very
> surprised to find that "Äaron" was not in the list. (And again, you miss
the
> point that if lexicographical sorting /is/ what you want, it works just as
well
> in UTF-16 as it does in UTF-8).

Sure - and my point is it wasn't necessary to decode UTF-8 to do that sort. It's not necessary for hashing the string, either.

> /Real/ sorting, however, requires full
> understanding of Unicode, and for that, ASCII is just not good enough.

There are many different ways to sort, and since the unicode characters are not always ordered the obvious way, you have to deal with that specially in each of UTF-8, -16, and -32.

> Searching? If you treat "UTF-8 as if it were single byte", there are an
/awful
> lot/ of characters you can't search for, including the British Pound
(currency)
> sign, the Euro currency sign and anything with an accent over it. And
searching
> for the single byte 0x81 (for example) is not exactly useful.

That's why std.string.find() takes a dchar as its search argument. What you do is treat it as a substring search. There are a lot of very fast algorithms for doing such searches, such as Boyer-Moore, which get pretty close to the performance of a single character search. Furthermore, I'd optimize it so the first thing the search did was check if the search character was ascii. If so, it'd do the single character scan. Otherwise, it'd do the substring search.

> >Of course, there are still many situations where UTF-8 is not ideal,
which
> >is why D tries to be agnostic about whether the application programmer
wants
> >to use char[], wchar[], or dchar[] or any combination of the three.
>
> Yes, agnostic is good. No problem there. I'm only talking about the
/default/. I
> thought you were all /for/ internationalization and Unicode and all that?

But D does not have a default. Programmers can use the encoding which is optimal for the data they expect to see. Even if UTF-8 were the default, UTF-8 still supports full internationalization and Unicode. I am certainly not talking about supporting only ASCII or having ASCII as the default.

> I'm surprised to find myself arguing with you on this one. (Okay, I didn't
really
> expect you to ditch the char, but to prefer wchar[] over char[] is /reasonable/). I have given many, many, many reasons why I think that
wchar[]
> strings should be the default in D, and if I can't convince you, I think
that
> would be a big shame.

My experience with using UTF-16 throughout a program is it speeded up quite a bit when converted to UTF-8. There is no blanket advantage to UTF-16, it depends on your expected data. When your expected data will be mostly ASCII, then UTF-8 is the reasonable choice.


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgp8di$2ec2$1@digitaldaemon.com...
> In article <cgobse$237t$2@digitaldaemon.com>, Walter says...
>
> >There are also twice as many (sic) bytes to scan for the gc,
>
> Why are strings added to the GC root list anyway? It occurs to me that
arrays of
> bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated
on
> the heap can never contain pointers, and so should not be added to the
GC's list
> of things to scan when created with new (or modification of .length).
>
> I imagine that this one simple step would increase D's performance rather dramatically.

There's certainly potential in D to add type awareness to the gc. But that adds penalties of its own, and it's an open question whether on the balance it will be faster or not.


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgp845$2ea2$1@digitaldaemon.com...
> In article <cgobse$237t$2@digitaldaemon.com>, Walter says...
> >There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very
real
> >issue for server apps, since it means that you reach the point of having
to
> >double the hardware in half the time.
> There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL.

Are you sure? Even european languages are mostly ascii.

> In Chinese, wchar[] strings are
> shorter than char[] strings. In Japanese, wchar[] strings are shorter than
> char[] strings. In Mongolian, wchar[] strings are shorter than char[]
strings.
> In Tibetan, wchar[] strings are shorter than char[] strings. I assume I
don't
> need to go on...?
>
> <sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm>

Never is not the right word here. The right idea is what is the frequency distribution of the various types of data one's app will see. Once you know that, you optimize for the most common cases. Tibetan is still fully supported regardless.

> Walter, servers are one the places where internationalization matters
most. XML
> and HTML documents, for example, could be (a) stored and (b) requested in
any
> encodings whatsoever.

Of course. But what are the frequencies of the requests for various encodings? Each of the 3 UTF encodings fully support unicode and are fully internationalized. Which one you pick depends on the frequency distribution of your data.

> A server would have to push them through a transcoding function. For this, wchar[]s are more sensible.

It is not optimal unless the person optimizing the server app has instrumented his data so he knows the frequency distribution of the various characters. Only then can he select the encoding that will deliver the best performance.

> I don't understand the basis of your determination. It seems ill-founded.

Experience optimizing apps. One of the most potent tools for optimization is analyzing the data patterns, and making the most common cases take the shortest path through the code. UTF-16 is not optimal for a great many applications - and I have experience with it.