Jump to page: 1 24  
Page
Thread overview
performance of char vs wchar
Aug 27, 2004
Ben Hinkle
Aug 27, 2004
Ben Hinkle
Aug 27, 2004
Berin Loritsch
Aug 27, 2004
Ben Hinkle
Aug 27, 2004
Arcane Jill
Aug 27, 2004
Ben Hinkle
Aug 27, 2004
Berin Loritsch
Aug 27, 2004
Walter
Aug 28, 2004
Arcane Jill
Aug 28, 2004
Arcane Jill
Aug 28, 2004
Walter
Aug 30, 2004
Berin Loritsch
Aug 30, 2004
Walter
Sep 02, 2004
Roald Ribe
Aug 27, 2004
Berin Loritsch
Aug 27, 2004
Arcane Jill
Aug 27, 2004
Walter
Aug 28, 2004
Arcane Jill
Aug 28, 2004
Walter
Aug 30, 2004
Berin Loritsch
Aug 30, 2004
Roald Ribe
Aug 30, 2004
Lars Ivar Igesund
Aug 30, 2004
Ben Hinkle
Aug 30, 2004
Ilya Minkov
Aug 31, 2004
Arcane Jill
Sep 01, 2004
Ilya Minkov
Aug 28, 2004
van eeshan
GC non-roots (was performance of char vs wchar)
Aug 28, 2004
Arcane Jill
Aug 28, 2004
Walter
Aug 27, 2004
Sean Kelly
Aug 27, 2004
Arcane Jill
August 27, 2004
All the threads about char vs wchar got me thinking about char performance,
so I thought I'd test out the difference. I've attached a simple test that
allocates a bunch of strings (of type char[] or wchar[]) of random length
each with one random 'a' in them. Then it loops over the strings randomly
and counts the number of 'a's. So this test does not measure transcoding
speed - it just measures allocation and simple access speed across a large
number of long strings. My times are
 char: .36 seconds
 wchar: .78
 dchar: 1.3

I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generally remains until the times gets so small that it doesn't matter. It looks like a lot of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down from 2 to something under 1.5 or so.

I built it using
 dmd charspeed.d -O -inline -release

It would be nice to try other tests that involve searching for sub-strings (like multi-byte unicode strings) but I haven't done that.

-Ben

August 27, 2004
Note: looks like the version that got attached was not the version I used to run the tests - it was the one I was using to tests access. I thought attaching the file would make a copy right after attaching then but it looks like my newsreader makes a copy only when the message is posted. The version I ran to get my original numbers had longer strings (like 10000 instead of 1000) and timed the allocation as well as the access.

-Ben

August 27, 2004
Ben Hinkle wrote:
> All the threads about char vs wchar got me thinking about char performance,
> so I thought I'd test out the difference. I've attached a simple test that
> allocates a bunch of strings (of type char[] or wchar[]) of random length
> each with one random 'a' in them. Then it loops over the strings randomly
> and counts the number of 'a's. So this test does not measure transcoding
> speed - it just measures allocation and simple access speed across a large
> number of long strings. My times are
>  char: .36 seconds
>  wchar: .78
>  dchar: 1.3
> 
> I tried to make sure the dchar version didn't run into swap space. If I
> scale up or down the size of the problem the factor of 2 generally remains
> until the times gets so small that it doesn't matter. It looks like a lot
> of the time is taken in initializing the strings since the when I modify
> the test to only time access performance the scale factor goes down from 2
> to something under 1.5 or so.
> 
> I built it using
>  dmd charspeed.d -O -inline -release
> 
> It would be nice to try other tests that involve searching for sub-strings
> (like multi-byte unicode strings) but I haven't done that. 
>

FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
The idea of course, is that internally to the program all strings are
encoded the same and translated on IO.

I would venture to say there are two observations about Java strings:

1) In most applications they are fast enough to handle a lot of work.

2) It can become a bottleneck when excessive string concatenation is
   happening, or logging is overdone.

As far as D is concerned, it should be expected that a same sized array
would take longer if the difference between arrays is the element size.
In this case 1 byte, 2 bytes, 4 bytes.  I don't know much about
allocation routines, but I imagine getting something to run at constant
cost is either beyond the realm of possibility or it is just way too
difficult to be practical to pursue.

IMO, the benefits of using a standard unicode encoding within the D
program outway the costs of allocating the arrays.  Anytime we make
it easier to provide internationalization (i18n), it is a step toward
greater acceptance.  To assume the world only works with european (or
american) character sets is to stick your head in the sand.

Honestly, I would prefer something that made internationalized strings
easier to manage that more difficult.  If there is no multi-char
graphemes (i.e. takes up more than one code space) then that would be
the easiest to work with and write libraries for.
August 27, 2004
In article <cgn92o$1jj4$1@digitaldaemon.com>, Ben Hinkle says...

> char: .36 seconds
> wchar: .78
> dchar: 1.3

Yeah, I forgot about allocation time. Of course, D initializes all arrays, no matter whence they are allocated. char[]s will be filled with all FFs, and wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as many bytes to initialize. Damn!

A super-fast character array allocator would make a lot of difference here. There are probably many different ways of doing this very fast. I guess this has to happen within DMD.



August 27, 2004
"Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:cgnc25$1l1f$1@digitaldaemon.com...
> Ben Hinkle wrote:
> > All the threads about char vs wchar got me thinking about char
performance,
> > so I thought I'd test out the difference. I've attached a simple test
that
> > allocates a bunch of strings (of type char[] or wchar[]) of random
length
> > each with one random 'a' in them. Then it loops over the strings
randomly
> > and counts the number of 'a's. So this test does not measure transcoding speed - it just measures allocation and simple access speed across a
large
> > number of long strings. My times are
> >  char: .36 seconds
> >  wchar: .78
> >  dchar: 1.3
> >
> > I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generally
remains
> > until the times gets so small that it doesn't matter. It looks like a
lot
> > of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down from
2
> > to something under 1.5 or so.
> >
> > I built it using
> >  dmd charspeed.d -O -inline -release
> >
> > It would be nice to try other tests that involve searching for
sub-strings
> > (like multi-byte unicode strings) but I haven't done that.
> >
>
> FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO.
>
> I would venture to say there are two observations about Java strings:
>
> 1) In most applications they are fast enough to handle a lot of work.
>
> 2) It can become a bottleneck when excessive string concatenation is
>     happening, or logging is overdone.

I wonder what Java strings in utf8 would be like... I wonder if anyone has tried that out.

> As far as D is concerned, it should be expected that a same sized array would take longer if the difference between arrays is the element size. In this case 1 byte, 2 bytes, 4 bytes.  I don't know much about allocation routines, but I imagine getting something to run at constant cost is either beyond the realm of possibility or it is just way too difficult to be practical to pursue.

agreed. There probably isn't any way to speed up the initialization much.

> IMO, the benefits of using a standard unicode encoding within the D program outway the costs of allocating the arrays.  Anytime we make it easier to provide internationalization (i18n), it is a step toward greater acceptance.  To assume the world only works with european (or american) character sets is to stick your head in the sand.

agreed. utf8 and utf16 are both unicode standards that require multi-byte handling to cover all of unicode so in terms of ease of use they shouldn't be any different.

> Honestly, I would prefer something that made internationalized strings easier to manage that more difficult.  If there is no multi-char graphemes (i.e. takes up more than one code space) then that would be the easiest to work with and write libraries for.

dchar would be the choice for ease of use but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standard and the ease-of-use of utf8 and utf16 are essentially equivalent so since utf8 has the best performance it should be the default. Hence my attempt to measure which is faster for typical usage: char or wchar?

-Ben


August 27, 2004
In article <cgner6$1mh4$1@digitaldaemon.com>, Ben Hinkle says...

>dchar would be the choice for ease of use

Actually, wchar and dchar are pretty much the same in terms of ease of use. In almost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character. Except in very specialist cases, you'll be correct. Even if you're using characters beyond U+FFFF, regarding them as two characters instead of one is likely to be harmless.

Of course, there are applications for which you can't do this. Font rendering algorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the (almost insignificant) cost of determining true character boundaries - a trivial task in UTF-16.

char, by contrast, is much more difficult to use, because you /cannot/ "pretend" it's a single-byte encoding, because there are too many characters beyond U+00FF which applications need to "understand".



>but as you can see performance
>goes downhill significantly (at least for the naive test I ran). To me the
>performance of dchar is too poor to make it the standard

Agreed.


>and the ease-of-use
>of utf8 and utf16 are essentially equivalent

Not really. The ICU API is geared around UTF-16, so if you use wchar[]s, you can call ICU functions directly. With char[]s, there's a bit more faffing, so UTF-8 becomes less easy to use in this regard.

And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a trivially simple algorithm. As an application programmer, you may be shielded from the complexity of UTF-8 by library implementation and implicit conversion, but it's all going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it.



>so since utf8 has the best
>performance it should be the default.

But it doesn't. Your tests were unfair. A UTF-16 array will not, in general, require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not otherwise, and for the majority of characters the performance will be worse. Check out this table:

Codepoint            Number of bytes
range           UTF-8    UTF-16   UTF-32
----------------------------------------
0000 to 007F    1        2        4
0080 to 07FF    2        2        4
0800 to FFFF    3        2        4    <----- who wins on this row?
10000+          4        4        4



>Hence my attempt to measure which is
>faster for typical usage: char or wchar?

You could try measuring it again, but this time without the assumption that all characters are ASCII.

Arcane Jill



August 27, 2004
In article <cgn92o$1jj4$1@digitaldaemon.com>, Ben Hinkle says...
>
> char: .36 seconds
> wchar: .78
> dchar: 1.3

So:

wchar = char * 2
dchar = char * 4

It looks like the time complexity is a direct factor of element size, which stands to reason since D default initializes all arrays.  I would be interested in seeing performance comparisons for transcoding between different formats. For the sake of argument, perhaps using both std.utf and whatever Mango uses.


Sean


August 27, 2004
In article <cgnh8j$1nl6$1@digitaldaemon.com>, Sean Kelly says...

>So:
>
>wchar = char * 2
>dchar = char * 4

Only if there are the same number of chars in a char array as there are wchars in a wchar array, and dchars in a dchar array. This will /only/ be true if the string is pure ASCII.

Jill


August 27, 2004
Ben Hinkle wrote:

> "Berin Loritsch" <bloritsch@d-haven.org> wrote in message
> news:cgnc25$1l1f$1@digitaldaemon.com...
> 
>>FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
>>The idea of course, is that internally to the program all strings are
>>encoded the same and translated on IO.
>>
>>I would venture to say there are two observations about Java strings:
>>
>>1) In most applications they are fast enough to handle a lot of work.
>>
>>2) It can become a bottleneck when excessive string concatenation is
>>    happening, or logging is overdone.
> 
> 
> I wonder what Java strings in utf8 would be like... I wonder if anyone has
> tried that out.

Internally there is no such thing.  It's just easier to deal with that
way.  The translation happens with encoders and decoders on IO.

>>Honestly, I would prefer something that made internationalized strings
>>easier to manage that more difficult.  If there is no multi-char
>>graphemes (i.e. takes up more than one code space) then that would be
>>the easiest to work with and write libraries for.
> 
> dchar would be the choice for ease of use but as you can see performance
> goes downhill significantly (at least for the naive test I ran). To me the
> performance of dchar is too poor to make it the standard and the ease-of-use
> of utf8 and utf16 are essentially equivalent so since utf8 has the best
> performance it should be the default. Hence my attempt to measure which is
> faster for typical usage: char or wchar?

Yea, but I've learned not to get hung up on strict performance.  There
is a difference between ultimately fast and fast enough.  Sometimes to
squeeze those extra cycles out we can cause more programming issues
than needs be.

If the allocation routines could be done faster (big assumption here),
would it be preferable to use a dchar if it is fast enough?

For the record, I believe Java uses UTF16 internally, which means for
most things there is less of a need to worry about MB characters.

The interesting test would be to have strings of the same length, and
then test the algorithm to get a substring from that string.  For
example, a fixed string that uses some multi-codespace characters here
and there, and then getting the 15th through 20th characters of that
string.

This will show just how things might come out in the wash when we are
dealing with that type of issue.  For example, it is not uncommon to
have the system read in a block of text from a file into memory (say
4k worth at a time), and then just iterate one line at a time.  Which
gives us the substring scenario.

Alternatively there is the regex algorithm that would need to account
for the multi-codespace characters that will get a performance hit as
well.

There is a lot more to an overall more performant system than just
string allocation--even though we are aware that it can be a significant
cost.
August 27, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgngtc$1nep$1@digitaldaemon.com...
> In article <cgner6$1mh4$1@digitaldaemon.com>, Ben Hinkle says...
>
> >dchar would be the choice for ease of use
>
> Actually, wchar and dchar are pretty much the same in terms of ease of
use. In
> almost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character.
Except in
> very specialist cases, you'll be correct. Even if you're using characters
beyond
> U+FFFF, regarding them as two characters instead of one is likely to be harmless.
>
> Of course, there are applications for which you can't do this. Font
rendering
> algorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the
(almost
> insignificant) cost of determining true character boundaries - a trivial
task in
> UTF-16.
>
> char, by contrast, is much more difficult to use, because you /cannot/
"pretend"
> it's a single-byte encoding, because there are too many characters beyond
U+00FF
> which applications need to "understand".

true - shortcuts can be taken if the application doesn't have to support all of unicode.

> >but as you can see performance
> >goes downhill significantly (at least for the naive test I ran). To me
the
> >performance of dchar is too poor to make it the standard
>
> Agreed.
>
>
> >and the ease-of-use
> >of utf8 and utf16 are essentially equivalent
>
> Not really. The ICU API is geared around UTF-16, so if you use wchar[]s,
you can
> call ICU functions directly. With char[]s, there's a bit more faffing, so
UTF-8
> becomes less easy to use in this regard.

I'm not sure what faffing is but yeah I agree - I just meant the ease-of-use without worrying about library APIs.

> And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a
trivially
> simple algorithm. As an application programmer, you may be shielded from
the
> complexity of UTF-8 by library implementation and implicit conversion, but
it's
> all going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it.

I meant ease-of-use for end-users. Calling toUTF8 is just as easy as calling toUTF16. Personally as long as it is correct the underlying library complexity doesn't affect me. Performance of the application becomes the key factor. .

> >so since utf8 has the best
> >performance it should be the default.
>
> But it doesn't. Your tests were unfair. A UTF-16 array will not, in
general,
> require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not
otherwise,
> and for the majority of characters the performance will be worse. Check
out this
> table:
>
> Codepoint            Number of bytes
> range           UTF-8    UTF-16   UTF-32
> ----------------------------------------
> 0000 to 007F    1        2        4
> 0080 to 07FF    2        2        4
> 0800 to FFFF    3        2        4    <----- who wins on this row?
> 10000+          4        4        4

I was going to start digging into non-ascii next. I remember reading somewhere that encoding asian languages in utf8 typically results in longer strings than utf16. That will definitely hamper utf8.

> >Hence my attempt to measure which is
> >faster for typical usage: char or wchar?
>
> You could try measuring it again, but this time without the assumption
that all
> characters are ASCII.
>
> Arcane Jill
>
>
>


« First   ‹ Prev
1 2 3 4