Thread overview | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
August 27, 2004 performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Attachments: | All the threads about char vs wchar got me thinking about char performance, so I thought I'd test out the difference. I've attached a simple test that allocates a bunch of strings (of type char[] or wchar[]) of random length each with one random 'a' in them. Then it loops over the strings randomly and counts the number of 'a's. So this test does not measure transcoding speed - it just measures allocation and simple access speed across a large number of long strings. My times are char: .36 seconds wchar: .78 dchar: 1.3 I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generally remains until the times gets so small that it doesn't matter. It looks like a lot of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down from 2 to something under 1.5 or so. I built it using dmd charspeed.d -O -inline -release It would be nice to try other tests that involve searching for sub-strings (like multi-byte unicode strings) but I haven't done that. -Ben |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Note: looks like the version that got attached was not the version I used to run the tests - it was the one I was using to tests access. I thought attaching the file would make a copy right after attaching then but it looks like my newsreader makes a copy only when the message is posted. The version I ran to get my original numbers had longer strings (like 10000 instead of 1000) and timed the allocation as well as the access. -Ben |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote:
> All the threads about char vs wchar got me thinking about char performance,
> so I thought I'd test out the difference. I've attached a simple test that
> allocates a bunch of strings (of type char[] or wchar[]) of random length
> each with one random 'a' in them. Then it loops over the strings randomly
> and counts the number of 'a's. So this test does not measure transcoding
> speed - it just measures allocation and simple access speed across a large
> number of long strings. My times are
> char: .36 seconds
> wchar: .78
> dchar: 1.3
>
> I tried to make sure the dchar version didn't run into swap space. If I
> scale up or down the size of the problem the factor of 2 generally remains
> until the times gets so small that it doesn't matter. It looks like a lot
> of the time is taken in initializing the strings since the when I modify
> the test to only time access performance the scale factor goes down from 2
> to something under 1.5 or so.
>
> I built it using
> dmd charspeed.d -O -inline -release
>
> It would be nice to try other tests that involve searching for sub-strings
> (like multi-byte unicode strings) but I haven't done that.
>
FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
The idea of course, is that internally to the program all strings are
encoded the same and translated on IO.
I would venture to say there are two observations about Java strings:
1) In most applications they are fast enough to handle a lot of work.
2) It can become a bottleneck when excessive string concatenation is
happening, or logging is overdone.
As far as D is concerned, it should be expected that a same sized array
would take longer if the difference between arrays is the element size.
In this case 1 byte, 2 bytes, 4 bytes. I don't know much about
allocation routines, but I imagine getting something to run at constant
cost is either beyond the realm of possibility or it is just way too
difficult to be practical to pursue.
IMO, the benefits of using a standard unicode encoding within the D
program outway the costs of allocating the arrays. Anytime we make
it easier to provide internationalization (i18n), it is a step toward
greater acceptance. To assume the world only works with european (or
american) character sets is to stick your head in the sand.
Honestly, I would prefer something that made internationalized strings
easier to manage that more difficult. If there is no multi-char
graphemes (i.e. takes up more than one code space) then that would be
the easiest to work with and write libraries for.
|
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | In article <cgn92o$1jj4$1@digitaldaemon.com>, Ben Hinkle says... > char: .36 seconds > wchar: .78 > dchar: 1.3 Yeah, I forgot about allocation time. Of course, D initializes all arrays, no matter whence they are allocated. char[]s will be filled with all FFs, and wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as many bytes to initialize. Damn! A super-fast character array allocator would make a lot of difference here. There are probably many different ways of doing this very fast. I guess this has to happen within DMD. |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Berin Loritsch | "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:cgnc25$1l1f$1@digitaldaemon.com... > Ben Hinkle wrote: > > All the threads about char vs wchar got me thinking about char performance, > > so I thought I'd test out the difference. I've attached a simple test that > > allocates a bunch of strings (of type char[] or wchar[]) of random length > > each with one random 'a' in them. Then it loops over the strings randomly > > and counts the number of 'a's. So this test does not measure transcoding speed - it just measures allocation and simple access speed across a large > > number of long strings. My times are > > char: .36 seconds > > wchar: .78 > > dchar: 1.3 > > > > I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generally remains > > until the times gets so small that it doesn't matter. It looks like a lot > > of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down from 2 > > to something under 1.5 or so. > > > > I built it using > > dmd charspeed.d -O -inline -release > > > > It would be nice to try other tests that involve searching for sub-strings > > (like multi-byte unicode strings) but I haven't done that. > > > > FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO. > > I would venture to say there are two observations about Java strings: > > 1) In most applications they are fast enough to handle a lot of work. > > 2) It can become a bottleneck when excessive string concatenation is > happening, or logging is overdone. I wonder what Java strings in utf8 would be like... I wonder if anyone has tried that out. > As far as D is concerned, it should be expected that a same sized array would take longer if the difference between arrays is the element size. In this case 1 byte, 2 bytes, 4 bytes. I don't know much about allocation routines, but I imagine getting something to run at constant cost is either beyond the realm of possibility or it is just way too difficult to be practical to pursue. agreed. There probably isn't any way to speed up the initialization much. > IMO, the benefits of using a standard unicode encoding within the D program outway the costs of allocating the arrays. Anytime we make it easier to provide internationalization (i18n), it is a step toward greater acceptance. To assume the world only works with european (or american) character sets is to stick your head in the sand. agreed. utf8 and utf16 are both unicode standards that require multi-byte handling to cover all of unicode so in terms of ease of use they shouldn't be any different. > Honestly, I would prefer something that made internationalized strings easier to manage that more difficult. If there is no multi-char graphemes (i.e. takes up more than one code space) then that would be the easiest to work with and write libraries for. dchar would be the choice for ease of use but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standard and the ease-of-use of utf8 and utf16 are essentially equivalent so since utf8 has the best performance it should be the default. Hence my attempt to measure which is faster for typical usage: char or wchar? -Ben |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | In article <cgner6$1mh4$1@digitaldaemon.com>, Ben Hinkle says... >dchar would be the choice for ease of use Actually, wchar and dchar are pretty much the same in terms of ease of use. In almost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character. Except in very specialist cases, you'll be correct. Even if you're using characters beyond U+FFFF, regarding them as two characters instead of one is likely to be harmless. Of course, there are applications for which you can't do this. Font rendering algorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the (almost insignificant) cost of determining true character boundaries - a trivial task in UTF-16. char, by contrast, is much more difficult to use, because you /cannot/ "pretend" it's a single-byte encoding, because there are too many characters beyond U+00FF which applications need to "understand". >but as you can see performance >goes downhill significantly (at least for the naive test I ran). To me the >performance of dchar is too poor to make it the standard Agreed. >and the ease-of-use >of utf8 and utf16 are essentially equivalent Not really. The ICU API is geared around UTF-16, so if you use wchar[]s, you can call ICU functions directly. With char[]s, there's a bit more faffing, so UTF-8 becomes less easy to use in this regard. And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a trivially simple algorithm. As an application programmer, you may be shielded from the complexity of UTF-8 by library implementation and implicit conversion, but it's all going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it. >so since utf8 has the best >performance it should be the default. But it doesn't. Your tests were unfair. A UTF-16 array will not, in general, require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not otherwise, and for the majority of characters the performance will be worse. Check out this table: Codepoint Number of bytes range UTF-8 UTF-16 UTF-32 ---------------------------------------- 0000 to 007F 1 2 4 0080 to 07FF 2 2 4 0800 to FFFF 3 2 4 <----- who wins on this row? 10000+ 4 4 4 >Hence my attempt to measure which is >faster for typical usage: char or wchar? You could try measuring it again, but this time without the assumption that all characters are ASCII. Arcane Jill |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | In article <cgn92o$1jj4$1@digitaldaemon.com>, Ben Hinkle says... > > char: .36 seconds > wchar: .78 > dchar: 1.3 So: wchar = char * 2 dchar = char * 4 It looks like the time complexity is a direct factor of element size, which stands to reason since D default initializes all arrays. I would be interested in seeing performance comparisons for transcoding between different formats. For the sake of argument, perhaps using both std.utf and whatever Mango uses. Sean |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | In article <cgnh8j$1nl6$1@digitaldaemon.com>, Sean Kelly says... >So: > >wchar = char * 2 >dchar = char * 4 Only if there are the same number of chars in a char array as there are wchars in a wchar array, and dchars in a dchar array. This will /only/ be true if the string is pure ASCII. Jill |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote: > "Berin Loritsch" <bloritsch@d-haven.org> wrote in message > news:cgnc25$1l1f$1@digitaldaemon.com... > >>FWIW, the Java 'char' is a 16 bit value due to the unicode standards. >>The idea of course, is that internally to the program all strings are >>encoded the same and translated on IO. >> >>I would venture to say there are two observations about Java strings: >> >>1) In most applications they are fast enough to handle a lot of work. >> >>2) It can become a bottleneck when excessive string concatenation is >> happening, or logging is overdone. > > > I wonder what Java strings in utf8 would be like... I wonder if anyone has > tried that out. Internally there is no such thing. It's just easier to deal with that way. The translation happens with encoders and decoders on IO. >>Honestly, I would prefer something that made internationalized strings >>easier to manage that more difficult. If there is no multi-char >>graphemes (i.e. takes up more than one code space) then that would be >>the easiest to work with and write libraries for. > > dchar would be the choice for ease of use but as you can see performance > goes downhill significantly (at least for the naive test I ran). To me the > performance of dchar is too poor to make it the standard and the ease-of-use > of utf8 and utf16 are essentially equivalent so since utf8 has the best > performance it should be the default. Hence my attempt to measure which is > faster for typical usage: char or wchar? Yea, but I've learned not to get hung up on strict performance. There is a difference between ultimately fast and fast enough. Sometimes to squeeze those extra cycles out we can cause more programming issues than needs be. If the allocation routines could be done faster (big assumption here), would it be preferable to use a dchar if it is fast enough? For the record, I believe Java uses UTF16 internally, which means for most things there is less of a need to worry about MB characters. The interesting test would be to have strings of the same length, and then test the algorithm to get a substring from that string. For example, a fixed string that uses some multi-codespace characters here and there, and then getting the 15th through 20th characters of that string. This will show just how things might come out in the wash when we are dealing with that type of issue. For example, it is not uncommon to have the system read in a block of text from a file into memory (say 4k worth at a time), and then just iterate one line at a time. Which gives us the substring scenario. Alternatively there is the regex algorithm that would need to account for the multi-codespace characters that will get a performance hit as well. There is a lot more to an overall more performant system than just string allocation--even though we are aware that it can be a significant cost. |
August 27, 2004 Re: performance of char vs wchar | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | "Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgngtc$1nep$1@digitaldaemon.com... > In article <cgner6$1mh4$1@digitaldaemon.com>, Ben Hinkle says... > > >dchar would be the choice for ease of use > > Actually, wchar and dchar are pretty much the same in terms of ease of use. In > almost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character. Except in > very specialist cases, you'll be correct. Even if you're using characters beyond > U+FFFF, regarding them as two characters instead of one is likely to be harmless. > > Of course, there are applications for which you can't do this. Font rendering > algorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the (almost > insignificant) cost of determining true character boundaries - a trivial task in > UTF-16. > > char, by contrast, is much more difficult to use, because you /cannot/ "pretend" > it's a single-byte encoding, because there are too many characters beyond U+00FF > which applications need to "understand". true - shortcuts can be taken if the application doesn't have to support all of unicode. > >but as you can see performance > >goes downhill significantly (at least for the naive test I ran). To me the > >performance of dchar is too poor to make it the standard > > Agreed. > > > >and the ease-of-use > >of utf8 and utf16 are essentially equivalent > > Not really. The ICU API is geared around UTF-16, so if you use wchar[]s, you can > call ICU functions directly. With char[]s, there's a bit more faffing, so UTF-8 > becomes less easy to use in this regard. I'm not sure what faffing is but yeah I agree - I just meant the ease-of-use without worrying about library APIs. > And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a trivially > simple algorithm. As an application programmer, you may be shielded from the > complexity of UTF-8 by library implementation and implicit conversion, but it's > all going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it. I meant ease-of-use for end-users. Calling toUTF8 is just as easy as calling toUTF16. Personally as long as it is correct the underlying library complexity doesn't affect me. Performance of the application becomes the key factor. . > >so since utf8 has the best > >performance it should be the default. > > But it doesn't. Your tests were unfair. A UTF-16 array will not, in general, > require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not otherwise, > and for the majority of characters the performance will be worse. Check out this > table: > > Codepoint Number of bytes > range UTF-8 UTF-16 UTF-32 > ---------------------------------------- > 0000 to 007F 1 2 4 > 0080 to 07FF 2 2 4 > 0800 to FFFF 3 2 4 <----- who wins on this row? > 10000+ 4 4 4 I was going to start digging into non-ascii next. I remember reading somewhere that encoding asian languages in utf8 typically results in longer strings than utf16. That will definitely hamper utf8. > >Hence my attempt to measure which is > >faster for typical usage: char or wchar? > > You could try measuring it again, but this time without the assumption that all > characters are ASCII. > > Arcane Jill > > > |
Copyright © 1999-2021 by the D Language Foundation