Thread overview
Unichar optimization: performance vs RAM usage. Your opinion?
Jun 08, 2004
Hauke Duden
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Hauke Duden
Jun 09, 2004
Kevin Bealer
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Arcane Jill
Jun 08, 2004
Hauke Duden
Jun 08, 2004
Walter
Jun 08, 2004
DemmeGod
June 08, 2004
I had an idea last night on how to optimize the unichar module to only use about 50 KB instead of the current 2 MB of RAM at runtime. It may also be possible to reduce the impact on the executable size (12 KB currently) but I haven't worked that out fully yet.

Unfortunately the change would make the lookup code slower. For example, translating "chr" to lower case would have to be changed from

return lowerData[chr]

to

return chr + lowerData[chr>>9][chr & 511];

In other words instead of one memory lookup it will do two memory lookups, a shift, and AND and an addition. It is not THAT expensive, but in an inner loop it might add up.

And people have lots of RAM nowadays - 2 MB is a small amount and since only small parts of it will be actually used (depending on the language the program is working with) most of it will be paged to the disk by the OS.

So what do you think: is the lower RAM usage worth the performance decrease? I could also implement both versions and select one at compile time, but then the question is which one is the default?


Hauke


June 08, 2004
In article <ca3vvv$gpf$1@digitaldaemon.com>, Hauke Duden says...

>return lowerData[chr]

Huh? There are over a million codepoints in Unicode (1,114,112, in fact). That must be a ----ing HUGE lookup table! Even if the data were only one byte per codepoint, that's a meg already.

Of course it's reasonable to compress that. There are huge swathes of codepoints that are just totally unused (yet), and many, many blocks in which all characters have identical properties. One humungous lookup table is definitely not how I would have done it. Maybe I'm just not enough of a speed freak.

Jill



June 08, 2004
In article <ca3vvv$gpf$1@digitaldaemon.com>, Hauke Duden says...

>return chr + lowerData[chr>>9][chr & 511];

You can get faster than that.

>       union U
>       {
>           dchar c;
>           ubyte[3] n;
>       }
>
>       U u;
>       u.c = chr;
>       return lowerData[u.n[2]][u.n[1]][u.n[0]];

No bitshifting. You just need tables for blocks of 256 characters, then tables for blocks of 256 pointers to these, then tables for blocks of 17 pointers to these. Duplicate blocks (in particular, those for unassigned codepoints) can simply be the same block.

Well, that's how I would have done it.
Jill


June 08, 2004
In article <ca46cu$rfj$1@digitaldaemon.com>, Arcane Jill says...
>
>>       union U
>>       {
>>           dchar c;
>>           ubyte[3] n;
>>       }
>>

That's assuming little-endian architecture, of course. You'd need two versions to cover both possibilities.

Jill


June 08, 2004
Arcane Jill wrote:
> In article <ca3vvv$gpf$1@digitaldaemon.com>, Hauke Duden says...
> 
> 
>>return chr + lowerData[chr>>9][chr & 511];
> 
> 
> You can get faster than that.
> 
> 
>>      union U
>>      {
>>          dchar c;
>>          ubyte[3] n;
>>      }
>>
>>      U u;
>>      u.c = chr;
>>      return lowerData[u.n[2]][u.n[1]][u.n[0]];
> 
> 
> No bitshifting. You just need tables for blocks of 256 characters, then tables
> for blocks of 256 pointers to these, then tables for blocks of 17 pointers to
> these. Duplicate blocks (in particular, those for unassigned codepoints) can
> simply be the same block.
> 
> Well, that's how I would have done it.

Splitting the data into blocks and using the same memory for equal blocks is indeed the main idea behind the optimization.

But doing 3 memory lookups with byte indices is probably slower than doing the bit shift and 2 lookups. Memory is usually a bottleneck and shifts have been 1-cycle operations for ages (even on the 386 IIRC). Pipelining is also possible to some degree.

Also, byte-based operations are slower than int-based ones on many CPUs.

Hauke
June 08, 2004
Arcane Jill wrote:

> In article <ca3vvv$gpf$1@digitaldaemon.com>, Hauke Duden says...
> 
> 
>>return lowerData[chr]
> 
> 
> Huh? There are over a million codepoints in Unicode (1,114,112, in fact). That
> must be a ----ing HUGE lookup table! Even if the data were only one byte per
> codepoint, that's a meg already.

Not quite ;). If you look at the data then you'll see that only the first ~200.000 actually have case mappings. That translates to about 2 MB of needed RAM (for all tables together).

> Of course it's reasonable to compress that. There are huge swathes of codepoints
> that are just totally unused (yet), and many, many blocks in which all
> characters have identical properties. One humungous lookup table is definitely
> not how I would have done it. Maybe I'm just not enough of a speed freak.

Dividing the Unicode range into pages and using the same memory for identical ones is exactly the optimization I think about doing. The question is if the modest (by today's standards) decrease in needed RAM is worth the decrease in performance.

Hauke

June 08, 2004
It's such a huge reduction in memory usage, I'd go for it. Reducing memory consumption has its own associated performance speedups - less swapping, less memory for the gc to scan, etc. Keep the faster version there with a version {} statement, though, in case experience shows otherwise.


June 08, 2004
I think what Walter means to say is that it'll never make it into Phobos using 2mb of ram.  Yes?

John

On Tue, 08 Jun 2004 09:23:08 -0700, Walter wrote:

> It's such a huge reduction in memory usage, I'd go for it. Reducing memory consumption has its own associated performance speedups - less swapping, less memory for the gc to scan, etc. Keep the faster version there with a version {} statement, though, in case experience shows otherwise.

June 09, 2004
In article <ca4aap$11jk$1@digitaldaemon.com>, Hauke Duden says...
>
>Arcane Jill wrote:
>
>> In article <ca3vvv$gpf$1@digitaldaemon.com>, Hauke Duden says...
>> 
>> 
>>>return lowerData[chr]
>> 
>> 
>> Huh? There are over a million codepoints in Unicode (1,114,112, in fact). That must be a ----ing HUGE lookup table! Even if the data were only one byte per codepoint, that's a meg already.
>
>Not quite ;). If you look at the data then you'll see that only the first ~200.000 actually have case mappings. That translates to about 2 MB of needed RAM (for all tables together).
>
>> Of course it's reasonable to compress that. There are huge swathes of codepoints that are just totally unused (yet), and many, many blocks in which all characters have identical properties. One humungous lookup table is definitely not how I would have done it. Maybe I'm just not enough of a speed freak.
>
>Dividing the Unicode range into pages and using the same memory for identical ones is exactly the optimization I think about doing. The question is if the modest (by today's standards) decrease in needed RAM is worth the decrease in performance.
>
>Hauke

I've seen the idea of a two-level table mentioned on one of the Unicode sites; it was one of the recommended implementations for a lookup table.  I used a version of this at IBM to handle ASCII/EBCDIC/Unicode translation.

I think doing an extra table lookup per character will not be a big issue, since the first table (the pointers to the other tables) will probably stay in cache.

Kevin