April 17, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> Robert Fraser:
>> Hey, could you please post your implementation (assuming it's open-source?) I'd love to use them, but can't be bothered to implement it. Thanks!
>
> It's just a translation to D of this Py code:
> Info:
> http://jeethurao.com/blog/?p=146
> Code:
> http://bitbucket.org/woadwarrior/trie/
>
> Where necessary you can also follow the ideas from the original article here (but using an enum is MUCH better and it's also safer in C, to avoid aliasing):
> http://www.ddj.com/windows/184410528
>
> My implementation is about 57 times faster than the Python version and much faster than the Psyco version too.
>
> You can also take a look here:
> http://sourceforge.net/projects/ternary/
>
> You can implement it with a _single_ template struct, filled with all the methods you want:
>
> struct TST(T) {
> TST* left, right; union {
> TST* mid;
> int index;
> }
> T item;
>
> // methods here
> }
>
> Note that inside there you don't need TST!(T)*, TST* is probably enough, but I don't know why yet.
>
> You can start with only the few basic methods, test them, and later when/if you want you can add more methods, taking ideas from here:
> http://abc.se/~re/code/tst/tst_docs/index.html
>
> I'd like to not show my code because it's not nice, it has no comments, tests, etc. But it's quite close to that Py code.
>
> A problem with that template struct is that even if you add a opIn_r() method as I have done, then you have to do (note the star):
> if (!(k in *ds))
>
> I don't know if in D2 for structs you can use the syntax you use with classes:
> if (!(k in ds))
>
> Even better syntax is:
> if (k !in ds)
>
> In Python you write something better still:
> if k not in ds:
>
> Bye,
> bearophile
Awesome; thanks
| |||
April 17, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Jason House | Fawzi: >so TST in the structure refers by default to the struct itself.< Thank you, now I understand. I am slowly learning more. ------------------------- Jason House: > Why use a struct for TST?< To give you a good answer I have done some tests using a file "english_words.txt", of about 2.1 MB that contains about 220_000 different words. The following tiny program loads that file and splits it. It requires 6.2 MB of RAM and about 0.09 seconds to run (once the file cache is warm. This thanks to the last patch to the GC that speeds this up a lot): import std.stdio: writefln; import std.string: split; import std.file: read; void main() { auto words = split(cast(string)read("english_words.txt")); } -------- The following program tests the TST (positive lookups only, no negative lookups yet), this is for struct-based TST!(char). For the TST-class-based version , you can just replace *root with root: void main() { auto words = split(cast(string)read("english_words.txt")); auto root = new TST!(char); root.bulk_add(words); for (int i; i < 100; i++) foreach (word; words) if (!(word in *root)) throw new Exception("\"" ~ word ~ "\" not found"); } The struct-based needs 3.95 seconds and 24.35 MB RAM to run. The class-based (final methods) needs 4.03 seconds and 24.56 MB RAM (DMD v1.042 on WinXP, 2 GHz CPU). Both versions allocate 461_495 TST nodes. [Each TST struct node needs 16 bytes. Each TST object probably 24 bytes that I think the GC allocates as 32. But the actual memory used denies this twofold difference, I don't know why.] I guess the speed difference comes mostly from the higher memory overhead of classes (because the more memory there is, the more the CPU caches have to work). -------- With a tiny change, adding "scope", so just the root of the tree gets allocated on the stack seems to reduce the running time of the class-based version from 4.03 to 3.99 seconds. I don't know why moving the root to the stack gives so much difference. It's not a small difference, because on a 2 GHz CPU ~0.04 seconds mean something like 80 million clock ticks. The root reference is traversed each time, so the removal of just this level of indirection may explain the time difference. A mammalian brain like mine isn't well equipped to understand something that performs billions of operations every second. I have taken a look at the ASM generated by the class-based with and without scope, but I have not understood the asm well enough: void main() { auto words = split(cast(string)read("english_words.txt")); scope auto root = new TST!(char); // scope **** root.bulk_add(words); for (int i; i < 100; i++) foreach (word; words) if (!(word in root)) throw new Exception("\"" ~ word ~ "\" not found"); } -------- This, with a struct-based TST, needs 3.96 seconds (0.48 s just building the tree): void main() { auto words = split(cast(string)read("english_words.txt")); TST!(char) root; root.bulk_add(words); for (int i; i < 100; i++) foreach (word; words) if (!(word in root)) throw new Exception("\"" ~ word ~ "\" not found"); } -------- The following with a struct-based TST, needs 3.71 s, 13.68 MB (0.32 s just building, 0.31 s with no GC) (0.26 s just building, allocating memory with calloc, total 3.52 s, 13.37 MB): TST[] nodes; int nnode; void main() { nodes = new TST!(char)[461_495]; auto words = split(cast(string)read("english_words.txt")); auto root = nodes[nnode++]; root.bulk_add(words); for (int i; i < 100; i++) foreach (word; words) if (!(word in root)) throw new Exception("\"" ~ word ~ "\" not found"); } Idem, but struct fields aligned to 1 byte: 3.89 s, 12.72 MB. (0.48 s just building), (12.46 MB allocating memory with calloc) -------- Without align, using std.gc.malloc + hasNoPointers + std.c.memset: 3.53 s total. void main() { disable(); TST.nodes = cast(TST*)gcmalloc(461_495 * TST.sizeof); hasNoPointers(TST.nodes); memset(TST.nodes, 0, 461_495 * TST.sizeof); auto words = split(cast(string)read("english_words.txt")); auto root = TST.nodes[TST.nnode++]; root.bulk_add(words); for (int i; i < 100; i++) foreach (word; words) if (!(word in root)) throw new Exception("\"" ~ word ~ "\" not found"); } Of course you can add static methods to the struct to create the array, create a new node moving an index forward, and freeing the whole array, plus static fields of the array and #nodes used so far. -------- Once tree nodes are items of an array, you can use indexes to address node children, so if memory gets very tight you can even use 24-bit integers as indexes: http://www.fantascienza.net/leonardo/so/dlibs/uint24.html They are very slow, so in most situations you want to use ints: struct TST(T, TyIndex=int) {... Then this uses less memory, as low as 11 bytes/node, but TST must also be aligned to 1 byte: TST!(char, Uint24) root; // -------- The last examples work only if you know the number of nodes at compile time. If you need a more dynamic data structure, the code gets more tricky/hairy. Probably the best thing you can do is to keep an array of blocks of nodes, each block not too much big (es 64 KB), and allocate such blocks on-demand. Something like (code untested, it surely contains bugs. This comes after several rounds of code simplification): struct TST(T) { // *** TST instance attributes here *** struct Block { // Block is not too much big nor too much small const int SIZE = ((1 << 16) / TST.sizeof) > 1 ? ((1 << 16) / TST.sizeof) : 1; TST[TST.Block.SIZE] nodes; } // usedLastBlock keeps the number of nodes used from the last block of blocks // there's no need to keep the whole number of blocks, because its usage is // slower and all the blocks but the last one are full anyway. static int usedLastBlock; static Block*[] blocks; static TST* newNode() { if (!TST.blocks.length || TST.usedLastBlock >= TST.Block.SIZE) { // add new block TST.blocks.length = blocks.length + 1; // this whole new block gets scanned by the GC even if you need only 1 node TST.blocks[$-1] = new Block; TST.usedLastBlock = 0; } return &(TST.blocks[$-1].nodes[TST.usedLastBlock++]); } /// deallocate blocks of nodes static void free() { foreach (block_ptr; TST.blocks) delete block_ptr; TST.blocks[] = null; TST.blocks.length = 0; } // *** all TST methods here *** } This last code doesn't work yet, I have so far failed to debug it, the compiler returns: struct test.TST!(char).TST no size yet for forward reference I'll try to fix it when I have the time. If you know how to fix it, or what the problem is, please tell me. Bye, bearophile | |||
April 17, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | To shorten the code further I have replaced the Block struct with a typedefined fixed-size array:
struct TST(T) {
// *** TST instance attributes here ***
const int BLOCKSIZE = ((1 << 16) / TST.sizeof) > 1 ? ((1 << 16) / TST.sizeof) : 1;
// usedLastBlock keeps the number of nodes used from the last block of blocks
// there's no need to keep the whole number of blocks, because its usage is
// slower and all the blocks but the last one are full anyway.
static int usedLastBlock;
typedef TST[BLOCKSIZE] Block;
static Block*[] blocks;
static TST* newNode() {
if (!TST.blocks.length || TST.usedLastBlock >= Block.length) {
// add new block
TST.blocks.length = blocks.length + 1;
// this whole new block gets scanned by the GC even if you need only 1 node
TST.blocks[$-1] = new Block; // this doesn't work, you can't new a static array
TST.usedLastBlock = 0;
}
return &(*(TST.blocks[$-1])[TST.usedLastBlock++]);
}
/// deallocate blocks of nodes
static void free() {
foreach (block_ptr; TST.blocks)
delete block_ptr;
TST.blocks.length = 0;
}
But beside the same bug as before:
struct test.TST!(char).TST no size yet for forward reference
it's even more buggy, because now you can't even new a Block. Oh, well.
Bye,
bearophile
| |||
April 17, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | OK, now the code works. Besude the TST struct I have created this helper struct that keeps a pool of tree nodes:
struct StructPool(TyStruct, int MAX_BLOCK_SIZE_BYTES=1 << 14) {
static assert(MAX_BLOCK_SIZE_BYTES >= 1);
struct Block {
// Block is not too much big nor too much small
const int MAXB = MAX_BLOCK_SIZE_BYTES / TyStruct.sizeof;
const int SIZE = MAXB >= 1 ? MAXB : 1;
TyStruct[StructPool.Block.SIZE] structs;
}
static TyStruct* next_free_struct_ptr, last_struct_ptr;
static Block*[] blocks;
static TyStruct* newStruct() {
if (StructPool.next_free_struct_ptr >= StructPool.last_struct_ptr) {
// then add new block
// this whole new block gets scanned by the GC even if you need only 1 node
StructPool.blocks.length = StructPool.blocks.length + 1;
StructPool.blocks[$-1] = new Block;
StructPool.next_free_struct_ptr = StructPool.blocks[$-1].structs.ptr;
StructPool.last_struct_ptr = StructPool.next_free_struct_ptr + Block.SIZE;
}
return StructPool.next_free_struct_ptr++;
}
/// deallocate blocks of structs
static void free() {
foreach (block_ptr; StructPool.blocks)
delete block_ptr;
StructPool.blocks[] = null;
StructPool.blocks.length = 0;
StructPool.next_free_struct_ptr = null;
StructPool.last_struct_ptr = null;
}
}
alias StructPool!(TST!(char)) TSTpool;
(I have used to static pointers, one single index to the last filled up struct is also enough.)
That newStruct() is fast, about as class allocations in Java :-)
It's not nice code yet, with that external alias and so on, but it seems to work and it almost halves the total allocation time of a big ternary search tree (but despite the improve cache coherence the positive lookups into such TST are about as fast as in a normally allocated TST, maybe because the D GC packs the nodes closely in memory anyway). So I think in certain situations it can be useful.
Now I'm trying to make the code nicer and less bug-prone, for example turning that StructPool struct into a template mixin that I can use to add such pool capability to other structs too that need to allocate many small structs and to deallocate them all at the same time. But so far I have had the same kinds errors I have shown in the two past posts.
Bye,
bearophile
| |||
April 18, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> OK, now the code works. Besude the TST struct I have created this helper struct that keeps a pool of tree nodes:
>
> struct StructPool(TyStruct, int MAX_BLOCK_SIZE_BYTES=1 << 14) {
> static assert(MAX_BLOCK_SIZE_BYTES >= 1);
>
> struct Block {
> // Block is not too much big nor too much small
> const int MAXB = MAX_BLOCK_SIZE_BYTES / TyStruct.sizeof;
> const int SIZE = MAXB >= 1 ? MAXB : 1;
> TyStruct[StructPool.Block.SIZE] structs;
> }
>
> static TyStruct* next_free_struct_ptr, last_struct_ptr;
> static Block*[] blocks;
>
> static TyStruct* newStruct() {
> if (StructPool.next_free_struct_ptr >= StructPool.last_struct_ptr) {
> // then add new block
>
> // this whole new block gets scanned by the GC even if you need only 1 node
> StructPool.blocks.length = StructPool.blocks.length + 1;
> StructPool.blocks[$-1] = new Block;
>
> StructPool.next_free_struct_ptr = StructPool.blocks[$-1].structs.ptr;
> StructPool.last_struct_ptr = StructPool.next_free_struct_ptr + Block.SIZE;
>
> }
>
> return StructPool.next_free_struct_ptr++;
> }
>
> /// deallocate blocks of structs
> static void free() {
> foreach (block_ptr; StructPool.blocks)
> delete block_ptr;
> StructPool.blocks[] = null;
> StructPool.blocks.length = 0;
> StructPool.next_free_struct_ptr = null;
> StructPool.last_struct_ptr = null;
> }
> }
>
> alias StructPool!(TST!(char)) TSTpool;
>
> (I have used to static pointers, one single index to the last filled up struct is also enough.)
> That newStruct() is fast, about as class allocations in Java :-)
>
> It's not nice code yet, with that external alias and so on, but it seems to work and it almost halves the total allocation time of a big ternary search tree (but despite the improve cache coherence the positive lookups into such TST are about as fast as in a normally allocated TST, maybe because the D GC packs the nodes closely in memory anyway). So I think in certain situations it can be useful.
>
> Now I'm trying to make the code nicer and less bug-prone, for example turning that StructPool struct into a template mixin that I can use to add such pool capability to other structs too that need to allocate many small structs and to deallocate them all at the same time. But so far I have had the same kinds errors I have shown in the two past posts.
>
> Bye,
> bearophile
Cool, thanks for posting this... Why is the default block size so freaking huge (16kb)?
| |||
April 18, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Robert Fraser | Robert Fraser:
> Why is the default block size so freaking huge (16kb)?<
If the block is too much small you waste too much time allocating small blocks of memory (about as much time as you waste allocating the single tree nodes).
If the block is too much big you waste empty nodes in the last block, and over a certain size you don't gain much anyway. And too much big chunks don't fit in the data part of the L1 cache anyway (it's 32 KB on many CPUs).
From experiments I have seen that usually having a block bigger than 64 KB lets you gain very little. And having blocks less than 4-8 KB is a waste of time. 16-32 KB is the faster and on a 2 GB RAM PC it doesn't waste much RAM. If you want to save some RAM you can reduce that value at compile time.
Bye,
bearophile
| |||
April 24, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile Attachments: | bearophile wrote:
> Does someone has some need for Ternary Search Trees into Phobos (for D1. And eventually later for D2 too)?
> TSTs allow to find keys, key prefixes, or even keys with holes. Keys are arrays of T, where T is the template type.
> They can be designed to store the keys alone, or as an associative data structure.
>
> With some benchmarks I have seen that a simple TST implementation is about as fast as the built-in AAs of D (but much slower than Python dicts).
>
> Bye,
> bearophile
I implemented a version of the TST you posted using Tango + D1... Here are my results:
Test Part Mean Median Max
---- ---- ---- ------ ---
TST Insertion 0.0428 0.0338 0.0886 TST Iteration 0.0022 0.0022 0.0024 TST Lookup 0.0225 0.0223 0.0237 HashMap Insertion 0.0621 0.0421 0.2205 HashMap Iteration 0.0035 0.0034 0.0036 HashMap Lookup 0.0169 0.0168 0.0184 TreeMap Insertion 0.1045 0.1041 0.1058 TreeMap Iteration 0.0041 0.0041 0.0044 TreeMap Lookup 0.0895 0.0892 0.0917 AssocArray Insertion 0.0262 0.0262 0.0268 AssocArray Iteration 0.0015 0.0015 0.0016 AssocArray Lookup 0.0130 0.0129 0.0132
(TreeMap and HashMap are in tango,util.container, AssocArray is the built-in D associative array; testing code is attached. Compiled with -O -release -inline using DSSS 0.78 and DMD 1.043 on Windows XP x86 SP3).
Indeed, TSTs seem to beat Red-Black BSTs (the implementation Tango's TreeMap uses) quite handily, and have faster insert times than hash maps, though slower lookup. The big winner here, though, appears to be D's built-in associative arrays. I thought they were supposed to be very slow, but Tango's implementation, at least, looks pretty good (without any re-hashing).
For order-sensitive collections, I'm definitely using TSTs; I'm sold on the concept. Not only are they faster, allowing prefix search could be very useful. However, the times here are very low (the word file is only 1.1MB, and my system is fairly fast, though this is memory-dependent not CPU-dependent)... I'll try with a larger word list & see what results I get.
| |||
April 24, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Robert Fraser | Robert Fraser wrote:
> bearophile wrote:
>> Does someone has some need for Ternary Search Trees into Phobos (for D1. And eventually later for D2 too)?
>> TSTs allow to find keys, key prefixes, or even keys with holes. Keys are arrays of T, where T is the template type.
>> They can be designed to store the keys alone, or as an associative data structure.
>>
>> With some benchmarks I have seen that a simple TST implementation is about as fast as the built-in AAs of D (but much slower than Python dicts).
>>
>> Bye,
>> bearophile
>
> I implemented a version of the TST you posted using Tango + D1... Here are my results:
>
> Test Part Mean Median Max
> ---- ---- ---- ------ ---
> TST Insertion 0.0428 0.0338 0.0886
> TST Iteration 0.0022 0.0022 0.0024
> TST Lookup 0.0225 0.0223 0.0237
> HashMap Insertion 0.0621 0.0421 0.2205
> HashMap Iteration 0.0035 0.0034 0.0036
> HashMap Lookup 0.0169 0.0168 0.0184
> TreeMap Insertion 0.1045 0.1041 0.1058
> TreeMap Iteration 0.0041 0.0041 0.0044
> TreeMap Lookup 0.0895 0.0892 0.0917
> AssocArray Insertion 0.0262 0.0262 0.0268
> AssocArray Iteration 0.0015 0.0015 0.0016
> AssocArray Lookup 0.0130 0.0129 0.0132
>
> (TreeMap and HashMap are in tango,util.container, AssocArray is the built-in D associative array; testing code is attached. Compiled with -O -release -inline using DSSS 0.78 and DMD 1.043 on Windows XP x86 SP3).
>
> Indeed, TSTs seem to beat Red-Black BSTs (the implementation Tango's TreeMap uses) quite handily, and have faster insert times than hash maps, though slower lookup. The big winner here, though, appears to be D's built-in associative arrays. I thought they were supposed to be very slow, but Tango's implementation, at least, looks pretty good (without any re-hashing).
>
> For order-sensitive collections, I'm definitely using TSTs; I'm sold on the concept. Not only are they faster, allowing prefix search could be very useful. However, the times here are very low (the word file is only 1.1MB, and my system is fairly fast, though this is memory-dependent not CPU-dependent)... I'll try with a larger word list & see what results I get.
>
Results are similar on a much larger word list:
Test Part Mean Median Max
---- ---- ---- ------ ---
TST Insertion 0.1906 0.1630 0.2495
TST Iteration 0.0032 0.0031 0.0035
TST Lookup 0.1650 0.1651 0.1662
HashMap Insertion 0.1637 0.1504 0.2679
HashMap Iteration 0.0029 0.0028 0.0030
HashMap Lookup 0.1212 0.1210 0.1241
TreeMap Insertion 0.6133 0.6120 0.6276
TreeMap Iteration 0.0035 0.0035 0.0035
TreeMap Lookup 0.6310 0.6292 0.6446
AssocArray Insertion 0.1131 0.1129 0.1154
AssocArray Iteration 0.0014 0.0014 0.0016
AssocArray Lookup 0.0939 0.0928 0.1045
| |||
April 24, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Robert Fraser | Robert Fraser: > I implemented a version of the TST you posted using Tango + D1... Here are my results: Nice. Is your TST a map? (not a set). Are you willing to show me your TST code? Your code is surely better than mine, and there can be few little things that can be tuned. And are you also willing to add it to Phobos? (I'll port it to Phobos1, it's probably easy to do)? > The big winner here, though, appears to be D's built-in associative arrays. I thought they were supposed to be very slow, but Tango's implementation, at least, looks pretty good (without any re-hashing). Try comparing the performance of built-in D AAs with Psyco+Python dicts, or a good C++ STL hash set/hash map. D AAs are slow probably also because Phobos is statically compiled, they aren't a template. > However, the times here are very low (the word file is only > 1.1MB, and my system is fairly fast, though this is memory-dependent not > CPU-dependent)... I'll try with a larger word list & see what results I get. With modern CPUs lot of tests become too much fast :-) Bye, bearophile | |||
April 24, 2009 Re: Ternary Search Trees | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote: > Robert Fraser: >> I implemented a version of the TST you posted using Tango + D1... Here are my results: > > Nice. > Is your TST a map? (not a set). Yes, it's a map (for my use case). > Are you willing to show me your TST code? Your code is surely better than mine, and there can be few little things that can be tuned. > And are you also willing to add it to Phobos? (I'll port it to Phobos1, it's probably easy to do)? Attached, if you don't mind NO comments. I really want to port this: http://dasnar.sdf-eu.org/res/ctst-README.html ... The automatic balancing might make it a lot better if elements are inserted in order. It might be very easy by replacing the malloc/free calls with gc_malloc/gc_free calls. >> The big winner here, though, appears to be D's built-in associative arrays. I thought they were supposed to be very slow, but Tango's implementation, at least, looks pretty good (without any re-hashing). > > Try comparing the performance of built-in D AAs with Psyco+Python dicts, or a good C++ STL hash set/hash map. D AAs are slow probably also because Phobos is statically compiled, they aren't a template. I think Phobos1's AAs were pretty bad, but have you tried Tango's/druntimes? They're quite optimized (they're implemented as a hash with trees used for collisions). | |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply