Thread overview
How to use sets in D?
Feb 07, 2020
mark
Feb 07, 2020
Jan Hönig
Feb 07, 2020
H. S. Teoh
Feb 08, 2020
mark
Mar 08, 2020
mark
Mar 08, 2020
Johan
Mar 08, 2020
Jesse Phillips
Mar 09, 2020
mark
February 07, 2020
I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib.

However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about.

I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?
February 07, 2020
On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
> I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib.
>
> However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about.
>
> I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?

I have not that much experience, but i have asked the exact same thing in the IRC.
rbtree was recommended. However, if it is not performance critical (in terms of speed or size), your solution works fine as well.
February 07, 2020
On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:
> I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib.
> 
> However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about.
> 
> I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?

bool[E] works just fine.

The bool does take up 1 byte, though, so if you're *really* want to optimize that away, you could do this:

	alias Unit = void[0];
	enum unit = Unit.init;

	// Look, ma! A bona fide set!
	Unit[E] mySet;

	mySet[...] = unit;
	mySet.remove(...);
	... // etc.

Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax.  But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-D


T

-- 
If you look at a thing nine hundred and ninety-nine times, you are perfectly safe; if you look at it the thousandth time, you are in frightful danger of seeing it for the first time. -- G. K. Chesterton
February 08, 2020
On Friday, 7 February 2020 at 22:03:00 UTC, H. S. Teoh wrote:
> On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:
[snip]
> bool[E] works just fine.
[snip]
> Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax.  But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-D

I didn't think bool[E] would be a win because although it is only one byte per item, it won't align so wouldn't it end up taking 4 bytes of space anyway. The void[0][E] you showed is good, but, I'll try using the rbtree since I'd rather use an out-of-the-box collection.

Thanks for the replies.
March 08, 2020
I use sets a lot and am now working on a program that will need to hold sets of 65,000+ items, so I thought I do some timings for the different approaches.

Here are some timings (uset uses the AA Unit approach, tset uses an rbtree, and aset uses an AA with bool values):

$ ./sets.d
size 50,000
uset 1 ms, 340 μs, and 8 hnsecs
tset 4 ms, 637 μs, and 1 hnsec
aset 1 ms, 402 μs, and 6 hnsecs
$ ./sets.d
size 100,000
uset 2 ms, 338 μs, and 4 hnsecs
tset 12 ms, 262 μs, and 6 hnsecs
aset 2 ms and 991 μs
$ ./sets.d
size 200,000
uset 5 ms, 971 μs, and 5 hnsecs
tset 30 ms, 675 μs, and 5 hnsecs
aset 6 ms, 74 μs, and 6 hnsecs
$ ./sets.d
size 400,000
uset 11 ms, 823 μs, and 4 hnsecs
tset 74 ms, 146 μs, and 2 hnsecs
aset 12 ms, 560 μs, and 5 hnsecs

What seems pretty clear is that for my purposes (checking presence or absence of membership in a set), AAs are much faster than rbtrees. (This is to be expected since if an AA uses a hash is should be O(1) vs O(lg n) for an rbtree).

Here is the test code I used:

#!/usr/bin/env rdmd

enum SIZE = 400_000;
enum SUB_SIZE = SIZE / 10;
enum MIN_LEN = 10;
enum MAX_LEN = 50;
enum AZ = "abcdefghijklmnopqrstuvwxyz";

void main() {
    import std.stdio: writefln;

    auto data = Data.populate();
    auto sets = Sets(data.words);

    writefln("size %,d", SIZE);
    check(sets.uset, data.present, data.absent, "uset");
    check(sets.tset, data.present, data.absent, "tset");
    check(sets.aset, data.present, data.absent, "aset");
}

struct Sets {
    import std.container.rbtree: RedBlackTree;

    alias Unit = void[0];
    enum unit = Unit.init;

    Unit[string] uset;
    RedBlackTree!string tset;
    bool[string] aset;

    this(string[] words) {
        tset = new RedBlackTree!string;
        foreach (word; words) {
            uset[word] = unit;
            tset.insert(word);
            aset[word] = false;
        }
    }
}

struct Data {
    string[] words;
    string[] present;
    string[] absent;

    static Data populate() {
        Data data;
        for (int i = 0; i < SIZE; i++) {
            auto word = makeWord;
            data.words ~= word;
            if (data.present.length < SUB_SIZE)
                data.present ~= word;
            if (data.absent.length < SUB_SIZE)
                data.absent ~= word ~ "9";
        }
        return data;
    }
}

string makeWord() {
    import std.random: randomCover, uniform;
    import std.range: take;
    import std.string: join, split;

    enum AZS = (AZ ~ AZ ~ AZ ~ AZ ~ AZ ~ AZ).split("");
    return randomCover(AZS).take(uniform(MIN_LEN, MAX_LEN)).join("");
}

void check(T)(T set, string[] present, string[] absent, string name) {
    import std.datetime.stopwatch: AutoStart, StopWatch;
    import std.stdio: writeln;

    auto timer = StopWatch(AutoStart.yes);
    foreach (string p; present)
        assert(p in set);
    foreach (string a; absent)
        assert(a !in set);
    writeln(name, " ", timer.peek);
}
March 08, 2020
On Sunday, 8 March 2020 at 08:43:10 UTC, mark wrote:
>
> Here are some timings ...
[...]
> #!/usr/bin/env rdmd

Please remember that performance testing is not trivial.
At the very least, you should be testing optimized code (-O) and preferably with LDC or GDC because they have a much stronger optimizer than DMD.
Also, `assert` is not a good way to check something in performance testing code. See https://forum.dlang.org/thread/bpbyhmrsfzirfqggnlyw@forum.dlang.org

-Johan

March 08, 2020
On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
> I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib.
>
> However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about.
>
> I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?

I think I've usually used the associative arrays, but I also think I tend to avoid using this approach but couldn't quite remember what I do instead.

I believe I have started just using an array.

arr ~= addMyData;
arr.sort.uniq

Then I make use of the algorithms here.

https://dlang.org/phobos/std_algorithm_setops.html
March 09, 2020
On Sunday, 8 March 2020 at 17:58:16 UTC, Jesse Phillips wrote:
> On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
[snip]
> I think I've usually used the associative arrays, but I also think I tend to avoid using this approach but couldn't quite remember what I do instead.
>
> I believe I have started just using an array.
>
> arr ~= addMyData;
> arr.sort.uniq
>
> Then I make use of the algorithms here.
>
> https://dlang.org/phobos/std_algorithm_setops.html

I can see the sense in that for use with union & intersection.
However, AA's (assuming they are hashes under the hood) give O(1) for `in` whereas those algorithms like the rbtree are O(lg n) for `in`, and most of what I'm doing at the moment is `in`. I'll keep the idea in mind though, because I have use cases for intersection.