Thread overview
How to use sets in D?
Feb 08, 2022
tastyminerals
Feb 08, 2022
H. S. Teoh
Feb 09, 2022
Siarhei Siamashka
Feb 10, 2022
Siarhei Siamashka
Feb 10, 2022
H. S. Teoh
Feb 10, 2022
tastyminerals
Feb 08, 2022
Paul Backus
Feb 08, 2022
H. S. Teoh
Feb 08, 2022
Paul Backus
Feb 09, 2022
H. S. Teoh
February 08, 2022
https://forum.dlang.org/post/mailman.1072.1581112984.31109.digitalmars-d-learn@puremagic.com

On Friday, 7 February 2020 at 22:03:00 UTC, H. S. Teoh wrote:
> On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:
>> [...]
>
> bool[E] works just fine.
>
> The bool does take up 1 byte, though, so if you're *really* want to optimize that away, you could do this:
>
> 	alias Unit = void[0];
> 	enum unit = Unit.init;
>
> 	// Look, ma! A bona fide set!
> 	Unit[E] mySet;
>
> 	mySet[...] = unit;
> 	mySet.remove(...);
> 	... // etc.
>
> Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax.  But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-D
>
>
> T

Can you please explain what does `bool[E]` mean? And how does the code with aliasing void[0] and then using enum even work?
February 08, 2022
On Tue, Feb 08, 2022 at 09:08:47PM +0000, tastyminerals via Digitalmars-d-learn wrote: [...]
> > bool[E] works just fine.
> > 
> > The bool does take up 1 byte, though, so if you're *really* want to optimize that away, you could do this:
> > 
> > 	alias Unit = void[0];
> > 	enum unit = Unit.init;
> > 
> > 	// Look, ma! A bona fide set!
> > 	Unit[E] mySet;
> > 
> > 	mySet[...] = unit;
> > 	mySet.remove(...);
> > 	... // etc.
> > 
> > Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax.  But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-D
[...]
> Can you please explain what does `bool[E]` mean?

E is whatever key type you want to use. String, or integer ID, or whatever you want to put in your set.


> And how does the code with aliasing void[0] and then using enum even work?

The alias is a workaround for a syntactic limitation in D, because you cannot write `mySet[someKey] = void[0].init;` without a syntax error. So we use the alias to give `void[0]` a name that we can refer to in assignment statements, so that we can assign its values to the AA.

As for `void[0]` itself, it's just a zero-element static array of an indeterminate type. The whole point is to create something that occupies 0 bytes. The actual type doesn't actually matter; you could have used int[0] and it'd work too.  (You can't use an empty struct because empty structs have size 1.)  By using a zero-sized value type for the AA, we effectively turn it into a set: it only stores keys.  Well, technically it stores values of zero size too, but since it's zero-sized, it's as if the value isn't there, and it incurs zero memory overhead. (Whereas an empty struct would likely occupy an entire machine word, rounded up from size 1.)

But as I said, this is overkill for something so trivial. Using `bool[E]` or an RBTree works just fine.


T

-- 
Javascript is what you use to allow third party programs you don't know anything about and doing you know not what to run on your computer. -- Charles Hixson
February 08, 2022

On Tuesday, 8 February 2022 at 21:08:47 UTC, tastyminerals wrote:

>

https://forum.dlang.org/post/mailman.1072.1581112984.31109.digitalmars-d-learn@puremagic.com

On Friday, 7 February 2020 at 22:03:00 UTC, H. S. Teoh wrote:

>

On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:

>

[...]

bool[E] works just fine.

The bool does take up 1 byte, though, so if you're really want to optimize that away, you could do this:

alias Unit = void[0];
enum unit = Unit.init;

// Look, ma! A bona fide set!
Unit[E] mySet;

mySet[...] = unit;
mySet.remove(...);
... // etc.

[...]

>

Can you please explain what does bool[E] mean? And how does the code with aliasing void[0] and then using enum even work?

bool[E] means an associative array with keys of type E and values of type bool. Since a bool value takes up 1 byte in memory, using a bool[E] associative array to store a set of E values means that for every value in your set, you will have to allocate an extra 1 byte for the bool in addition to the bytes required for the E itself.

In most cases, 1 extra byte is not going to matter very much, so that's not a big deal. But if you really, really want to avoid allocating any extra bytes, you can use void[0] in place of bool, since a void[0] takes up 0 bytes in memory. (The void part has no special significance here--you could also use int[0] or char[0], for example.)

The alias and the enum just make the code a little nicer to read by letting you write Unit instead of void[0] and unit instead of void[0].init. You could get rid of them and the code would work exactly the same way; it'd just be a little bit uglier:

void[0][E] mySet;

mySet[...] = void[0].init;
mySet.remove(...);
// etc.

The name "unit" comes from the concept of a unit type in theoretical computer science.

February 08, 2022
On Tue, Feb 08, 2022 at 09:47:13PM +0000, Paul Backus via Digitalmars-d-learn wrote: [...]
> The `alias` and the `enum` just make the code a little nicer to read by letting you write `Unit` instead of `void[0]` and `unit` instead of `void[0].init`. You could get rid of them and the code would work exactly the same way; it'd just be a little bit uglier:
> 
>     void[0][E] mySet;
> 
>     mySet[...] = void[0].init;
[...]

Unfortunately, this doesn't work due to a syntax restriction (the parser isn't expecting a type name after the `=`, and will raise a syntax error). So the alias is in fact necessary.


T

-- 
Fact is stranger than fiction.
February 08, 2022
On Tuesday, 8 February 2022 at 21:58:49 UTC, H. S. Teoh wrote:
> On Tue, Feb 08, 2022 at 09:47:13PM +0000, Paul Backus via Digitalmars-d-learn wrote: [...]
>> The `alias` and the `enum` just make the code a little nicer to read by letting you write `Unit` instead of `void[0]` and `unit` instead of `void[0].init`. You could get rid of them and the code would work exactly the same way; it'd just be a little bit uglier:
>> 
>>     void[0][E] mySet;
>> 
>>     mySet[...] = void[0].init;
> [...]
>
> Unfortunately, this doesn't work due to a syntax restriction (the parser isn't expecting a type name after the `=`, and will raise a syntax error). So the alias is in fact necessary.

Ah, yeah, I forgot about that bug.

It also works if you use parentheses:

    mySet[...] = (void[0]).init;
February 08, 2022
On Tue, Feb 08, 2022 at 10:02:30PM +0000, Paul Backus via Digitalmars-d-learn wrote: [...]
> It also works if you use parentheses:
> 
>     mySet[...] = (void[0]).init;

OT1H, nice that works! ... but OTOH, ugh, it looks so ugly. :-P  I think I'll just stick with the alias.


T

-- 
Who told you to swim in Crocodile Lake without life insurance??
February 09, 2022

On Tuesday, 8 February 2022 at 21:42:06 UTC, H. S. Teoh wrote:

>

But as I said, this is overkill for something so trivial. Using bool[E] or an RBTree works just fine.

Is the current implementation of associative arrays in D language resistant to Denial of Service hash collision attacks?

If not, then redBlackTree may be safer to use when dealing with potentially adversarially constructed input data.

February 10, 2022

On Wednesday, 9 February 2022 at 21:05:47 UTC, Siarhei Siamashka wrote:

>

Is the current implementation of associative arrays in D language resistant to Denial of Service hash collision attacks?

Answering to myself. No, it isn't. Here's a simple example:

import std, core.time;

const ulong n = 100000;

// Count the number of unique elements using associative array
ulong count_unique_values_assoc(const ulong[] a)
{
    bool[ulong] set;
    foreach (x ; a)
        set[x] = true;
    return set.length;
}

// Count the number of unique elements using redBlackTree
ulong count_unique_values_rbtree(const ulong[] a)
{
    auto set = redBlackTree!("a<b", false, ulong)();
    foreach (x ; a)
        set.insert(x);
    return set.length;
}

// Generate array, which triggers hash collisions in the
// current D language associative array implementation
ulong[] generate_bad_array(ulong n)
{
    return iota(n).map!(x => (x << 46)).array;
}

// Benchmark associative array vs. rbtree
void run_benchmark(ulong[] a)
{
    auto t1 = MonoTime.currTime;
    auto result_assoc = a.count_unique_values_assoc;
    auto t2 = MonoTime.currTime;
    auto result_rbtree = a.count_unique_values_rbtree;
    auto t3 = MonoTime.currTime;

    assert(result_assoc == result_rbtree);
    writefln("n=%d, unique_elements=%d, assoc time: %d ms, rbtree time: %d ms",
             a.length, result_assoc,
             (t2 - t1).total!"msecs",
             (t3 - t2).total!"msecs");
}

void main()
{
    assert([1UL, 1UL, 2UL].count_unique_values_assoc == 2);
    assert([1UL, 1UL, 2UL].count_unique_values_rbtree == 2);

    writefln("== consecutive numbers ==");
    n.iota.array.run_benchmark;

    writefln("\n== random numbers between 0 and 99 ==");
    n.iota.map!(_ => uniform(0UL, 100UL)).array.run_benchmark;

    writefln("\n== adversarially constructed array ==");
    n.generate_bad_array.run_benchmark;
}

Benchmark results using a 64-bit LDC compiler:

== consecutive numbers ==
n=100000, unique_elements=100000, assoc time: 11 ms, rbtree time: 20 ms

== random numbers between 0 and 99 ==
n=100000, unique_elements=100, assoc time: 2 ms, rbtree time: 4 ms

== adversarially constructed array ==
n=100000, unique_elements=100000, assoc time: 22626 ms, rbtree time: 15 ms
February 09, 2022
On Thu, Feb 10, 2022 at 12:53:08AM +0000, Siarhei Siamashka via Digitalmars-d-learn wrote:
> On Wednesday, 9 February 2022 at 21:05:47 UTC, Siarhei Siamashka wrote:
> > Is the current implementation of associative arrays in D language resistant to Denial of Service hash collision attacks?
> 
> Answering to myself. No, it isn't. Here's a simple example:
[...]
> ulong[] generate_bad_array(ulong n)
> {
>     return iota(n).map!(x => (x << 46)).array;
> }
[...]
> == adversarially constructed array ==
> n=100000, unique_elements=100000, assoc time: 22626 ms, rbtree time: 15 ms
> ```

Nice investigation!  This should be filed as a bug.

I remember browsing through various hash functions in druntime before (built-in AA's use per-type hash functions), and many of them were trivial, which would imply that it's trivial to construct adversarial input that triggers a large number of collisions for various basic types.


T

-- 
Give me some fresh salted fish, please.
February 10, 2022
On Tuesday, 8 February 2022 at 21:42:06 UTC, H. S. Teoh wrote:
> On Tue, Feb 08, 2022 at 09:08:47PM +0000, tastyminerals via Digitalmars-d-learn wrote: [...]
>> > [...]
> [...]
>> [...]
>
> E is whatever key type you want to use. String, or integer ID, or whatever you want to put in your set.
>
> [...]

Thank you! That's neat.