Thread overview
How to resume iteration from previous point in associative array
Feb 19, 2014
Tarman
Feb 19, 2014
Tarman
Feb 19, 2014
bearophile
Feb 19, 2014
Nicolas Sicard
Feb 19, 2014
bearophile
Feb 19, 2014
Ali Çehreli
Feb 19, 2014
simendsjo
Feb 19, 2014
Tobias Pankrath
Feb 19, 2014
Tarman
February 19, 2014
Hi,

We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries.

However in this system we need to fairly consider possible solutions for many "units" at a time within a single thread.

So we'd like to... say, first iterate over the first 10 million for each "unit" then iterate over the next 10 million for the next unit so that each unit gets a fair share of CPU time.

However in this case we can't:

int count = 0;
foreach (item; associativeArray) {
  if (++count == 10_000_000) {
    // here we wish to somehow save the iteration state
  }
}

Then on the next iteration:

foreach (resume from previous iteration point) {
}

We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.
February 19, 2014
I guess what might work here is to allow some kind of argument to "byKey" that will let us resume so maybe something like:

foreach (item; massiveAssociativeArray.byKey(indexOfKeyToStartFrom)) {
}

If something like this isn't possible, then would one of our engineers be able to submit a patch to libphobos to provide the feature?
February 19, 2014
Tarman:

> We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries.

Built-in associative arrays were never tested with so many pairs, so perform exhaustive performance benchmarks first, and report in Bugzilla the performance and memory problems you hit.


> We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.

Have you tried the simplest thing, to let std.parallelism chunk a  AA.byKey.zip(AA.byValue) ?

Bye,
bearophile
February 19, 2014
On Wednesday, 19 February 2014 at 09:21:48 UTC, Tarman wrote:
> Hi,
>
> We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries.
>
> However in this system we need to fairly consider possible solutions for many "units" at a time within a single thread.
>
> So we'd like to... say, first iterate over the first 10 million for each "unit" then iterate over the next 10 million for the next unit so that each unit gets a fair share of CPU time.
>
> However in this case we can't:
>
> int count = 0;
> foreach (item; associativeArray) {
>   if (++count == 10_000_000) {
>     // here we wish to somehow save the iteration state
>   }
> }
>
> Then on the next iteration:
>
> foreach (resume from previous iteration point) {
> }
>
> We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.

Probably not an answer to your question, but if possible, you can use a fiber and yield when you should give control to another fiber.
http://dlang.org/phobos/core_thread.html#.Fiber
February 19, 2014
On Wednesday, 19 February 2014 at 09:21:48 UTC, Tarman wrote:
> Hi,
>
> We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries.
>
> However in this system we need to fairly consider possible solutions for many "units" at a time within a single thread.
>
> So we'd like to... say, first iterate over the first 10 million for each "unit" then iterate over the next 10 million for the next unit so that each unit gets a fair share of CPU time.
>
> However in this case we can't:
>
> int count = 0;
> foreach (item; associativeArray) {
>   if (++count == 10_000_000) {
>     // here we wish to somehow save the iteration state
>   }
> }
>
> Then on the next iteration:
>
> foreach (resume from previous iteration point) {
> }
>
> We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.

You could iterate over byValue/byKey/zip(byKey,byValue) which gives you a range to start again from. You'd need to iterate by hand (with while,for) though, since your range won't get consumed by foreach, I guess. Or use RefRange.

not tested:

auto values = refRange(&associativeArray.byValue);
foreach(item; values)
{
  if(someday) break;
}

// here values should have been advanced accordingly
// and you can just continue your iteration
foreach(item; values)
{

}


February 19, 2014
@bearophile:

Thanks for your suggestion! Yes we've tried this but unfortunately the performance doesn't work for us, maybe because all our CPUs are quite saturated.

@simendsjo:

Thanks also for your suggestion, we are using tasks and yield in our code but in this case we felt that using it just to resume iteration was perhaps a misuse of the feature although we would be interested to see what the overhead is.

@Mr Pankrath:

Thanks that looks like what we need!
February 19, 2014
On Wednesday, 19 February 2014 at 09:28:27 UTC, bearophile wrote:
> Tarman:
>
>> We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries.
>
> Built-in associative arrays were never tested with so many pairs, so perform exhaustive performance benchmarks first, and report in Bugzilla the performance and memory problems you hit.
>
>
>> We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.
>
> Have you tried the simplest thing, to let std.parallelism chunk a
>  AA.byKey.zip(AA.byValue) ?
>
> Bye,
> bearophile

Are byKey and byValue guaranteed to iterate the associative array in the same order?
February 19, 2014
Nicolas Sicard:

> Are byKey and byValue guaranteed to iterate the associative array in the same order?

Nope, or not yet, sorry.

Bye,
bearophile
February 19, 2014
On 02/19/2014 04:11 AM, bearophile wrote:
> Nicolas Sicard:
>
>> Are byKey and byValue guaranteed to iterate the associative array in
>> the same order?
>
> Nope, or not yet, sorry.
>
> Bye,
> bearophile

Is there a bug report about that? I can't imagine how they can iterate in any other order.

Ali