How to resume iteration from previous point in associative array

Feb 19, 2014

Tarman

Feb 19, 2014

Tarman

Feb 19, 2014

Feb 19, 2014

Feb 19, 2014

Feb 19, 2014

Feb 19, 2014

Feb 19, 2014

Feb 19, 2014

Hi, We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries. However in this system we need to fairly consider possible solutions for many "units" at a time within a single thread. So we'd like to... say, first iterate over the first 10 million for each "unit" then iterate over the next 10 million for the next unit so that each unit gets a fair share of CPU time. However in this case we can't: int count = 0; foreach (item; associativeArray) { if (++count == 10_000_000) { // here we wish to somehow save the iteration state } } Then on the next iteration: foreach (resume from previous iteration point) { } We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.

I guess what might work here is to allow some kind of argument to "byKey" that will let us resume so maybe something like: foreach (item; massiveAssociativeArray.byKey(indexOfKeyToStartFrom)) { } If something like this isn't possible, then would one of our engineers be able to submit a patch to libphobos to provide the feature?

Tarman: > We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries. Built-in associative arrays were never tested with so many pairs, so perform exhaustive performance benchmarks first, and report in Bugzilla the performance and memory problems you hit. > We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator. Have you tried the simplest thing, to let std.parallelism chunk a AA.byKey.zip(AA.byValue) ? Bye, bearophile

On Wednesday, 19 February 2014 at 09:21:48 UTC, Tarman wrote: > Hi, > > We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries. > > However in this system we need to fairly consider possible solutions for many "units" at a time within a single thread. > > So we'd like to... say, first iterate over the first 10 million for each "unit" then iterate over the next 10 million for the next unit so that each unit gets a fair share of CPU time. > > However in this case we can't: > > int count = 0; > foreach (item; associativeArray) { > if (++count == 10_000_000) { > // here we wish to somehow save the iteration state > } > } > > Then on the next iteration: > > foreach (resume from previous iteration point) { > } > > We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator. Probably not an answer to your question, but if possible, you can use a fiber and yield when you should give control to another fiber. http://dlang.org/phobos/core_thread.html#.Fiber

February 19, 2014

Re: How to resume iteration from previous point in associative array

Posted by Tobias Pankrath
in reply to Tarman

Permalink

Tobias Pankrath

Posted in reply to Tarman

Permalink

On Wednesday, 19 February 2014 at 09:21:48 UTC, Tarman wrote:
> Hi,
>
> We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries.
>
> However in this system we need to fairly consider possible solutions for many "units" at a time within a single thread.
>
> So we'd like to... say, first iterate over the first 10 million for each "unit" then iterate over the next 10 million for the next unit so that each unit gets a fair share of CPU time.
>
> However in this case we can't:
>
> int count = 0;
> foreach (item; associativeArray) {
>   if (++count == 10_000_000) {
>     // here we wish to somehow save the iteration state
>   }
> }
>
> Then on the next iteration:
>
> foreach (resume from previous iteration point) {
> }
>
> We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator.

You could iterate over byValue/byKey/zip(byKey,byValue) which gives you a range to start again from. You'd need to iterate by hand (with while,for) though, since your range won't get consumed by foreach, I guess. Or use RefRange.

not tested:

auto values = refRange(&associativeArray.byValue);
foreach(item; values)
{
  if(someday) break;
}

// here values should have been advanced accordingly
// and you can just continue your iteration
foreach(item; values)
{

}

@bearophile: Thanks for your suggestion! Yes we've tried this but unfortunately the performance doesn't work for us, maybe because all our CPUs are quite saturated. @simendsjo: Thanks also for your suggestion, we are using tasks and yield in our code but in this case we felt that using it just to resume iteration was perhaps a misuse of the feature although we would be interested to see what the overhead is. @Mr Pankrath: Thanks that looks like what we need!

On Wednesday, 19 February 2014 at 09:28:27 UTC, bearophile wrote: > Tarman: > >> We're doing some "super computing" "big data" style stuff with D. We have a system where we're comparing associative arrays with billions of entries. > > Built-in associative arrays were never tested with so many pairs, so perform exhaustive performance benchmarks first, and report in Bugzilla the performance and memory problems you hit. > > >> We've tried copying the keys into a non-associative array and sure this works, but it is far far far less optimal than an equivalent C++ solution we wrote where we use an std::unordered_set and can simply store the iterator. > > Have you tried the simplest thing, to let std.parallelism chunk a > AA.byKey.zip(AA.byValue) ? > > Bye, > bearophile Are byKey and byValue guaranteed to iterate the associative array in the same order?

On 02/19/2014 04:11 AM, bearophile wrote: > Nicolas Sicard: > >> Are byKey and byValue guaranteed to iterate the associative array in >> the same order? > > Nope, or not yet, sorry. > > Bye, > bearophile Is there a bug report about that? I can't imagine how they can iterate in any other order. Ali

Forums