foreach (i; taskPool.parallel(0..2_000_000) (page 3)

Settings

Help

Index » Learn » foreach (i; taskPool.parallel(0..2_000_000) (page 3)

April 04, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by Paul
in reply to Steven Schveighoffer

Permalink

Paul

Posted in reply to Steven Schveighoffer

Permalink

On Monday, 3 April 2023 at 23:50:48 UTC, Steven Schveighoffer wrote:

So what you need is inside createSpansOfNoBeacons, take as a reference a ref Span[MAX_SPANS], and have it return a Span[] that is a slice of that which was "alocated".

See if this helps.

Well Steven just making the change you said reduced the execution time from ~6-7 secs to ~3 secs. Then, including the 'parallel' in the foreach statement took it down to ~1 sec.

Boy lesson learned in appending-to and zeroing dynamic arrays in a hot loop!

April 04, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by H. S. Teoh
in reply to Paul

Permalink

H. S. Teoh

Posted in reply to Paul

Permalink

On Tue, Apr 04, 2023 at 09:35:29PM +0000, Paul via Digitalmars-d-learn wrote: [...]
> Well Steven just making the change you said reduced the execution time from ~6-7 secs to ~3 secs.  Then, including the 'parallel' in the foreach statement took it down to ~1 sec.
> 
> Boy lesson learned in appending-to and zeroing dynamic arrays in a hot loop!

Best practices for arrays in hot loops:
- Avoid appending if possible; instead, pre-allocate outside the loop.
- Where possible, reuse existing arrays instead of discarding old ones
  and allocating new ones.
- Use slices where possible instead of making copies of subarrays (this
  esp. applies to strings).
- Where possible, prefer sequential access over random access (take
  advantage of the CPU cache hierarchy).


T

-- 
Famous last words: I *think* this will work...

April 05, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by Paul
in reply to H. S. Teoh

Permalink

Paul

Posted in reply to H. S. Teoh

Permalink

On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:

> Best practices for arrays in hot loops:
> - Avoid appending if possible; instead, pre-allocate outside the loop.
> - Where possible, reuse existing arrays instead of discarding old ones
>   and allocating new ones.
> - Use slices where possible instead of making copies of subarrays (this
>   esp. applies to strings).
> - Where possible, prefer sequential access over random access (take
>   advantage of the CPU cache hierarchy).

Thanks for sharing Teoh!  Very helpful.

would this be random access? for(size_t i; i<arr.length; i++) using indices?
...and this be sequential foreach(a;arr) ?

or would they have to be completely different kinds of containers?
a dlang 'range' vs arr[]?

April 05, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by Steven Schveighoffer
in reply to Paul

Permalink

Steven Schveighoffer

Posted in reply to Paul

Permalink

On 4/5/23 6:34 PM, Paul wrote:
> On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:
> 
>> Best practices for arrays in hot loops:
>> - Avoid appending if possible; instead, pre-allocate outside the loop.
>> - Where possible, reuse existing arrays instead of discarding old ones
>>   and allocating new ones.
>> - Use slices where possible instead of making copies of subarrays (this
>>   esp. applies to strings).
>> - Where possible, prefer sequential access over random access (take
>>   advantage of the CPU cache hierarchy).
> 
> Thanks for sharing Teoh!  Very helpful.
> 
> would this be random access? for(size_t i; i<arr.length; i++) using indices?
> ...and this be sequential foreach(a;arr) ?

No, random access is access out of sequence. Those two lines are pretty much equivalent, and even a naive compiler is going to produce exactly the same generated code from both of them.

A classic example is processing a 2d array:

```d
for(int i = 0; i < arr[0].length; ++i)
   for(int j = 0; j < arr.length; ++j)
     arr[j][i]++;

// vs
for(int j = 0; j < arr.length; ++j)
   for(int i = 0; i < arr[0].length; ++i)
     arr[j][i]++;
```

The first accesses elements *by column*, which means that the array data is accessed non-linearly in memory.

To be fair, both are "linear" in terms of algorithm, but one is going to be faster because of cache coherency (you are accessing sequential *hardware addresses*).

-Steve

April 05, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by H. S. Teoh
in reply to Paul

Permalink

H. S. Teoh

Posted in reply to Paul

Permalink

On Wed, Apr 05, 2023 at 10:34:22PM +0000, Paul via Digitalmars-d-learn wrote:
> On Tuesday, 4 April 2023 at 22:20:52 UTC, H. S. Teoh wrote:
> 
> > Best practices for arrays in hot loops:
[...]
> > - Where possible, prefer sequential access over random access (take
> >   advantage of the CPU cache hierarchy).
> 
> Thanks for sharing Teoh!  Very helpful.
> 
> would this be random access? for(size_t i; i<arr.length; i++) using
> indices?  ...and this be sequential foreach(a;arr) ?
> 
> or would they have to be completely different kinds of containers?  a dlang 'range' vs arr[]?
[...]

The exact syntactic construct you use is not important; under the hood, for(i; i<arr.length; i++) and foreach(a;arr) are exactly the same thing. What's important is the actual pattern of memory access.  Looping from the first element of an array to the last element is sequential access; doing binary search on an array is random access (because it's unpredictable where the next access will be). Traversing a linked list is also random access, because in general individual nodes will be stored in different places in memory. So doing a hashtable lookup.

Basically, modern CPUs have a cache hierarchy and memory prefetching controlled by the access predictor. When the predictor sees memory being accessed sequentially, it can speed up future accesses by preloading subsequent blocks of memory into cache lines so that when the CPU next tries to read from that location it's already in the cache and it doesn't have to wait for a fetch cycle from RAM, which is slow.  The bad thing about things like hashtables is because it's unpredictable where the next access will be, so the CPU's predictor won't be able to preload the next access for you. So you'll incur a fetch cycle from RAM every time.

Of course, this doesn't mean that hashtable (or other data structure) lookups are necessarily bad.  If the alternative is linear scan through very large arrays, then hashtables will speed up your inner loop in spite of the cache misses. So it depends on your exact use case. But if your arrays are not overly large, scanning through the entire array may sometimes be faster than doing many hashtable lookups.  Some modern hashtable implementations are also adapted to take advantage of the cache hierarchy, so it may not be as bad as a naïve implementation.

Nevertheless, the general principle is that locality (adjacent memory accesses are close to each other) is good, and should generally be preferred over strategies that require jumping around in memory. So your data structures and algorithms should be designed in a way that takes advantage of linear access where possible.

T

-- 
In theory, there is no difference between theory and practice.

April 06, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by Paul
in reply to H. S. Teoh

Permalink

Paul

Posted in reply to H. S. Teoh

Permalink

On Wednesday, 5 April 2023 at 23:06:54 UTC, H. S. Teoh wrote:

> So your data structures and algorithms should be designed in a way that takes advantage of linear access where possible.
>
>
> T

Yes I understand, basically, what's going on in hardware.  I just wasn't sure if the access type was linked to the container type.  It seems obvious now, since you've both made it clear, that it also depends on how I'm accessing my container.

Having said all of this, isn't a D 'range' fundamentally a sequential access container (i.e popFront) ?

April 05, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by H. S. Teoh
in reply to Paul

Permalink

H. S. Teoh

Posted in reply to Paul

Permalink

On Thu, Apr 06, 2023 at 01:20:28AM +0000, Paul via Digitalmars-d-learn wrote: [...]
> Yes I understand, basically, what's going on in hardware.  I just wasn't sure if the access type was linked to the container type.  It seems obvious now, since you've both made it clear, that it also depends on how I'm accessing my container.
> 
> Having said all of this, isn't a D 'range' fundamentally a sequential access container (i.e popFront) ?

D ranges are conceptually sequential, but the actual underlying memory access patterns depends on the concrete type at runtime. An array's elements are stored sequentially in memory, and arrays are ranges.  But a linked-list can also have a range interface, yet its elements may be stored in non-consecutive memory locations.  So the concrete type matters here; the range API only gives you conceptual sequentiality, it does not guarantee physically sequential memory access.

T

-- 
Many open minds should be closed for repairs. -- K5 user

April 06, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by Salih Dincer
in reply to Steven Schveighoffer

Permalink

Salih Dincer

Posted in reply to Steven Schveighoffer

Permalink

On Tuesday, 4 April 2023 at 16:22:29 UTC, Steven Schveighoffer wrote:

On 4/4/23 11:34 AM, Salih Dincer wrote:

On Tuesday, 4 April 2023 at 14:20:20 UTC, Steven Schveighoffer wrote:

parallel is a shortcut to TaskPool.parallel, which is indeed a foreach-only construct, it does not return a range.

I think what you want is TaskPool.map:

// untested, just looking at the
taskPool.map!(/* your map function here */)
   (s.iota(len)).writeln;

I tried, thanks but it goes into infinite loop. For example, the first 50 of the sequence should have been printed to the screen immediately without waiting.

long[50] arr;
RowlandSequence_v2 range;

auto next(long a)
{
   range.popFront();
   return arr[a] = range.front();
}

void main()
{
     range = RowlandSequence_v2(7, 2);
     taskPool.map!next(iota(50))/*
     s.iota(50)
      .map!next
      .parallel//*/
      .writeln;
}

Keep in mind that arr and range are thread-local, and so will be different states for different tasks.

I guess the reason it goes into an infinite loop is because gcd() a recursive function (gcd). This is the only solution I can think of about this:

import std.range, std.algorithm : map;
import std.stdio, std.parallelism;
//import std.numeric : gcd;
/*
struct Vector {
  long x, y, result;
  alias result this;
}

Vector gcd(long a, long b) {
  if(b == 0) return Vector(1, 0, a);

  auto pre = gcd(b, a % b);
  auto tmp = (a / b) * pre.y;

  return Vector(pre.y, pre.x - tmp, pre.result);
}//*/

struct RowlandSequence_v3 {
  long b, r, n, a = 3, limit;

  bool empty() { return n == limit; }

  auto front() { return a; }

  void popFront() {
    long result = 1;
    while(result == 1) {
      result = gcd(r++, b);
      b += result;
    }
    a = result;
  }

  long gcd(long a, long b) {
    long c;
    while(b != 0) {
      c = a % b;
      a = b;
      b = c;
    }
    return a;
  }
}

auto next(ref RowlandSequence_v3 range) {
  with(range) {
    if(empty) return [n, front];
    popFront();
    return [n++, front];
  }
}

long[179] arr;

void main()
{
  // initialization:
  RowlandSequence_v3[4] r = [
  RowlandSequence_v3(7 , 2, 0, 3, 112),
  RowlandSequence_v3(186837678, 62279227, 112, 3, 145),
  RowlandSequence_v3(747404910, 249134971, 145, 6257, 160),
  RowlandSequence_v3(1494812421, 498270808, 160, 11, 177)
  ];

  auto tasks = [ task(&next, r[0]),
                 task(&next, r[1]),
                 task(&next, r[2]),
                 task(&next, r[3])
               ];

  // quad parallel operation:
  foreach(_; 0..r[0].limit)
  {
    foreach(p, ref task; tasks)
    {
        task.executeInNewThread;
        auto t = task.workForce;
        arr[t[0]] = t[1];
    }
  }

  // prints...
  foreach(x, n; arr) {
    switch(x + 1) {
      case 112, 145, 160:
        n.writeln; break;
      default:
        n.write(", ");
    }
  }
} /* PRINTS:
user@debian:~/Documents$ dmd -O rowlandSequence.d -release
user@debian:~/Documents$ time ./rowlandSequence
5, 3, 11, 3, 23, 3, 47, 3, 5, 3, 101, 3, 7, 11, 3, 13, 233, 3, 467, 3, 5, 3, 941, 3, 7, 1889, 3, 3779, 3, 7559, 3, 13, 15131, 3, 53, 3, 7, 30323, 3, 60647, 3, 5, 3, 101, 3, 121403, 3, 242807, 3, 5, 3, 19, 7, 5, 3, 47, 3, 37, 5, 3, 17, 3, 199, 53, 3, 29, 3, 486041, 3, 7, 421, 23, 3, 972533, 3, 577, 7, 1945649, 3, 163, 7, 3891467, 3, 5, 3, 127, 443, 3, 31, 7783541, 3, 7, 15567089, 3, 19, 29, 3, 5323, 7, 5, 3, 31139561, 3, 41, 3, 5, 3, 62279171, 3, 7, 83, 3
29, 3, 1103, 3, 5, 3, 13, 7, 124559609, 3, 107, 3, 911, 3, 249120239, 3, 11, 3, 7, 61, 37, 179, 3, 31, 19051, 7, 3793, 23, 3, 5, 3, 6257, 3
3, 11, 3, 13, 5, 3, 739, 37, 5, 3, 498270791, 3, 19, 11, 3
3, 3, 5, 3, 996541661, 3, 7, 37, 5, 3, 67, 1993083437, 3, 5, 3, 83, 3, 3, 0,
real	7m54.093s
user	7m54.062s
sys	0m0.024s
*/

However, parallel processing for 4-digit sequence elements is not promising at least for the Rowland Sequence.

SDB@79

April 06, 2023

Re: foreach (i; taskPool.parallel(0..2_000_000)

Posted by Paul
in reply to H. S. Teoh

Permalink

Paul

Posted in reply to H. S. Teoh

Permalink

On Thursday, 6 April 2023 at 01:44:15 UTC, H. S. Teoh wrote:

>
> D ranges are conceptually sequential, but the actual underlying memory access patterns depends on the concrete type at runtime. An array's elements are stored sequentially in memory, and arrays are ranges.  But a linked-list can also have a range interface, yet its elements may be stored in non-consecutive memory locations.  So the concrete type matters here; the range API only gives you conceptual sequentiality, it does not guarantee physically sequential memory access.

Very helpful Teoh.  Thanks again.

Top | Forum index | About this forum

Forums