May 06, 2020
06.05.2020 07:52, data pulverizer пишет:
> On Wednesday, 6 May 2020 at 04:04:14 UTC, Mathias LANG wrote:
>> On Wednesday, 6 May 2020 at 03:41:11 UTC, data pulverizer wrote:
>>> Yes, that's exactly what I want the actual computation I'm running is much more expensive and much larger. It shouldn't matter if I have like 100_000_000 threads should it? The threads should just be queued until the cpu works on it?
>>
>> It does matter quite a bit. Each thread has its own resources allocated to it, and some part of the language will need to interact with *all* threads, e.g. the GC.
>> In general, if you want to parallelize something, you should aim to have as many threads as you have cores. Having 100M threads will mean you have to do a lot of context switches. You might want to look up the difference between tasks and threads.
> 
> Sorry, I meant 10_000 not 100_000_000 I square the number by mistake because I'm calculating a 10_000 x 10_000 matrix it's only 10_000 tasks, so 1 task does 10_000 calculations. The actual bit of code I'm parallelising is here:
> 
> ```
> auto calculateKernelMatrix(T)(AbstractKernel!(T) K, Matrix!(T) data)
> {
>    long n = data.ncol;
>    auto mat = new Matrix!(T)(n, n);
> 
>    foreach(j; taskPool.parallel(iota(n)))
>    {
>      auto arrj = data.refColumnSelect(j).array;
>      for(long i = j; i < n; ++i)
>      {
>        mat[i, j] = K.kernel(data.refColumnSelect(i).array, arrj);
>        mat[j, i] = mat[i, j];
>      }
>    }
>    return mat;
> }
> ```
> 
> At the moment this code is running a little bit faster than threaded simd optimised Julia code, but as I said in an earlier reply to Ali when I look at my system monitor, I can see that all the D threads are active and running at ~ 40% usage, meaning that they are mostly doing nothing. The Julia code runs all threads at 100% and is still a tiny bit slower so my (maybe incorrect?) assumption is that I could get more performance from D. The method `refColumnSelect(j).array` is (trying to) reference a column from a matrix (1D array with computed index referencing) which I select from the matrix using:
> 
> ```
> return new Matrix!(T)(data[startIndex..(startIndex + nrow)], [nrow, 1]);
> ```
> 
> If I use the above code, I am I wrong in assuming that the sliced data (T[]) is referenced rather than copied? That so if I do:
> 
> ```
> auto myData = data[5...10];
> ```
> 
> myData is referencing elements [5..10] of data and not creating a new array with elements data[5..10] copied?

General advice - try to avoid using `array` and `new` in hot code. Memory allocating is slow in general, except if you use carefully crafted custom memory allocators. And that can easily be the reason of 40% cpu usage because the cores are waiting for the memory subsystem.
May 06, 2020
On Wednesday, 6 May 2020 at 05:44:47 UTC, drug wrote:
>
> proc is already a delegate, so &proc is a pointer to the delegate, just pass a `proc` itself

Thanks done that but getting a range violation on z which was not there before.

```
core.exception.RangeError@onlineapp.d(3): Range violation
----------------
??:? _d_arrayboundsp [0x55de2d83a6b5]
onlineapp.d:3 void onlineapp.process(double, double, long, shared(double[])) [0x55de2d8234fd]
onlineapp.d:16 void onlineapp.main().__lambda1() [0x55de2d823658]
??:? void core.thread.osthread.Thread.run() [0x55de2d83bdf9]
??:? thread_entryPoint [0x55de2d85303d]
??:? [0x7fc1d6088668]
```

May 06, 2020
On Wednesday, 6 May 2020 at 05:50:23 UTC, drug wrote:
> General advice - try to avoid using `array` and `new` in hot code. Memory allocating is slow in general, except if you use carefully crafted custom memory allocators. And that can easily be the reason of 40% cpu usage because the cores are waiting for the memory subsystem.

I changed the Matrix object from class to struct and timing went from about 19 seconds with ldc2 and flags `-O5` to 13.69 seconds, but CPU usage is still at ~ 40% still using `taskPool.parallel(iota(n))`. The `.array` method is my method for the Matrix object just returning internal data array object so it shouldn't copy. Julia is now at about 34 seconds (D was at about 30 seconds while just using dmd with no optimizations), to make things more interesting I also did an implementation in Chapel which is now at around 9 seconds with `--fast` flag.
May 06, 2020
06.05.2020 09:24, data pulverizer пишет:
> On Wednesday, 6 May 2020 at 05:44:47 UTC, drug wrote:
>>
>> proc is already a delegate, so &proc is a pointer to the delegate, just pass a `proc` itself
> 
> Thanks done that but getting a range violation on z which was not there before.
> 
> ```
> core.exception.RangeError@onlineapp.d(3): Range violation
> ----------------
> ??:? _d_arrayboundsp [0x55de2d83a6b5]
> onlineapp.d:3 void onlineapp.process(double, double, long, shared(double[])) [0x55de2d8234fd]
> onlineapp.d:16 void onlineapp.main().__lambda1() [0x55de2d823658]
> ??:? void core.thread.osthread.Thread.run() [0x55de2d83bdf9]
> ??:? thread_entryPoint [0x55de2d85303d]
> ??:? [0x7fc1d6088668]
> ```
> 

confirmed. I think that's because `proc` delegates captures `i` variable of `for` loop. I managed to get rid of range violation by using `foreach`:
```
foreach(i; 0..n) // instead of for(long i = 0; i < n;)
```
I guess that `proc` delegate cant capture `i` var of `foreach` loop so the range violation doesn't happen.

you use `proc` delegate to pass arguments to `process` function. I would recommend for this purpose to derive a class from class Thread. Then you can pass the arguments in ctor of the derived class like:
```
foreach(long i; 0..n)
    new DerivedThread(double)(i), cast(double)(i + 1), i, z).start(); thread_joinAll();
```

not tested example of derived thread
```
class DerivedThread
{
    this(double x, double y, long i, shared(double[]) z)
    {
        this.x = x;
	this.y = y;
	this.i = i;
	this.z = z;
        super(&run);
    }
private:
    void run()
    {
         process(x, y, i, z);
    }
	double x, y;
	long i;
	shared(double[]) z;
}
```

May 06, 2020
06.05.2020 09:43, data pulverizer пишет:
> On Wednesday, 6 May 2020 at 05:50:23 UTC, drug wrote:
>> General advice - try to avoid using `array` and `new` in hot code. Memory allocating is slow in general, except if you use carefully crafted custom memory allocators. And that can easily be the reason of 40% cpu usage because the cores are waiting for the memory subsystem.
> 
> I changed the Matrix object from class to struct and timing went from about 19 seconds with ldc2 and flags `-O5` to 13.69 seconds, but CPU usage is still at ~ 40% still using `taskPool.parallel(iota(n))`. The `.array` method is my method for the Matrix object just returning internal data array object so it shouldn't copy. Julia is now at about 34 seconds (D was at about 30 seconds while just using dmd with no optimizations), to make things more interesting I also did an implementation in Chapel which is now at around 9 seconds with `--fast` flag.

Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
Yes, `array` is smart enough and if you call it on another array it is no op.
What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag
May 06, 2020
On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
> Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
> Yes, `array` is smart enough and if you call it on another array it is no op.
> What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag

I tried `--fast-math` in ldc but it didn't make any difference the documentation of `--fast` in Chapel says "Disable checks; optimize/specialize".
May 06, 2020
On Wednesday, 6 May 2020 at 06:49:13 UTC, drug wrote:
> ... Then you can pass the arguments in ctor of the derived class like:
> ```
> foreach(long i; 0..n)
>     new DerivedThread(double)(i), cast(double)(i + 1), i, z).start(); thread_joinAll();
> ```
>
> not tested example of derived thread
> ```
> class DerivedThread
> {
>     this(double x, double y, long i, shared(double[]) z)
>     {
>         this.x = x;
> 	this.y = y;
> 	this.i = i;
> 	this.z = z;
>         super(&run);
>     }
> private:
>     void run()
>     {
>          process(x, y, i, z);
>     }
> 	double x, y;
> 	long i;
> 	shared(double[]) z;
> }
> ```

Thanks. Now working.
May 06, 2020
On 2020-05-06 05:25, data pulverizer wrote:
> I have been using std.parallelism and that has worked quite nicely but it is not fully utilising all the cpu resources in my computation

If you happen to be using macOS, I know that when std.parallelism checks how many cores the computer has, it checks physical cores instead of logical cores. That could be a reason, if you're running macOS.

-- 
/Jacob Carlborg
May 06, 2020
On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
> On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
>> Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
>> Yes, `array` is smart enough and if you call it on another array it is no op.
>> What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag
>
> I tried `--fast-math` in ldc but it didn't make any difference the documentation of `--fast` in Chapel says "Disable checks; optimize/specialize".

Just tried removing the boundscheck and got 1.5 seconds in D!
May 06, 2020
On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
> On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
>> Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
>> Yes, `array` is smart enough and if you call it on another array it is no op.
>> What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag
>
> I tried `--fast-math` in ldc but it didn't make any difference the documentation of `--fast` in Chapel says "Disable checks; optimize/specialize".

Just tried removing the boundscheck and got 1.5 seconds in D!