May 06, 2020
On 2020-05-06 08:54, drug wrote:

> Do you try `--fast-math` in ldc? Don't know if 05 use this flag

Try the following flags as well:

`-mcpu=native -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`

-- 
/Jacob Carlborg
May 06, 2020
On 2020-05-06 06:04, Mathias LANG wrote:

> In general, if you want to parallelize something, you should aim to have as many threads as you have cores.

That should be _logical_ cores. If the CPU supports hyper threading it can run two threads per core.

-- 
/Jacob Carlborg
May 06, 2020
06.05.2020 10:42, data pulverizer пишет:
> On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
>> On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
>>> Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
>>> Yes, `array` is smart enough and if you call it on another array it is no op.
>>> What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag
>>
>> I tried `--fast-math` in ldc but it didn't make any difference the documentation of `--fast` in Chapel says "Disable checks; optimize/specialize".
> 
> Just tried removing the boundscheck and got 1.5 seconds in D!
> 

Congrats! it looks like the thriller!
What about cpu usage? the same 40%?
May 06, 2020
On Wednesday, 6 May 2020 at 07:42:44 UTC, data pulverizer wrote:
> On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
>> On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
>>> Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
>>> Yes, `array` is smart enough and if you call it on another array it is no op.
>>> What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag
>>
>> I tried `--fast-math` in ldc but it didn't make any difference the documentation of `--fast` in Chapel says "Disable checks; optimize/specialize".
>
> Just tried removing the boundscheck and got 1.5 seconds in D!

Cool! But before getting too excited I would recommend you to also run tests if the resulting data is even still correct before you keep this in if you haven't done this already!

If you feel like it, I would recommend you to write up some small blog article what you learned about how to improve performance of hot code like this. Maybe simply write a post on reddit or make a full blog or something.

Ultimately: all the smart suggestions in here should probably be aggregated. More benchmarks and more blog articles always help the discoverability then.
May 06, 2020
On Wednesday, 6 May 2020 at 07:47:59 UTC, drug wrote:
> 06.05.2020 10:42, data pulverizer пишет:
>> On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
>>> On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
>>>> Thing are really interesting. So there is a space to improve performance in 2.5 times :-)
>>>> Yes, `array` is smart enough and if you call it on another array it is no op.
>>>> What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't know if 05 use this flag
>>>
>>> I tried `--fast-math` in ldc but it didn't make any difference the documentation of `--fast` in Chapel says "Disable checks; optimize/specialize".
>> 
>> Just tried removing the boundscheck and got 1.5 seconds in D!
>> 
>
> Congrats! it looks like the thriller!
> What about cpu usage? the same 40%?

CPU usage now revs up almost has time to touch 100% before the process is finished! Interestingly using `--boundscheck=off` without `--ffast-math` gives a timing of around 4 seconds and, whereas using `--ffast-math` without `--boundscheck=off` made no difference, having both gives us the 1.5 seconds. As Jacob Carlborg suggested I tried adding `-mcpu=native -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto` but I didn't see any difference.

Current Julia time is still around 35 seconds even when using @inbounds @simd, and running julia -O3 --check-bounds=no but I'll probably need to run the code by the Julia community to see whether it can be further optimized but it's pretty interesting to see D so far in front. Interesting when I attempt to switch off the garbage collector in Julia, the process gets killed because my computer runs out of memory (I have over 26 GB of memory free) whereas in D the memory I'm using barely registers (max 300MB) - it uses even less than Chapel (max 500MB) - which doesn't use much at all. It's exactly the same computation, D and Julia's timing were similar before the GC optimization and compiler flag magic in D.
May 06, 2020
On Wednesday, 6 May 2020 at 07:57:46 UTC, WebFreak001 wrote:
> On Wednesday, 6 May 2020 at 07:42:44 UTC, data pulverizer wrote:
>> On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
>> Just tried removing the boundscheck and got 1.5 seconds in D!
>
> Cool! But before getting too excited I would recommend you to also run tests if the resulting data is even still correct before you keep this in if you haven't done this already!

Yes, I've been outputting portions of the result which is a 10_000 x 10_000 matrix but it's definitely a good idea to do a full reconciliation of the outputs from all the languages.

> If you feel like it, I would recommend you to write up some small blog article what you learned about how to improve performance of hot code like this. Maybe simply write a post on reddit or make a full blog or something.

I'll probably do a blog on GitHub and it can be linked it on reddit.

> Ultimately: all the smart suggestions in here should probably be aggregated. More benchmarks and more blog articles always help the discoverability then.

Definitely, Julia has a very nice performance optimization section that makes things easy to start with https://docs.julialang.org/en/v1/manual/performance-tips/index.html, it helps alot to start getting your code speedy before you ask for help from the community.

May 06, 2020
06.05.2020 11:18, data pulverizer пишет:
> 
> CPU usage now revs up almost has time to touch 100% before the process is finished! Interestingly using `--boundscheck=off` without `--ffast-math` gives a timing of around 4 seconds and, whereas using `--ffast-math` without `--boundscheck=off` made no difference, having both gives us the 1.5 seconds. As Jacob Carlborg suggested I tried adding `-mcpu=native -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto` but I didn't see any difference.
> 
> Current Julia time is still around 35 seconds even when using @inbounds @simd, and running julia -O3 --check-bounds=no but I'll probably need to run the code by the Julia community to see whether it can be further optimized but it's pretty interesting to see D so far in front. Interesting when I attempt to switch off the garbage collector in Julia, the process gets killed because my computer runs out of memory (I have over 26 GB of memory free) whereas in D the memory I'm using barely registers (max 300MB) - it uses even less than Chapel (max 500MB) - which doesn't use much at all. It's exactly the same computation, D and Julia's timing were similar before the GC optimization and compiler flag magic in D.

What is current D time? That would be really nice if you make the resume of your research.
May 06, 2020
On Wednesday, 6 May 2020 at 08:28:41 UTC, drug wrote:
> What is current D time? ...

Current Times:

D:      ~ 1.5 seconds
Chapel: ~ 9 seconds
Julia:  ~ 35 seconds

> That would be really nice if you make the resume of your research.

Yes, I'll do a blog or something on GitHub and link it.

Thanks for all your help.
May 06, 2020
On 5/6/20 2:49 AM, drug wrote:
> 06.05.2020 09:24, data pulverizer пишет:
>> On Wednesday, 6 May 2020 at 05:44:47 UTC, drug wrote:
>>>
>>> proc is already a delegate, so &proc is a pointer to the delegate, just pass a `proc` itself
>>
>> Thanks done that but getting a range violation on z which was not there before.
>>
>> ```
>> core.exception.RangeError@onlineapp.d(3): Range violation
>> ----------------
>> ??:? _d_arrayboundsp [0x55de2d83a6b5]
>> onlineapp.d:3 void onlineapp.process(double, double, long, shared(double[])) [0x55de2d8234fd]
>> onlineapp.d:16 void onlineapp.main().__lambda1() [0x55de2d823658]
>> ??:? void core.thread.osthread.Thread.run() [0x55de2d83bdf9]
>> ??:? thread_entryPoint [0x55de2d85303d]
>> ??:? [0x7fc1d6088668]
>> ```
>>
> 
> confirmed. I think that's because `proc` delegates captures `i` variable of `for` loop. I managed to get rid of range violation by using `foreach`:
> ```
> foreach(i; 0..n) // instead of for(long i = 0; i < n;)
> ```
> I guess that `proc` delegate cant capture `i` var of `foreach` loop so the range violation doesn't happen.

foreach over a range of integers is lowered to an equivalent for loop, so that was not the problem.

Indeed, D does not capture individual for loop contexts, only the context of the entire function.

> 
> you use `proc` delegate to pass arguments to `process` function. I would recommend for this purpose to derive a class from class Thread. Then you can pass the arguments in ctor of the derived class like:
> ```
> foreach(long i; 0..n)
>      new DerivedThread(double)(i), cast(double)(i + 1), i, z).start(); thread_joinAll();
> ```

This is why it works, because you are capturing the value manually while in the loop itself.

Another way to do this is to create a new capture context:

foreach(long i; 0 .. n)
   auto proc = (val => {
       process(cast(double)(val), cast(double)(val + 1), val, z);
   })(i);
   ...
}

-Steve
May 06, 2020
On 2020-05-06 12:23, data pulverizer wrote:

> Yes, I'll do a blog or something on GitHub and link it.

It would be nice if you could get it published on the Dlang blog [1]. One usually get paid for that. Contact Mike Parker.

[1] https://blog.dlang.org

-- 
/Jacob Carlborg