May 06, 2020
06.05.2020 16:57, Steven Schveighoffer пишет:
>> ```
>> foreach(i; 0..n) // instead of for(long i = 0; i < n;)
>> ```
>> I guess that `proc` delegate cant capture `i` var of `foreach` loop so the range violation doesn't happen.
> 
> foreach over a range of integers is lowered to an equivalent for loop, so that was not the problem.

I was surprised but `foreach` version do not have range violation, so there is difference between `foreach` and `for` loops. I did not try DerivedThread at all, only suggested them to avoid var capture. I just changed `for` by `foreach` and range violation gone. Probably this is implementation details.

May 06, 2020
06.05.2020 13:23, data pulverizer пишет:
> On Wednesday, 6 May 2020 at 08:28:41 UTC, drug wrote:
>> What is current D time? ...
> 
> Current Times:
> 
> D:      ~ 1.5 seconds
> Chapel: ~ 9 seconds
> Julia:  ~ 35 seconds
> 

Oh, I'm impressed. I thought that D time has been decreased by 1.5 seconds but it is 1.5 seconds!

>> That would be really nice if you make the resume of your research.
> 
> Yes, I'll do a blog or something on GitHub and link it.
> 
> Thanks for all your help.

You're welcome! Helping others helps me too.
May 06, 2020
On 5/6/20 2:29 PM, drug wrote:
> 06.05.2020 16:57, Steven Schveighoffer пишет:
>>> ```
>>> foreach(i; 0..n) // instead of for(long i = 0; i < n;)
>>> ```
>>> I guess that `proc` delegate cant capture `i` var of `foreach` loop so the range violation doesn't happen.
>>
>> foreach over a range of integers is lowered to an equivalent for loop, so that was not the problem.
> 
> I was surprised but `foreach` version do not have range violation, so there is difference between `foreach` and `for` loops. I did not try DerivedThread at all, only suggested them to avoid var capture. I just changed `for` by `foreach` and range violation gone. Probably this is implementation details.
> 

Ah yes, because foreach(i; 0 .. n) actually uses a hidden variable to iterate, and assigns it to i each time through the loop. It used to just use i for iteration, but then you could play tricks by adjusting i.

So the equivalent for loop would be:

for(int _i = 0; _i < n; ++_i)
{
   auto i = _i; // this won't be executed after _i is out of range
   ... // foreach body
}

So the problem would not be a range error, but just random i's coming through to the various threads ;)

Very interesting!

-Steve
May 06, 2020
On Wednesday, 6 May 2020 at 17:31:39 UTC, Jacob Carlborg wrote:
> On 2020-05-06 12:23, data pulverizer wrote:
>
>> Yes, I'll do a blog or something on GitHub and link it.
>
> It would be nice if you could get it published on the Dlang blog [1]. One usually get paid for that. Contact Mike Parker.
>
> [1] https://blog.dlang.org

I'm definitely open to publishing it in dlang blog, getting paid would be nice. I've just done a full reconciliation of the output from D and Chapel with Julia's output they're all the same. In the calculation I used 32-bit floats to minimise memory consumption, I was also working with the 10,000 MINST image data (t10k-images-idx3-ubyte.gz) http://yann.lecun.com/exdb/mnist/ rather than random generated data.

The -O3 -O5 optimization on the ldc compiler is instrumental in bringing the times down, going with -02 based optimization even with the other flags gives us ~ 13 seconds for the 10,000 dataset rather than the very nice 1.5 seconds.

As an idea of how kernel matrix computations scale the file "train-images-idx3-ubyte.gz" contains 60,000 images and Julia performs a kernel matrix calculation in 1340 seconds while D performs it in 163 seconds - not really in line with the first time, I'd expect around 1.5*36 = 54 seconds; Chapel performs in 357 seconds - approximately line with the original and the new kernel matrix consumes about 14 GB of memory which is why I chose to use 32 bit floats - to give me an opportunity to do the kernel matrix calculation on my laptop that currently has 31GB of RAM.
May 07, 2020
On Wednesday, 6 May 2020 at 23:10:05 UTC, data pulverizer wrote:
> The -O3 -O5 optimization on the ldc compiler is instrumental in bringing the times down, going with -02 based optimization even with the other flags gives us ~ 13 seconds for the 10,000 dataset rather than the very nice 1.5 seconds.

What is the difference between -O2 and -O3 ldc2 compiler optimizations?


May 07, 2020
On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
> D:      ~ 1.5 seconds

This is going to sound absurd but can we do even better? If none of the optimizations we have so far is using simd maybe we can get even better performance by using it. I think I need to go and read a simd primer.


May 07, 2020
On Thursday, 7 May 2020 at 02:06:32 UTC, data pulverizer wrote:
> On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
>> D:      ~ 1.5 seconds
>
> This is going to sound absurd but can we do even better? If none of the optimizations we have so far is using simd maybe we can get even better performance by using it. I think I need to go and read a simd primer.

After running the Julia code by the Julia community they made some changes (using views rather than passing copies of the array) and their time has come down to ~ 2.5 seconds. The plot thickens.
May 07, 2020
07.05.2020 17:49, data pulverizer пишет:
> On Thursday, 7 May 2020 at 02:06:32 UTC, data pulverizer wrote:
>> On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
>>> D:      ~ 1.5 seconds
>>
>> This is going to sound absurd but can we do even better? If none of the optimizations we have so far is using simd maybe we can get even better performance by using it. I think I need to go and read a simd primer.
> 
> After running the Julia code by the Julia community they made some changes (using views rather than passing copies of the array) and their time has come down to ~ 2.5 seconds. The plot thickens.

That's a good sign because I was afraid that super short D time was a result of wrong benchmark (too good to be truth). I'm glad D time really is both great and real. Your blog post definitely will be very interesting.
May 07, 2020
On Thursday, 7 May 2020 at 15:41:12 UTC, drug wrote:
> 07.05.2020 17:49, data pulverizer пишет:
>> On Thursday, 7 May 2020 at 02:06:32 UTC, data pulverizer wrote:
>>> On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
>>>> D:      ~ 1.5 seconds
>> 
>> After running the Julia code by the Julia community they made some changes (using views rather than passing copies of the array) and their time has come down to ~ 2.5 seconds. The plot thickens.
>
> That's a good sign because I was afraid that super short D time was a result of wrong benchmark (too good to be truth). I'm glad D time really is both great and real. Your blog post definitely will be very interesting.

Don't worry, the full code will be released so that it can be inspected by anyone interested before the blog is released. Now working for a version on Nim for even more comparison. Can't wait to find out how everything compares.
May 08, 2020
On Thursday, 7 May 2020 at 14:49:43 UTC, data pulverizer wrote:
> After running the Julia code by the Julia community they made some changes (using views rather than passing copies of the array) and their time has come down to ~ 2.5 seconds. The plot thickens.

I've run the Chapel code past the Chapel programming language people and they've brought the time down to ~ 6.5 seconds. I've disallowed calling BLAS because I'm looking at the performance of the programming language implementations rather than it's ability to call other libraries.

So far the times are looking like this:

D:      ~ 1.5 seconds
Julia:  ~ 2.5 seconds
Chapel: ~ 6.5 seconds

I've been working on the Nim benchmark and have written a little byte order set of functions for big -> little endian stuff (https://gist.github.com/dataPulverizer/744fadf8924ae96135fc600ac86c7060) which was fun and has the ntoh, hton, and so forth functions that can be applied to any basic type. Now writing a little matrix type in the same vein as the D matrix type I wrote and then do the easy bit which is writing the kernel matrix algorithm itself.

In the end I'll run the benchmark on data of various sizes. Currently I'm just running it on the (10,000 x 784) data set which outputs a (10,000 x 10,000) matrix. I'll end up running (5,000 x 784), (10,000 x 784), (20,000 x 784), (30,000 x 784), (40,000 x 784), (50,000 x 784), and (60,000 x 784). Ideally I'd measure each on 100 times and plot confidence intervals, but I'll have to settle for measuring each one 3 times and take an average otherwise it will take too much time. I don't think that D will have it it's own way for all the data sizes, from what I can see, Julia may do better at the largest data set, maybe simd will be a factor there.

The data set sizes are not randomly chosen. In many common data science tasks maybe > 90% of what data scientists currently work on, people work with data sets in this range or even smaller, the big data stuff is much less common unless you're working for Google (FANGs) or a specialist startup. I remember running a kernel cluster in often used "data science" languages (none of which I'm benchmarking here) and it wasn't done after an hour and then hung and crashed, I implemented something in Julia and it was done in a minute. Calculating kernel matrices is the cornerstone of many kernel-based machine learning libraries kernel PCA, Kernel Clustering, SVM and so on. It's a pretty important thing to calculate and shows the potential of these languages in the data science field. I think an article like this is valid for people that implement numerical libraries. I'm also hoping to throw in C++ by way of comparison.