December 04, 2020
On Friday, 4 December 2020 at 14:48:32 UTC, jmh530 wrote:
>
> It looks like all the `sweep_XXX` functions are only defined for contiguous slices, as that would be the default if define a Slice!(T, N).
>
> How the functions access the data is a big difference. If you compare the `sweep_field` version with the `sweep_naive` version, the `sweep_field` function is able to access through one index, whereas the `sweep_naive` function has to use two in the 2d version and 3 in the 3d version.
>
> Also, the main difference in the NDSlice version is that it uses *built-in* MIR functionality, like how `sweep_ndslice` uses the `each` function from MIR, whereas `sweep_field` uses a for loop. I think this is partially to show that the built-in MIR functionality is as fast as if you tried to do it with a for loop yourself.

I see, looking at some of the code, field case is literally doing the indexing calculation right there. I guess ndslice is doing the same thing just with "Mir magic" an in the background? Still, ndslice is able to get a consistent higher rate of flops than the field case - interesting. One thing I discovered about these kinds of plots is that introducing log scale or two particularly for timed comparisons can make the differences between different methods that look close clearer. A log plot might show some consistent difference between the timings of ndslice and the field case. Underneath they should be doing essentially the same thing so teasing out what is causing the difference would be interesting. Is Mir doing some more efficient form of the indexing calculation than naked field calculations?

I'm still not sure why slice is so slow. Doesn't that completely rely on the opSlice implementations? The choice of indexing method and underlying data structure? Isn't it just a symbolic interface that you write whatever you want?

December 04, 2020
On Friday, 4 December 2020 at 20:26:17 UTC, data pulverizer wrote:
> [snip]
>
> I see, looking at some of the code, field case is literally doing the indexing calculation right there. I guess ndslice is doing the same thing just with "Mir magic" an in the background? Still, ndslice is able to get a consistent higher rate of flops than the field case - interesting. One thing I discovered about these kinds of plots is that introducing log scale or two particularly for timed comparisons can make the differences between different methods that look close clearer. A log plot might show some consistent difference between the timings of ndslice and the field case. Underneath they should be doing essentially the same thing so teasing out what is causing the difference would be interesting. Is Mir doing some more efficient form of the indexing calculation than naked field calculations?
>
> I'm still not sure why slice is so slow. Doesn't that completely rely on the opSlice implementations? The choice of indexing method and underlying data structure? Isn't it just a symbolic interface that you write whatever you want?

Ilya might have a better ability to answer that than me.
December 05, 2020
On Thursday, 3 December 2020 at 16:50:39 UTC, Andre Pany wrote:
> On Thursday, 3 December 2020 at 16:27:59 UTC, 9il wrote:
>> Hi all,
>>
>> Since the first announcement [0] the original benchmark [1] has been boosted [2] with Mir-like implementations.
>>
>> D+Mir:
>>  1. is more abstract than NumPy
>>  2. requires less code for multidimensional algorithms
>>  3. doesn't require indexing
>>  4. uses recursion across dimensions
>>  5. a few times faster than NumPy for non-trivial real-world applications.
>>
>> Why Mir is faster than NumPy?
>>
>> 1. Mir allows the compiler to generate specialized kernels while NumPy constraints a user to write code that needs to access memory twice or more times.
>>
>> Another Mir killer feature is the ability to write generalized N-dimensional implementations, while Numpy code needs to have separate implementations for 1D, 2D, and 3D cases. For example, the main D loop in the benchmark can compile for 4D, 5D, and higher dimensional optimizations.
>>
>> 2. @nogc iteration loop. @nogc helps when you need to control what is going on with your memory allocations in the critical code part.
>>
>> [0] https://forum.dlang.org/post/pemharpztorlqkxdooul@forum.dlang.org
>> [1] https://github.com/typohnebild/numpy-vs-mir
>> [2] https://github.com/typohnebild/numpy-vs-mir/pull/1
>>
>> The benchmark [1] has been created by Christoph Alt and Tobias Schmidt.
>>
>> Kind regards,
>> Ilya
>
> Hi Ilya,
>
> Thanks a lot for sharing the update. I am currently working on porting a python package called FMPY to D. This package makes usage of numpy and I hope I can use MIR here.

Probably you may want to express FMI entities as Algebraic types rather than classes. mir.algebraic can be really helpful here

http://mir-core.libmir.org/mir_algebraic.html

> Somehow it is hard to get started to learn MIR. What maybe could help python developers is to have some articles showing numpy coding and side by side the equivalent MIR coding.

It is hard for me to write articles. I will try to write a small one this year, but it would be Mir only. Maybe this benchmark can be used as an example and if one wishes to write a side-by-side comparison with NumPy I would be happy to comment and explain the D implementation and what it is doing internally.

> What I miss in MIR is a function to read and write CSV files. Is s.th. like numpy.genfromtxt planned?

Unlikely I would add it but can do a code review.

Currently, we can load/safe NumPy binary data with numir

https://libmir.github.io/numir/io.html

Kind regards,
Ilya
December 05, 2020
On Thursday, 3 December 2020 at 17:08:58 UTC, jmh530 wrote:
> On Thursday, 3 December 2020 at 16:27:59 UTC, 9il wrote:
> Looks good, but a few typos:

Thanks!
December 05, 2020
On Friday, 4 December 2020 at 02:35:49 UTC, data pulverizer wrote:
> On Thursday, 3 December 2020 at 21:28:04 UTC, jmh530 wrote:
>
> Am I correct in assuming that the data in the NDSlice is also a single array?

sweep_ndslice uses (2*N - 1) arrays to index U, this allows LDC to unroll the loop.

For example, for 2D case, withNeighboursSum [2] will store the pointer to the result, and the pointer at rows above and below.

matrix:
--------------
------a------- above iterator
------r------- the result
------b------- below iterator
--------------

Also, for AVX-512 targets it allows vectorizing the loop [1]. The benchmark has been run on the AVX2 CPU.

[1] https://github.com/typohnebild/numpy-vs-mir/issues/4
[2] http://mir-algorithm.libmir.org/mir_ndslice_topology.html#.withNeighboursSum
December 05, 2020
On Friday, 4 December 2020 at 03:48:15 UTC, Walter Bright wrote:
> On 12/3/2020 8:27 AM, 9il wrote:
>> Since the first announcement [0] the original benchmark [1] has been boosted [2] with Mir-like implementations.
>
> This is really great! Can you write an article about it? Such would be really helpful in letting people know about it.

Thanks! The README is really great as the benchmark description. I will do a small article about Mir this year.
December 05, 2020
On Friday, 4 December 2020 at 20:26:17 UTC, data pulverizer wrote:
> On Friday, 4 December 2020 at 14:48:32 UTC, jmh530 wrote:
>>
>> It looks like all the `sweep_XXX` functions are only defined for contiguous slices, as that would be the default if define a Slice!(T, N).
>>
>> How the functions access the data is a big difference. If you compare the `sweep_field` version with the `sweep_naive` version, the `sweep_field` function is able to access through one index, whereas the `sweep_naive` function has to use two in the 2d version and 3 in the 3d version.
>>
>> Also, the main difference in the NDSlice version is that it uses *built-in* MIR functionality, like how `sweep_ndslice` uses the `each` function from MIR, whereas `sweep_field` uses a for loop. I think this is partially to show that the built-in MIR functionality is as fast as if you tried to do it with a for loop yourself.
>
> I see, looking at some of the code, field case is literally doing the indexing calculation right there. I guess ndslice is doing the same thing just with "Mir magic" an in the background?

sweep_ndslice uses (2*N - 1) arrays to index U, this allows LDC to unroll the loop.

More details here
https://forum.dlang.org/post/qejwviqovawnuniuagtd@forum.dlang.org

> I'm still not sure why slice is so slow. Doesn't that completely rely on the opSlice implementations? The choice of indexing method and underlying data structure?

sweep_slice is slower because it iterates data in few loops rather than in a single one. For small matrices this makes JMP/FLOP ratio higher, for large matrices that can't feet into the CPU cache, it is less memory efficient.


December 05, 2020
On Saturday, 5 December 2020 at 07:04:59 UTC, 9il wrote:
> On Thursday, 3 December 2020 at 16:50:39 UTC, Andre Pany wrote:
>> On Thursday, 3 December 2020 at 16:27:59 UTC, 9il wrote:
>>> [...]
>>
>> Hi Ilya,
>>
>> Thanks a lot for sharing the update. I am currently working on porting a python package called FMPY to D. This package makes usage of numpy and I hope I can use MIR here.
>
> Probably you may want to express FMI entities as Algebraic types rather than classes. mir.algebraic can be really helpful here
>
> http://mir-core.libmir.org/mir_algebraic.html
>
>> Somehow it is hard to get started to learn MIR. What maybe could help python developers is to have some articles showing numpy coding and side by side the equivalent MIR coding.
>
> It is hard for me to write articles. I will try to write a small one this year, but it would be Mir only. Maybe this benchmark can be used as an example and if one wishes to write a side-by-side comparison with NumPy I would be happy to comment and explain the D implementation and what it is doing internally.
>
>> What I miss in MIR is a function to read and write CSV files. Is s.th. like numpy.genfromtxt planned?
>
> Unlikely I would add it but can do a code review.
>
> Currently, we can load/safe NumPy binary data with numir
>
> https://libmir.github.io/numir/io.html
>
> Kind regards,
> Ilya

Thanks a lot for the tipps. I will work through the documentation of mir_algebraic. Currently I follow the strategy to have the D coding as similiar as possible to the python coding. Fmpy is in active development and I want to backport changes easily.

I totally missed the fact that MIR can load / safe numpy binary data. This is really great.

Kind regards
Andre
December 06, 2020
On Saturday, 5 December 2020 at 07:44:33 UTC, 9il wrote:
>
> sweep_ndslice uses (2*N - 1) arrays to index U, this allows LDC to unroll the loop.
>
> For example, for 2D case, withNeighboursSum [2] will store the pointer to the result, and the pointer at rows above and below.
>
> matrix:
> --------------
> ------a------- above iterator
> ------r------- the result
> ------b------- below iterator
> --------------
>
> Also, for AVX-512 targets it allows vectorizing the loop [1]. The benchmark has been run on the AVX2 CPU.
>
> [1] https://github.com/typohnebild/numpy-vs-mir/issues/4
> [2] http://mir-algorithm.libmir.org/mir_ndslice_topology.html#.withNeighboursSum

Very interesting, thank you for the explanations. Are there journal/book other implementation references for these approaches to implementing tensor-like multidimensional arrays? Tensor-like multidimensional arrays data structures is one of the worst covered in research/conventional literature compared to almost anything else, which can be rather frustrating.

December 07, 2020
On Sunday, 6 December 2020 at 17:30:13 UTC, data pulverizer wrote:
> On Saturday, 5 December 2020 at 07:44:33 UTC, 9il wrote:
>>
>> sweep_ndslice uses (2*N - 1) arrays to index U, this allows LDC to unroll the loop.
>>
>> For example, for 2D case, withNeighboursSum [2] will store the pointer to the result, and the pointer at rows above and below.
>>
>> matrix:
>> --------------
>> ------a------- above iterator
>> ------r------- the result
>> ------b------- below iterator
>> --------------
>>
>> Also, for AVX-512 targets it allows vectorizing the loop [1]. The benchmark has been run on the AVX2 CPU.
>>
>> [1] https://github.com/typohnebild/numpy-vs-mir/issues/4
>> [2] http://mir-algorithm.libmir.org/mir_ndslice_topology.html#.withNeighboursSum
>
> Very interesting, thank you for the explanations. Are there journal/book other implementation references for these approaches to implementing tensor-like multidimensional arrays?

I don't know. Tensors aren't so complex. The complex part is a design that allows Mir to construct and iterate various kinds of lazy tensors of any complexity and have quite a universal API, and all of these are boosted by the fact that the user-provided kernel(lambda) function is optimized by the compiler without the overhead.