Thread overview
Array fill performance differences between for, foreach, slice
Mar 31, 2020
data pulverizer
Apr 01, 2020
Jacob Carlborg
Apr 01, 2020
Adam D. Ruppe
Apr 01, 2020
Jesse Phillips
Apr 01, 2020
Adam D. Ruppe
Apr 01, 2020
lithium iodate
Apr 01, 2020
data pulverizer
Apr 08, 2020
data pulverizer
March 31, 2020
I've observed large differences in timing performance while filling arrays using different methods (for vs foreach vs arr[] = x) and don't know why. I've looked at array.d (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) but I'm still none the wiser.

Here is an example:

```
/* fill.d */

import std.stdio: writeln;
import std.typecons: Tuple, tuple;
import std.algorithm.iteration: mean;
import std.algorithm.iteration: sum;
import std.datetime.stopwatch: AutoStart, StopWatch;


/* Benchmarking Function */
auto bench(alias fun, string units = "msecs",
          ulong minN = 10, bool doPrint = false)(ulong n, string msg = "")
{
  auto times = new double[n];
  auto sw = StopWatch(AutoStart.no);
  for(ulong i = 0; i < n; ++i)
  {
    sw.start();
    fun();
    sw.stop();
    times[i] = cast(double)sw.peek.total!units;
    sw.reset();
  }
  double ave = mean(times);
  double sd = 0;

  if(n >= minN)
  {
    for(ulong i = 0; i < n; ++i)
      sd += (times[i] - ave)^^2;
    sd /= (n - 1);
    sd ^^= 0.5;
  }else{
    sd = double.nan;
  }

  static if(doPrint)
    writeln(msg ~ "Mean time("~ units ~ "): ", ave, ", Standard Deviation: ", sd);

  return tuple!("mean", "sd")(ave, sd);
}

/* Fill Functions */
auto fill_for(alias x, ulong n)()
{
  alias T = typeof(x);
  auto arr = new T[n];

  for(ulong i = 0; i < n; ++i)
    arr[i] = x;

  return arr;
}

auto fill_foreach(alias x, ulong n)()
{
  alias T = typeof(x);
  auto arr = new T[n];

  foreach(ref el; arr)
    el = x;

  return arr;
}

auto fill_slice(alias x, ulong n)()
{
  alias T = typeof(x);
  auto arr = new T[n];

  arr[] = x;

  return arr;
}


void main()
{
  double x = 42;

  bench!(fill_slice!(x, 100_000), "usecs", 10, true)(100, "Slice: ");
  bench!(fill_foreach!(x, 100_000), "usecs", 10, true)(100, "Foreach: ");
  bench!(fill_for!(x, 100_000), "usecs", 10, true)(100, "For: ");
}

/*
$ dmd fill.d && ./fill
Slice: Mean time(usecs): 87.38, Standard Deviation: 54.1542
Foreach: Mean time(usecs): 179.9, Standard Deviation: 41.4109
For: Mean time(usecs): 245.81, Standard Deviation: 53.0798

$ dmd --version
DMD64 D Compiler v2.090.1
...
*/
```

It would be great to know why there are large differences in performance between these approaches and it would be great to see where array's opSliceAssign (or the equivalent method) for D's native array is implemented. Playing with `-boundscheck` made no difference in the contrasting performances. Thanks.



March 31, 2020
On 3/31/20 5:30 PM, data pulverizer wrote:
> I've observed large differences in timing performance while filling arrays using different methods (for vs foreach vs arr[] = x) and don't know why. I've looked at array.d (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) but I'm still none the wiser.

for: you are indexing an array on each step, incurring a bounds check, and also doing a pointer calculation on each access.

foreach: Better, iterates a pointer over the whole array, so it's going to be faster.

slice assign: Best, this uses whatever tricks are possible in the given architecture. Most likely memset, or memcpy. These are insanely optimized functions, which is why they perform so well.

-Steve
April 01, 2020
On 2020-03-31 23:30, data pulverizer wrote:

> $ dmd fill.d && ./fill

You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.

Without optimizations I get numbers like these:

Slice: Mean time(usecs): 92.91, Standard Deviation: 49.8002
Foreach: Mean time(usecs): 183.95, Standard Deviation: 30.749
For: Mean time(usecs): 239.74, Standard Deviation: 30.753

With optimizations turned on I get quite varying results:

Slice: Mean time(usecs): 96.84, Standard Deviation: 52.5676
Foreach: Mean time(usecs): 95.84, Standard Deviation: 21.9244
For: Mean time(usecs): 106.73, Standard Deviation: 31.2545

Slice: Mean time(usecs): 192.03, Standard Deviation: 115.971
Foreach: Mean time(usecs): 177.06, Standard Deviation: 128.012
For: Mean time(usecs): 89.14, Standard Deviation: 25.681

Slice: Mean time(usecs): 97.74, Standard Deviation: 53.2058
Foreach: Mean time(usecs): 79.72, Standard Deviation: 11.0088
For: Mean time(usecs): 80.54, Standard Deviation: 12.06

I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code:

Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
For: Mean time(usecs): 39.89, Standard Deviation: 7.80041

Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
For: Mean time(usecs): 40.02, Standard Deviation: 7.67388

-- 
/Jacob Carlborg
April 01, 2020
On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
> You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.

-release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk.

For realistic code, try to make it fast without using that switch, there's plenty of other ways, like isolated @trusted sections as needed. If you have significant cost from asserts you might turn them off with `-check=assert=off` to leave them out of the compile without affecting the other safety features.

> I you care about performance you should really compile using LDC (with `-O5 -release -flto=full

again forget the -release switch. the rest cool though.
April 01, 2020
On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
> I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code:
>
> Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
> Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
> For: Mean time(usecs): 39.89, Standard Deviation: 7.80041
>
> Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
> Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
> For: Mean time(usecs): 40.02, Standard Deviation: 7.67388

The benchmark is highly flawed, the results for LDC are completely meaningless. LDC produces identical code for every case and the filling procedures are all elided.
April 01, 2020
On Wednesday, 1 April 2020 at 12:22:48 UTC, Adam D. Ruppe wrote:
> On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
>> You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.
>
> -release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk.
>

It is nice that bounds checks remain in place when using release and the code is @safe.
April 01, 2020
On Wednesday, 1 April 2020 at 15:04:44 UTC, Jesse Phillips wrote:
> It is nice that bounds checks remain in place when using release and the code is @safe.

Yeah, it used to be even worse than it is now, but it is still a terrible switch that should NEVER be used. There are always better ways and we should stop telling people to use it.

The newer `-check` and `-boundscheck` switches give more fine-grained control if you need it (and you again never should, since you can control on a per-expression basis with @trusted and the .ptr attribute.)
April 01, 2020
Thanks for all the suggestions made so far. I am still interested in looking at the implementation details of the slice assign `arr[] = x` which I can't seem to find. Before I made my initial post, I tried doing a `memcpy` and `memmove` under a `for` loop but it did not change the performance or get the same kind of performance as the initial slice performance so I didn't bother to mention them, I haven't tried it with the suggested compiler flags though.

@StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) please correct me if I am wrong but it looks as if `memset` can only write from an `int` sized source and I need the source size to be any potential size (T).

    ----------------------------------------------------------------------

On a related aside I noticed that the timing was reduced across the board so much so that the initial slice time halved when initialising with:

```
auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | GC.BlkAttr.APPENDABLE))[0..n];
```

Instead of:

```
auto arr = new T[n];
```

If I am filling the array with something anyway for example `arr[] = x` is there anything wrong with this approach? Should I care if the memory is not "scanned" - I have no idea what that means but I assume it's some kind of check that takes time and is switched off by `GC.BlkAttr.NO_SCAN`. Are there any downsides to this that I should be aware of?

I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't find it in the standard C library, the only reference I found for it is [here](https://linux.die.net/man/3/gc) and it is named slightly differently but appears to be the same function. In `core.memory`, it is specified by the `extern (C)` declaration (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I guess it must be somewhere on my system?

Thanks

April 01, 2020
On 4/1/20 11:23 AM, data pulverizer wrote:
> Thanks for all the suggestions made so far. I am still interested in looking at the implementation details of the slice assign `arr[] = x` which I can't seem to find. Before I made my initial post, I tried doing a `memcpy` and `memmove` under a `for` loop but it did not change the performance or get the same kind of performance as the initial slice performance so I didn't bother to mention them, I haven't tried it with the suggested compiler flags though.

Using disassembly, on run.dlang.io, it says it's using __memsetDouble.

> 
> @StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) please correct me if I am wrong but it looks as if `memset` can only write from an `int` sized source and I need the source size to be any potential size (T).

Again, the compiler uses whatever tools are available. It might be memset, it might be something else.


In the case of your code, it's using __memsetDouble, which I have no idea where it's defined (probably libc).

> ----------------------------------------------------------------------
> 
> On a related aside I noticed that the timing was reduced across the board so much so that the initial slice time halved when initialising with:
> 
> ```
> auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | GC.BlkAttr.APPENDABLE))[0..n];
> ```
> 
> Instead of:
> 
> ```
> auto arr = new T[n];
> ```

What this means is, don't scan the block for pointers during a GC collect cycle. If you have pointers in your T, this is a very bad idea. Not only that, but this does not initialize the appendable data at the end of the block.

In addition, GC.malloc just zero-initializes the data. If you do new T[n], and T has an initializer, it's going to be a lot more expensive.

If you are going to use this, remove the GC.BlkAttr.APPENDABLE.

In the case of double, it is initialized to NaN.

This could explain the difference in timing.

> 
> I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't find it in the standard C library, the only reference I found for it is [here](https://linux.die.net/man/3/gc) and it is named slightly differently but appears to be the same function. In `core.memory`, it is specified by the `extern (C)` declaration (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I guess it must be somewhere on my system?

It's in the D garbage collector, here: https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166

extern(C) functions can be implemented in D. The major difference between standard D functions and extern(C) is that the latter does not do name mangling.

-Steve
April 08, 2020
In all honesty till now I haven't really thought deeply about memory allocation, I just assumed that malloc, free, and so on where low level operating system functions and that was that. I've heard people in the D community talk about garbage collection, and memory allocation but I didn't think it was something I had to worry about.

A lot of the work I do is statistics and data science based, often speed is important, calculations on arrays involve a lot of array instantiation, moving and copying elements and so forth. As far as all that stuff goes the speed of memory operations is important. The fact that I could get those kinds of operations to run faster using D's GC.malloc and different types of iteration is significant. Today it hit me that I actually need to study memory allocation and possibly garbage collection as topics - not so that I can implement a garbage collector, but so that I can write faster code.

I know to everyone here that's just basic stuff, but it's a bit of a revelation for me. :-)