Array fill performance differences between for, foreach, slice - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Array fill performance differences between for, foreach, slice

Thread overview

Array fill performance differences between for, foreach, slice
Mar 31, 2020 data pulverizer
Apr 01, 2020 Steven Schveighoffer
Apr 01, 2020 Jacob Carlborg
Apr 01, 2020 Adam D. Ruppe
Apr 01, 2020 Jesse Phillips
Apr 01, 2020 Adam D. Ruppe
Apr 01, 2020 lithium iodate
Apr 01, 2020 data pulverizer
Apr 01, 2020 Steven Schveighoffer
Apr 08, 2020 data pulverizer

March 31, 2020

Array fill performance differences between for, foreach, slice

Posted by data pulverizer

data pulverizer

I've observed large differences in timing performance while filling arrays using different methods (for vs foreach vs arr[] = x) and don't know why. I've looked at array.d (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) but I'm still none the wiser.

Here is an example:

```
/* fill.d */

import std.stdio: writeln;
import std.typecons: Tuple, tuple;
import std.algorithm.iteration: mean;
import std.algorithm.iteration: sum;
import std.datetime.stopwatch: AutoStart, StopWatch;


/* Benchmarking Function */
auto bench(alias fun, string units = "msecs",
          ulong minN = 10, bool doPrint = false)(ulong n, string msg = "")
{
  auto times = new double[n];
  auto sw = StopWatch(AutoStart.no);
  for(ulong i = 0; i < n; ++i)
  {
    sw.start();
    fun();
    sw.stop();
    times[i] = cast(double)sw.peek.total!units;
    sw.reset();
  }
  double ave = mean(times);
  double sd = 0;

  if(n >= minN)
  {
    for(ulong i = 0; i < n; ++i)
      sd += (times[i] - ave)^^2;
    sd /= (n - 1);
    sd ^^= 0.5;
  }else{
    sd = double.nan;
  }

  static if(doPrint)
    writeln(msg ~ "Mean time("~ units ~ "): ", ave, ", Standard Deviation: ", sd);

  return tuple!("mean", "sd")(ave, sd);
}

/* Fill Functions */
auto fill_for(alias x, ulong n)()
{
  alias T = typeof(x);
  auto arr = new T[n];

  for(ulong i = 0; i < n; ++i)
    arr[i] = x;

  return arr;
}

auto fill_foreach(alias x, ulong n)()
{
  alias T = typeof(x);
  auto arr = new T[n];

  foreach(ref el; arr)
    el = x;

  return arr;
}

auto fill_slice(alias x, ulong n)()
{
  alias T = typeof(x);
  auto arr = new T[n];

  arr[] = x;

  return arr;
}


void main()
{
  double x = 42;

  bench!(fill_slice!(x, 100_000), "usecs", 10, true)(100, "Slice: ");
  bench!(fill_foreach!(x, 100_000), "usecs", 10, true)(100, "Foreach: ");
  bench!(fill_for!(x, 100_000), "usecs", 10, true)(100, "For: ");
}

/*
$ dmd fill.d && ./fill
Slice: Mean time(usecs): 87.38, Standard Deviation: 54.1542
Foreach: Mean time(usecs): 179.9, Standard Deviation: 41.4109
For: Mean time(usecs): 245.81, Standard Deviation: 53.0798

$ dmd --version
DMD64 D Compiler v2.090.1
...
*/
```

It would be great to know why there are large differences in performance between these approaches and it would be great to see where array's opSliceAssign (or the equivalent method) for D's native array is implemented. Playing with `-boundscheck` made no difference in the contrasting performances. Thanks.

March 31, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by Steven Schveighoffer
in reply to data pulverizer

Steven Schveighoffer

Posted in reply to data pulverizer

On 3/31/20 5:30 PM, data pulverizer wrote:
> I've observed large differences in timing performance while filling arrays using different methods (for vs foreach vs arr[] = x) and don't know why. I've looked at array.d (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) but I'm still none the wiser.

for: you are indexing an array on each step, incurring a bounds check, and also doing a pointer calculation on each access.

foreach: Better, iterates a pointer over the whole array, so it's going to be faster.

slice assign: Best, this uses whatever tricks are possible in the given architecture. Most likely memset, or memcpy. These are insanely optimized functions, which is why they perform so well.

-Steve

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by Jacob Carlborg
in reply to data pulverizer

Jacob Carlborg

Posted in reply to data pulverizer

On 2020-03-31 23:30, data pulverizer wrote:

> $ dmd fill.d && ./fill

You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.

Without optimizations I get numbers like these:

Slice: Mean time(usecs): 92.91, Standard Deviation: 49.8002
Foreach: Mean time(usecs): 183.95, Standard Deviation: 30.749
For: Mean time(usecs): 239.74, Standard Deviation: 30.753

With optimizations turned on I get quite varying results:

Slice: Mean time(usecs): 96.84, Standard Deviation: 52.5676
Foreach: Mean time(usecs): 95.84, Standard Deviation: 21.9244
For: Mean time(usecs): 106.73, Standard Deviation: 31.2545

Slice: Mean time(usecs): 192.03, Standard Deviation: 115.971
Foreach: Mean time(usecs): 177.06, Standard Deviation: 128.012
For: Mean time(usecs): 89.14, Standard Deviation: 25.681

Slice: Mean time(usecs): 97.74, Standard Deviation: 53.2058
Foreach: Mean time(usecs): 79.72, Standard Deviation: 11.0088
For: Mean time(usecs): 80.54, Standard Deviation: 12.06

I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code:

Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
For: Mean time(usecs): 39.89, Standard Deviation: 7.80041

Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
For: Mean time(usecs): 40.02, Standard Deviation: 7.67388

-- 
/Jacob Carlborg

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by Adam D. Ruppe
in reply to Jacob Carlborg

Adam D. Ruppe

Posted in reply to Jacob Carlborg

On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
> You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.

-release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk.

For realistic code, try to make it fast without using that switch, there's plenty of other ways, like isolated @trusted sections as needed. If you have significant cost from asserts you might turn them off with `-check=assert=off` to leave them out of the compile without affecting the other safety features.

> I you care about performance you should really compile using LDC (with `-O5 -release -flto=full

again forget the -release switch. the rest cool though.

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by lithium iodate
in reply to Jacob Carlborg

lithium iodate

Posted in reply to Jacob Carlborg

On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
> I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code:
>
> Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
> Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
> For: Mean time(usecs): 39.89, Standard Deviation: 7.80041
>
> Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
> Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
> For: Mean time(usecs): 40.02, Standard Deviation: 7.67388

The benchmark is highly flawed, the results for LDC are completely meaningless. LDC produces identical code for every case and the filling procedures are all elided.

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by Jesse Phillips
in reply to Adam D. Ruppe

Jesse Phillips

Posted in reply to Adam D. Ruppe

On Wednesday, 1 April 2020 at 12:22:48 UTC, Adam D. Ruppe wrote:
> On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
>> You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.
>
> -release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk.
>

It is nice that bounds checks remain in place when using release and the code is @safe.

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by Adam D. Ruppe
in reply to Jesse Phillips

Adam D. Ruppe

Posted in reply to Jesse Phillips

On Wednesday, 1 April 2020 at 15:04:44 UTC, Jesse Phillips wrote:
> It is nice that bounds checks remain in place when using release and the code is @safe.

Yeah, it used to be even worse than it is now, but it is still a terrible switch that should NEVER be used. There are always better ways and we should stop telling people to use it.

The newer `-check` and `-boundscheck` switches give more fine-grained control if you need it (and you again never should, since you can control on a per-expression basis with @trusted and the .ptr attribute.)

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by data pulverizer
in reply to data pulverizer

data pulverizer

Posted in reply to data pulverizer

Thanks for all the suggestions made so far. I am still interested in looking at the implementation details of the slice assign `arr[] = x` which I can't seem to find. Before I made my initial post, I tried doing a `memcpy` and `memmove` under a `for` loop but it did not change the performance or get the same kind of performance as the initial slice performance so I didn't bother to mention them, I haven't tried it with the suggested compiler flags though.

@StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) please correct me if I am wrong but it looks as if `memset` can only write from an `int` sized source and I need the source size to be any potential size (T).

    ----------------------------------------------------------------------

On a related aside I noticed that the timing was reduced across the board so much so that the initial slice time halved when initialising with:

```
auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | GC.BlkAttr.APPENDABLE))[0..n];
```

Instead of:

```
auto arr = new T[n];
```

If I am filling the array with something anyway for example `arr[] = x` is there anything wrong with this approach? Should I care if the memory is not "scanned" - I have no idea what that means but I assume it's some kind of check that takes time and is switched off by `GC.BlkAttr.NO_SCAN`. Are there any downsides to this that I should be aware of?

I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't find it in the standard C library, the only reference I found for it is [here](https://linux.die.net/man/3/gc) and it is named slightly differently but appears to be the same function. In `core.memory`, it is specified by the `extern (C)` declaration (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I guess it must be somewhere on my system?

Thanks

April 01, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by Steven Schveighoffer
in reply to data pulverizer

Steven Schveighoffer

Posted in reply to data pulverizer

On 4/1/20 11:23 AM, data pulverizer wrote:
> Thanks for all the suggestions made so far. I am still interested in looking at the implementation details of the slice assign `arr[] = x` which I can't seem to find. Before I made my initial post, I tried doing a `memcpy` and `memmove` under a `for` loop but it did not change the performance or get the same kind of performance as the initial slice performance so I didn't bother to mention them, I haven't tried it with the suggested compiler flags though.

Using disassembly, on run.dlang.io, it says it's using __memsetDouble.

> 
> @StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) please correct me if I am wrong but it looks as if `memset` can only write from an `int` sized source and I need the source size to be any potential size (T).

Again, the compiler uses whatever tools are available. It might be memset, it might be something else.

In the case of your code, it's using __memsetDouble, which I have no idea where it's defined (probably libc).

> ----------------------------------------------------------------------
> 
> On a related aside I noticed that the timing was reduced across the board so much so that the initial slice time halved when initialising with:
> 
> ```
> auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | GC.BlkAttr.APPENDABLE))[0..n];
> ```
> 
> Instead of:
> 
> ```
> auto arr = new T[n];
> ```

What this means is, don't scan the block for pointers during a GC collect cycle. If you have pointers in your T, this is a very bad idea. Not only that, but this does not initialize the appendable data at the end of the block.

In addition, GC.malloc just zero-initializes the data. If you do new T[n], and T has an initializer, it's going to be a lot more expensive.

If you are going to use this, remove the GC.BlkAttr.APPENDABLE.

In the case of double, it is initialized to NaN.

This could explain the difference in timing.

> 
> I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't find it in the standard C library, the only reference I found for it is [here](https://linux.die.net/man/3/gc) and it is named slightly differently but appears to be the same function. In `core.memory`, it is specified by the `extern (C)` declaration (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I guess it must be somewhere on my system?

It's in the D garbage collector, here: https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166

extern(C) functions can be implemented in D. The major difference between standard D functions and extern(C) is that the latter does not do name mangling.

-Steve

April 08, 2020

Re: Array fill performance differences between for, foreach, slice

Posted by data pulverizer
in reply to Steven Schveighoffer

data pulverizer

Posted in reply to Steven Schveighoffer

In all honesty till now I haven't really thought deeply about memory allocation, I just assumed that malloc, free, and so on where low level operating system functions and that was that. I've heard people in the D community talk about garbage collection, and memory allocation but I didn't think it was something I had to worry about.

A lot of the work I do is statistics and data science based, often speed is important, calculations on arrays involve a lot of array instantiation, moving and copying elements and so forth. As far as all that stuff goes the speed of memory operations is important. The fact that I could get those kinds of operations to run faster using D's GC.malloc and different types of iteration is significant. Today it hit me that I actually need to study memory allocation and possibly garbage collection as topics - not so that I can implement a garbage collector, but so that I can write faster code.

I know to everyone here that's just basic stuff, but it's a bit of a revelation for me. :-)

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation