Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
March 31, 2020 Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
I've observed large differences in timing performance while filling arrays using different methods (for vs foreach vs arr[] = x) and don't know why. I've looked at array.d (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) but I'm still none the wiser. Here is an example: ``` /* fill.d */ import std.stdio: writeln; import std.typecons: Tuple, tuple; import std.algorithm.iteration: mean; import std.algorithm.iteration: sum; import std.datetime.stopwatch: AutoStart, StopWatch; /* Benchmarking Function */ auto bench(alias fun, string units = "msecs", ulong minN = 10, bool doPrint = false)(ulong n, string msg = "") { auto times = new double[n]; auto sw = StopWatch(AutoStart.no); for(ulong i = 0; i < n; ++i) { sw.start(); fun(); sw.stop(); times[i] = cast(double)sw.peek.total!units; sw.reset(); } double ave = mean(times); double sd = 0; if(n >= minN) { for(ulong i = 0; i < n; ++i) sd += (times[i] - ave)^^2; sd /= (n - 1); sd ^^= 0.5; }else{ sd = double.nan; } static if(doPrint) writeln(msg ~ "Mean time("~ units ~ "): ", ave, ", Standard Deviation: ", sd); return tuple!("mean", "sd")(ave, sd); } /* Fill Functions */ auto fill_for(alias x, ulong n)() { alias T = typeof(x); auto arr = new T[n]; for(ulong i = 0; i < n; ++i) arr[i] = x; return arr; } auto fill_foreach(alias x, ulong n)() { alias T = typeof(x); auto arr = new T[n]; foreach(ref el; arr) el = x; return arr; } auto fill_slice(alias x, ulong n)() { alias T = typeof(x); auto arr = new T[n]; arr[] = x; return arr; } void main() { double x = 42; bench!(fill_slice!(x, 100_000), "usecs", 10, true)(100, "Slice: "); bench!(fill_foreach!(x, 100_000), "usecs", 10, true)(100, "Foreach: "); bench!(fill_for!(x, 100_000), "usecs", 10, true)(100, "For: "); } /* $ dmd fill.d && ./fill Slice: Mean time(usecs): 87.38, Standard Deviation: 54.1542 Foreach: Mean time(usecs): 179.9, Standard Deviation: 41.4109 For: Mean time(usecs): 245.81, Standard Deviation: 53.0798 $ dmd --version DMD64 D Compiler v2.090.1 ... */ ``` It would be great to know why there are large differences in performance between these approaches and it would be great to see where array's opSliceAssign (or the equivalent method) for D's native array is implemented. Playing with `-boundscheck` made no difference in the contrasting performances. Thanks. |
March 31, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to data pulverizer | On 3/31/20 5:30 PM, data pulverizer wrote:
> I've observed large differences in timing performance while filling arrays using different methods (for vs foreach vs arr[] = x) and don't know why. I've looked at array.d (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/src/dmd/root/array.d) but I'm still none the wiser.
for: you are indexing an array on each step, incurring a bounds check, and also doing a pointer calculation on each access.
foreach: Better, iterates a pointer over the whole array, so it's going to be faster.
slice assign: Best, this uses whatever tricks are possible in the given architecture. Most likely memset, or memcpy. These are insanely optimized functions, which is why they perform so well.
-Steve
|
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to data pulverizer | On 2020-03-31 23:30, data pulverizer wrote: > $ dmd fill.d && ./fill You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations. Without optimizations I get numbers like these: Slice: Mean time(usecs): 92.91, Standard Deviation: 49.8002 Foreach: Mean time(usecs): 183.95, Standard Deviation: 30.749 For: Mean time(usecs): 239.74, Standard Deviation: 30.753 With optimizations turned on I get quite varying results: Slice: Mean time(usecs): 96.84, Standard Deviation: 52.5676 Foreach: Mean time(usecs): 95.84, Standard Deviation: 21.9244 For: Mean time(usecs): 106.73, Standard Deviation: 31.2545 Slice: Mean time(usecs): 192.03, Standard Deviation: 115.971 Foreach: Mean time(usecs): 177.06, Standard Deviation: 128.012 For: Mean time(usecs): 89.14, Standard Deviation: 25.681 Slice: Mean time(usecs): 97.74, Standard Deviation: 53.2058 Foreach: Mean time(usecs): 79.72, Standard Deviation: 11.0088 For: Mean time(usecs): 80.54, Standard Deviation: 12.06 I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code: Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621 Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039 For: Mean time(usecs): 39.89, Standard Deviation: 7.80041 Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014 Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672 For: Mean time(usecs): 40.02, Standard Deviation: 7.67388 -- /Jacob Carlborg |
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote: > You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations. -release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk. For realistic code, try to make it fast without using that switch, there's plenty of other ways, like isolated @trusted sections as needed. If you have significant cost from asserts you might turn them off with `-check=assert=off` to leave them out of the compile without affecting the other safety features. > I you care about performance you should really compile using LDC (with `-O5 -release -flto=full again forget the -release switch. the rest cool though. |
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
> I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code:
>
> Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
> Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
> For: Mean time(usecs): 39.89, Standard Deviation: 7.80041
>
> Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
> Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
> For: Mean time(usecs): 40.02, Standard Deviation: 7.67388
The benchmark is highly flawed, the results for LDC are completely meaningless. LDC produces identical code for every case and the filling procedures are all elided.
|
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to Adam D. Ruppe | On Wednesday, 1 April 2020 at 12:22:48 UTC, Adam D. Ruppe wrote:
> On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
>> You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations.
>
> -release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk.
>
It is nice that bounds checks remain in place when using release and the code is @safe.
|
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jesse Phillips | On Wednesday, 1 April 2020 at 15:04:44 UTC, Jesse Phillips wrote:
> It is nice that bounds checks remain in place when using release and the code is @safe.
Yeah, it used to be even worse than it is now, but it is still a terrible switch that should NEVER be used. There are always better ways and we should stop telling people to use it.
The newer `-check` and `-boundscheck` switches give more fine-grained control if you need it (and you again never should, since you can control on a per-expression basis with @trusted and the .ptr attribute.)
|
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to data pulverizer | Thanks for all the suggestions made so far. I am still interested in looking at the implementation details of the slice assign `arr[] = x` which I can't seem to find. Before I made my initial post, I tried doing a `memcpy` and `memmove` under a `for` loop but it did not change the performance or get the same kind of performance as the initial slice performance so I didn't bother to mention them, I haven't tried it with the suggested compiler flags though. @StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) please correct me if I am wrong but it looks as if `memset` can only write from an `int` sized source and I need the source size to be any potential size (T). ---------------------------------------------------------------------- On a related aside I noticed that the timing was reduced across the board so much so that the initial slice time halved when initialising with: ``` auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | GC.BlkAttr.APPENDABLE))[0..n]; ``` Instead of: ``` auto arr = new T[n]; ``` If I am filling the array with something anyway for example `arr[] = x` is there anything wrong with this approach? Should I care if the memory is not "scanned" - I have no idea what that means but I assume it's some kind of check that takes time and is switched off by `GC.BlkAttr.NO_SCAN`. Are there any downsides to this that I should be aware of? I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the bit mask option that makes it faster than `core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't find it in the standard C library, the only reference I found for it is [here](https://linux.die.net/man/3/gc) and it is named slightly differently but appears to be the same function. In `core.memory`, it is specified by the `extern (C)` declaration (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I guess it must be somewhere on my system? Thanks |
April 01, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to data pulverizer | On 4/1/20 11:23 AM, data pulverizer wrote: > Thanks for all the suggestions made so far. I am still interested in looking at the implementation details of the slice assign `arr[] = x` which I can't seem to find. Before I made my initial post, I tried doing a `memcpy` and `memmove` under a `for` loop but it did not change the performance or get the same kind of performance as the initial slice performance so I didn't bother to mention them, I haven't tried it with the suggested compiler flags though. Using disassembly, on run.dlang.io, it says it's using __memsetDouble. > > @StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) please correct me if I am wrong but it looks as if `memset` can only write from an `int` sized source and I need the source size to be any potential size (T). Again, the compiler uses whatever tools are available. It might be memset, it might be something else. In the case of your code, it's using __memsetDouble, which I have no idea where it's defined (probably libc). > ---------------------------------------------------------------------- > > On a related aside I noticed that the timing was reduced across the board so much so that the initial slice time halved when initialising with: > > ``` > auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | GC.BlkAttr.APPENDABLE))[0..n]; > ``` > > Instead of: > > ``` > auto arr = new T[n]; > ``` What this means is, don't scan the block for pointers during a GC collect cycle. If you have pointers in your T, this is a very bad idea. Not only that, but this does not initialize the appendable data at the end of the block. In addition, GC.malloc just zero-initializes the data. If you do new T[n], and T has an initializer, it's going to be a lot more expensive. If you are going to use this, remove the GC.BlkAttr.APPENDABLE. In the case of double, it is initialized to NaN. This could explain the difference in timing. > > I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the bit mask option that makes it faster than `core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't find it in the standard C library, the only reference I found for it is [here](https://linux.die.net/man/3/gc) and it is named slightly differently but appears to be the same function. In `core.memory`, it is specified by the `extern (C)` declaration (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I guess it must be somewhere on my system? It's in the D garbage collector, here: https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166 extern(C) functions can be implemented in D. The major difference between standard D functions and extern(C) is that the latter does not do name mangling. -Steve |
April 08, 2020 Re: Array fill performance differences between for, foreach, slice | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | In all honesty till now I haven't really thought deeply about memory allocation, I just assumed that malloc, free, and so on where low level operating system functions and that was that. I've heard people in the D community talk about garbage collection, and memory allocation but I didn't think it was something I had to worry about. A lot of the work I do is statistics and data science based, often speed is important, calculations on arrays involve a lot of array instantiation, moving and copying elements and so forth. As far as all that stuff goes the speed of memory operations is important. The fact that I could get those kinds of operations to run faster using D's GC.malloc and different types of iteration is significant. Today it hit me that I actually need to study memory allocation and possibly garbage collection as topics - not so that I can implement a garbage collector, but so that I can write faster code. I know to everyone here that's just basic stuff, but it's a bit of a revelation for me. :-) |
Copyright © 1999-2021 by the D Language Foundation