Need for speed (page 2)

On Thursday, 1 April 2021 at 19:00:08 UTC, Berni44 wrote: > > Try using ldc2 instead of dmd: > > ``` > ldc2 -O3 -release -boundscheck=off -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto speed.d > ``` > > should produce much better results. Since this is a "Learn" part of the Foruam, be careful with "-boundscheck=off". I mean for this little snippet is OK, but for a other projects this my be wrong, and as it says here: https://dlang.org/dmd-windows.html#switch-boundscheck "This option should be used with caution and as a last resort to improve performance. Confirm turning off @safe bounds checks is worthwhile by benchmarking." Matheus.

On 01.04.21 21:00, Berni44 wrote: > ``` > ldc2 -O3 -release -boundscheck=off -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto speed.d > ``` Please don't recommend `-boundscheck=off` to newbies. It's not just an optimization. It breaks @safe. If you want to do welding without eye protection, that's on you. But please don't recommend it to the new guy.

On 4/1/21 3:27 PM, ag0aep6g wrote: > On 01.04.21 21:00, Berni44 wrote: >> ``` >> ldc2 -O3 -release -boundscheck=off -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto speed.d >> ``` > > Please don't recommend `-boundscheck=off` to newbies. It's not just an optimization. It breaks @safe. If you want to do welding without eye protection, that's on you. But please don't recommend it to the new guy. Yes, but you can recommend `-boundscheck=safeonly`, which leaves it on for @safe code. though I personally leave it on for everything. -Steve

April 01, 2021

Re: Need for speed

Posted by H. S. Teoh
in reply to Nestor

Permalink

H. S. Teoh

Posted in reply to Nestor

Permalink

On Thu, Apr 01, 2021 at 04:52:17PM +0000, Nestor via Digitalmars-d-learn wrote: [...]
> ```
> import std.stdio;
> import std.random;
> import std.datetime.stopwatch : benchmark, StopWatch, AutoStart;
> import std.algorithm;
> 
> void main()
> {
>     auto sw = StopWatch(AutoStart.no);
>     sw.start();
>     int[] mylist;

Since the length of the array is already known beforehand, you could get significant speedups by preallocating the array:

	int[] mylist = new int[100000];
	for (int number ...)
	{
		...
		mylist[number] = n;
	}

>     for (int number = 0; number < 100000; ++number)
>     {
>         auto rnd = Random(unpredictableSeed);
[...]

Don't reseed the RNG every loop iteration. (1) It's very inefficient and slow, and (2) it actually makes it *less* random than if you seeded it only once at the start of the program.  Move this outside the loop, and you should see some gains.

>         auto n = uniform(0, 100, rnd);
>         mylist ~= n;
>     }
>     mylist.sort();
>     sw.stop();
>     long msecs = sw.peek.total!"msecs";
>     writefln("%s", msecs);
> }
[...]
> ```

Also, whenever performance matters, use gdc or ldc2 instead of dmd. Try `ldc2 -O2`, for example.

I did a quick test with LDC, with a side-by-side comparison of your original version and my improved version:

-------------
import std.stdio;
import std.random;
import std.datetime.stopwatch : benchmark, StopWatch, AutoStart;
import std.algorithm;

void original()
{
    auto sw = StopWatch(AutoStart.no);
    sw.start();
    int[] mylist;
    for (int number = 0; number < 100000; ++number)
    {
        auto rnd = Random(unpredictableSeed);
        auto n = uniform(0, 100, rnd);
        mylist ~= n;
    }
    mylist.sort();
    sw.stop();
    long msecs = sw.peek.total!"msecs";
    writefln("%s", msecs);
}

void improved()
{
    auto sw = StopWatch(AutoStart.no);
    sw.start();
    int[] mylist = new int[100000];
    auto rnd = Random(unpredictableSeed);
    for (int number = 0; number < 100000; ++number)
    {
        auto n = uniform(0, 100, rnd);
        mylist[number] = n;
    }
    mylist.sort();
    sw.stop();
    long msecs = sw.peek.total!"msecs";
    writefln("%s", msecs);
}

void main()
{
    original();
    improved();
}
-------------

Here's the typical output:
-------------
209
5
-------------

As you can see, that's a 40x improvement in speed. ;-)

Assuming that the ~209 msec on my PC corresponds with your observed 280ms, and assuming that the 40x improvement will also apply on your machine, the improved version should run in about 9-10 msec.  So this *should* have give you a 4x speedup over the Python version, in theory. I'd love to see how it actually measures on your machine, if you don't mind. ;-)

T

-- 
Holding a grudge is like drinking poison and hoping the other person dies. -- seen on the 'Net

On 01.04.21 21:36, Steven Schveighoffer wrote: > On 4/1/21 3:27 PM, ag0aep6g wrote: >> On 01.04.21 21:00, Berni44 wrote: >>> ``` >>> ldc2 -O3 -release -boundscheck=off -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto speed.d >>> ``` [...] > Yes, but you can recommend `-boundscheck=safeonly`, which leaves it on for @safe code. `-O -release` already does that, doesn't it?

On 4/1/21 3:44 PM, ag0aep6g wrote: > On 01.04.21 21:36, Steven Schveighoffer wrote: >> On 4/1/21 3:27 PM, ag0aep6g wrote: >>> On 01.04.21 21:00, Berni44 wrote: >>>> ``` >>>> ldc2 -O3 -release -boundscheck=off -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto speed.d >>>> ``` > [...] >> Yes, but you can recommend `-boundscheck=safeonly`, which leaves it on for @safe code. > `-O -release` already does that, doesn't it? Maybe, but I wasn't responding to that, just your statement not to recommend -boundscheck=off. In any case, it wouldn't hurt, right? I don't know what -O3 and -release do on ldc. -Steve

On Thu, Apr 01, 2021 at 07:25:53PM +0000, matheus via Digitalmars-d-learn wrote: [...] > Since this is a "Learn" part of the Foruam, be careful with "-boundscheck=off". > > I mean for this little snippet is OK, but for a other projects this my be wrong, and as it says here: https://dlang.org/dmd-windows.html#switch-boundscheck > > "This option should be used with caution and as a last resort to improve performance. Confirm turning off @safe bounds checks is worthwhile by benchmarking." [...] It's interesting that whenever a question about D's performance pops up in the forums, people tend to reach for optimization flags. I wouldn't say it doesn't help; but I've found that significant performance improvements can usually be obtained by examining the code first, and catching common newbie mistakes. Those usually account for the majority of the observed performance degradation. Only after the code has been cleaned up and obvious mistakes fixed, is it worth reaching for optimization flags, IMO. Common mistakes I've noticed include: - Constructing large arrays by appending 1 element at a time with `~`. Obviously, this requires many array reallocations and the associated copying; not to mention greatly-increased GC load that could have been easily avoided by preallocation or using std.array.appender. - Failing to move repeated computations (esp. inefficient ones) outside the inner loop. Sometimes a good optimizing compiler is able to hoist it out automatically, but not always. - Constructing lots of temporaries in inner loops as heap-allocated classes instead of by-value structs: the former leads to heavy GC load, not to mention memory allocation is generally slow and should be avoided inside inner loops. Heap-allocated objects also require indirections, which slow things down even more. The latter can be passed around in registers: no GC pressure, no indirections; so can significantly improve performance. - Using O(N^2) (or other super-linear) algorithms with large data sets where a more efficient algorithm is available. This one ought to speak for itself. :-D Nevertheless it still crops up from time to time, so deserves to be mentioned again. T -- Those who don't understand Unix are condemned to reinvent it, poorly.

On 4/1/21 12:55 PM, H. S. Teoh wrote: > - Constructing large arrays by appending 1 element at a time with `~`. > Obviously, this requires many array reallocations and the associated > copying And that may not be a contributing factor. :) The following program sees just 15 allocations and 1722 element copies for 1 million appending operations: import std.stdio; void main() { int[] arr; auto place = arr.ptr; size_t relocated = 0; size_t copied = 0; foreach (i; 0 .. 1_000_000) { arr ~= i; if (arr.ptr != place) { ++relocated; copied += arr.length - 1; place = arr.ptr; } } writeln("relocated: ", relocated); writeln("copied : ", copied); } This is because the GC does not allocate if there are unused pages right after the array. (However, increasing the element count to 10 million increases allocations slightly to 18 but element copies jump to 8 million.) Ali

Forums