November 17, 2019
On Sat, Nov 16, 2019 at 5:50 PM Jacob Shtokolov via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>
> On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov wrote:
> > Just tried to compile and run Base64
>
> The Havlak test is closer to reality:
>
> ```
> Nim:    12.24s, 477.8Mb
> C++:    17.33s, 179.3Mb
> Golang: 21.58s, 358.0Mb
> D LDC2: 23.55s, 460.4Mb
> D DMD:  29.04s, 461.9Mb
> ```
>
> Nim is the winner.
>
> But here I would look into the code: what makes LDC produce such poorly optimized binary.
>

LDC binary is ok, this is about GC, I was able to make it lamost as twice fast for ldc with some improvments
November 17, 2019
On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11@gmail.com> wrote:
>
> > Nim is the winner.
> >
> > But here I would look into the code: what makes LDC produce such poorly optimized binary.
> >
>
> LDC binary is ok, this is about GC, I was able to make it lamost as twice fast for ldc with some improvements

original code
Golang: 22.74s, 364.1Mb
D LDC2: 29.55s, 463.9Mb
D DMD:  29.42s, 462.5Mb
D GDC:  25.28s, 415.3Mb
Nim:       14.26s, 468,9Mb

with small changes:
Golang: 22.74s, 364.1Mb
D LDC2: 15.90s, 389.8Mb
D DMD:  16.86s, 387.3Mb
D GDC:  19.48s, 403.8Mb
Nim:       14.26s, 468,9Mb
November 17, 2019
On Saturday, 16 November 2019 at 16:45:02 UTC, Jacob Shtokolov wrote:
> On Saturday, 16 November 2019 at 16:34:58 UTC, Jacob Shtokolov wrote:
>> Just tried to compile and run Base64
>
> The Havlak test is closer to reality:
>
> ```
> Nim:    12.24s, 477.8Mb
> C++:    17.33s, 179.3Mb
> Golang: 21.58s, 358.0Mb
> D LDC2: 23.55s, 460.4Mb
> D DMD:  29.04s, 461.9Mb
> ```
>
> Nim is the winner.
>
> But here I would look into the code: what makes LDC produce such poorly optimized binary.

C++ memory consumption is way lower than the rest. Is this because of the tracing GC penalty? It would have been interesting to see Rust here as it doesn't use GC and if it would get close to the C++ memory consumption.
November 17, 2019
On 11/17/19 6:04 AM, Daniel Kozak wrote:
> On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11@gmail.com> wrote:
>> LDC binary is ok, this is about GC, I was able to make it lamost as
>> twice fast for ldc with some improvements


Can you summarize or share the changes for learning purposes?
November 17, 2019
On Sunday, 17 November 2019 at 10:36:41 UTC, Daniel Kozak wrote:
> LDC binary is ok, this is about GC, I was able to make it almost as twice fast for ldc with some improvements

Just checked the code and found that they're using allocations with `new` in loops. But that's very interesting to see what changes you made to make it run so much faster!

Could you please share it somewhere?

November 17, 2019
On Sun, Nov 17, 2019 at 2:50 PM Jacob Shtokolov via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>
> On Sunday, 17 November 2019 at 10:36:41 UTC, Daniel Kozak wrote:
> > LDC binary is ok, this is about GC, I was able to make it almost as twice fast for ldc with some improvements
>
> Just checked the code and found that they're using allocations with `new` in loops. But that's very interesting to see what changes you made to make it run so much faster!
>
> Could you please share it somewhere?
>
Sorry I missed insert the link. It is on my github: https://github.com/Kozzi11/benchmarks/tree/improve_d
November 17, 2019
On Sunday, 17 November 2019 at 11:04:55 UTC, Daniel Kozak wrote:
> On Sun, Nov 17, 2019 at 11:36 AM Daniel Kozak <kozzi11@gmail.com> wrote:
>>
>> > Nim is the winner.
>> >
>> > But here I would look into the code: what makes LDC produce such poorly optimized binary.
>> >
>>
>> LDC binary is ok, this is about GC, I was able to make it lamost as twice fast for ldc with some improvements
>
> original code
> Golang: 22.74s, 364.1Mb
> D LDC2: 29.55s, 463.9Mb
> D DMD:  29.42s, 462.5Mb
> D GDC:  25.28s, 415.3Mb
> Nim:       14.26s, 468,9Mb
>
> with small changes:
> Golang: 22.74s, 364.1Mb
> D LDC2: 15.90s, 389.8Mb
> D DMD:  16.86s, 387.3Mb
> D GDC:  19.48s, 403.8Mb
> Nim:       14.26s, 468,9Mb

With full LTO, I'm seeing an additional 5% boost on Windows (-flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto). As they are using gcc LTO for the brainfuck2 benchmark too (https://github.com/kostya/benchmarks/blob/2777925c4e64987e83e9a53478910de080408057/brainfuck2/build.sh#L5), I wouldn't consider it to be cheating.
November 17, 2019
On Sunday, 17 November 2019 at 14:15:00 UTC, Daniel Kozak wrote:
> Sorry I missed insert the link. It is on my github: https://github.com/Kozzi11/benchmarks/tree/improve_d

Now it's faster than the C++ version on my machine:

```
Nim:    12.01s, 478.1Mb
D LDC2: 13.48s, 428.1Mb
C++:    19.97s, 179.3Mb
Golang: 21.90s, 364.7Mb
```

So basically the only critical change was to replace the built-in associative arrays with Appender types?

That's really amazing!
November 17, 2019
> So basically the only critical change was to replace the built-in associative arrays with Appender types?
>
> That's really amazing!

Not only, other change is not filling number AA with UNVISITED, the other change is to disable parallel GC, because it is cause performance decrease
November 17, 2019
On Sunday, 17 November 2019 at 16:25:52 UTC, Daniel Kozak wrote:
>> So basically the only critical change was to replace the built-in associative arrays with Appender types?
>>
>> That's really amazing!
>
> Not only, other change is not filling number AA with UNVISITED, the other change is to disable parallel GC, because it is cause performance decrease

Regarding the benefits seen from switching from AAs to Appenders - This is a nice performance improvement. Also a nice example of often available performance improvements in D programs.

At a high level, I feel I've seen this pattern a number of times. When people starting with D run benchmarks as part of their initial experiments, they naturally start with the simplest and most straightforward programming approaches. Nothing wrong with this. It's a strength of D that quality code can be written quickly.

However, in many cases these simple approaches allocate a fair bit of GC memory, memory that becomes unused quickly and needs to be GC collected. Again, nothing wrong with this. But, I have the impression that many times there is an expectation that such code will perform similarly to code using manually managed memory in other native compiled languages. And often this expectation is not met, as memory allocation and use patterns are a major performance driver.

What often gets missed in these assessments is that D has quite a few mechanisms available to enable better memory management use, without needing to drop GC paradigms entirely and move to fully manually managed memory. Modifying performance sensitive programs to use these mechanisms is often not hard. The switch here from AAs to Appenders is an example.

Being able to improve program performance in this way is a strength of D. One consideration is that until one has some experience with the language, it may not be obvious that these options exist, and the specific changes and approaches that can be used. This can lead to perception issues if nothing else.

--Jon