May 25, 2020
On Monday, 25 May 2020 at 05:55:07 UTC, Walter Bright wrote:
> On 5/24/2020 8:19 PM, Adam D. Ruppe wrote:
>> I ran a test to compare the two implementations. First was comparing 10,000 references to the same instance.
>
> Thank you. This is good information. I'm curious how the new staticMap compares:
>
> https://github.com/dlang/phobos/pull/7490

This has brought down the compile time of my test indeed.
Almost but not quite down to `...`

Let me post my test.

I measured that for my test your staticMap is 1.2x faster than the regular the version before the pull.

--
  './dmd.sh sm.d -c -version=Walter' ran
    1.21 ± 0.10 times faster than './dmd.sh sm.d -c'
--


I measured that our `...` is about 1.5x faster than  than your improved staticMap

--
  './dmd.sh sm.d -c -version=DotDotDot' ran
    1.49 ± 0.18 times faster than './dmd.sh sm.d -c -version=Walter'
--

Which should make our version about 2 times faster than the baseline
and indeed

--
  './dmd.sh sm.d -c -version=DotDotDot' ran
    2.05 ± 0.23 times faster than './dmd.sh sm.d -c'
--



The source used file used for the benchmark is here:
https://gist.github.com/UplinkCoder/9772f27cf088f08901b44fb57935aaaf

The dmd used is at commit 2f23e3348e9a927c025637a4852d4d77beb997d3
which is on manu's fork of the compiler.
May 25, 2020
On 5/25/20 1:59 AM, FeepingCreature wrote:
> On Sunday, 24 May 2020 at 23:40:48 UTC, Andrei Alexandrescu wrote:
>> Consider staticMap's current version (there are a few upcoming changes that don't change the point):
>>
>> ...
>>
>> In the instantiation staticMap!(pred, A, B, C, D, E, F, G, H), recursion will instantiate:
>>
>> staticMap!(pred, A, B, C, D)
>> staticMap!(pred, A, B)
>> staticMap!(pred, A)
>> staticMap!(pred, B)
>> staticMap!(pred, C, D)
>> staticMap!(pred, C)
>> staticMap!(pred, D)
>> staticMap!(pred, E. F, G, H)
>> staticMap!(pred, E, F)
>> staticMap!(pred, E)
>> staticMap!(pred, F)
>> staticMap!(pred, G, H)
>> staticMap!(pred, G)
>> staticMap!(pred, H)
>>
>> Total: 15 instantiations including original. This is because a full tree with 8 leaves is being created. Generally, a tree with N leaves has about 2N - 1 nodes total (exactly so when N is a power of 2).
>>
>> The more naive version would simply use linear decomposition:
>> ...
>> There' no more need to handle the 1-element case, which is already a plus. But that's not the topic here. The instantiations created are:
>>
>> staticMap!(pred, B, C, D, E, F, G, H)
>> staticMap!(pred, C, D, E, F, G, H)
>> staticMap!(pred, D, E, F, G, H)
>> staticMap!(pred, E, F, G, H)
>> staticMap!(pred, F, G, H)
>> staticMap!(pred, G, H)
>> staticMap!(pred, H)
>>
>> Total: 8 instantiations including original. That's about half the current number of instantiations.
>>
>> So it seems divide et impera doesn't quite help with certain recursive templates.
> 
> To add to the good previous points: the divide-and-conquer strategy can take advantage of caching for the leaf instantiations with one parameter. That is, 8 template instantiations can trivially be reused, and the others can sometimes be reused if they hit a common subtree. The linear instantiation can only be reused once or twice at the very end.

Not sure I understand. A,...,H are assumed to be distinct. The number of templates generated is the same regardless of memoization.
May 25, 2020
On 5/25/20 2:18 AM, FeepingCreature wrote:
> On Monday, 25 May 2020 at 03:19:51 UTC, Adam D. Ruppe wrote:
>> All this was done on a stock DMD64 D Compiler v2.091.1 on linux. Several enhancements were merged to dmd master recently that may change the results btw.
>>
>> quick test: simple impl: 2s, 1.57G. phobos impl: 2.3s, 1.97G
>>
>> hand impl: 1.1s, 725M. w/ static if: 1.2s, 775M
>>
>> Decent time improvement across the board... but the binary got bigger at 6.4M :( (prolly druntime's fault more than anything else but idk, im not focusing on that rn)
> 
> Could you also try with this version, please?
> 
> template staticMap(alias F, T...)
> {
>      mixin(staticMapHelper!(T.length));
> }
> 
> private enum staticMapHelper(size_t length) = staticMapHelperGen!(length);
> 
> private string staticMapHelperGen(size_t length)()
> {
>      string res = "alias staticMap = AliasSeq!(";
>      static foreach (i; 0 .. length)
>      {
>          if (i) res ~= ", ";
>          res ~= "F!(T[" ~ i.stringof ~ "])";
>      }
>      res ~= ");";
>      return res;
> }

Simpler:

private string staticMapHelperGen(size_t length)()
{
    string res = "alias staticMap = AliasSeq!(";
    static foreach (i; 0 .. length)
        res ~= "F!(T[" ~ i.stringof ~ "]),";
    res ~= ");";
    return res;
}
May 25, 2020
On 5/25/20 3:15 AM, Max Samukha wrote:
> On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
> 
>> static foreach and stringof were used so I didn't have to pull in `format` or `to!string`, which had issues in std.meta.
>>
>> I'm seeing some improvement over 2.086 at least; it should be about equivalent to the hand-unrolled version in master.
> 
> How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack!
> http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach 

static foreach is quadratic? Is this a matter of principle, or QoI?

May 25, 2020
On 5/25/20 9:00 AM, Adam D. Ruppe wrote:
> On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
>> private enum staticMapHelper(size_t length) = staticMapHelperGen!(length);
> 
> lol, also called it staticMapHelper when doing this myself!
> 
> Solid result though: 2s, 810MB.
> 
> I'm kinda surprised it beat the crap out of my mixin version.
> 
> Mine was 3s, 1.2G using a mixin expression (alias staticMap = mixin(helper()). Changing it to a mixin declaration (the alias is in the mixin, like yours) brought it to 2.2s, 924MB. Interesting. And changing the helper from a template to a function gives another small win: 2.2s but 906MB.

So it seems a mixin-based version + Max's precalculated length = win!

May 25, 2020
On Monday, 25 May 2020 at 16:04:24 UTC, Andrei Alexandrescu wrote:
> On 5/25/20 3:15 AM, Max Samukha wrote:
>> On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
>> 
>>> static foreach and stringof were used so I didn't have to pull in `format` or `to!string`, which had issues in std.meta.
>>>
>>> I'm seeing some improvement over 2.086 at least; it should be about equivalent to the hand-unrolled version in master.
>> 
>> How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack!
>> http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach
>
> static foreach is quadratic? Is this a matter of principle, or QoI?

Yes it is.
I did present this fact at the pre-event to Dconf 2018.

As for your other question it's both. The QoI is not stellar.
But it is fundamentally limited by having to create independent scopes.

--
Stefan
May 25, 2020
On Monday, 25 May 2020 at 16:07:56 UTC, Stefan Koch wrote:

> I did present this fact at the pre-event to Dconf 2018.

Could have been 19 ?


May 25, 2020
On Monday, 25 May 2020 at 15:59:57 UTC, Andrei Alexandrescu wrote:
> On 5/25/20 1:59 AM, FeepingCreature wrote:
>> To add to the good previous points: the divide-and-conquer strategy can take advantage of caching for the leaf instantiations with one parameter. That is, 8 template instantiations can trivially be reused, and the others can sometimes be reused if they hit a common subtree. The linear instantiation can only be reused once or twice at the very end.
>
> Not sure I understand. A,...,H are assumed to be distinct. The number of templates generated is the same regardless of memoization.

I think the idea is that you can potentially re-use template instances across separate calls to staticMap. So for example, if you have `staticMap!(A, B, C, D)` and `staticMap!(A, B, E, F)`, you get:

staticMap!(pred, A, B, C, D)
staticMap!(pred, A, B)
staticMap!(pred, A)
staticMap!(pred, B)
staticMap!(pred, C, D)
staticMap!(pred, C)
staticMap!(pred, D)

staticMap!(pred, A, B, E, F)
staticMap!(pred, E, F)
staticMap!(pred, E)
staticMap!(pred, F)
May 25, 2020
On Monday, 25 May 2020 at 16:04:24 UTC, Andrei Alexandrescu wrote:
> static foreach is quadratic? Is this a matter of principle, or QoI?

I have a hard time seeing why it HAS to be this way. Regular tuple foreach is better in most cases, string mixins are better in most cases (if written with appropriate care).

There might be some cases where it is fundamental - I just honestly don't know - and it is possible automatically differentiating those cases is impossible and/or more trouble than it is worth, but I continue to believe our primary problem is the implementation right now.

The problem is idk what the fix is. I spent a couple hours looking at the code a couple weeks ago and didn't find a solution. But that doesn't mean there isn't one.
May 25, 2020
On 25.05.20 18:04, Andrei Alexandrescu wrote:
> On 5/25/20 3:15 AM, Max Samukha wrote:
>> On Monday, 25 May 2020 at 06:18:13 UTC, FeepingCreature wrote:
>>
>>> static foreach and stringof were used so I didn't have to pull in `format` or `to!string`, which had issues in std.meta.
>>>
>>> I'm seeing some improvement over 2.086 at least; it should be about equivalent to the hand-unrolled version in master.
>>
>> How could you miss the state-of-the-art?) Preallocate the result string and use the string counter hack!
>> http://dpldocs.info/this-week-in-d/Blog.Posted_2020_05_11.html#dmd-and-static-foreach 
> 
> 
> static foreach is quadratic? Is this a matter of principle, or QoI?
> 

QoI. The original `static foreach` implementation in my compiler frontend runs in linear time, even though it is implemented in a very similar way. Note that even `iota(n).array` has superlinear running time with DMD's CTFE.

The implementation of `static foreach` in DMD is based on lowering, and the existing "tuple" foreach implementation. I suspect the culprit for bad performance for very large `static foreach` is a slow expression "tuple" implementation, but I am not sure.
Therefore, maybe performance could be improved by modifying the foreach expansion code such that it can accept a CTFE array instead of an expression tuple. One could also add a special-case implementation for range `static foreach` to work around inefficient CTFE.