May 19, 2016
On Thursday, 19 May 2016 at 23:56:46 UTC, poliklosio wrote:
> (...)
> Clearly the reason for building such a gigantic string was some sort of repetition. Detect the repetition on-the-fly to avoid processing all of it. This way the generated name is already compressed.

It seems like dynamic programming can be useful.
May 20, 2016
On Thursday, 19 May 2016 at 22:46:02 UTC, Adam D. Ruppe wrote:
> On Thursday, 19 May 2016 at 22:16:03 UTC, Walter Bright wrote:
>> Using 64 character random strings will make symbolic debugging unpleasant.
>
> Using 6.4 megabyte strings already makes symbolic debugging unpleasant.
>
> The one thing that worries me about random strings is that it needs to be the same across all builds, or you'll get random linking errors when doing package-at-a-time or whatever (dmd already has some problems like this!). But building a gigantic string then compressing or hashing it still sucks... what we need is a O(1) solution that is still unique and repeatable.

Good point. Using a SHA1 derived from the string instead of a GUID is imho better. It has the advantage of repeatability, is even shorter and not very expensive to generate.

May 20, 2016
On Thursday, 19 May 2016 at 22:16:03 UTC, Walter Bright wrote:
> On 5/19/2016 6:45 AM, Andrei Alexandrescu wrote:
>> I very much advocate slapping a 64-long random string for all Voldermort returns
>> and calling it a day. I bet Liran's code will get a lot quicker to build and
>> smaller to boot.
>
> Let's see how far we get with compression first.
>
>   https://github.com/dlang/dmd/pull/5793
>
> Using 64 character random strings will make symbolic debugging unpleasant.

You could simply add a "trivial" version of the struct + enclosing function name (ie once, without repetitions) before or after the random character string. This way you would know which struct its referring to, its unique, and you still avoid generating a 5 Exabyte large symbol name just to compress/hash/whatever it.
May 20, 2016
On Friday, 20 May 2016 at 05:34:15 UTC, default0 wrote:
> On Thursday, 19 May 2016 at 22:16:03 UTC, Walter Bright wrote:
> > (...)
> (...) you still avoid generating a 5 Exabyte large symbol name just to compress/hash/whatever it.

This is why compression is not good enough.
May 20, 2016
On Thursday, 19 May 2016 at 16:41:59 UTC, Andrei Alexandrescu wrote:
> On 05/19/2016 11:56 AM, Georgi D wrote:
>> Making a local copy of chain and moving the structure outside of the
>> method solved the problem and reduced the code size.
>>
>> The stripped size even reduced from the version that did not experience
>> the huge increase. Stripped: 7.4Mb -> 5.5MB
>
> Thanks very much for helping with these measurements! -- Andrei

Just some more numbers:

On the version with the local chain which is 5.5MB

# strings test.local_choose|wc -l
4170
# strings test.local_choose|wc -c
5194012

Which mean that of the 5.5MB total binary size  4.95MB are strings.

Inspecting the content of the strings over 98% are symbol names. Is there a way to not have all the symbol names as strings even after the binary was stripped?

I think that some compression for the symbol names is a good idea.

I agree though that a O(n) or better for the voldemort types is also needed.


May 20, 2016
On Thursday, 19 May 2016 at 22:16:03 UTC, Walter Bright wrote:
> On 5/19/2016 6:45 AM, Andrei Alexandrescu wrote:
>> I very much advocate slapping a 64-long random string for all Voldermort returns
>> and calling it a day. I bet Liran's code will get a lot quicker to build and
>> smaller to boot.
>
> Let's see how far we get with compression first.
>
>   https://github.com/dlang/dmd/pull/5793
>
> Using 64 character random strings will make symbolic debugging unpleasant.

Unfortunately, the PR doesn't cure the root cause, but only provides linear 45-55% improvement, which doesn't scale with the exponential growth of the symbol size:


    case        time to compile     du -h   strings file | wc -c
NG-case before      0m19.708s       339M        338.23M
NG-case  after      0m27.006s       218M        209.35M
OK-case before      0m1.466s         16M         15.33M
OK-case  after      0m1.856s         11M          9.28M

For more info on the measurements:
https://github.com/dlang/dmd/pull/5793#issuecomment-220550682


May 20, 2016
On 5/19/16 6:16 PM, Walter Bright wrote:
> On 5/19/2016 6:45 AM, Andrei Alexandrescu wrote:
>> I very much advocate slapping a 64-long random string for all
>> Voldermort returns
>> and calling it a day. I bet Liran's code will get a lot quicker to
>> build and
>> smaller to boot.
>
> Let's see how far we get with compression first.
>
>    https://github.com/dlang/dmd/pull/5793
>
> Using 64 character random strings will make symbolic debugging unpleasant.

This is a fallacy. I don't think so, at all, when the baseline is an extremely long string. In any given scope there are at most a handful of Voldemort types at play, and it's easy to identify which is which. We've all done so many times with IPs, GUIDs of objects, directories, etc. etc. -- Andrei
May 20, 2016
On Friday, 20 May 2016 at 09:48:15 UTC, ZombineDev wrote:
> On Thursday, 19 May 2016 at 22:16:03 UTC, Walter Bright wrote:
>> On 5/19/2016 6:45 AM, Andrei Alexandrescu wrote:
>>> I very much advocate slapping a 64-long random string for all Voldermort returns
>>> and calling it a day. I bet Liran's code will get a lot quicker to build and
>>> smaller to boot.
>>
>> Let's see how far we get with compression first.
>>
>>   https://github.com/dlang/dmd/pull/5793
>>
>> Using 64 character random strings will make symbolic debugging unpleasant.
>
> Unfortunately, the PR doesn't cure the root cause, but only provides linear 45-55% improvement, which doesn't scale with the exponential growth of the symbol size:
>
>
>     case        time to compile     du -h   strings file | wc -c
> NG-case before      0m19.708s       339M        338.23M
> NG-case  after      0m27.006s       218M        209.35M
> OK-case before      0m1.466s         16M         15.33M
> OK-case  after      0m1.856s         11M          9.28M
>
> For more info on the measurements:
> https://github.com/dlang/dmd/pull/5793#issuecomment-220550682

IMO, the best way forward is:
+ The compiler should lower voldemort types, according to the scheme that Steve suggested (http://forum.dlang.org/post/nhkmo7$ob5$1@digitalmars.com)
+ After that, during symbol generation (mangling) if a symbol starts getting larger than some threshold (e.g. 800 characters), the mangling algorithm should detect that and bail out by generating some unique id instead. The only valuable information that the symbol must include is the module name and location (line and column number) of the template instantiation.
May 20, 2016
On Friday, 20 May 2016 at 11:32:16 UTC, ZombineDev wrote:
> IMO, the best way forward is:
> + The compiler should lower voldemort types, according to the scheme that Steve suggested (http://forum.dlang.org/post/nhkmo7$ob5$1@digitalmars.com)
> + After that, during symbol generation (mangling) if a symbol starts getting larger than some threshold (e.g. 800 characters), the mangling algorithm should detect that and bail out by generating some unique id instead. The only valuable information that the symbol must include is the module name and location (line and column number) of the template instantiation.

Location info shouldn't be used. This will break things like interface files and dynamic libraries.
May 20, 2016
On Friday, 20 May 2016 at 11:40:12 UTC, Rene Zwanenburg wrote:
> On Friday, 20 May 2016 at 11:32:16 UTC, ZombineDev wrote:
>> IMO, the best way forward is:
>> + The compiler should lower voldemort types, according to the scheme that Steve suggested (http://forum.dlang.org/post/nhkmo7$ob5$1@digitalmars.com)
>> + After that, during symbol generation (mangling) if a symbol starts getting larger than some threshold (e.g. 800 characters), the mangling algorithm should detect that and bail out by generating some unique id instead. The only valuable information that the symbol must include is the module name and location (line and column number) of the template instantiation.
>
> Location info shouldn't be used. This will break things like interface files and dynamic libraries.

Well... template-heavy code doesn't play well header-files and dynamic libraries. Most of the time templates are used for the implementation of an interface, but template types such as ranges are unsuable in function signatures. That's why they're called voldemort types - because they're the ones that can not/must not be named.
Instead dynamic libraries should use stable types such as interfaces, arrays and function pointers, which don't have the aforementioned symbol size problems.

Since we're using random numbers for symbols (instead of the actual names) it would not be possible for such symbols to be part of an interface, because a different invocation of the compiler would produce different symbol names. Such symbols should always an implementation detail, and not part of an interface. That's why location info would play no role, except for debugging purposes.