Need a Faster Compressor (page 9)

On Tuesday, 24 May 2016 at 18:30:36 UTC, Stefan Koch wrote: > On Tuesday, 24 May 2016 at 16:22:56 UTC, Timon Gehr wrote: >> On 24.05.2016 01:02, Walter Bright wrote: >>> [...] >> >> Speed. The title of this thread is "Need a faster compressor". >> >>> [...] >> >> No. Just increase the recursion depth by a small number of levels to completely kill the speedups gained by using a faster compression algorithm. >> >>> [...] >> >> Yes, it does. The compiler does not use exponential space to store the AST. >> >>> [...] >> >> It's _exponential_ growth. We don't even want to spend the time and memory required to generate the strings. >> >>>[...] >> >> The reason we have this discussion is that the worst case isn't rare enough to make this argument. Why compress in the first place if mangled names don't grow large in practice? > > I completely agree! > However such a thing will only work if dmd is used as a pre-linker and If we can stablize the hash. > I.E. every run of the compiler generates the same hash. KAMOULOX!

May 24, 2016

Re: Need a Faster Compressor

Posted by Anon
in reply to Timon Gehr

Permalink

Anon

Posted in reply to Timon Gehr

Permalink

On Tuesday, 24 May 2016 at 16:22:56 UTC, Timon Gehr wrote:
> It's _exponential_ growth. We don't even want to spend the time and memory required to generate the strings.
>
> The reason we have this discussion is that the worst case isn't rare enough to make this argument. Why compress in the first place if mangled names don't grow large in practice?

Since Walter is evidently still having a hard time understanding this, I've done a few more pathological cases, comparing LZ4 to Back Referencing Named Types.

For BRNT I manually replaced struct names in the mangling with N<number> numeric identifiers for all but one of their appearances, to simulate what could actually be done by the compiler. No other mangling changes (e.g., for eponymous templates or hidden types) were applied.

auto foo(T)(T x)
{
    static struct Vold { T payload; }
    return Vold(x);
}

At a chain of length 10:
foo(5).foo.foo.foo.foo.foo.foo.foo.foo.foo

current: Generate a 57_581 byte symbol
lz4 -9: Generate a 57_581 byte symbol, then compress it to 412 bytes
lzma -9: Generate a 57_581 byte symbol, then compress it to 239 bytes
BRNT: Generate a 332 byte symbol

Upping it to twelve levels:
foo(5).foo.foo.foo.foo.foo.foo.foo.foo.foo.foo.foo

current: Generate a 230_489 byte symbol
lz4 -9: Generate a 230_489 byte symbol, then compress it to 1128 bytes
lzma -9: Generate a 230_489 byte symbol, then compress it to 294 bytes
BRNT: Generate a 393 byte symbol

Upping it to fourteen levels:

I'm too lazy to do more manual BRNT, so beyond this point its numbers are estimated based on the previously established fact that BRNT symbols grow linearly.

current: Generate a 922_127 byte symbol
lz4 -9: Generate a 922_127 byte symbol, then compress it to 4012 bytes
lzma -9: Generate a 922_127 byte symbol, compress it to 422 bytes
BRNT: Generate a ~457 byte symbol

Upping it to sixteen levels:

current: Generate a 3_688_679 byte symbol
lz4 -9: Generate a 3_688_679 byte symbol, then compress it to 15535 bytes
lzma -9: Generate a 3_688_679 byte symbol, then compress it to 840 bytes
BRNT: Generate a ~527 byte symbol

I want to let that sink in: in the general case, BRNT beats even **LZMA**.

As if winning the compactness war while still being a symbol linkers won't have problems with wasn't enough, this approach also avoids generating giant symbols in the first place. Compression and/or hashing cannot do that. If D really wants to, it can compress symbols, but it *needs* to fix this problem properly *first*.

On 5/24/2016 12:16 PM, Anon wrote: > As if winning the compactness war while still being a symbol linkers won't have > problems with wasn't enough, this approach also avoids generating giant symbols > in the first place. Compression and/or hashing cannot do that. If D really wants > to, it can compress symbols, but it *needs* to fix this problem properly *first*. Because identifiers are generated from substrings that don't know about each other, to do your algorithm a scan of the combined string will still be necessary. In any case, you should be able to implement your algorithm as a replacement for id_compress(). How about giving it a go?

On 5/24/2016 4:55 AM, Era Scarecrow wrote: > So Walter, big question: How big before the compressor kicks in? What's the > minimum size of the input string the compressor should expect? It's currently set at 64 in the PR. But that number is just picked out of the air. > I assume speed is more important than a little loss in compression. Yes, but it's a judgement call.

On 5/24/2016 9:22 AM, Timon Gehr wrote: > Yes, it does. The compiler does not use exponential space to store the AST. BTW, all types in dmd have a 'deco' string which is the AST linearized as a string. This string is also used to build the mangled names. All the deco's are put into a hashtable, and then all types can be uniquely identified by their address in the hashtable. This reduces type comparisons to a single pointer comparison.

On Monday, 23 May 2016 at 20:34:19 UTC, Walter Bright wrote: > On 5/23/2016 12:32 PM, Timon Gehr wrote: >> Isn't it enough to create >> references to previously embedded mangled symbols (i-th symbol already mangled) >> while generating the mangled string? > > That's been covered upthread, and VC++/g++ do that. But as shown, it is inadequate. There's a lot more redundancy than just the identifiers. Yes, not all redundancy is covered. However, I think all non-pathologic blow-ups are avoided. Let's look at the properties of the Itanium C++ ABI mangling: 1. Guaranteed linear size name for recursive types that have exponential size names. 2. Compression speed can be linear to the size of the compressed name (i.e. no need to create an exponential garbage name just to compress it). 3. Decompression easy, possible without tools and speed linear to size of the compressed name. 4. Name retains some redundancy (in itself and when compared to other mangled names), well suited for a follow-up compression. Compare this to apply some arbitrary compression algorithm to a name of exponential size: 1. Exponential size names remain exponential size. Yes, up until the view size, some compression algorithms guarantee that, but they can not guarantee it in the general case. 2. Compression needs the whole name first. 3. Decompression only possible with tools. 4. Multiple mangled names can not be compressed well.

On Tuesday, 24 May 2016 at 20:07:02 UTC, Walter Bright wrote: > On 5/24/2016 4:55 AM, Era Scarecrow wrote: >> So Walter, big question: How big before the compressor kicks in? What's the minimum size of the input string the compressor should expect? > > It's currently set at 64 in the PR. But that number is just picked out of the air. 64 bytes? Hmm I had the impression it was going to be far larger...

On Tuesday, 24 May 2016 at 20:16:32 UTC, Walter Bright wrote: > On 5/24/2016 9:22 AM, Timon Gehr wrote: >> Yes, it does. The compiler does not use exponential space to store the AST. > > BTW, all types in dmd have a 'deco' string which is the AST linearized as a string. This string is also used to build the mangled names. > > All the deco's are put into a hashtable, and then all types can be uniquely identified by their address in the hashtable. This reduces type comparisons to a single pointer comparison. There's no reason not to use the compression in the deco name. Just make sure the references are relative and you're set. No exponential space to store the AST. Please, if you change the ABI, do it right the first time. Some arbitrary compression algorithm doesn't solve the problem.

On 5/24/2016 1:27 PM, Guillaume Boucher wrote: > There's no reason not to use the compression in the deco name. Just make sure > the references are relative and you're set. Not totally, since you'd like to refer to types that are common across deco names.

On Tuesday, 24 May 2016 at 20:39:41 UTC, Walter Bright wrote: > On 5/24/2016 1:27 PM, Guillaume Boucher wrote: >> There's no reason not to use the compression in the deco name. >> Just make sure >> the references are relative and you're set. > > Not totally, since you'd like to refer to types that are common across deco names. On Tuesday, 24 May 2016 at 20:39:41 UTC, Walter Bright wrote: A serious question, do these templates even need a mangle ? Is anyone ever going to call them from a different module ? I don't think you could do that because those are voldemort types. Those symbols are only temporary aren't they ?

Forums