Need a Faster Compressor

The current one is effective, but slow: https://github.com/dlang/dmd/blob/master/src/backend/compress.c Anyone want to make a stab at making it faster? Changing the format is fair game, as well as down and dirty assembler if that's what it takes. So really, how good are you at fast code?

On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote: > So really, how good are you at fast code? I recall being able to get a marginal boost in speed by assigning the first 1-2 bytes in preassigned buckets (256 & 65536 respectively), so you have an immediate 1-2 byte matches (skipping the chafe so to speak). Unfortunately memory-wise it's highly inefficient and requires 2 passes over the data, and probably doesn't play well with cache. I assume this is related to compressing symbols thread? I mentioned possibly considering the LZO library.

On 21.05.2016 23:12, Walter Bright wrote: > The current one is effective, but slow: > > https://github.com/dlang/dmd/blob/master/src/backend/compress.c > > Anyone want to make a stab at making it faster? Changing the format is > fair game, as well as down and dirty assembler if that's what it takes. > > So really, how good are you at fast code? Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?

On 21.05.2016 23:37, Timon Gehr wrote: > On 21.05.2016 23:12, Walter Bright wrote: >> The current one is effective, but slow: >> >> https://github.com/dlang/dmd/blob/master/src/backend/compress.c >> >> Anyone want to make a stab at making it faster? Changing the format is >> fair game, as well as down and dirty assembler if that's what it takes. >> >> So really, how good are you at fast code? > > Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)? (Those are worst case bounds, not sure what's better on average on D symbol strings.)

On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote: > The current one is effective, but slow: > > https://github.com/dlang/dmd/blob/master/src/backend/compress.c > > Anyone want to make a stab at making it faster? Changing the format is fair game, as well as down and dirty assembler if that's what it takes. > > So really, how good are you at fast code? Is LZ4 acceptable ? If so the current C implementation is hard to beat. And maybe it's best to just copy it in.

On 5/21/2016 2:27 PM, Era Scarecrow wrote: > I assume this is related to compressing symbols thread? Yes. > I mentioned possibly considering the LZO library. Give it a try, see if it pans out. Other possibilities are the lookback dictionary is fixed at 1023 characters. Experimenting and benchmarking can indicate if it should be more or less. Perhaps 512 will yield just as good compression on real world identifier strings. Or try another scheme entirely, like huffman coding.

On 5/21/2016 2:37 PM, Timon Gehr wrote: > Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)? I don't understand the terms you use, but as to the "why" it is based on what I knew about LZ77 compression. I don't pretend to be an expert on compression, and this is what I churned out. As mentioned elsewhere, the C++ mangling scheme has a primitive and ineffective LZ77 scheme built in, so I wouldn't waste time on that. A quick google search on finding the longest match in a string did not turn up anything obvious. If you want to throw your hat (i.e. expertise) into the ring and post a faster compressor, please do so!

On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote: > The current one is effective, but slow: > > https://github.com/dlang/dmd/blob/master/src/backend/compress.c > > Anyone want to make a stab at making it faster? Changing the format is fair game, as well as down and dirty assembler if that's what it takes. > > So really, how good are you at fast code? Could you post a link to code you use for benchmarking ?

May 21, 2016

Re: Need a Faster Compressor

Posted by Guillaume Boucher
in reply to Walter Bright

Permalink

Guillaume Boucher

Posted in reply to Walter Bright

Permalink

On Saturday, 21 May 2016 at 22:07:27 UTC, Walter Bright wrote:
> As mentioned elsewhere, the C++ mangling scheme has a primitive and ineffective LZ77 scheme built in, so I wouldn't waste time on that.

Sorry if I didn't memorize everything in this forum from the last 20 years, can you give a link to some reasoning?

The only thing I found are some statements in https://github.com/dlang/dmd/pull/5793:

> Note that the VC++ and g++ name mangling schemes for C++ each use a similar, but primitive (and fairly ineffective) form of compression. It's like people who invent their own crypto algorithms.

The name mangling in the Itanium C++ ABI is similar to LZ77 but heavily optimized for the typical use case in symbol names.  I don't really see how the general LZ77 would be better.  It's almost as if using information about the actual data leads to a better compression scheme.

> This is what C++ does, and results are pretty poor compression. Scanning to determine the backward references would consume the same order of time, anyway.

It uses a constant amount of time if implemented correctly, which is much faster.

Forums