Need a Faster Compressor (page 3)

On Sunday, 22 May 2016 at 08:50:47 UTC, Walter Bright wrote: > On 5/21/2016 11:26 PM, Era Scarecrow wrote: >> With 1 Million iterations: >> >> new_compress: TickDuration(311404604) >> id_compress TickDuration(385806589) >> >> Approx 20% increase in speed (if i'm reading and did this right). > > It is promising. Need more! Well there's other good news. While I was working with about 2Mb for the data size in general, it can be reduced it looks like to about 8k-16k, and run entirely on the stack. It doesn't appear to gain any speed, or any speed gained is lost with better memory management. Although based on how things look, the id_compress might perform better with a max window of 255 and max match of 127. I'll give the original one a tweak based on my own findings and see if it turns out true.

On Sunday, 22 May 2016 at 09:07:07 UTC, Era Scarecrow wrote: > On Sunday, 22 May 2016 at 08:50:47 UTC, Walter Bright wrote: >> On 5/21/2016 11:26 PM, Era Scarecrow wrote: >>> With 1 Million iterations: >>> >>> new_compress: TickDuration(311404604) >>> id_compress TickDuration(385806589) >>> >>> Approx 20% increase in speed (if i'm reading and did this right). >> >> It is promising. Need more! > > Although based on how things look, the id_compress might perform better with a max window of 255 and max match of 127. I'll give the original one a tweak based on my own findings and see if it turns out true. Oh noes! Making the tiny change to id_compress and it's faster!!! Now i can't take over the world!!! new_compress: TickDuration(306321939) 21% faster modified idcompress: TickDuration(269275629) 31% faster

On Sunday, 22 May 2016 at 09:19:54 UTC, Era Scarecrow wrote: > On 5/21/2016 11:26 PM, Era Scarecrow wrote: >> With 1 Million iterations: >> >> id_compress: TickDuration(385806589) (original/baseline) > modified id_compress: TickDuration(269275629) 31% faster And shrinking the lookback to a mere 127 increases it faster but compression starts taking a hit. Can't shrink it anymore anyways. modified id_compress2: TickDuration(230528074) 41% faster!!

On Sunday, 22 May 2016 at 09:43:29 UTC, Era Scarecrow wrote: > > And shrinking the lookback to a mere 127 increases it faster but compression starts taking a hit. Can't shrink it anymore anyways. > > modified id_compress2: TickDuration(230528074) 41% faster!! Nice results.

On Sunday, 22 May 2016 at 09:48:32 UTC, Stefan Koch wrote: > Nice results. This is with one sample to test against. I need a much bigger sample, but I'm not sure how to generate/find it to give it a full run for it's money. Not to mention there's a huge memory leak while doing the tests since I'm trying not to have malloc/free in the code results as much as possible. Although with a 30%-40% boost it might be enough to be good enough, and has minimal code changes. Question Walter (or Andrei): Would it be terrible to have lower characters as part of the identifier string? I'm referring to characters under 32 (control characters, especially the \r, \n and \t).

May 22, 2016

Re: Need a Faster Compressor

Posted by Marco Leise
in reply to Era Scarecrow

Permalink

Marco Leise

Posted in reply to Era Scarecrow

Permalink

Am Sun, 22 May 2016 09:57:20 +0000
schrieb Era Scarecrow <rtcvb32@yahoo.com>:

> On Sunday, 22 May 2016 at 09:48:32 UTC, Stefan Koch wrote:
> > Nice results.
> 
>   This is with one sample to test against. I need a much bigger
> sample, but I'm not sure how to generate/find it to give it a
> full run for it's money. Not to mention there's a huge memory
> leak while doing the tests since I'm trying not to have
> malloc/free in the code results as much as possible.
> 
>   Although with a 30%-40% boost it might be enough to be good
> enough, and has minimal code changes.
> 
> 
>   Question Walter (or Andrei): Would it be terrible to have lower
> characters as part of the identifier string? I'm referring to
> characters under 32 (control characters, especially the \r, \n
> and \t).

That would be my question, too. What is the allowed alphabet? Control characters might break output of debuggers, and spaces would also look confusing. So I guess everything in the range [33..127] (95 symbols)?

-- 
Marco

On Sunday, 22 May 2016 at 10:12:11 UTC, Marco Leise wrote: >> Question Walter (or Andrei): Would it be terrible to have lower characters as part of the identifier string? I'm referring to characters under 32 (control characters, especially the \r, \n and \t). > > That would be my question, too. What is the allowed alphabet? Control characters might break output of debuggers, and spaces would also look confusing. So I guess everything in the range (95 symbols)? Actually if compressed text (which uses symbols 128-155) are okay, then that gives us closer to 223 symbols to work with, not 95. Curiously the extra 95 symbols actually is just enough to keep compression up and bring speed to about 37% speed gain, with it looks like no disadvantages. Then again I still need a larger set to test things on, but it's very promising!

On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote: > The current one is effective, but slow: > > https://github.com/dlang/dmd/blob/master/src/backend/compress.c > > Anyone want to make a stab at making it faster? Changing the format is fair game, as well as down and dirty assembler if that's what it takes. > > So really, how good are you at fast code? A quick search led to this SO thread: https://stackoverflow.com/questions/1138345/an-efficient-compression-algorithm-for-short-text-strings It links to SMAZ: https://github.com/antirez/smaz It uses a codebook primed for english text. Since this is for name mangling, priming for Phobos might be a good idea. E.g. put "std.", "core.", "etc." into the codebook.

On Sunday, 22 May 2016 at 10:40:46 UTC, qznc wrote: > On Saturday, 21 May 2016 at 21:12:15 UTC, Walter Bright wrote: > It links to SMAZ: > > https://github.com/antirez/smaz > > It uses a codebook primed for english text. Since this is for name mangling, priming for Phobos might be a good idea. E.g. put "std.", "core.", "etc." into the codebook. Curiously premaking a few primers as such for languages can give similar performance with zlib and gzip. I recall back in 2009 doing that when I had xml pages to deal with, it would compress using all my pages and find the best one and use that. Then decompression was simply a matter of finding the right CRC32 that matched it and away it went! Although, priming something similar for the current id_compress for early on wouldn't hurt. 'unittest' I see quite a bit, probably ones including common templates would help too for minor compression benefits.

Am Sat, 21 May 2016 14:12:15 -0700 schrieb Walter Bright <newshound2@digitalmars.com>: > The current one is effective, but slow: > > https://github.com/dlang/dmd/blob/master/src/backend/compress.c > > Anyone want to make a stab at making it faster? Changing the format is fair game, as well as down and dirty assembler if that's what it takes. > > So really, how good are you at fast code? I can write something 2 times more compact and 16 times faster (in microbenchmarks) for the 10 times nested "testexpansion" test case from https://issues.dlang.org/show_bug.cgi?id=15831#c0 using BSD licensed lz4 compression. Does it have to be C++ code or can it be D code? -- Marco

Forums