May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On 5/22/2016 2:57 AM, Era Scarecrow wrote:
> Question Walter (or Andrei): Would it be terrible to have lower characters as
> part of the identifier string? I'm referring to characters under 32 (control
> characters, especially the \r, \n and \t).
Currently there's a bug in id_compress() where it doesn't take into account UTF-8 sequences, as a subset of them are allowed as identifiers. I rue the day I allowed that in D, but we're kinda stuck with it.
I kinda hate disallowing control characters, because then the compressor gets a bit specific.
|
May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On 5/22/2016 3:32 AM, Era Scarecrow wrote:
> Curiously the extra 95 symbols actually is just enough to keep compression up
> and bring speed to about 37% speed gain, with it looks like no disadvantages.
> Then again I still need a larger set to test things on, but it's very promising!
You're doing the gods' work, Era!
|
May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On 5/22/2016 3:32 AM, Era Scarecrow wrote:
> [...]
My idea for speeding things up is every time a new character c is matched in pattern[], a bit is set in mask:
ulong mask;
...
mask |= 1 << (c & 63);
Then, when scanning for longer matches, test:
if (!((1 << (id[i + matchlen - 1] & 63)) & mask))
i += matchlen;
because we know that id[i + matchlen - 1] cannot match any of the characters in the current match.
The two expressions look complex, but they should compile to single instructions on x64.
|
May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sunday, 22 May 2016 at 19:21:13 UTC, Walter Bright wrote:
> On 5/22/2016 3:32 AM, Era Scarecrow wrote:
>> Curiously the extra 95 symbols actually is just enough to keep compression up and bring speed to about 37% speed gain, with it looks like no disadvantages.
>
> You're doing the gods' work, Era!
Maybe... I have to wonder if LZ4 from the looks of things will take over instead. Have to see how it performs (and if we can integrate it); Meanwhile the main advantage of id_compress (as I'm tinkering with it) is it makes no changes to the signature/behavior so taking advantage of it's boost in speed can be immediate (although it will no doubt have a bit weaker compression, and demangling needs to be re-written for it).
Honestly the majority of speed gains with id_compress are simply from having a much smaller window(1023 to 220), also resulting in a smaller footprint for the compressed output (3 bytes to 2 bytes).
Compare to the speed gains from my new algorithm (at 20% faster) is pretty constant, and retains a HUGE window (pre-scans 8k, has a 2k window) that it can find matches from, so it will win in compression.
If LZ4 is half as good as the initial results are then I'd go with it instead.
|
May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rainer Schuetze | On 5/22/2016 10:44 AM, Rainer Schuetze wrote: > You are right about the symbols using the VC mangling. The test case > "1.s.s.s.s.s" in the bugreport translated to C++ yields > > ?foo@Result@?1???$s@UResult@?1???$s@UResult@?1???$s@UResult@?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@testexpansion@@YA@U1?1???$s@H@2@YA@H@Z@@Z@@testexpansion@@YA@U1?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@2@YA@U1?1???$s@H@2@YA@H@Z@@Z@@Z@@testexpansion@@YA@U1?1???$s@UResult@?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@testexpansion@@YA@U1?1???$s@H@2@YA@H@Z@@Z@@2@YA@U1?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@2@YA@U1?1???$s@H@2@YA@H@Z@@Z@@Z@@Z@@testexpansion@@YA@U1?1???$s@UResult@?1???$s@UResult@?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@testexpansion@@YA@U1?1???$s@H@2@YA@H@Z@@Z@@testexpansion@@YA@U1?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@2@YA@U1?1???$s@H@2@YA@H@Z@@Z@@Z@@2@YA@U1?1???$s@UResult@?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@testexpansion@@YA@U1?1???$s@H@2@YA@H@Z@@Z@@2@YA@U1?1???$s@UResult@?1???$s@H@testexpansion@@YA@H@Z@@2@YA@U1?1???$s@H@2@YA@H@Z@@Z@@Z@@Z@@Z@QAEXXZ Haha, thus showing why people should never invent their own ad-hoc crypto nor compression algorithms :-) > _ZZN13testexpansion1sIZNS0_IZNS0_IZNS0_IZNS0_IiEEDaT_E6ResultEES1_S2_E6ResultEES1_S2_E6ResultEES1_S2_E6ResultEES1_S2_EN6Result3fooEv Note "Result" appearing 5 times. It's still crappy compression. Ironically, this design predates the much worse VC++ one. > which is only 39 characters longer than the compressed version of the D symbol. > It uses a much smaller character set, though. The much smaller character set can be done, if desired, by using base64 encoding. The compressor is careful to not emit 0 bytes, as those definitely would screw up bintools, but the rest seem unperturbed by binary strings. |
May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco Leise | On 5/22/2016 9:06 AM, Marco Leise wrote: > Are there any objections > against the benchmarking method or the license? BSD may cause problems. > Can the implementation be in D? Not yet. > If not, should we link against the > system liblz4.so/a No. > or copy the C code into the backend? Yes. > Is a pre-filled dictionary worthwhile? The trouble with that is the demangler would need the same dictionary, right? If so, that would be problematic. |
May 23, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On Sunday, 22 May 2016 at 19:44:08 UTC, Era Scarecrow wrote: > ... Well here's the rundown of some numbers. min_compress uses a tiny window, big_compress was my original algorithmn but modified to use 16k total for a window. reduced_id_compress is the original except reduced to a 220 window and 2 byte constant output. Along with the compressed outputs of each. min_compress: [TickDuration(46410084)] 0.779836 big_compress: [TickDuration(47998202)] 0.806545 orig_id_compress: [TickDuration(59519257)] baseline reduced_id_compress: [TickDuration(44033192)] 0.739894 1001 (original size) 72 testexpansion.s!(æεó▌æ╗int).så°Resulà≡Ñ╪¢╨╘ÿ¼É ↑─↑╜►╘fñv├ÿ╜ ↑│↑Ä .foo() 73 testexpansion.s!(ÅæεÅó▌Åæ╗int).sÅå°ResulÅà≡ÅÑ╪Å¢╨Å╘ÿżÉ₧├ÿÄ╜É╝▓ÿëåâ.foo() 67 tes╤xpansion.s!(ÇææÇóóÇææint)∙⌡ResulÇàÅÇѺǢ»Ç╘τǼ∩ë├τü╜∩¢▓τ².foo() 78 testexpansion.s!(æ2óCæ2int).så(Resulà0ÑH¢P╘ê¼É╘íñÖ├ê┤ÿσ║ñ¬├ê¼É╘íñÖ├êÉ1å).foo() min_compress: [TickDuration(29210832)] 0.82391 big_compress: [TickDuration(31058664)] 0.87601 orig_id_compress: [TickDuration(35466130)] baseline reduced_id_compress: [TickDuration(25032532)] 0.705977 629 (original size) 52 E.s!(à·è⌡àδint).så°Resulà≡ÖΣÅ▄╝╝ö┤ líd _Ög læ◄.foo() 61 E.s!(Åà·Åè⌡Åàδint).sÅå°ResulÅà≡ÅÖΣÅÅ▄Å╝╝Åö┤₧ç∞ÄÖΣ¡ó╠ïå╗.foo() 52 E.s!(ΣÇèèΣint)∙⌡ResulÇàÅÇÖ¢ÇÅúÇ╝├Çö╦ëçôüÖ¢Æó│².foo() 52 E.s!(à&è+à&int).så(Resulà0Ö<ÅD╝döl ┤í╝ ┴Ö╣ ┤æ9.foo() |
May 23, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Am Sun, 22 May 2016 13:02:37 -0700 schrieb Walter Bright <newshound2@digitalmars.com>: > On 5/22/2016 9:06 AM, Marco Leise wrote: > > Are there any objections > > against the benchmarking method or the license? > > BSD may cause problems. > > > Can the implementation be in D? > > Not yet. > > > If not, should we link against the > > system liblz4.so/a > > No. > > > or copy the C code into the backend? > > Yes. Can you elaborate on how copying the source overrules your above license concerns? I would create a directory "lz4" for the sources, compile them like we compile zlib now and install dmd with a backend license stating that it uses lz4 licensed under BSD? > > Is a pre-filled dictionary worthwhile? > > The trouble with that is the demangler would need the same dictionary, right? If so, that would be problematic. Right, it's probably more trouble than worth it. -- Marco |
May 23, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rainer Schuetze | Am Sun, 22 May 2016 20:22:06 +0200 schrieb Rainer Schuetze <r.sagitario@gmx.de>: > The 3 consecutive ?s are also in the object file, so they are actually part of the mangled symbol. VC++ uses only ASCII characters for mangling (which D should do, too, IMO). In general I agree. The extended characters may be swallowed and the representation not copyable. In the worst case we may uncover bugs in debugging software. On the other hand, base-64 adds ~17% to the symbol length compared to base-128 and the linker in particular should only be interested in a terminating \0. -- Marco |
May 22, 2016 Re: Need a Faster Compressor | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco Leise | On 5/22/2016 10:52 PM, Marco Leise wrote: > Can you elaborate on how copying the source overrules your > above license concerns? It does not. It just avoids having to link in a library that may or may not exist on every machine we want to create demanglers on. > I would create a directory "lz4" for the sources, compile > them like we compile zlib now and install dmd with a backend > license stating that it uses lz4 licensed under BSD? How many lines of code are we talking about here? The post you made makes it look like just a handful - why go to all the trouble of creating more directories and build complexity? |
Copyright © 1999-2021 by the D Language Foundation