Can this implementation of Damm algorithm be optimized? (page 3)

On Sunday, 12 February 2017 at 00:43:55 UTC, Nestor wrote: > I fail to see where you are declaring QG10Matrix2, because apparently it's an array of chars, but buildMatrix2 returns an array of int (2560 elements??) with lots of -1 values. I declared it here: http://forum.dlang.org/post/fiamjuhiddbzwvaplxkr@forum.dlang.org and it's not chars, it's ints. Part of it is to avoid the mem access penalty for non divisible by 4 addresses, and another is that the array returns not chars, but numbers from 0-2560, which is the answer *256 (<<8), which offers hopefully some slight speed gain on longer inputs. Also the size of the array is guesstimated, and the -1's just signify the padding (which can be any value, but -1 makes it obvious). Since it's a 10x10 array it's based on, but using 4bits per section, there's 6 elements lost, and 6 whole rows lost. It's a necessary loss to gain speed; Thankfully it's only using 10k (2560 members) and not 700k as was my original guess when i was calculating it wrong earlier. Doing it wrong earlier, the compiler kept crashing... :P running out of memory.

On Saturday, 11 February 2017 at 21:56:54 UTC, Era Scarecrow wrote: > Just ran the unittests under the dmd profiler, says the algorithm is 11% faster now. So yeah slightly more optimized. Ran some more tests. Without optimization but with with 4 levels (a 2.5Mb table), it gains to a whopping 27%! However with optimizations turned on it dwindled to a mere 15% boost And optimization + no bounds checking, 2 & 4 levels both give a 9% boost total. Testing purely on 8byte inputs (Brute forced all combinations) receives the same 9% boost with negligible difference. Safe to say going higher levels isn't going to give you sufficient improvement; Also exe file is 3Mb big (but compresses to 150k).

On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow wrote: > On Saturday, 11 February 2017 at 21:56:54 UTC, Era Scarecrow wrote: >> Just ran the unittests under the dmd profiler, says the algorithm is 11% faster now. So yeah slightly more optimized. > > Ran some more tests. > > Without optimization but with with 4 levels (a 2.5Mb table), it gains to a whopping 27%! > However with optimizations turned on it dwindled to a mere 15% boost > And optimization + no bounds checking, 2 & 4 levels both give a 9% boost total. > > Testing purely on 8byte inputs (Brute forced all combinations) receives the same 9% boost with negligible difference. > > Safe to say going higher levels isn't going to give you sufficient improvement; Also exe file is 3Mb big (but compresses to 150k). Wow! Thanks for the interest and effort.

February 13, 2017

Re: Can this implementation of Damm algorithm be optimized?

Posted by Era Scarecrow
in reply to Nestor

Permalink

Era Scarecrow

Posted in reply to Nestor

Permalink

On Monday, 13 February 2017 at 00:56:37 UTC, Nestor wrote:
> On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow wrote:
>> Ran some more tests.
>
> Wow!
> Thanks for the interest and effort.

 Certainly. But the bulk of the answer comes down that the 2 levels that I've already provided are the fastest you're probably going to get. Certainly we can test using shorts or bytes instead, but it's likely the results will only go down.

 To note my tests are strictly on my x86 system and it would be better to also test this on other systems like PPC, Linux, ARM, and other architectures to see how they perform, and possibly tweak them as appropriate.

 Still we did find out there is some optimization that can be done and successfully for the Damm algorithm, it just isn't going to be a lot.

 Hmmm... A thought does come to mind. Parallelizing the code; However that would require probably 11 instances to get a 2x speedup (calculating the second half with all 10 possibilities for the carry over, and also calculating the first half, then choosing which of the 10 based on the first half's output), which only really works if you have a ton of cores, and the input is REALLY REALLY large, like a meg or something. While the usage of the Damm code is more useful for adding a digit to the end of a code like UPC or Barcodes as error detection, and expecting larger than 32 for real applications is unlikely.

 But at this point I'm rambling.

Forums