October 16
On Tuesday, 16 October 2018 at 17:57:23 UTC, Jabari Zakiya wrote:
> This is the exact same behavior I found with the Nim compiler too.

Well Nim compiler is more like translator. It translates Nim code to c or c++. Since gcc was responsible for optimizations and instruction selection it would be more correct to say that gcc behavior was same as llvm.


> How many other unknown gotchas are in the compilers? :-(

the number of inlining shall be 3
https://www.youtube.com/watch?v=s4wnuiCwTGU


October 16
On Tuesday, 16 October 2018 at 20:38:24 UTC, welkam wrote:
> On Tuesday, 16 October 2018 at 17:57:23 UTC, Jabari Zakiya wrote:
>> This is the exact same behavior I found with the Nim compiler too.
>
> Well Nim compiler is more like translator. It translates Nim code to c or c++. Since gcc was responsible for optimizations and instruction selection it would be more correct to say that gcc behavior was same as llvm.
>
>
>> How many other unknown gotchas are in the compilers? :-(
>
> the number of inlining shall be 3
> https://www.youtube.com/watch?v=s4wnuiCwTGU

Yes, it finally is the compiler doing it, and Clang does the same thing.
And they could be modded to catch semantics like this and produce faster code.

My guess is the difference in code execution times based on the ALU operation pipeline.

seg[k] = seg[k] | 1

can be done "inplace" in the pipeline, where the lsb can to be changed in mem, whereas

seg[k] = 1

may require a separate constant|value creation and memory write, and bypass the ALU.

I used to know most of these tricks back in the Pentium days, but I haven't had a need to keep up with Intel Ixx core chips machine|compiler level instruction optimizations.

October 16
On Tuesday, 16 October 2018 at 20:58:54 UTC, Jabari Zakiya wrote:
> And they could be modded to catch semantics like this and produce faster code.

Its hard to prove that you will only write 1 or 0 in the array and even if you write such pass it wont fire very often. So slower compile times for no gain in common case.


> My guess is the difference in code execution times based on the ALU operation pipeline.


and my guess is that CPU knows that its going to write the same value to memory so it ignores that write.

> I used to know most of these tricks back in the Pentium days, but I haven't had a need to keep up with Intel Ixx core chips machine|compiler > level instruction optimizations.

Life is too short for such things. Today compilers are good at optimizing low level stuff. Those decades of improvements add up.
October 16
On Tuesday, 16 October 2018 at 21:12:39 UTC, welkam wrote:
> On Tuesday, 16 October 2018 at 20:58:54 UTC, Jabari Zakiya wrote:
>> And they could be modded to catch semantics like this and produce faster code.
>
> Its hard to prove that you will only write 1 or 0 in the array and even if you write such pass it wont fire very often. So slower compile times for no gain in common case.
>
>
>> My guess is the difference in code execution times based on the ALU operation pipeline.
>
>
> and my guess is that CPU knows that its going to write the same value to memory so it ignores that write.

Exactly. It does the same instruction twice, and once it's set the first time it can ignore setting it the second time.

However with seg[k] = 1 , the mov instruction always moves, because it doesn't know|care what the previous value was. That's my guess.

> Life is too short for such things. Today compilers are good at optimizing low level stuff. Those decades of improvements add up.

Amen! :-)

November 07
On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
> On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
>>> $ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
>>> Enter integer number:
>>> threads = 8
>>> each thread segment is [1 x 65536] bytes array
>>> twinprime candidates = 175324676; resgroups = 1298702
>>> each 135 threads has nextp[2 x 5566] array
>>> setup time = 1 ms, 864 μs, and 7 hnsecs
>>> perform twinprimes ssoz sieve
>>> sieve time = 196 ms, 566 μs, and 5 hnsecs
>>> last segment = 53518 resgroups; segment slices = 20
>>> total twins = 9210144; last twin = 2999999712+/- 1
>>> total time = 198 ms, 431 μs, and 2 hnsecs
>>>
>>> My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.
>>
>> Here's what I get on my system.
>>
>> $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
>> Enter integer number: threads = 8
>> each thread segment is [1 x 65536] bytes array
>> twinprime candidates = 175324676; resgroups = 1298702
>> each 135 threads has nextp[2 x 5566] array
>> setup time = 0.000 secs
>> perform twinprimes ssoz sieve
>> sieve time = 0.144 secs
>> last segment = 53518 resgroups; segment slices = 20
>> total twins = 9210144; last twin = 2999999712+/-1
>> total time = 0.144 secs
>>
>> Could you list your hardware, D ver, compiler specs.
>>
>> I will run your code on my system with your D version and compiler, if I can.
>>
>> Excellent work!
>
> D has multiple compilers, but for the speed of the finished binary, LDC2 is generally recommended.  I used version 1.11.0.  https://github.com/ldc-developers/ldc/releases/tag/v1.11.0
>
> I was using DUB to manage the project, but to build the stand-alone file from the gist link, use this command:  $ ldc2 -release -O3 twinprimes_ssoz.d
> And to run it:  $ echo "3000000000" | ./twinprimes_ssoz
>
> Running the program a bunch of times, I get variable results, mostly between 195ms and 250ms.  Running the Nim version, I also get variable results, mostly between 230ms and 280ms.
>
> My system is an Alienware 14x R2 from 2012.  Specs can be found here: https://www.cnet.com/products/alienware-m14xr2-14-core-i7-3630qm-8-gb-ram-750-gb-hdd/specs/

Hey Vijay I got a complete D version running correctly.

https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61

Here's the Nim code for comparison.

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

I'm about 80-85% finished writing my paper. It's interesting the things you learn about a topic if you try to write about it. :-)  When I finish I'll post it here, and everything will become a whole lot clearer about what|why|how the algorithm|code works.

I found a few edge case bugs in the original code I posted, but also found some coding and performance improvements too. The D (and revised Nim) code reflects all these changes, and are the current state-of-the-art regarding this implementation.

One thing that could|should make it faster is the use of fast bitarrays to create the seg arrays in the threads (vs byte arrays). I didn't find anywhere that D has a native bitarray structure (Nim doesn't either).

Please don't be too offset by my coding style. I tried to mimick the Nim style (which is non-standard too).  I code to minimize the syntax noise, and like complete concepts on one line.

Performance is "similar" to Nim's, though Nim is slightly faster as N increases. Both are still faster than primesieve however. Still hoping someone does a GPU version.

Please compile on the different Nim compilers and see what differences they produce.

And last, the D version doesn't us P17, as in the Nim version, because of the issue you found with the D compiler when trying to compile with P17. If you have time maybe you (someone) can try to resolve that issue, to match its functionality to the Nim version.


November 08
Hi Jabari,

On Wednesday, 7 November 2018 at 20:30:33 UTC, Jabari Zakiya wrote:
> Hey Vijay I got a complete D version running correctly.
>
> https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61

I cloned this and enabled P17 for you, get it at https://github.com/veelo/twinprimes_ssoz

I basically optimized the compile time table generation by removing templates and avoiding reallocations. This is a big deal because currently the compiler will not collect its own garbage. You'll see that I reserve exactly the minimum array lengths; of course I didn't know these when I started, I got it to work just by preallocating for up to P13 and using appender for P17. But since I now know the exact required length, I can just as well use it. My system has 16GB of RAM, which is enough.

[...]
> One thing that could|should make it faster is the use of fast bitarrays to create the seg arrays in the threads (vs byte arrays). I didn't find anywhere that D has a native bitarray structure (Nim doesn't either).

There is https://dlang.org/phobos/std_bitmanip.html#BitArray

I haven't looked at the run-time code, it is very well possible that similar optimizations are possible there too.
3 days ago
On Thursday, 8 November 2018 at 18:49:35 UTC, Bastiaan Veelo wrote:
> Hi Jabari,
>
> On Wednesday, 7 November 2018 at 20:30:33 UTC, Jabari Zakiya wrote:
>> Hey Vijay I got a complete D version running correctly.
>>
>> https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61
>
> I cloned this and enabled P17 for you, get it at https://github.com/veelo/twinprimes_ssoz
>
> I basically optimized the compile time table generation by removing templates and avoiding reallocations. This is a big deal because currently the compiler will not collect its own garbage. You'll see that I reserve exactly the minimum array lengths; of course I didn't know these when I started, I got it to work just by preallocating for up to P13 and using appender for P17. But since I now know the exact required length, I can just as well use it. My system has 16GB of RAM, which is enough.
>
> [...]
>> One thing that could|should make it faster is the use of fast bitarrays to create the seg arrays in the threads (vs byte arrays). I didn't find anywhere that D has a native bitarray structure (Nim doesn't either).
>
> There is https://dlang.org/phobos/std_bitmanip.html#BitArray
>
> I haven't looked at the run-time code, it is very well possible that similar optimizations are possible there too.

Hi Bastiaan,

Good work on the code.

I've been playing with it, and made versions using boolean and bitarrays for seg array. Byte arrays are definitely faster, with bools slightly slower, but bitarrays way slower, by at least 2x.

I also had made a slight modification to twin_sieve, at its end, to speed it up a smidgen. I didn't know if you want PRs to your code.

However, there is a problem in the code when using P17. It generates incorrect results > 15 trillion (when P17 is selected). For 20 trillion (20_000_000_000_000) it gives 37,593,079,058 instead of 30,198,862,775 for the twins count, and the last twin value is off too.This likely means the P17 parameters are off, and/or nextp values, as not enough of the k values in twins_sieve are marking off prime multiples in the seg array.

However for P5-P13 the results are correct, and the performance is comparable to the Nim version, up to N = 5 billion, where you start to see D be slower (by a few percent). You can run both on your system to see what the difference is on it.

Anyway, if you figure out what's up with P17 that we can get it totally working, and I'll look at playing around with it to.

I'm over 90% finished with my paper now (want to finish b4 Thanksgiving).
Next ›   Last »
1 2 3 4 5