October 14, 2018
On Saturday, 13 October 2018 at 19:04:48 UTC, Jabari Zakiya wrote:
> On Saturday, 13 October 2018 at 18:31:57 UTC, Vijay Nayar wrote:
>> On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
>>> On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya wrote:
>>>>
>>>> It may be also running into a hard time limit imposed on compilation that Nim had/has that prevented my code from initially compiling. I'm generating a lot of PG parameter constants at compile time, and it's doing a lot of number crunching and building larger and larger arrays of constants as the PG's get larger.
>>>>
>>>> Try compiling with successive PG's (just P5, then P5 and P7, etc) to see where it fails. That will let you know the code is working correctly, and that the compiler is choking either/and because of a hard time limit and/or memory limit. That's why I put in a compiler output statement in 'genPGparameters' to see the progression of the PG parameters being built by the compiler to initially find when the compiler started choking. You may also need to patch whatever facility in the D compiler chain that controls this too.
>>>
>>> It's P17, the biggest one that takes the longest to build in the Nim version. I actually don't know what memory limits exist for the D compiler at compile-time, so I may need to do some homework.
>>
>> It's not just DMD either.
>>
>> $ ldc2 twinprimes_ssoz.d
>> ...
>> generating parameters for P17
>> Killed
>>
>> $ gdc twinprimes_ssoz.d
>> ...
>> generating parameters for P17
>> gdc: internal compiler error: Killed (program cc1d)
>> Please submit a full bug report,
>> with preprocessed source if appropriate.
>> See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.
>>
>> $ dmd twinprimes_ssoz.d
>> ...
>> generating parameters for P17
>> Killed
>
> In the Nim code, starting line 91 is when the PG constants are being generate at compile time.
>
> ---------------------------------------------------------
> # Generate at compile time the parameters for PGs P5-P17.
> const parametersp5  = genPGparameters(5)
> const parametersp7  = genPGparameters(7)
> const parametersp11 = genPGparameters(11)
> const parametersp13 = genPGparameters(13)
> const parametersp17 = genPGparameters(17)
> ---------------------------------------------------------
>
> Can it compile just using P5 (the first line, others commented out), and then P7, etc?
>
> I'm not understanding your comments now.
>
> If you can get a working version up and running (with correct output) we can solve the P17 compiler issues later (or in a parallel forum thread), especially if you have to delve into the weeds of the compiler chain.
>
> In my mind (same with Nim process) getting working code using any PG is first order priority (because you'll need getting multi-threading working too). Once you can do that, by default, you can then use any generator you want if you create the correct parameters for it. That's one of the advantages of the algorithm, it's PG agnostic (as long as your hardware will accommodate it).
>
> So don't prioritize getting P17 to compile right now (in this thread). Create the working generic structure that can work with any PG first.

Updated:  https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

I still get a few runtime errors likely from mistakes in my conversion for certain primes.  I'll resolve those after I get back from the gym.

But as previous posters have said, the code is not really very different between Nim and D.  Most of it is array manipulation and arithmetic operations, and not many of the features of either D or Nim are very different.  Both turn into fast code, both have garbage collection, and both have generally similar operators and libraries for this kind of problem.

The biggest differences I observed revolved not around the languages themselves, but around code style.  For example, can you put a loop and 3 additional statements on a single line in D?  Yes.  But it is considered to be not very readable code from a style perspective.

Once I get the bugs out, I'm curious to see if any performance differences crop up.  There's the theory that says they should be the same, and then there's the practice.
October 14, 2018
On Sunday, 14 October 2018 at 10:51:11 UTC, Vijay Nayar wrote:
> But as previous posters have said, the code is not really very different between Nim and D.  Most of it is array manipulation and arithmetic operations, and not many of the features of either D or Nim are very different.  Both turn into fast code, both have garbage collection, and both have generally similar operators and libraries for this kind of problem.
>
> The biggest differences I observed revolved not around the languages themselves, but around code style.

Are you aware of DCompute? [1] I haven’t used it myself but since Jabari explicitly mentioned taking advantage of the GPU, it may be worth giving it a try. It should have an impact on performance and possibly on style as well, even though it will still be D. It would be nice if Nicholas could chime in on this thread.

Bastiaan.

[1] https://dlang.org/blog/2017/10/30/d-compute-running-d-on-the-gpu/
October 15, 2018
On Sunday, 14 October 2018 at 10:51:11 UTC, Vijay Nayar wrote:
> Once I get the bugs out, I'm curious to see if any performance differences crop up.  There's the theory that says they should be the same, and then there's the practice.

I don't actually understand the underlying algorithm, but I at least understand the flow of the program and the structure.  The algorithm utilized depends heavily on using shared memory access, which can be done in D, but I definitely wouldn't call it idiomatic.  In D, message passing is preferred, but it really can't be well demonstrated on your algorithm without a deeper understanding of the algorithm itself.

A complete working version can be found at: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

I modified both versions of the program to utilize the pgParameters13 for more of an apples-to-apples comparison.

The final results are as follows:
$ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim && echo "3000000000" | ./twinprimes_ssoz
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.222 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.223 secs

$ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 1 ms, 864 μs, and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 196 ms, 566 μs, and 5 hnsecs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/- 1
total time = 198 ms, 431 μs, and 2 hnsecs

My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.

October 15, 2018
> I don't actually understand the underlying algorithm, but I at least understand the flow of the program and the structure.  The algorithm utilized depends heavily on using shared memory access, which can be done in D, but I definitely wouldn't call it idiomatic.  In D, message passing is preferred, but it really can't be well demonstrated on your algorithm without a deeper understanding of the algorithm itself.
>
> A complete working version can be found at: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7
>
> I modified both versions of the program to utilize the pgParameters13 for more of an apples-to-apples comparison.
>
> The final results are as follows:
> $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim && echo "3000000000" | ./twinprimes_ssoz
> Enter integer number: threads = 8
> each thread segment is [1 x 65536] bytes array
> twinprime candidates = 175324676; resgroups = 1298702
> each 135 threads has nextp[2 x 5566] array
> setup time = 0.000 secs
> perform twinprimes ssoz sieve
> sieve time = 0.222 secs
> last segment = 53518 resgroups; segment slices = 20
> total twins = 9210144; last twin = 2999999712+/-1
> total time = 0.223 secs
>
> $ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
> Enter integer number:
> threads = 8
> each thread segment is [1 x 65536] bytes array
> twinprime candidates = 175324676; resgroups = 1298702
> each 135 threads has nextp[2 x 5566] array
> setup time = 1 ms, 864 μs, and 7 hnsecs
> perform twinprimes ssoz sieve
> sieve time = 196 ms, 566 μs, and 5 hnsecs
> last segment = 53518 resgroups; segment slices = 20
> total twins = 9210144; last twin = 2999999712+/- 1
> total time = 198 ms, 431 μs, and 2 hnsecs
>
> My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.

Hey Great!  I'll take some time and study your code.

Nim currently allows using gcc or clang: --cc:gcc or --cc:clang, and
compiles to C: nim c --cc:gcc... or C++: nim c++ --cc:gcc...

You can compile in Nim with/out using garbage collection (GC) and choose GC.
This version needs GC to recover mem created in each thread, or it will eat
it up. See operation of program using htop/top to see threads|mem at work.

If D has a fast bitarray structure, try it to create the data arrays
"prms" in proc "sozpg" and "seg" arrays is "twin_sieves". This could allow
for larger segment sizes|arrays that fit into cpu cache, reducing the number
of segment slices to process. This would all have to be profiled and
encapsulated in "selectPG", which adaptively selects to best PG to use.

Here is a table of output data and performance times (in seconds).

The results were produced on a System76 laptop with an Intel I7 6700HQ cpu,
2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I compiled the file
"twinprimes_ssoz.nim" with Nim versions 0.18.0, and the just released 0.19.0,
using gcc 8.2.1 with Manjaro (Linux) in Virtual Box. I then ran the executables
on my host Linux system (which only has gcc-4.9.8) to use all 8 threads.

The output was produced on a "quiet" system, where I turned off the laptop and
rebooted, and ran tests with only a terminal and spreedsheet open (to record
results). The times are the "best" times from multiple runs.

I also compare times from the program "primesieve" - https://primesieve.org/
which states on its site:

------------------------------------------------------------------------------------
primesieve is a free software program and C/C++ library that generates primes using a highly optimized sieve of Eratosthenes implementation. It counts the primes below 10^10 in just 0.45 seconds on an Intel Core i7-6700 CPU (4 x 3.4 GHz). primesieve can generate primes and prime k‑tuplets up to 2^64.
------------------------------------------------------------------------------------

(An optimized C/C++ versions of twinprimes_ssoz would be nice to compare against too.)


  Number  | Twimprimes | Last Twinprime |twinprime_ssoz.nim, gcc-8.2.1| primesieve |
          |            |                |     0.18.0   |    0.19.0    |   9.7.1    |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^09 |    3424506 |     1999999872 |      0.043   |      0.041   |     0.049  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^09 |   14618166 |     4999999860 |      0.219   |      0.226   |     0.248  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^10 |   27412679 |     9999999702 |      0.398   |      0.401   |     0.533  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^10 |  118903682 |    49999999590 |      2.364   |      2.331   |     2.978  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^11 |  224376048 |    99999999762 |      4.586   |      4.476   |     6.189  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^11 |  986222314 |   499999999062 |     26.625   |     26.578   |    35.120  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^12 | 1870585220 |   999999999960 |     57.895   |     58.195   |    73.966  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^12 | 8312493003 |  4999999999878 |    385.562   |    389.347   |   433.320  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^13 |15834664872 |  9999999998490 |    855.705   |    857.771   |   934.253  |
----------|------------|----------------|--------------|--------------|------------|
2 x 10^13 |30198862775 | 19999999999830 |   1806.282   |   1818.766   |  1998.341  |
----------|------------|----------------|--------------|--------------|------------|

A typical output of the "twinprimes_ssoz.nim" files looks as below"

$ echo 123_345_789_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 98304] bytes array          // 98,304 segment size (Kn)
twinprime candidates = 6099517038; resgroups = 4107419  // max twins; max cols (KB)
each 1485 threads has nextp[2 x 30062] array            // 30,062, primes <= sqrt(N)
setup time = 0.006 secs                                 // time for primes|ssoz setup
perform twinprimes ssoz sieve                           // ssoz starts now
sieve time = 5.771 secs                                 // time to do just the ssoz
last segment = 76955 resgroups; segment slices = 42     // last seg cols; seg loops
total twins = 272018910; last twin = 123345788640+/-1   // twinprimes, val for last
total time = 5.778 secs                                 // setup + ssoz time

You can determine the selected PG by the number of threads being used.

I'm writing up the paper, and will probably release the program operation description
to provide a detailed understanding of how|why it works, which will allow for modification.
October 15, 2018
> $ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
> Enter integer number:
> threads = 8
> each thread segment is [1 x 65536] bytes array
> twinprime candidates = 175324676; resgroups = 1298702
> each 135 threads has nextp[2 x 5566] array
> setup time = 1 ms, 864 μs, and 7 hnsecs
> perform twinprimes ssoz sieve
> sieve time = 196 ms, 566 μs, and 5 hnsecs
> last segment = 53518 resgroups; segment slices = 20
> total twins = 9210144; last twin = 2999999712+/- 1
> total time = 198 ms, 431 μs, and 2 hnsecs
>
> My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.

Here's what I get on my system.

$ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.144 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.144 secs

Could you list your hardware, D ver, compiler specs.

I will run your code on my system with your D version and compiler, if I can.

Excellent work!

October 16, 2018
On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
>> $ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
>> Enter integer number:
>> threads = 8
>> each thread segment is [1 x 65536] bytes array
>> twinprime candidates = 175324676; resgroups = 1298702
>> each 135 threads has nextp[2 x 5566] array
>> setup time = 1 ms, 864 μs, and 7 hnsecs
>> perform twinprimes ssoz sieve
>> sieve time = 196 ms, 566 μs, and 5 hnsecs
>> last segment = 53518 resgroups; segment slices = 20
>> total twins = 9210144; last twin = 2999999712+/- 1
>> total time = 198 ms, 431 μs, and 2 hnsecs
>>
>> My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.
>
> Here's what I get on my system.
>
> $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
> Enter integer number: threads = 8
> each thread segment is [1 x 65536] bytes array
> twinprime candidates = 175324676; resgroups = 1298702
> each 135 threads has nextp[2 x 5566] array
> setup time = 0.000 secs
> perform twinprimes ssoz sieve
> sieve time = 0.144 secs
> last segment = 53518 resgroups; segment slices = 20
> total twins = 9210144; last twin = 2999999712+/-1
> total time = 0.144 secs
>
> Could you list your hardware, D ver, compiler specs.
>
> I will run your code on my system with your D version and compiler, if I can.
>
> Excellent work!

D has multiple compilers, but for the speed of the finished binary, LDC2 is generally recommended.  I used version 1.11.0.  https://github.com/ldc-developers/ldc/releases/tag/v1.11.0

I was using DUB to manage the project, but to build the stand-alone file from the gist link, use this command:  $ ldc2 -release -O3 twinprimes_ssoz.d
And to run it:  $ echo "3000000000" | ./twinprimes_ssoz

Running the program a bunch of times, I get variable results, mostly between 195ms and 250ms.  Running the Nim version, I also get variable results, mostly between 230ms and 280ms.

My system is an Alienware 14x R2 from 2012.  Specs can be found here: https://www.cnet.com/products/alienware-m14xr2-14-core-i7-3630qm-8-gb-ram-750-gb-hdd/specs/
October 16, 2018
So I run profiler and 97% of time is spent in void twinsSieve function and hotspots are seg[k] = seg[k] |  1; lines. Since seg[k] can only be 1 or 0 I removed that or operation. And the results are. Queue the drum-roll... 5% slower.

I thought that all of my studying was getting somewhere. That I beginning to understand things but no. Performing OR operation and then storing data is faster than just storing data. [sarcasm] Of course it makes sense [/sarcasm]

I looked at assembler and nothing changed except

orb    $0x1,(%rbx,%rdi,1)

is changed to

movb   $0x1,(%rbx,%rdi,1)

I`m completely lost.
October 16, 2018
On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
> D has multiple compilers, but for the speed of the finished binary, LDC2 is generally recommended.  I used version 1.11.0.  https://github.com/ldc-developers/ldc/releases/tag/v1.11.0
>
> I was using DUB to manage the project, but to build the stand-alone file from the gist link, use this command:  $ ldc2 -release -O3 twinprimes_ssoz.d
> And to run it:  $ echo "3000000000" | ./twinprimes_ssoz

It'd be interesting to see if LTO or PGO generated an improvement. It looks like it could in this case, as it might optimize some of the inner loops. LTO is easy, enable it with:

    -flto=<thin|full> -defaultlib=phobos2-ldc-lto,druntime-ldc-lto

(see: https://github.com/ldc-developers/ldc/releases/tag/v1.9.0). I've been using 'thin' on OSX, 'full' on Linux.

PGO is a bit more work, but not too bad. A good primer is here: https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

--Jon


October 16, 2018
On Tuesday, 16 October 2018 at 16:57:12 UTC, welkam wrote:
> So I run profiler and 97% of time is spent in void twinsSieve function and hotspots are seg[k] = seg[k] |  1; lines. Since seg[k] can only be 1 or 0 I removed that or operation. And the results are. Queue the drum-roll... 5% slower.
>
> I thought that all of my studying was getting somewhere. That I beginning to understand things but no. Performing OR operation and then storing data is faster than just storing data. [sarcasm] Of course it makes sense [/sarcasm]
>
> I looked at assembler and nothing changed except
>
> orb    $0x1,(%rbx,%rdi,1)
>
> is changed to
>
> movb   $0x1,(%rbx,%rdi,1)
>
> I`m completely lost.

This is the exact same behavior I found with the Nim compiler too.

seg[k] = 1  is slower than seg[k] = seg[k] or 1, which is why it's coded like that.

Thanks for finding the exact instruction difference.

How many other unknown gotchas are in the compilers? :-(

October 16, 2018
On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
> On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
>>> $ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
>>> Enter integer number:
>>> threads = 8
>>> each thread segment is [1 x 65536] bytes array
>>> twinprime candidates = 175324676; resgroups = 1298702
>>> each 135 threads has nextp[2 x 5566] array
>>> setup time = 1 ms, 864 μs, and 7 hnsecs
>>> perform twinprimes ssoz sieve
>>> sieve time = 196 ms, 566 μs, and 5 hnsecs
>>> last segment = 53518 resgroups; segment slices = 20
>>> total twins = 9210144; last twin = 2999999712+/- 1
>>> total time = 198 ms, 431 μs, and 2 hnsecs
>>>
>>> My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.
>>
>> Here's what I get on my system.
>>
>> $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
>> Enter integer number: threads = 8
>> each thread segment is [1 x 65536] bytes array
>> twinprime candidates = 175324676; resgroups = 1298702
>> each 135 threads has nextp[2 x 5566] array
>> setup time = 0.000 secs
>> perform twinprimes ssoz sieve
>> sieve time = 0.144 secs
>> last segment = 53518 resgroups; segment slices = 20
>> total twins = 9210144; last twin = 2999999712+/-1
>> total time = 0.144 secs

Hey Vijay I found subtle static typing errors.

When N < 2^32 - 1 (4,294,967,295) the twimprime count and lastprime values are correct.
When N > 2^32 - 1 both values start to be incorrect.

Examples: For N = 4,000,000,000 you get correct answers

$ echo 4000000000 | ./twinprimes_ssoz
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 233766231; resgroups = 1731602
each 135 threads has nextp[2 x 6332] array
setup time = 667 μs and 5 hnsecs
perform twinprimes ssoz sieve
sieve time = 181 ms, 277 μs, and 6 hnsecs
last segment = 27666 resgroups; segment slices = 27
total twins = 11944438; last twin = 3999999798+/- 1
total time = 181 ms, 945 μs, and 1 hnsec

Examples: For N = 5,000,000,000 you get wrong answers

$ echo 5000000000 | ./twinprimes_ssoz
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 292207792; resgroups = 2164503
each 135 threads has nextp[2 x 6999] array
setup time = 744 μs and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 225 ms, 323 μs, and 4 hnsecs
last segment = 1815 resgroups; segment slices = 34
total twins = 14618173; last twin = 705034592+/- 1
total time = 226 ms, 68 μs, and 1 hnse

These are correct answers for N = 5,000,000,000

$ echo 5000000000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 292207792; resgroups = 2164503
each 135 threads has nextp[2 x 6999] array
setup time = 0.001 secs
perform twinprimes ssoz sieve
sieve time = 0.237 secs
last segment = 1815 resgroups; segment slices = 34
total twins = 14618166; last twin = 4999999860+/-1
total time = 0.237 secs

The first easy check should show the lastprime value just a little smaller than N.
See the timing|results table I posted earlier to see correct outputs as N gets bigger.

It took me a looong time, and was a big PITA, to get the correct static typing too in Nim, especially coming from a dynamic language like Ruby.