Thread overview
Parallel reads on std.container.array.Array
Dec 08, 2017
Kagamin
December 08, 2017
I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector. I've created a small program for demonstration https://github.com/carun/parallel-read-tester

It works fine with just couple of problems though:

1. D version takes way too long compared to C++ version.

```
bash build-and-run.sh
g++ (Ubuntu 7.2.0-8ubuntu3) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

LDC - the LLVM D compiler (1.6.0):
  based on DMD v2.076.1 and LLVM 5.0.0
  built with LDC - the LLVM D compiler (1.6.0)
  Default target: x86_64-unknown-linux-gnu
  Host CPU: skylake
  http://dlang.org - http://wiki.dlang.org/LDC

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_be - AArch64 (big endian)
    arm        - ARM
    arm64      - ARM64 (little endian)
    armeb      - ARM (big endian)
    nvptx      - NVIDIA PTX 32-bit
    nvptx64    - NVIDIA PTX 64-bit
    ppc32      - PowerPC 32
    ppc64      - PowerPC 64
    ppc64le    - PowerPC 64 LE
    thumb      - Thumb
    thumbeb    - Thumb (big endian)
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64

=== Starting CPP version ===
Took 3.7583 to load 2000000 items. Gonna search in parallel...
5 4000000
6 4000000
2 4000000
0 4000000
1 4000000
7 4000000
4 4000000
3 4000000
Took 7.0247 to search

=== Starting D version ===
Took 1 sec, 506 ms, 672 μs, and 4 hnsecs to load 2000000 items. Gonna search in parallel...
3 4000000
4 4000000
2 4000000
6 4000000
7 4000000
5 4000000
1 4000000
0 4000000
Took 13 secs, 53 ms, 790 μs, and 3 hnsecs to search.
```
2. I'm on an 8 CPU box and I don't seem to hit 800% CPU with D version (max 720%). However I can get 800% CPU usage with the C++ version.

2. Introducing a string in the struct Data results in "std.container.Array.reserve failed to allocate memory", whereas adding a similar std::string in the C++ struct seems to work fine.

Am I missing anything obvious here?

Also why doesn't std.container.array support an equivalent of std::vector::erase?

Cheers,
Arun
December 08, 2017
On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:

> 2. I'm on an 8 CPU box and I don't seem to hit 800% CPU with D version (max 720%). However I can get 800% CPU usage with the C++ version.
Please ignore, this is because of the write.
December 08, 2017
On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:
> I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector. I've created a small program for demonstration https://github.com/carun/parallel-read-tester
>
> It works fine with just couple of problems though:
>
> 1. D version takes way too long compared to C++ version.
>
My mistake (IO bottleneck, std.stdio.write is probably flushing?)! The timings are now close enough, in the order of milliseconds. This is not just with one run, but multiple runs. (I should probably test this on a Xeon server).

=== Starting CPP version ===
Took 3.79253 to load 2000000 items. Gonna search in parallel...
4 400000000
1 400000000
3 400000000
2 400000000
6 400000000
7 400000000
5 400000000
0 400000000
Took 6.28018 to search

=== Starting D version ===
Took 1 sec, 474 ms, 869 μs, and 4 hnsecs to load 2000000 items. Gonna search in parallel...
0 400000000
1 400000000
2 400000000
7 400000000
6 400000000
4 400000000
3 400000000
5 400000000
Took 6 secs, 472 ms, 467 μs, and 8 hnsecs to search.

The one that puzzles me is, what's wrong with the CPP version? :) Why is it slow loading the gallery (more than twice as slow as the D counterpart)? I thought std::vector::emplace_back should do a decent job. RVO in D?

> 2. Introducing a string in the struct Data results in "std.container.Array.reserve failed to allocate memory", whereas adding a similar std::string in the C++ struct seems to work fine.
Couldn't find the reason!

> Am I missing anything obvious here?
>
> Also why doesn't std.container.array support an equivalent of std::vector::erase?
December 08, 2017
On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:
> I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector.

No, your code can also fail on a system with inconsistent cache because data written by writing thread can remain in its cache and not reach shared memory in time or reading threads can read from their stale cache.
December 08, 2017
On Friday, 8 December 2017 at 10:01:14 UTC, Kagamin wrote:
> On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:
>> I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector.
>
> No, your code can also fail on a system with inconsistent cache because data written by writing thread can remain in its cache and not reach shared memory in time or reading threads can read from their stale cache.

I'm OK with some delay between the writes and the reads. The same applies to the writes and reads across processes. At least between threads the impact/delay is minimum whereas between processes it's even worse as the page will have to be reflected in all the mapped processes.
December 09, 2017
So I tried the same on Haswell processor with LDC 1.6.0 and it crashes

```
=== Starting D version ===
Took 1 sec, 107 ms, and 383 μs to load 1000000 items. Gonna search in parallel...
*** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 ***
*** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 ***
```

DMD on the other hand takes forever to run and doesn't complete.
December 09, 2017
On Saturday, 9 December 2017 at 01:34:40 UTC, Arun Chandrasekaran wrote:
> So I tried the same on Haswell processor with LDC 1.6.0 and it crashes
>
> ```
> === Starting D version ===
> Took 1 sec, 107 ms, and 383 μs to load 1000000 items. Gonna search in parallel...
> *** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 ***
> *** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 ***
> ```

Learnt (from David Nadlinger) that due to lifetime management of transitory ranges, they can't be used for parallel reads. Iterating by index has solved the problem.

However, accessing the items in Array results in value copy. Is that expected? How can I fix this?

http://forum.dlang.org/post/cfhkszdbkaezprbzrnlc@forum.dlang.org