Create many objects using threads (page 2)

May 06, 2014

Re: Create many objects using threads

Posted by Kapps
in reply to Ali Çehreli

Permalink

Kapps

Posted in reply to Ali Çehreli

Permalink

On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
> On 05/05/2014 02:38 PM, Kapps wrote:
>
> > I think that the GC actually blocks when
> > creating objects, and thus multiple threads creating
> instances would not
> > provide a significant speedup, possibly even a slowdown.
>
> Wow! That is the case. :)
>
> > You'd want to benchmark this to be certain it helps.
>
> I did:
>
> import std.range;
> import std.parallelism;
>
> class C
> {}
>
> void foo()
> {
>     auto c = new C;
> }
>
> void main(string[] args)
> {
>     enum totalElements = 10_000_000;
>
>     if (args.length > 1) {
>         foreach (i; iota(totalElements).parallel) {
>             foo();
>         }
>
>     } else {
>         foreach (i; iota(totalElements)) {
>             foo();
>         }
>     }
> }
>
> Typical run on my system for "-O -noboundscheck -inline":
>
> $ time ./deneme parallel
>
> real	0m4.236s
> user	0m4.325s
> sys	0m9.795s
>
> $ time ./deneme
>
> real	0m0.753s
> user	0m0.748s
> sys	0m0.003s
>
> Ali

Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same.

Using malloc and emplace instead of new D, results are about 50% faster for single-threadeded and ~3-4 times faster for multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The multi-threaded version is still twice as slow though. On my Windows laptop (with the program compiled for 32-bit), it did not make a significant difference and the multi-threaded version is still 4 times slower.

That being said, I think most malloc implementations while being thread-safe, usually use locks or do not scale well.

Code:
import std.range;
import std.parallelism;
import std.datetime;
import std.stdio;
import core.stdc.stdlib;
import std.conv;

class C {}

void foo() {
    //auto c = new C;
    enum size = __traits(classInstanceSize, C);
    void[] mem = malloc(size)[0..size];
    emplace!C(mem);
}

void createFoos(size_t count) {
    foreach(i; 0 .. count) {
        foo();
    }
}

void main(string[] args) {
    StopWatch sw = StopWatch(AutoStart.yes);
    enum totalElements = 10_000_000;
    if (args.length <= 1) {
        foreach (i; iota(totalElements)) {
            foo();
        }
    } else if(args[1] == "tasks") {
        foreach (i; parallel(iota(totalElements))) {
            foo();
        }
    } else if(args[1] == "parallel") {
        for(int i = 0; i < totalCPUs; i++) {
            taskPool.put(task(&createFoos, totalElements / totalCPUs));
        }
        taskPool.finish(true);
    } else
        writeln("Unknown argument '", args[1], "'.");
    sw.stop();
    writeln(cast(Duration)sw.peek);
}

Results (Linux 64-bit):
shardsoft:~$ dmd -O -inline -release test.d
shardsoft:~$ ./test
552 ms, 729 μs, and 7 hnsecs
shardsoft:~$ ./test
532 ms, 139 μs, and 5 hnsecs
shardsoft:~$ ./test tasks
1 sec, 171 ms, 126 μs, and 4 hnsecs
shardsoft:~$ ./test tasks
1 sec, 38 ms, 468 μs, and 6 hnsecs
shardsoft:~$ ./test parallel
1 sec, 146 ms, 738 μs, and 2 hnsecs
shardsoft:~$ ./test parallel
1 sec, 268 ms, 195 μs, and 3 hnsecs

On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote: > On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote: >> On 05/05/2014 02:38 PM, Kapps wrote: >> >> > I think that the GC actually blocks when >> > creating objects, and thus multiple threads creating >> instances would not >> > provide a significant speedup, possibly even a slowdown. >> >> Wow! That is the case. :) >> >> > You'd want to benchmark this to be certain it helps. >> >> I did: >> >> import std.range; >> import std.parallelism; >> >> class C >> {} >> >> void foo() >> { >> auto c = new C; >> } >> >> void main(string[] args) >> { >> enum totalElements = 10_000_000; >> >> if (args.length > 1) { >> foreach (i; iota(totalElements).parallel) { >> foo(); >> } >> >> } else { >> foreach (i; iota(totalElements)) { >> foo(); >> } >> } >> } >> >> Typical run on my system for "-O -noboundscheck -inline": >> >> $ time ./deneme parallel >> >> real 0m4.236s >> user 0m4.325s >> sys 0m9.795s >> >> $ time ./deneme >> >> real 0m0.753s >> user 0m0.748s >> sys 0m0.003s >> >> Ali > > Huh, that's a much, much, higher impact than I'd expected. > I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. > >snip I tried with using an allocator that never releases memory, rounds up to a power of 2, and is lock-free. The results are quite a bit better. shardsoft:~$ ./test 1 sec, 47 ms, 474 μs, and 4 hnsecs shardsoft:~$ ./test 1 sec, 43 ms, 588 μs, and 2 hnsecs shardsoft:~$ ./test tasks 692 ms, 769 μs, and 8 hnsecs shardsoft:~$ ./test tasks 692 ms, 686 μs, and 8 hnsecs shardsoft:~$ ./test parallel 691 ms, 856 μs, and 9 hnsecs shardsoft:~$ ./test parallel 690 ms, 22 μs, and 3 hnsecs I get similar results on my laptop (which is much faster than the results I got on it using DMD's malloc): >test 1 sec, 125 ms, and 847 ╬╝s >test 1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs >test tasks 556 ms, 613 ╬╝s, and 8 hnsecs >test tasks 552 ms and 287 ╬╝s >test parallel 554 ms, 542 ╬╝s, and 6 hnsecs >test parallel 551 ms, 514 ╬╝s, and 9 hnsecs Code: http://pastie.org/9146326 Unfortunately it doesn't compile with the ancient version of gdc available in Debian, so I couldn't test with that. The results should be quite a bit better since core.atomic would be faster. And frankly, I'm not sure if the allocator actually works properly, but it's just for testing purposes anyways.

On 05/06/2014 05:46 AM, hardcoremore wrote: > But what does exactly means that Garbage Collector blocks? What > does it blocks and in which way? I know this much: The current GC that comes in D runtime is a single-threaded GC (aka "a stop-the-world GC"), meaning that all threads are stopped when the GC is running a garbage collection cycle. > And can I use threads to create multiple instance faster or that is just > not possible? My example program that did nothing but constructed objects on the GC heap cannot be an indicator of the performance of all multi-threaded programs. In real programs there will be computation-intensive parts; there will be parts blocked on I/O; etc. There is no way of knowing without measuring. Ali

Forums