May 06, 2014
On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
> On 05/05/2014 02:38 PM, Kapps wrote:
>
> > I think that the GC actually blocks when
> > creating objects, and thus multiple threads creating
> instances would not
> > provide a significant speedup, possibly even a slowdown.
>
> Wow! That is the case. :)
>
> > You'd want to benchmark this to be certain it helps.
>
> I did:
>
> import std.range;
> import std.parallelism;
>
> class C
> {}
>
> void foo()
> {
>     auto c = new C;
> }
>
> void main(string[] args)
> {
>     enum totalElements = 10_000_000;
>
>     if (args.length > 1) {
>         foreach (i; iota(totalElements).parallel) {
>             foo();
>         }
>
>     } else {
>         foreach (i; iota(totalElements)) {
>             foo();
>         }
>     }
> }
>
> Typical run on my system for "-O -noboundscheck -inline":
>
> $ time ./deneme parallel
>
> real	0m4.236s
> user	0m4.325s
> sys	0m9.795s
>
> $ time ./deneme
>
> real	0m0.753s
> user	0m0.748s
> sys	0m0.003s
>
> Ali

Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same.

Using malloc and emplace instead of new D, results are about 50% faster for single-threadeded and ~3-4 times faster for multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The multi-threaded version is still twice as slow though. On my Windows laptop (with the program compiled for 32-bit), it did not make a significant difference and the multi-threaded version is still 4 times slower.

That being said, I think most malloc implementations while being thread-safe, usually use locks or do not scale well.

Code:
import std.range;
import std.parallelism;
import std.datetime;
import std.stdio;
import core.stdc.stdlib;
import std.conv;

class C {}

void foo() {
    //auto c = new C;
    enum size = __traits(classInstanceSize, C);
    void[] mem = malloc(size)[0..size];
    emplace!C(mem);
}

void createFoos(size_t count) {
    foreach(i; 0 .. count) {
        foo();
    }
}

void main(string[] args) {
    StopWatch sw = StopWatch(AutoStart.yes);
    enum totalElements = 10_000_000;
    if (args.length <= 1) {
        foreach (i; iota(totalElements)) {
            foo();
        }
    } else if(args[1] == "tasks") {
        foreach (i; parallel(iota(totalElements))) {
            foo();
        }
    } else if(args[1] == "parallel") {
        for(int i = 0; i < totalCPUs; i++) {
            taskPool.put(task(&createFoos, totalElements / totalCPUs));
        }
        taskPool.finish(true);
    } else
        writeln("Unknown argument '", args[1], "'.");
    sw.stop();
    writeln(cast(Duration)sw.peek);
}

Results (Linux 64-bit):
shardsoft:~$ dmd -O -inline -release test.d
shardsoft:~$ ./test
552 ms, 729 μs, and 7 hnsecs
shardsoft:~$ ./test
532 ms, 139 μs, and 5 hnsecs
shardsoft:~$ ./test tasks
1 sec, 171 ms, 126 μs, and 4 hnsecs
shardsoft:~$ ./test tasks
1 sec, 38 ms, 468 μs, and 6 hnsecs
shardsoft:~$ ./test parallel
1 sec, 146 ms, 738 μs, and 2 hnsecs
shardsoft:~$ ./test parallel
1 sec, 268 ms, 195 μs, and 3 hnsecs



May 06, 2014
On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:
> On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
>> On 05/05/2014 02:38 PM, Kapps wrote:
>>
>> > I think that the GC actually blocks when
>> > creating objects, and thus multiple threads creating
>> instances would not
>> > provide a significant speedup, possibly even a slowdown.
>>
>> Wow! That is the case. :)
>>
>> > You'd want to benchmark this to be certain it helps.
>>
>> I did:
>>
>> import std.range;
>> import std.parallelism;
>>
>> class C
>> {}
>>
>> void foo()
>> {
>>    auto c = new C;
>> }
>>
>> void main(string[] args)
>> {
>>    enum totalElements = 10_000_000;
>>
>>    if (args.length > 1) {
>>        foreach (i; iota(totalElements).parallel) {
>>            foo();
>>        }
>>
>>    } else {
>>        foreach (i; iota(totalElements)) {
>>            foo();
>>        }
>>    }
>> }
>>
>> Typical run on my system for "-O -noboundscheck -inline":
>>
>> $ time ./deneme parallel
>>
>> real	0m4.236s
>> user	0m4.325s
>> sys	0m9.795s
>>
>> $ time ./deneme
>>
>> real	0m0.753s
>> user	0m0.748s
>> sys	0m0.003s
>>
>> Ali
>
> Huh, that's a much, much, higher impact than I'd expected.
> I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same.
>
>snip

I tried with using an allocator that never releases memory, rounds up to a power of 2, and is lock-free. The results are quite a bit better.

shardsoft:~$ ./test
1 sec, 47 ms, 474 μs, and 4 hnsecs
shardsoft:~$ ./test
1 sec, 43 ms, 588 μs, and 2 hnsecs
shardsoft:~$ ./test tasks
692 ms, 769 μs, and 8 hnsecs
shardsoft:~$ ./test tasks
692 ms, 686 μs, and 8 hnsecs
shardsoft:~$ ./test parallel
691 ms, 856 μs, and 9 hnsecs
shardsoft:~$ ./test parallel
690 ms, 22 μs, and 3 hnsecs

I get similar results on my laptop (which is much faster than the results I got on it using DMD's malloc):
>test
1 sec, 125 ms, and 847 ╬╝s
>test
1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs

>test tasks
556 ms, 613 ╬╝s, and 8 hnsecs
>test tasks
552 ms and 287 ╬╝s

>test parallel
554 ms, 542 ╬╝s, and 6 hnsecs
>test parallel
551 ms, 514 ╬╝s, and 9 hnsecs


Code:
http://pastie.org/9146326

Unfortunately it doesn't compile with the ancient version of gdc available in Debian, so I couldn't test with that. The results should be quite a bit better since core.atomic would be faster. And frankly, I'm not sure if the allocator actually works properly, but it's just for testing purposes anyways.

May 06, 2014
On 05/06/2014 05:46 AM, hardcoremore wrote:

> But what does exactly means that Garbage Collector blocks? What
> does it blocks and in which way?

I know this much: The current GC that comes in D runtime is a single-threaded GC (aka "a stop-the-world GC"), meaning that all threads are stopped when the GC is running a garbage collection cycle.

> And can I use threads to create multiple instance faster or that is just
> not possible?

My example program that did nothing but constructed objects on the GC heap cannot be an indicator of the performance of all multi-threaded programs. In real programs there will be computation-intensive parts; there will be parts blocked on I/O; etc. There is no way of knowing without measuring.

Ali

1 2
Next ›   Last »