May 06, 2014 Re: Create many objects using threads | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ali Çehreli | On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote: > On 05/05/2014 02:38 PM, Kapps wrote: > > > I think that the GC actually blocks when > > creating objects, and thus multiple threads creating > instances would not > > provide a significant speedup, possibly even a slowdown. > > Wow! That is the case. :) > > > You'd want to benchmark this to be certain it helps. > > I did: > > import std.range; > import std.parallelism; > > class C > {} > > void foo() > { > auto c = new C; > } > > void main(string[] args) > { > enum totalElements = 10_000_000; > > if (args.length > 1) { > foreach (i; iota(totalElements).parallel) { > foo(); > } > > } else { > foreach (i; iota(totalElements)) { > foo(); > } > } > } > > Typical run on my system for "-O -noboundscheck -inline": > > $ time ./deneme parallel > > real 0m4.236s > user 0m4.325s > sys 0m9.795s > > $ time ./deneme > > real 0m0.753s > user 0m0.748s > sys 0m0.003s > > Ali Huh, that's a much, much, higher impact than I'd expected. I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. Using malloc and emplace instead of new D, results are about 50% faster for single-threadeded and ~3-4 times faster for multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The multi-threaded version is still twice as slow though. On my Windows laptop (with the program compiled for 32-bit), it did not make a significant difference and the multi-threaded version is still 4 times slower. That being said, I think most malloc implementations while being thread-safe, usually use locks or do not scale well. Code: import std.range; import std.parallelism; import std.datetime; import std.stdio; import core.stdc.stdlib; import std.conv; class C {} void foo() { //auto c = new C; enum size = __traits(classInstanceSize, C); void[] mem = malloc(size)[0..size]; emplace!C(mem); } void createFoos(size_t count) { foreach(i; 0 .. count) { foo(); } } void main(string[] args) { StopWatch sw = StopWatch(AutoStart.yes); enum totalElements = 10_000_000; if (args.length <= 1) { foreach (i; iota(totalElements)) { foo(); } } else if(args[1] == "tasks") { foreach (i; parallel(iota(totalElements))) { foo(); } } else if(args[1] == "parallel") { for(int i = 0; i < totalCPUs; i++) { taskPool.put(task(&createFoos, totalElements / totalCPUs)); } taskPool.finish(true); } else writeln("Unknown argument '", args[1], "'."); sw.stop(); writeln(cast(Duration)sw.peek); } Results (Linux 64-bit): shardsoft:~$ dmd -O -inline -release test.d shardsoft:~$ ./test 552 ms, 729 μs, and 7 hnsecs shardsoft:~$ ./test 532 ms, 139 μs, and 5 hnsecs shardsoft:~$ ./test tasks 1 sec, 171 ms, 126 μs, and 4 hnsecs shardsoft:~$ ./test tasks 1 sec, 38 ms, 468 μs, and 6 hnsecs shardsoft:~$ ./test parallel 1 sec, 146 ms, 738 μs, and 2 hnsecs shardsoft:~$ ./test parallel 1 sec, 268 ms, 195 μs, and 3 hnsecs |
May 06, 2014 Re: Create many objects using threads | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kapps | On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote: > On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote: >> On 05/05/2014 02:38 PM, Kapps wrote: >> >> > I think that the GC actually blocks when >> > creating objects, and thus multiple threads creating >> instances would not >> > provide a significant speedup, possibly even a slowdown. >> >> Wow! That is the case. :) >> >> > You'd want to benchmark this to be certain it helps. >> >> I did: >> >> import std.range; >> import std.parallelism; >> >> class C >> {} >> >> void foo() >> { >> auto c = new C; >> } >> >> void main(string[] args) >> { >> enum totalElements = 10_000_000; >> >> if (args.length > 1) { >> foreach (i; iota(totalElements).parallel) { >> foo(); >> } >> >> } else { >> foreach (i; iota(totalElements)) { >> foo(); >> } >> } >> } >> >> Typical run on my system for "-O -noboundscheck -inline": >> >> $ time ./deneme parallel >> >> real 0m4.236s >> user 0m4.325s >> sys 0m9.795s >> >> $ time ./deneme >> >> real 0m0.753s >> user 0m0.748s >> sys 0m0.003s >> >> Ali > > Huh, that's a much, much, higher impact than I'd expected. > I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. > >snip I tried with using an allocator that never releases memory, rounds up to a power of 2, and is lock-free. The results are quite a bit better. shardsoft:~$ ./test 1 sec, 47 ms, 474 μs, and 4 hnsecs shardsoft:~$ ./test 1 sec, 43 ms, 588 μs, and 2 hnsecs shardsoft:~$ ./test tasks 692 ms, 769 μs, and 8 hnsecs shardsoft:~$ ./test tasks 692 ms, 686 μs, and 8 hnsecs shardsoft:~$ ./test parallel 691 ms, 856 μs, and 9 hnsecs shardsoft:~$ ./test parallel 690 ms, 22 μs, and 3 hnsecs I get similar results on my laptop (which is much faster than the results I got on it using DMD's malloc): >test 1 sec, 125 ms, and 847 ╬╝s >test 1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs >test tasks 556 ms, 613 ╬╝s, and 8 hnsecs >test tasks 552 ms and 287 ╬╝s >test parallel 554 ms, 542 ╬╝s, and 6 hnsecs >test parallel 551 ms, 514 ╬╝s, and 9 hnsecs Code: http://pastie.org/9146326 Unfortunately it doesn't compile with the ancient version of gdc available in Debian, so I couldn't test with that. The results should be quite a bit better since core.atomic would be faster. And frankly, I'm not sure if the allocator actually works properly, but it's just for testing purposes anyways. |
May 06, 2014 Re: Create many objects using threads | ||||
---|---|---|---|---|
| ||||
Posted in reply to hardcoremore | On 05/06/2014 05:46 AM, hardcoremore wrote: > But what does exactly means that Garbage Collector blocks? What > does it blocks and in which way? I know this much: The current GC that comes in D runtime is a single-threaded GC (aka "a stop-the-world GC"), meaning that all threads are stopped when the GC is running a garbage collection cycle. > And can I use threads to create multiple instance faster or that is just > not possible? My example program that did nothing but constructed objects on the GC heap cannot be an indicator of the performance of all multi-threaded programs. In real programs there will be computation-intensive parts; there will be parts blocked on I/O; etc. There is no way of knowing without measuring. Ali |
Copyright © 1999-2021 by the D Language Foundation