February 09, 2012
Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha@yahoo.com>:

> I wonder how much it helps to just optimize the GC a little.  How much does the performance gap close when you use DMD 2.058 beta instead of 2.057?  This upcoming release has several new garbage collector optimizations.  If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive.

I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt:

CPU: Core 2, speed 2001 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  %        linenr info                 symbol name
13838    18.8416  gcx.d:426                   void*
gc.gcx.GC.malloc(ulong, uint, ulong*)
4465      6.0795  gcx.d:2454                  ulong
gc.gcx.Gcx.fullcollect(void*)
...

Compiled with: gcc-Version 4.6.2 20111026 (gdc 0.31 - r751:34491c2e7bb4, using dmd 2.057) (GCC)

February 09, 2012
Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
>> Generally, D's message passing is implemented in quite easy-to-use
>> way, but far from being fast.
>> I dislike the Variant structure, because it adds a huge overhead. I'd
>> rather have a templated message passing system with type-safe message
>> queue, so no Variant is necessary.
>> In specific cases Messages can be polymorphic objects. This will be
>> way faster, then Variant.
>
> cc Sean Kelly
>
> I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags).
>
> We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's).
>
>
> Andrei

Well, what does +1 Variant and +1 LinkedListNode sum up to?
February 09, 2012
On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:

> Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha@yahoo.com>:
> 
>> I wonder how much it helps to just optimize the GC a little.  How much does the performance gap close when you use DMD 2.058 beta instead of 2.057?  This upcoming release has several new garbage collector optimizations.  If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive.
> 
> I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt:
> 
> CPU: Core 2, speed 2001 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  %        linenr info                 symbol name
> 13838    18.8416  gcx.d:426                   void* gc.gcx.GC.malloc(ulong, uint, ulong*)
> 4465      6.0795  gcx.d:2454                  ulong gc.gcx.Gcx.fullcollect(void*)

One random thing that just occurred to me… if the standard receive pattern is:

receive((int x) { … });

There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as "scope").  I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters?
February 09, 2012
On 2/9/12 10:31 AM, Marco Leise wrote:
> Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org>:
>
>> On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
>>> Generally, D's message passing is implemented in quite easy-to-use
>>> way, but far from being fast.
>>> I dislike the Variant structure, because it adds a huge overhead. I'd
>>> rather have a templated message passing system with type-safe message
>>> queue, so no Variant is necessary.
>>> In specific cases Messages can be polymorphic objects. This will be
>>> way faster, then Variant.
>>
>> cc Sean Kelly
>>
>> I haven't looked at the implementation, but one possible liability is
>> that large messages don't fit in a Variant and must use dynamic
>> allocation under the wraps. There are a number of ways to avoid that,
>> such as parallel arrays (one array per type for data and one for the
>> additional tags).
>>
>> We must make the message passing subsystem to not use any memory
>> allocation in the quiescent state. If we're doing one allocation per
>> message passed, that might explain the 4x performance difference (I
>> have no trouble figuring Java's allocator is this much faster than D's).
>>
>>
>> Andrei
>
> Well, what does +1 Variant and +1 LinkedListNode sum up to?

Sorry, I don't understand...

Andrei
February 09, 2012
On 02/09/2012 08:27 PM, Sean Kelly wrote:
> On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:
>
>> Am 09.02.2012, 17:22 Uhr, schrieb dsimcha<dsimcha@yahoo.com>:
>>
>>> I wonder how much it helps to just optimize the GC a little.  How much does the performance gap close when you use DMD 2.058 beta instead of 2.057?  This upcoming release has several new garbage collector optimizations.  If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive.
>>
>> I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt:
>>
>> CPU: Core 2, speed 2001 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
>> samples  %        linenr info                 symbol name
>> 13838    18.8416  gcx.d:426                   void* gc.gcx.GC.malloc(ulong, uint, ulong*)
>> 4465      6.0795  gcx.d:2454                  ulong gc.gcx.Gcx.fullcollect(void*)
>
> One random thing that just occurred to me… if the standard receive pattern is:
>
> receive((int x) { … });
>
> There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as "scope").  I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters?

You can mark an entire tuple as scope without trouble:

void foo(T,S...)(T arg1, scope S args) {...}

Does this improve the run time?
February 09, 2012
Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

>>> If we're doing one allocation per
>>> message passed, that might explain the 4x performance difference (I
>>> have no trouble figuring Java's allocator is this much faster than D's).
>>>
>>>
>>> Andrei
>>
>> Well, what does +1 Variant and +1 LinkedListNode sum up to?
>
> Sorry, I don't understand...
>
> Andrei

There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.
February 09, 2012
Marco Leise:

> Sean found another possible allocation in the other branch of this discussion.

Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070

Bye,
bearophile
February 09, 2012
On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean@invisibleduck.org> wrote:

> So a queue per message type?  How would ordering be preserved? Also, how would this work for interprocess messaging?  An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc.  I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast.

I didn't yet got around to polish my lock-free SList/DList implementations,
but mutexes should only become a problem with high contention when you need to block.
You'd also would need some kind of blocking for lock-free lists.

Best first order optimization would be to allocate the list node deterministically.
The only reason to use GC memory for them is that mallocating is still too cumbersome.
Nodes are unshared so you'd want a unique_pointer.
February 09, 2012
On 2/9/12 11:49 AM, Marco Leise wrote:
> Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org>:
>
>>>> If we're doing one allocation per
>>>> message passed, that might explain the 4x performance difference (I
>>>> have no trouble figuring Java's allocator is this much faster than
>>>> D's).
>>>>
>>>>
>>>> Andrei
>>>
>>> Well, what does +1 Variant and +1 LinkedListNode sum up to?
>>
>> Sorry, I don't understand...
>>
>> Andrei
>
> There are at least 2 allocations, one for the Variant and one for the
> new node in the linked list aka message box. But from what you wrote it
> sounds like a Variant doesn't allocate unless the contained data exceeds
> some internal storage. Sean found another possible allocation in the
> other branch of this discussion.

I understand. The good news is, this looks like low-hanging fruit! I'll keep an eye on pull requests in druntime. Thanks to fellow Romanian Nicolae Mihalache for contributing the comparison.


Andrei
February 09, 2012
On Feb 9, 2012, at 10:31 AM, Marco Leise wrote:

> Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:
> 
>> On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:
>>> Generally, D's message passing is implemented in quite easy-to-use
>>> way, but far from being fast.
>>> I dislike the Variant structure, because it adds a huge overhead. I'd
>>> rather have a templated message passing system with type-safe message
>>> queue, so no Variant is necessary.
>>> In specific cases Messages can be polymorphic objects. This will be
>>> way faster, then Variant.
>> 
>> cc Sean Kelly
>> 
>> I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags).
>> 
>> We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's).
> 
> Well, what does +1 Variant and +1 LinkedListNode sum up to?

FWIW, you can use DMD's built in profiler so long as the receiving thread is the same as the sending thread:

import std.concurrency;

void main() {
    for(int i = 0; i < 1_000_000; i++) {
        send(thisTid, 12345);
        auto x = receiveOnly!int();
    }
}


I generated timings for this both before and after adding "scope" to mbox.get():

$ dmd -release -inline -O abc
$ time abc

real	0m0.831s
user	0m0.829s
sys	0m0.002s

… add "scope" to mbox.get()

$ dmd -release -inline -O abc
$ time abc

real	0m0.653s
user	0m0.649s
sys	0m0.003s


And here's the trace log after "scope" was added.  Notice that there were 61 calls to GCX.fullcollect().  We can also see that there was 1 allocation per send/receive operation, so only an alloc for the message list node.

$ dmd -O -release -profile abc
gladsheim:misc sean$ time abc

real	0m11.348s
user	0m11.331s
sys	0m0.015s


======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

1000000   437709765   220179413         220     void std.concurrency._send!(int)._send(std.concurrency.MsgType, std.concurrency.Tid, int)
1000000   300987757   140736393         140     bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*)
1000000   202131609    89479808          89     void* gc.gcx.GC.malloc(uint, uint, uint*)
      1   825045422    57556501    57556501     _Dmain
1000033   112651800    52026745          52     void* gc.gcx.GC.mallocNoSync(uint, uint, uint*)
     61    53422342    49606106      813214     uint gc.gcx.Gcx.fullcollect(void*)
2000000   160103753    42531732          21     bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*).bool scan(ref std.concurrency.List!(std.concurrency.Message).List)
2000000    42018612    39837170          19     int std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.VariantN!(32u).VariantN.OpID, ubyte[32]*, void*)
1000000   117572021    24641771          24     bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*).bool onStandardMsg(ref std.concurrency.Message)
1000000    47280794    20418675          20     void std.concurrency.Message.map!(nothrow @safe void delegate(int)).map(nothrow @safe void delegate(int))
1000000   316556767    15569009          15     int std.concurrency.receiveOnly!(int).receiveOnly()
1000000    36317362    13212905          13     @property bool std.variant.VariantN!(32u).VariantN.convertsTo!(int).convertsTo()
1000000    15445906    10879089          10     std.concurrency.Message std.concurrency.Message.__ctor!(int).__ctor(std.concurrency.MsgType, int)
1000000    45649454     9332092           9     @property bool std.concurrency.Message.convertsTo!(int).convertsTo()
1000000    26790032     7875877           7     @property int std.variant.VariantN!(32u).VariantN.get!(int).get()
1000000   444757778     7048013           7     void std.concurrency._send!(int)._send(std.concurrency.Tid, int)
  15512     6681162     6657279         429     int gc.gcx.Gcx.allocPage(ubyte)
1000000   450932153     6174374           6     void std.concurrency.send!(int).send(std.concurrency.Tid, int)
1000000     4566817     4566817           4     std.variant.VariantN!(32u).VariantN std.variant.VariantN!(32u).VariantN.opAssign!(int).opAssign(int)
   4087     3635735     3518353         860     void gc.gcx.Gcx.mark(void*, void*, int)
2000000     2069965     2069965           1     int std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.VariantN!(32u).VariantN.OpID, ubyte[32]*, void*).bool tryPutting(int*, TypeInfo, void*)
2990271      195287      195287           0     void* gc.gcx.sentinel_add(void*)
1000000      147610      147610           0     bool std.concurrency.MessageBox.get!(nothrow @safe void delegate(int), pure @safe void function(std.concurrency.LinkTerminated)*, pure @safe void function(std.concurrency.OwnerTerminated)*, pure @safe void function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow @safe void delegate(int), scope pure @safe void function(std.concurrency.LinkTerminated)*, scope pure @safe void function(std.concurrency.OwnerTerminated)*, scope pure @safe void function(std.variant.VariantN!(32u).VariantN)*).bool pty(ref std.concurrency.List!(std.concurrency.Message).List)
1000032      121459      121459           0     void gc.gcx.Gcx.setBits(gc.gcx.Pool*, uint, uint)
2000000      111475      111475           0     int std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.VariantN!(32u).VariantN.OpID, ubyte[32]*, void*).int* getPtr(void*)
1000033      102413      102413           0     ubyte gc.gcx.Gcx.findBin(uint)
1002228       94742       94742           0     @property uint gc.gcx.Pool.shiftBy()
1000000       72086       72086           0     int std.concurrency.receiveOnly!(int).receiveOnly().nothrow @safe void __lambda17(int)
 995119       70854       70854           0     void gc.gcx.Gcx.log_free(void*)
1000033       61505       61505           0     void gc.gcx.sentinel_init(void*, uint)
1000033       55070       55070           0     void gc.gcx.Gcx.log_malloc(void*, uint)
 995119       51748       51748           0     void gc.gcx.sentinel_Invariant(const(void*))
      1       24692       24692       24692     void gc.gcx.Pool.initialize(uint, bool)
   3538     3544517       24525           6     void gc.gcx.Gcx.mark(void*, void*)
  15511       23883       23522           1     uint gc.gcx.Pool.allocPages(uint)
 124447       11615       11615           0     void gc.gcx.Gcx.clrBitsSmallSweep(gc.gcx.Pool*, uint, uint)
      1        7051        5921        5921     void gc.gcx.GC.initialize()
      1       27490        2797        2797     gc.gcx.Pool* gc.gcx.Gcx.newPool(uint, bool)
     61    53424783        2441          40     uint gc.gcx.Gcx.fullcollectshell()
     55        2227        2227          40     void gc.gcx.Gcx.addRange(void*, void*)
      1        1129        1119        1119     void gc.gcx.Gcx.initialize()
     16         360         360          22     uint gc.gcx.Pool.extendPages(uint)
      1        2547         319         319     void gc.gcx.GC.addRange(void*, uint)
   2196         278         278           0     gc.gcx.Pool* gc.gcx.Gcx.findPool(void*)
      1          65          65          65     void gc.gcx.GC.disable()
      1           9           9           9     void gc.gcx.Gcx.log_init()
      1           5           5           5     void gc.gcx.GC.enable()
      1           0           0           0     void gc.gcx.GC.setStackBottom(void*)
      1           0           0           0     @property uint gc.gcx.Pool.divisor()