February 09, 2012
On Feb 9, 2012, at 11:53 AM, bearophile wrote:

> Marco Leise:
> 
>> Sean found another possible allocation in the other branch of this discussion.
> 
> Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070

This would be handy.  I don't always think to check the asm dump when I'm working with delegates.
February 09, 2012
On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:

> On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean@invisibleduck.org> wrote:
> 
>> So a queue per message type?  How would ordering be preserved? Also, how would this work for interprocess messaging?  An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc.  I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast.
> 
> I didn't yet got around to polish my lock-free SList/DList implementations,
> but mutexes should only become a problem with high contention when you need to block.
> You'd also would need some kind of blocking for lock-free lists.

No blocking should be necessary for the lock-free list.  Just try to steal a node with a CAS.  If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC.

> Best first order optimization would be to allocate the list node deterministically.

Neat idea.  I think I can make that change fairly trivially.
February 09, 2012
I suggest using a template-generated type that can contain any of the messages to be sent over a channel. It is reasonably straightforward to generate all the boilerplate code necessary to make this happen. My prototype (attached) still needs work to remove linux dependencies and tighten it up, but it works ok. Another advantage of this approach (well, I see it as an advantage) is that you declare in a single location all the messages that can be sent over the channel, and of course the messages are type-safe.

The file of interest is concurrency.d.

On 10/02/12 02:14, Sean Kelly wrote:
> So a queue per message type?  How would ordering be preserved? Also, how would this work for interprocess messaging?  An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc.  I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast.
>
> On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan<gor.f.gyolchanyan@gmail.com>  wrote:
>
>> Generally, D's message passing is implemented in quite easy-to-use
>> way, but far from being fast.
>> I dislike the Variant structure, because it adds a huge overhead. I'd
>> rather have a templated message passing system with type-safe message
>> queue, so no Variant is necessary.
>> In specific cases Messages can be polymorphic objects. This will be
>> way faster, then Variant.
>>
>> On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal<alex_dovhal@yahoo.com>  wrote:
>>> Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.
>>>
>>>
>>
>>
>> -- 
>> Bye,
>> Gor Gyolchanyan.


-- 
Graham St Jack



February 09, 2012
On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:
> 
>> Best first order optimization would be to allocate the list node deterministically.
> 
> Neat idea.  I think I can make that change fairly trivially.

$ time abc

real	0m0.556s
user	0m0.555s
sys	0m0.001s

So another 100ms improvement.  Switching to a (__gshared, no mutex) free-list that falls back on malloc yields:

$ time abc

real	0m0.505s
user	0m0.503s
sys	0m0.001s

Not as much of a gain there, and I believe we've eliminated all the allocations (though I'd have to do a pile build to verify).  Still, that's approaching being twice as fast as before, which is definitely something.

February 10, 2012
On 2/9/12 11:17 PM, Sean Kelly wrote:
> On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:
>> I didn't yet got around to polish my lock-free SList/DList implementations,
>> but mutexes should only become a problem with high contention when you need to block.
>> You'd also would need some kind of blocking for lock-free lists.
>
> No blocking should be necessary for the lock-free list.  Just try to steal a node with a CAS.  If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC.

And the neat thing is that you don't have to worry about node deletion as much when you have a GC…

David
February 10, 2012
> > I wonder how much it helps to just optimize the GC a little.  How much does the performance gap close when you use DMD 2.058 beta instead of 2.057?  This upcoming release has several new garbage collector optimizations.  If the GC is the bottleneck, then it's not surprising

Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.

-- Oliver

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
February 10, 2012
On 2012-02-10 14:54, Oliver Plow wrote:
>>> I wonder how much it helps to just optimize the GC a little.  How much
>>> does the performance gap close when you use DMD 2.058 beta instead of
>>> 2.057?  This upcoming release has several new garbage collector
>>> optimizations.  If the GC is the bottleneck, then it's not surprising
>
> Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.
>
> -- Oliver

There's a stub implementation of the GC which uses malloc, IIRC. Try using that one at link time.

https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d

-- 
/Jacob Carlborg
February 10, 2012
Le 09/02/2012 20:57, Martin Nowak a écrit :
> On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean@invisibleduck.org>
> wrote:
>
>> So a queue per message type? How would ordering be preserved? Also,
>> how would this work for interprocess messaging? An array-based queue
>> is an option however (though it would mean memmoves on receive), as
>> are free-lists for nodes, etc. I guess the easiest thing there would
>> be a lock-free shared slist for the node free-list, though I couldn't
>> weigh the chance of cache misses from using old memory blocks vs. just
>> expecting the allocator to be fast.
>
> I didn't yet got around to polish my lock-free SList/DList implementations,
> but mutexes should only become a problem with high contention when you
> need to block.
> You'd also would need some kind of blocking for lock-free lists.
>

Doesn't this require shared to be working ?
February 10, 2012
On 02/10/12 14:54, Oliver Plow wrote:
>>> I wonder how much it helps to just optimize the GC a little.  How much does the performance gap close when you use DMD 2.058 beta instead of 2.057?  This upcoming release has several new garbage collector optimizations.  If the GC is the bottleneck, then it's not surprising
> 
> Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.


Calling GC.disable() at runtime will delay the GC until it's actually
needed, but won't disable it completely.
Having a std noop GC stub selected by a switch would be nice, but you
can get the same effect by giving the linker an object that has the
necessary stubs.

For this test case something like the patch below improves things significantly, for more gains, std.concurrency needs more invasive changes.

Note it's just POC, to measure the current std.concurrency efficiency vs other approaches. The "freelist" arrays are not freed and leak, a complete implementation would free them when the link is shut down. Ignore the synchronize() calls - "synchronized" isn't properly lowered by the compiler, so i had to resort to this after switching the locking primitives. It should work with std "synchronized" equally well.

The original testcase from this thread achieves ~4M msg/sec with this change (the numbers aren't stable, but mostly in the 3.5..4.0M range, 4.5M+ happens sometimes). The memory usage also decreases noticeably.

artur

--- std/concurrency.d
+++ std/concurrency.d
@@ -1387,7 +1396,7 @@ private
                 m_last = n;
             Node* todelete = n.next;
             n.next = n.next.next;
-            //delete todelete;
+            delete todelete;
             m_count--;
         }

@@ -1430,6 +1439,56 @@ private
             {
                 val = v;
             }
+            import core.memory;
+            import core.exception;
+            new(size_t size) {
+               void* p;
+               if (afreelist.length)
+                  p = afreelist[--afreelist.length];
+               else if (gfreelist.length) {
+                  {
+                     scope lock = synchronize(fl);
+                     if (gfreelist.length) {
+                        afreelist = cast(Node*[])gfreelist;
+                        gfreelist.length=0;
+                     }
+                  }
+                  if (afreelist.length)
+                     p = afreelist[--afreelist.length];
+               }
+
+               if (p)
+                  return p;
+
+               p = std.c.stdlib.malloc(size);
+               if (!p)
+                   throw new OutOfMemoryError();
+               GC.addRange(p, size);
+               return p;
+            }
+            delete(void* p) {
+               if (!p)
+                  return;
+               pfreelist ~= cast(Node*)p;
+               if (pfreelist.length>=8)
+               {
+                  {
+                     scope lock = synchronize(fl);
+                     gfreelist ~= cast(shared Node*[])pfreelist;
+                  }
+                  pfreelist.length=0;
+                  pfreelist.assumeSafeAppend();
+               }
+               // At some point all free nodes need to be freed, using:
+               //GC.removeRange(p);
+               //std.c.stdlib.free(p);
+            }
+            static Node*[] afreelist;
+            static ubyte[56] d1;
+            static Node*[] pfreelist;
+            static ubyte[56] d2;
+            shared static Node*[] gfreelist;
+            shared static Mutex fl;
         }


February 10, 2012
GC.disable and GC.reserve are applicable. I tested with these and they did help but not a ton.

On Feb 10, 2012, at 5:54 AM, "Oliver Plow" <saxo123@gmx.de> wrote:

>>> I wonder how much it helps to just optimize the GC a little.  How much does the performance gap close when you use DMD 2.058 beta instead of 2.057?  This upcoming release has several new garbage collector optimizations.  If the GC is the bottleneck, then it's not surprising
> 
> Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-windows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.
> 
> -- Oliver
> 
> -- 
> Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
> belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de