Jump to page: 1 2 3
Thread overview
More embarrassing microbenchmars for D's GC.
Jun 09, 2008
Leandro Lucarella
Jun 09, 2008
bearophile
Jun 10, 2008
Leandro Lucarella
Jun 10, 2008
bearophile
Jun 09, 2008
Sean Kelly
Jun 10, 2008
Leandro Lucarella
Jun 10, 2008
Sean Kelly
Jun 11, 2008
Marius Muja
Jun 11, 2008
Leandro Lucarella
Jun 10, 2008
Walter Bright
Jun 10, 2008
Leandro Lucarella
Jun 10, 2008
Walter Bright
Jun 11, 2008
Leandro Lucarella
Sep 15, 2010
Rounin
Sep 16, 2010
Rounin
Sep 16, 2010
bearophile
Sep 16, 2010
Rounin
Sep 16, 2010
dsimcha
Sep 16, 2010
Sean Kelly
Sep 16, 2010
dsimcha
Sep 16, 2010
Justin Johansson
Sep 16, 2010
Rounin
Jun 10, 2008
Leandro Lucarella
Jun 25, 2008
Michael Neumann
Jun 25, 2008
bearophile
June 09, 2008
I've done a very simple microbenchmark testing a single linked list.

What the test does is fill the list with long elements, and then iterate the list and increment the associated value. This test mostly test the allocation speed.

I've done plain a C version, a C++ version (using STL double-linked list, so it has some disvantage here), a D version without using the GC, a D version using the GC, and a Python version.

Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):

       C         C++       D no gc   D gc       D gc dis  Python
Fill   0.136136  0.142705  0.131238  22.628577  0.242637  12.952816
Inc    0.025134  0.038768  0.037456   0.050480  0.051545   3.765271
Total  0.161270  0.181473  0.168694  22.679057  0.294182  16.718087

Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).

You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).

But there are a few other results I can't explain:
1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?
2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>


	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)

Tests attached.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
RENUNCIO PARA IR A REZARLE A SAN CAYETANO
	-- Crónica TV


June 09, 2008
Leandro Lucarella:
> You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).

Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco).

Try putting your code into a main() and add a:
from psyco.classes import __metaclass__
At the top.

The code:

import sys
from datetime import datetime

import psyco
psyco.full()
from psyco.classes import __metaclass__

class Node:
        def __init__(self):
                self.data = 0
                self.next = None

def main():
    start = datetime.now()
    list = Node()
    n = list
    for i in xrange(long(sys.argv[1])):
            n.data = i
            n.next = Node()
            n = n.next
    n.next = None
    end = datetime.now()
    delta = datetime.now() - start
    elapsed1 = delta.seconds * 1000000 + delta.microseconds

    start = datetime.now()
    p = list
    while p:
            p.data += 1
            p = p.next
    delta = datetime.now() - start
    elapsed2 = delta.seconds * 1000000 + delta.microseconds

    print "Fill:  %lu usec" % elapsed1
    print "Inc:   %lu usec" % elapsed2
    print "Total: %lu usec" % (elapsed1 + elapsed2)

main()

-----------------------

Original: 9.32 s
Just moving code into a function: 8.39 s
the same, using Psyco: 1.94 s

Bye,
bearophile
June 09, 2008
== Quote from Leandro Lucarella (llucax@gmail.com)'s article
> But there are a few other results I can't explain:
> 1) Why is D gc (disabled or not) version ~25% slower than the D version
>    that uses malloc when iterating the list? It shouldn't be any GC
>    activity in that part. Could be some GC locallity issue that yields
>    more cache misses?

I think it may have more to do with the allocation strategy in the GC.  It obtains
memory in chunks from the OS, and each chunk is typically a max of 8MB.  So
for a test like this the D GC will end up hitting the OS quite a few times asking
for more memory.  If I had to guess I'd say that malloc has a more efficient
strategy here.  If you're interested, try running the same test using Tango
with and without a call tol tango.core.Memory.GC.reserve() for the amount of
memory you expect the app to use before the loop.

> 2) Why is D malloc version ~33% slower than the C version? I took a look
>    at the generated assembly and it's almost identical:
> 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
> 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
> 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
> 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
> 	<_Dmain+216>:   mov    (%eax),%eax
> 	<_Dmain+218>:   test   %eax,%eax
> 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
> 	<main+248>:     lea    -0x1c(%ebp),%eax
> 	<main+251>:     nop
> 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
> 	<main+256>:     addl   $0x1,0x4(%eax)
> 	<main+260>:     adcl   $0x0,0x8(%eax)
> 	<main+264>:     mov    (%eax),%eax
> 	<main+266>:     test   %eax,%eax
> 	<main+268>:     jne    0x8048550 <main+256>
> 	<main+270>:     movl   $0x0,0x4(%esp)
> 	<main+278>:     movl   $0x8049800,(%esp)

No idea.  I'd expect them to be roughly equivalent.


Sean
June 10, 2008
Leandro Lucarella wrote:
> But there are a few other results I can't explain:
> 1) Why is D gc (disabled or not) version ~25% slower than the D version
>    that uses malloc when iterating the list? It shouldn't be any GC
>    activity in that part. Could be some GC locallity issue that yields
>    more cache misses?

There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.

> 2) Why is D malloc version ~33% slower than the C version? I took a look
>    at the generated assembly and it's almost identical:
> 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
> 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
> 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
> 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
> 	<_Dmain+216>:   mov    (%eax),%eax
> 	<_Dmain+218>:   test   %eax,%eax
> 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
> 
> 
> 	<main+248>:     lea    -0x1c(%ebp),%eax
> 	<main+251>:     nop
> 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
> 	<main+256>:     addl   $0x1,0x4(%eax)
> 	<main+260>:     adcl   $0x0,0x8(%eax)
> 	<main+264>:     mov    (%eax),%eax
> 	<main+266>:     test   %eax,%eax
> 	<main+268>:     jne    0x8048550 <main+256>
> 	<main+270>:     movl   $0x0,0x4(%esp)
> 	<main+278>:     movl   $0x8049800,(%esp)
> 
> Tests attached.

Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
June 10, 2008
bearophile, el  9 de junio a las 17:03 me escribiste:
> Leandro Lucarella:
> > You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).
> 
> Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco).
> 

Thanks for the info, I've imported psyco but didn't play with it really =)

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Sus discipulos se miraron sin entended hasta que uno preguntose:
Peperino, soy Daniel Q. de Olivos tengo 54 años y aún soy virgen. A lo
que Peperino respondiole: Si sos ganso, ganso ser. Y lo frotó, y lo
curó y lo sanó. A lo que todos dijeron: ­¡¡¡Peperino se la come,
Peperino se la come!!!
	-- Peperino Pómoro
June 10, 2008
Sean Kelly, el  9 de junio a las 22:53 me escribiste:
> == Quote from Leandro Lucarella (llucax@gmail.com)'s article
> > But there are a few other results I can't explain:
> > 1) Why is D gc (disabled or not) version ~25% slower than the D version
> >    that uses malloc when iterating the list? It shouldn't be any GC
> >    activity in that part. Could be some GC locallity issue that yields
> >    more cache misses?
> 
> I think it may have more to do with the allocation strategy in the GC.  It obtains
> memory in chunks from the OS, and each chunk is typically a max of 8MB.  So
> for a test like this the D GC will end up hitting the OS quite a few times asking
> for more memory.  If I had to guess I'd say that malloc has a more efficient
> strategy here.  If you're interested, try running the same test using Tango
> with and without a call tol tango.core.Memory.GC.reserve() for the amount of
> memory you expect the app to use before the loop.

Really? I though it was more collecion-related. I don't know exactly how the D GC is implemented, but I thought that, as you say, it allocates a chunk of memory, with each new it gave you a piece of that memory, and when it ran out of memory, it runs a collection cycle. When that cycle can't recover unused memory, allocates more.

Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential?

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Me encanta el éxito; por eso prefiero el estado de progreso constante,
con la meta al frente y no atrás.
	-- Ricardo Vaporeso. Punta del Este, Enero de 1918.


June 10, 2008
Leandro Lucarella:
> Thanks for the info, I've imported psyco but didn't play with it really =)

If you have a bit of free time you may time that code of mine (that's equal to yours + psyco) on your PC, so you can show us those timings too compared to the other ones...

Bye,
bearophile
June 10, 2008
Walter Bright, el  9 de junio a las 17:12 me escribiste:
> Leandro Lucarella wrote:
> >But there are a few other results I can't explain:
> >1) Why is D gc (disabled or not) version ~25% slower than the D version
> >   that uses malloc when iterating the list? It shouldn't be any GC
> >   activity in that part. Could be some GC locallity issue that yields
> >   more cache misses?
> 
> There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.

But there is no GC activity there right?

> >2) Why is D malloc version ~33% slower than the C version? I took a look
> >   at the generated assembly and it's almost identical:
> >	<_Dmain+198>:   lea    -0x20(%ebp),%eax
> >	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
> >	<_Dmain+208>:   addl   $0x1,0x8(%eax)
> >	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
> >	<_Dmain+216>:   mov    (%eax),%eax
> >	<_Dmain+218>:   test   %eax,%eax
> >	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
> >	<main+248>:     lea    -0x1c(%ebp),%eax
> >	<main+251>:     nop
> >	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
> >	<main+256>:     addl   $0x1,0x4(%eax)
> >	<main+260>:     adcl   $0x0,0x8(%eax)
> >	<main+264>:     mov    (%eax),%eax
> >	<main+266>:     test   %eax,%eax
> >	<main+268>:     jne    0x8048550 <main+256>
> >	<main+270>:     movl   $0x0,0x4(%esp)
> >	<main+278>:     movl   $0x8049800,(%esp)
> >Tests attached.
> 
> Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.

Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
Si pensas que el alma no se ve
el alma sí se ve en los ojos


June 10, 2008
Adding info about Python+Psyco:
Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):

       C        C++      D no gc  D gc       D gc dis  Python     Py+Psyco
Fill   .136136  .142705  .131238  22.628577  0.242637  12.952816  2.564722
Inc    .025134  .038768  .037456   0.050480  0.051545   3.765271  0.087375
Total  .161270  .181473  .168694  22.679057  0.294182  16.718087  2.652097

Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
EL "PITUFO ENRIQUE" LLEGO A LA BAILANTA
	-- Crónica TV
June 10, 2008
== Quote from Leandro Lucarella (llucax@gmail.com)'s article
> --TB36FDmn/VVEgNH/
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> Sean Kelly, el  9 de junio a las 22:53 me escribiste:
> > == Quote from Leandro Lucarella (llucax@gmail.com)'s article
> > > But there are a few other results I can't explain:
> > > 1) Why is D gc (disabled or not) version ~25% slower than the D version
> > >    that uses malloc when iterating the list? It shouldn't be any GC
> > >    activity in that part. Could be some GC locallity issue that yields
> > >    more cache misses?
> >
> > I think it may have more to do with the allocation strategy in the GC.  It obtains
> > memory in chunks from the OS, and each chunk is typically a max of 8MB.  So
> > for a test like this the D GC will end up hitting the OS quite a few times asking
> > for more memory.  If I had to guess I'd say that malloc has a more efficient
> > strategy here.  If you're interested, try running the same test using Tango
> > with and without a call tol tango.core.Memory.GC.reserve() for the amount of
> > memory you expect the app to use before the loop.
> Really? I though it was more collecion-related. I don't know exactly how
> the D GC is implemented, but I thought that, as you say, it allocates a
> chunk of memory, with each new it gave you a piece of that memory, and
> when it ran out of memory, it runs a collection cycle. When that cycle
> can't recover unused memory, allocates more.
> Considering that the version with GC disabled is almost as fast as the
> malloc version, and that the time grows exponentially (graph attached)
> with the number of nodes, it looks like the problem is the GC
> collection cycles to me. Is the GC algorithm exponential?

Oh okay.  I was replying to the post that suggested that said the malloc version was still much faster than D with the GC disabled.  But if they're comparable then all the better.  This is what I would expect anyway.

As for the GC algorithm being exponential... all nodes of an AA contain pointers, so they will all be scanned by the GC.  Given this and the chance of false positives, I would expect performance to decrease in a superlinear manner.  But an exponential slow-down still seems a bit extreme.


Sean
« First   ‹ Prev
1 2 3