| Thread overview | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
June 09, 2008 More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
I've done a very simple microbenchmark testing a single linked list.
What the test does is fill the list with long elements, and then iterate the list and increment the associated value. This test mostly test the allocation speed.
I've done plain a C version, a C++ version (using STL double-linked list, so it has some disvantage here), a D version without using the GC, a D version using the GC, and a Python version.
Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):
C C++ D no gc D gc D gc dis Python
Fill 0.136136 0.142705 0.131238 22.628577 0.242637 12.952816
Inc 0.025134 0.038768 0.037456 0.050480 0.051545 3.765271
Total 0.161270 0.181473 0.168694 22.679057 0.294182 16.718087
Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).
You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).
But there are a few other results I can't explain:
1) Why is D gc (disabled or not) version ~25% slower than the D version
that uses malloc when iterating the list? It shouldn't be any GC
activity in that part. Could be some GC locallity issue that yields
more cache misses?
2) Why is D malloc version ~33% slower than the C version? I took a look
at the generated assembly and it's almost identical:
<_Dmain+198>: lea -0x20(%ebp),%eax
<_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi
<_Dmain+208>: addl $0x1,0x8(%eax)
<_Dmain+212>: adcl $0x0,0xc(%eax)
<_Dmain+216>: mov (%eax),%eax
<_Dmain+218>: test %eax,%eax
<_Dmain+220>: jne 0x804a240 <_Dmain+208>
<main+248>: lea -0x1c(%ebp),%eax
<main+251>: nop
<main+252>: lea 0x0(%esi,%eiz,1),%esi
<main+256>: addl $0x1,0x4(%eax)
<main+260>: adcl $0x0,0x8(%eax)
<main+264>: mov (%eax),%eax
<main+266>: test %eax,%eax
<main+268>: jne 0x8048550 <main+256>
<main+270>: movl $0x0,0x4(%esp)
<main+278>: movl $0x8049800,(%esp)
Tests attached.
--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
RENUNCIO PARA IR A REZARLE A SAN CAYETANO
-- Crónica TV
| ||||
June 09, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | Leandro Lucarella:
> You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).
Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco).
Try putting your code into a main() and add a:
from psyco.classes import __metaclass__
At the top.
The code:
import sys
from datetime import datetime
import psyco
psyco.full()
from psyco.classes import __metaclass__
class Node:
def __init__(self):
self.data = 0
self.next = None
def main():
start = datetime.now()
list = Node()
n = list
for i in xrange(long(sys.argv[1])):
n.data = i
n.next = Node()
n = n.next
n.next = None
end = datetime.now()
delta = datetime.now() - start
elapsed1 = delta.seconds * 1000000 + delta.microseconds
start = datetime.now()
p = list
while p:
p.data += 1
p = p.next
delta = datetime.now() - start
elapsed2 = delta.seconds * 1000000 + delta.microseconds
print "Fill: %lu usec" % elapsed1
print "Inc: %lu usec" % elapsed2
print "Total: %lu usec" % (elapsed1 + elapsed2)
main()
-----------------------
Original: 9.32 s
Just moving code into a function: 8.39 s
the same, using Psyco: 1.94 s
Bye,
bearophile
| |||
June 09, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | == Quote from Leandro Lucarella (llucax@gmail.com)'s article > But there are a few other results I can't explain: > 1) Why is D gc (disabled or not) version ~25% slower than the D version > that uses malloc when iterating the list? It shouldn't be any GC > activity in that part. Could be some GC locallity issue that yields > more cache misses? I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop. > 2) Why is D malloc version ~33% slower than the C version? I took a look > at the generated assembly and it's almost identical: > <_Dmain+198>: lea -0x20(%ebp),%eax > <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi > <_Dmain+208>: addl $0x1,0x8(%eax) > <_Dmain+212>: adcl $0x0,0xc(%eax) > <_Dmain+216>: mov (%eax),%eax > <_Dmain+218>: test %eax,%eax > <_Dmain+220>: jne 0x804a240 <_Dmain+208> > <main+248>: lea -0x1c(%ebp),%eax > <main+251>: nop > <main+252>: lea 0x0(%esi,%eiz,1),%esi > <main+256>: addl $0x1,0x4(%eax) > <main+260>: adcl $0x0,0x8(%eax) > <main+264>: mov (%eax),%eax > <main+266>: test %eax,%eax > <main+268>: jne 0x8048550 <main+256> > <main+270>: movl $0x0,0x4(%esp) > <main+278>: movl $0x8049800,(%esp) No idea. I'd expect them to be roughly equivalent. Sean | |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | Leandro Lucarella wrote: > But there are a few other results I can't explain: > 1) Why is D gc (disabled or not) version ~25% slower than the D version > that uses malloc when iterating the list? It shouldn't be any GC > activity in that part. Could be some GC locallity issue that yields > more cache misses? There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it. > 2) Why is D malloc version ~33% slower than the C version? I took a look > at the generated assembly and it's almost identical: > <_Dmain+198>: lea -0x20(%ebp),%eax > <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi > <_Dmain+208>: addl $0x1,0x8(%eax) > <_Dmain+212>: adcl $0x0,0xc(%eax) > <_Dmain+216>: mov (%eax),%eax > <_Dmain+218>: test %eax,%eax > <_Dmain+220>: jne 0x804a240 <_Dmain+208> > > > <main+248>: lea -0x1c(%ebp),%eax > <main+251>: nop > <main+252>: lea 0x0(%esi,%eiz,1),%esi > <main+256>: addl $0x1,0x4(%eax) > <main+260>: adcl $0x0,0x8(%eax) > <main+264>: mov (%eax),%eax > <main+266>: test %eax,%eax > <main+268>: jne 0x8048550 <main+256> > <main+270>: movl $0x0,0x4(%esp) > <main+278>: movl $0x8049800,(%esp) > > Tests attached. Without running a profiler, there's no way to be sure about just where in the code the time is being consumed. | |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile, el 9 de junio a las 17:03 me escribiste: > Leandro Lucarella: > > You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results). > > Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco). > Thanks for the info, I've imported psyco but didn't play with it really =) -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Sus discipulos se miraron sin entended hasta que uno preguntose: Peperino, soy Daniel Q. de Olivos tengo 54 años y aún soy virgen. A lo que Peperino respondiole: Si sos ganso, ganso ser. Y lo frotó, y lo curó y lo sanó. A lo que todos dijeron: ¡¡¡Peperino se la come, Peperino se la come!!! -- Peperino Pómoro | |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly Attachments: | Sean Kelly, el 9 de junio a las 22:53 me escribiste: > == Quote from Leandro Lucarella (llucax@gmail.com)'s article > > But there are a few other results I can't explain: > > 1) Why is D gc (disabled or not) version ~25% slower than the D version > > that uses malloc when iterating the list? It shouldn't be any GC > > activity in that part. Could be some GC locallity issue that yields > > more cache misses? > > I think it may have more to do with the allocation strategy in the GC. It obtains > memory in chunks from the OS, and each chunk is typically a max of 8MB. So > for a test like this the D GC will end up hitting the OS quite a few times asking > for more memory. If I had to guess I'd say that malloc has a more efficient > strategy here. If you're interested, try running the same test using Tango > with and without a call tol tango.core.Memory.GC.reserve() for the amount of > memory you expect the app to use before the loop. Really? I though it was more collecion-related. I don't know exactly how the D GC is implemented, but I thought that, as you say, it allocates a chunk of memory, with each new it gave you a piece of that memory, and when it ran out of memory, it runs a collection cycle. When that cycle can't recover unused memory, allocates more. Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential? -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Me encanta el éxito; por eso prefiero el estado de progreso constante, con la meta al frente y no atrás. -- Ricardo Vaporeso. Punta del Este, Enero de 1918. | |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | Leandro Lucarella:
> Thanks for the info, I've imported psyco but didn't play with it really =)
If you have a bit of free time you may time that code of mine (that's equal to yours + psyco) on your PC, so you can show us those timings too compared to the other ones...
Bye,
bearophile
| |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments: | Walter Bright, el 9 de junio a las 17:12 me escribiste: > Leandro Lucarella wrote: > >But there are a few other results I can't explain: > >1) Why is D gc (disabled or not) version ~25% slower than the D version > > that uses malloc when iterating the list? It shouldn't be any GC > > activity in that part. Could be some GC locallity issue that yields > > more cache misses? > > There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it. But there is no GC activity there right? > >2) Why is D malloc version ~33% slower than the C version? I took a look > > at the generated assembly and it's almost identical: > > <_Dmain+198>: lea -0x20(%ebp),%eax > > <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi > > <_Dmain+208>: addl $0x1,0x8(%eax) > > <_Dmain+212>: adcl $0x0,0xc(%eax) > > <_Dmain+216>: mov (%eax),%eax > > <_Dmain+218>: test %eax,%eax > > <_Dmain+220>: jne 0x804a240 <_Dmain+208> > > <main+248>: lea -0x1c(%ebp),%eax > > <main+251>: nop > > <main+252>: lea 0x0(%esi,%eiz,1),%esi > > <main+256>: addl $0x1,0x4(%eax) > > <main+260>: adcl $0x0,0x8(%eax) > > <main+264>: mov (%eax),%eax > > <main+266>: test %eax,%eax > > <main+268>: jne 0x8048550 <main+256> > > <main+270>: movl $0x0,0x4(%esp) > > <main+278>: movl $0x8049800,(%esp) > >Tests attached. > > Without running a profiler, there's no way to be sure about just where in the code the time is being consumed. Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Si pensas que el alma no se ve el alma sí se ve en los ojos | |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | Adding info about Python+Psyco:
Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):
C C++ D no gc D gc D gc dis Python Py+Psyco
Fill .136136 .142705 .131238 22.628577 0.242637 12.952816 2.564722
Inc .025134 .038768 .037456 0.050480 0.051545 3.765271 0.087375
Total .161270 .181473 .168694 22.679057 0.294182 16.718087 2.652097
Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).
--
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
EL "PITUFO ENRIQUE" LLEGO A LA BAILANTA
-- Crónica TV
| |||
June 10, 2008 Re: More embarrassing microbenchmars for D's GC. | ||||
|---|---|---|---|---|
| ||||
Posted in reply to Leandro Lucarella | == Quote from Leandro Lucarella (llucax@gmail.com)'s article
> --TB36FDmn/VVEgNH/
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> Sean Kelly, el 9 de junio a las 22:53 me escribiste:
> > == Quote from Leandro Lucarella (llucax@gmail.com)'s article
> > > But there are a few other results I can't explain:
> > > 1) Why is D gc (disabled or not) version ~25% slower than the D version
> > > that uses malloc when iterating the list? It shouldn't be any GC
> > > activity in that part. Could be some GC locallity issue that yields
> > > more cache misses?
> >
> > I think it may have more to do with the allocation strategy in the GC. It obtains
> > memory in chunks from the OS, and each chunk is typically a max of 8MB. So
> > for a test like this the D GC will end up hitting the OS quite a few times asking
> > for more memory. If I had to guess I'd say that malloc has a more efficient
> > strategy here. If you're interested, try running the same test using Tango
> > with and without a call tol tango.core.Memory.GC.reserve() for the amount of
> > memory you expect the app to use before the loop.
> Really? I though it was more collecion-related. I don't know exactly how
> the D GC is implemented, but I thought that, as you say, it allocates a
> chunk of memory, with each new it gave you a piece of that memory, and
> when it ran out of memory, it runs a collection cycle. When that cycle
> can't recover unused memory, allocates more.
> Considering that the version with GC disabled is almost as fast as the
> malloc version, and that the time grows exponentially (graph attached)
> with the number of nodes, it looks like the problem is the GC
> collection cycles to me. Is the GC algorithm exponential?
Oh okay. I was replying to the post that suggested that said the malloc version was still much faster than D with the GC disabled. But if they're comparable then all the better. This is what I would expect anyway.
As for the GC algorithm being exponential... all nodes of an AA contain pointers, so they will all be scanned by the GC. Given this and the chance of false positives, I would expect performance to decrease in a superlinear manner. But an exponential slow-down still seems a bit extreme.
Sean
| |||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply