Jump to page: 1 2
Thread overview
Profiling graph library
Jul 30, 2013
bearophile
Jul 30, 2013
John Colvin
Jul 30, 2013
bearophile
Jul 31, 2013
bearophile
Jul 31, 2013
bearophile
Aug 01, 2013
bearophile
Jul 31, 2013
bearophile
July 30, 2013
Hello all,

I thought I'd share some profiling results from the graph/network library that
I've been working on -- see:
https://github.com/WebDrake/Dgraph
http://braingam.es/2013/07/complex-networks-in-d/
http://braingam.es/2013/07/complex-networks-in-d-part-2-building-the-network/

I'd like to get others' thoughts and insight on these -- obviously I have some ideas of my own about how to interpret them and move forward, but it's always good to have feedback (and there are several things I'd like to make sure I have my facts straight on before making any more "public" comment).

The basic data structure, which is taken from the igraph library, consists of a pair of arrays of head and tail vertices for edges, together with indices which sort the edges according to head or tail vertex, and sums of the total number of edges whose head/tail ID is less than a certain value: https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L33-L38

Taken together this offers a very memory-efficient structure that should also be able to minimize the total number of allocations required to build a graph.

When I started working on the code the initial goal was to have something that gave a very simple and obvious representation of the algorithms at work, which would be easy to understand.  So, for example, I used std.algorithm.map quite freely to make an "easy" representation of how a vertex's neighbours were calculated: https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L260-L286

It turns out this results in a fairly major speed hit that stems from a variety of factors.  The following is the result of profiling a simple function that just iterates over all the neighbours of each vertex in turn, and sums the neighbour IDs.  The code is attached in neighbours.d.

---------------------------------------------------------------------------------
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 19.20      0.24     0.24 50000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda46TmZ10__lambda46MFNbNfmZxm
 18.40      0.47     0.23
_D2gc3gcx2GC12mallocNoSyncMFmkPmZPv
 15.60      0.67     0.20 50000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda48TmZ10__lambda48MFNbNfmZxm
  9.60      0.79     0.12   500000     0.00     0.00
_D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFKC6dgraph5graph13__T5GraphVb0Z5GraphZm
  8.00      0.89     0.10 25000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result
  8.00      0.99     0.10                             _D2gc3gcx2GC6mallocMFmkPmZPv
  4.80      1.05     0.06                             _D2gc6gcbits6GCBits3setMFmZv
  4.80      1.11     0.06                             _D2gc3gcx3Gcx11fullcollectMFZm
  4.00      1.16     0.05                             _D2gc6gcbits6GCBits4testMFmZm
  2.40      1.19     0.03
_D4core4sync5mutex5Mutex6unlockMFNeZv
  1.60      1.21     0.02                             _D2gc6gcbits6GCBits5allocMFmZv
  1.20      1.22     0.02
_D3std4conv15__T6toImplTiTkZ6toImplFNaNfkZi15__dgliteral2501MFNaNfZC6object9Throwable
  1.20      1.24     0.02                             gc_malloc
  0.40      1.24     0.01
_D4core4sync5mutex5Mutex12MonitorProxy11__xopEqualsFKxS4core4sync5mutex5Mutex12MonitorProxyKxS4core4sync5mutex5Mutex12MonitorProxyZb
  0.40      1.25     0.01
_D4core4sync5mutex5Mutex4lockMFNeZv
  0.40      1.25     0.01                             gc_clrAttr
---------------------------------------------------------------------------------

What's striking about this is (i) the various range/map based functions carry a substantial hit in and of themselves; (ii) there's also a fairly nasty hit from the GC.  (If I run this with a directed graph, which avoids the use of std.range.chain, there's a little bit of saving, but only a very little.)

As an experiment, I wrote some alternative code that provides a cache for all the neighbours of all the vertices.  This takes an overall memory hit but means that one can avoid having to recalculate the list of neighbours so long as the graph itself doesn't change: https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L42-L53 https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L358-L379

The speed benefits are immediate:

---------------------------------------------------------------------------------
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 64.29      0.14     0.14 25000000     0.00     0.00
_D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMFNaNbNfymZAm
 33.33      0.21     0.07   500000     0.00     0.00
_D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFNaNbNfKC6dgraph5graph13__T5GraphVb0Z5GraphZm
  2.38      0.21     0.01
_D6dgraph5graph13__T5GraphVb0Z5Graph13incidentEdgesMFNaNbNfymZAm
---------------------------------------------------------------------------------

In fact there's a double benefit here because (i) it's more efficient to write the neighbour values into an array than to use a map and (ii) because we're cacheing them, we only have to calculate them once.  The extra memory required may be bearable given that the overall constraints on network size are probably going to be down to something other than RAM.

However, I think it might be worth avoiding this if possible.  So I'd like to be sure that I understand -- first, are map, chain, etc. likely to become more efficient in terms of speed and memory use?  I recall that there has been some concern over unnecessary allocations, and I'm not sure if std.algorithm.map or std.range.chain are victims here.  Alternatively, there might be some benefit in replacing the maps, chains, etc. with custom data structures that provide _exactly_ what is wanted.

Finally, whether using the map approach or the cache, neither of these approaches seems as fast in benchmarking as a simple ad-hoc approach as I might use for a graph in simulation, such as,

    size_t[][] neighbours;

My instinct is that the latter approach will start to choke when you start trying to build a really big graph because of the many, many reallocations required to build the structure.  It will probably also become unwieldy if you start wanting to allow arbitrary collections of edge and vertex properties (weight, colour, etc.) because it doesn't allow for a very elegant way to store and retrieve those properties.

Suffice to say that it looks like there are an interesting range of compromises to make depending on needs, and it's quite likely there's a benefit in providing different graph data structures depending on what scale of graph (and what kind of graph properties) you want to work with.

Anyway, although this is a very limited bit of profiling work, I'd be very grateful for anyone's thoughts or insight on the issues raised here.

Thanks & best wishes,

     -- Joe


July 30, 2013
There are many ways to design a graph data structure. All of them have tradeoffs, they can't be very good for everything.

For your benchmarks are you using the ldc2 compiler with correct compilation switches?

map() should not allocate memory.

Sometimes lazy code is faster and sometimes it's slower. A good compiler (like Haskell GHC) should be able to optimize map well.

Also try to use a bitvector from Phobos instead of bool[].

Sometimes you can keep memory allocation low, performance almost acceptable, and code easy to read using a pre-allocated scratch space plus using map() and copy() from std.algorithm.

Bye,
bearophile
July 30, 2013
On Tuesday, 30 July 2013 at 23:02:40 UTC, bearophile wrote:
> Sometimes you can keep memory allocation low, performance almost acceptable, and code easy to read using a pre-allocated scratch space plus using map() and copy() from std.algorithm.

Any chance of some nice examples of this?
July 30, 2013
John Colvin:

> Any chance of some nice examples of this?

Allocate a buffer somewhere, on the stack or heap, and then instead of using a map()+array() as in that code use map()+copy() plus a test to be sure the space is sufficient.

But in the end I don't know how much well even ldc2 will compile that. You have to take a look at the produced asm with the "-output-s" switch of ldc2.

Bye,
bearophile
July 30, 2013
On 07/31/2013 01:02 AM, bearophile wrote:
> There are many ways to design a graph data structure. All of them have tradeoffs, they can't be very good for everything.

Sure, it's just that in this case the reference code (igraph) is _much_ more performant.  I'm keen to try and identify the major issues that are holding the D code back.

I _could_ just copy the C algorithms from igraph into my own code, but nothing would be learned from that, and it would seem to defeat the purpose of using a more elegant language like D.

> For your benchmarks are you using the ldc2 compiler with correct compilation switches?

No, gdmd -O -release -inline -g -profile, and then gprof to generate the profile.

> map() should not allocate memory.
> 
> Sometimes lazy code is faster and sometimes it's slower. A good compiler (like Haskell GHC) should be able to optimize map well.

That's what I'd assumed, which was why I was so disappointed that in this case performance was _very_ bad.  I particularly don't understand the major GC hit.

Or rather, I should say, I don't see why that GC hit should be there, unless it's that there is a fault in the implementation of iota, map and/or chain.

> Also try to use a bitvector from Phobos instead of bool[].

I did consider that, I haven't tried yet.  Obviously it's superior from a storage point of view, I wasn't sure if it would be worse or equivalent performance-wise (I seem to recall reading that it's slightly slower than a bool[]).  I'll try it. :-)

> Sometimes you can keep memory allocation low, performance almost acceptable, and
> code easy to read using a pre-allocated scratch space plus using map() and
> copy() from std.algorithm.

Could you expand a little on that point?

I don't mean to ask you to write my code for me, it's just that (like most non-computer-science researchers) my knowledge of programming is somewhat specialized, and there are blank spots on the map.  So when in your reply to John you say, "Allocate a buffer somewhere, on the stack or heap", I'm still not sure exactly what you mean or how you'd use it.  It'd really help to see a concrete example (it doesn't need to be tailored to the use-case here).

In any case, thanks as ever for all the useful remarks and advice. :-)
July 31, 2013
On 07/31/2013 01:02 AM, bearophile wrote:
> Also try to use a bitvector from Phobos instead of bool[].

I take it you mean std.bitmanip.BitArray ... ?

I ran into some interesting problems which are probably down to incomplete implementation: e.g.

    myBitArray.length += n;

... results in a compiler error message: myBitArray.length() is not an lvalue

BitArrays are also not sliceable, it seems.

I'm guessing that's simply failure to implement certain interfaces, rather than an unavoidable problem ... ?
July 31, 2013
Joseph Rushton Wakeling:

> I take it you mean std.bitmanip.BitArray ... ?

Right.


> I ran into some interesting problems which are probably down to incomplete
> implementation: e.g.
>
>     myBitArray.length += n;
>
> ... results in a compiler error message: myBitArray.length() is not an lvalue

This is just a missing feature. Someone should implement it. std.bitmanip.BitArray lacks several small features.


> BitArrays are also not sliceable, it seems.

This is a bit more complex to implement in BitArray.
One way to implement it is to copy all the sliced data, with no alignment, and create a new BitArray. But this is semantically different from the regular array slicing (that doesn't copy data).
If you don't want to copy data then you have to add to BitArray a field (one ubyte one uint, if you use an ubyte it needs a static if for static safety) that represents the 0..(size_T.sizeof * 8) offset. But this could slow down a bit the access of single bits.

Bye,
bearophile
July 31, 2013
Joseph Rushton Wakeling:

> No, gdmd -O -release -inline -g -profile, and then gprof to generate the profile.

Then I suggest to try ldc2 too.


> That's what I'd assumed, which was why I was so disappointed that in this case
> performance was _very_ bad.  I particularly don't understand the major GC hit.
>
> Or rather, I should say, I don't see why that GC hit should be there, unless
> it's that there is a fault in the implementation of iota, map and/or chain.

My suggestion is to copy the implementation of those ranges in a test module, and try to understand what are the causes of the GC activity you see.


> Could you expand a little on that point?
>
> I don't mean to ask you to write my code for me, it's just that (like most
> non-computer-science researchers) my knowledge of programming is somewhat
> specialized, and there are blank spots on the map.  So when in your reply to
> John you say, "Allocate a buffer somewhere, on the stack or heap", I'm still not
> sure exactly what you mean or how you'd use it.  It'd really help to see a
> concrete example (it doesn't need to be tailored to the use-case here).

If this resulting array is used only for a short period of time:

return sequence.map!(x => x.func).array;


then perhaps you can perform an array allocation in a constructor, static constructor, or you could allocate a fixed sized array inside the body of the graph, etc. Let's say this array is named "buf".

Then with code like:

auto left = sequence.map!(x => x.func).copy(buf[]);
return buf[...];

left tells you how much is left, so you can slice the buf and return it. Somewhere you also have to safe the length of the used buf.
If the resulting array is used only for a moment, this will not cause troubles in a single thread program. The point is to remove the memory allocation caused by array().

An alternative design comes from Tango, instead of allocating the buffer somewhere, your function takes an optional buffer, and uses it unless it's too much small for the purpose, otherwise it allocates with array() or something else.

Bye,
bearophile
July 31, 2013
On 07/31/2013 02:31 PM, bearophile wrote:
> Then I suggest to try ldc2 too.

I haven't profiled with ldc2 yet (I will) but the broad speed tendencies are similar.  It's faster but not so much faster that I can assume the speed hit isn't still there.

> My suggestion is to copy the implementation of those ranges in a test module, and try to understand what are the causes of the GC activity you see.

I'll look into that.  Thanks for the suggestion. :-)

> If this resulting array is used only for a short period of time:
> 
> return sequence.map!(x => x.func).array;
> 
> 
> then perhaps you can perform an array allocation in a constructor, static constructor, or you could allocate a fixed sized array inside the body of the graph, etc. Let's say this array is named "buf".
> 
> Then with code like:
> 
> auto left = sequence.map!(x => x.func).copy(buf[]);
> return buf[...];
> 
> left tells you how much is left, so you can slice the buf and return it.
> Somewhere you also have to safe the length of the used buf.
> If the resulting array is used only for a moment, this will not cause troubles
> in a single thread program. The point is to remove the memory allocation caused
> by array().
> 
> An alternative design comes from Tango, instead of allocating the buffer somewhere, your function takes an optional buffer, and uses it unless it's too much small for the purpose, otherwise it allocates with array() or something else.

Oh, I see.  Yes, that makes sense and could be useful.

Actually, something like this (but not done as elegantly as your example) was the reason behind the cached version of the graph.  I did think of just returning a small buffer with the neighbours in, but I was concerned that if someone wanted to keep using the resulting array over a longer period of time, errors could creep in.

So, I came up with the full cache that is memory-heavy but safer as any copies you take will remain valid so long as the graph itself is not altered.

But most likely the best compromise solution is as you suggest a buffer, with the instruction ".dup the result if you want it to be persistent".

July 31, 2013
Joseph Rushton Wakeling:

> But most likely the best compromise solution is as you
> suggest a buffer, with the instruction ".dup the result
> if you want it to be persistent".

In my library code often I do the opposite: I add a template boolean switch named as "doCopy" that defaults to true. If it's true, the function copies the buffer. So if you don't know what you are doing, or you don't care about performance (and this happens often), it's safer. When you know what you are doing or your profiling (or your experience) tells you you want more performance, you call the function with a false doCopy, and it doesn't allocate a new buffer.

This is the design I suggested for File.byLine() but Andrei ha preferred a simpler design that doesn't create a new buffer for each line.

Bye,
bearophile
« First   ‹ Prev
1 2