Search
Implementing and optimizing a simple graph metric
Sep 18, 2013
bearophile
Sep 18, 2013
bearophile
Sep 18, 2013
bearophile
Sep 24, 2013
bearophile
Sep 26, 2013
bearophile
Sep 26, 2013
bearophile
Sep 26, 2013
bearophile
Hello all,

I thought I'd do a writeup of the process of implementing and optimizing one of the graph metrics in Dgraph, starting from a fairly straight copy of pseudo-code in a research paper all through the various incremental tweaks that improve performance.
http://braingam.es/2013/09/betweenness-centrality-in-dgraph/

I guess the optimizations made in this process are trivial for anyone experienced in D, but I hope the detailed description of what changes produce useful speedups might be useful for people new to the language.

{Enj,Destr}oy :-)

Best wishes,

-- Joe
Joseph Rushton Wakeling:

Some small improvements:

T[] betweenness(Graph, T = double)(ref Graph g, bool[] ignore = null)
if (isGraph!Graph && isFloatingPoint!T)
{
enum T T0 = 0;
enum T T1 = 1;
auto centrality = minimallyInitializedArray!(typeof(return))(g.vertexCount);
centrality[] = T0;
auto stack = new size_t[g.vertexCount];
auto sigma = minimallyInitializedArray!T(g.vertexCount);
sigma[] = T0;
auto delta = minimallyInitializedArray!T(g.vertexCount);
delta[] = T0;
auto d = minimallyInitializedArray!long(g.vertexCount);
d[] = -1;
auto q = VertexQueue(g.vertexCount);
auto p = new Appender!(size_t[])[g.vertexCount];

foreach (immutable s; 0 .. g.vertexCount)
{
if (ignore && ignore[s])
{
continue;
}

size_t stackLength = 0;
assert(q.empty);
sigma[s] = T1;
d[s] = 0;
q.push(s);

while (!q.empty)
{
immutable size_t v = q.front;
q.pop;
stack[stackLength] = v;
++stackLength;
foreach (immutable w; g.neighboursOut(v))
{
if (ignore && ignore[w])
{
continue;
}

if (d[w] < 0)
{
q.push(w);
d[w] = d[v] + 1;
assert(sigma[w] == T0);
sigma[w] = sigma[v];
p[w].clear;
p[w].put(v);
}
else if (d[w] > (d[v] + 1))
{
/* w has already been encountered, but we've
found a shorter path to the source.  This
is probably only relevant to the weighted
case, but let's keep it here in order to
be ready for that update. */
d[w] = d[v] + 1L;
sigma[w] = sigma[v];
p[w].clear;
p[w].put(v);
}
else if (d[w] == (d[v] + 1))
{
sigma[w] += sigma[v];
p[w].put(v);
}
}
}

while (stackLength > 0)
{
--stackLength;
immutable w = stack[stackLength];
foreach (immutable v; p[w].data)
{
delta[v] += ((sigma[v] / sigma[w]) * (T1 + delta[w]));
}
if (w != s)
{
centrality[w] += delta[w];
}
sigma[w] = T0;
delta[w] = T0;
d[w] = -1;
}
}

static if (!g.directed)
{
centrality[] /= 2;
}

return centrality;
}

Just for a test, try to allocate all those arrays in a different way:
- First try a std.array.Array using the LDC2 compiler;
- Another thing to try is to allocate them on the stack using core.stdc.stdlib.alloca:
auto p = cast(T*)alloca(T.sizeof * g.vertexCount);
if (!p) throw new MemoryError...
auto centrality = p[0 .. g.vertexCount];
This probably can't be used for very large graphs, so you have to add even more logic to allocate the arrays on the C heap if they are too much large. This could be useful if you call betweenness() many many times.
- Try to optionally accept the buffers from outside.

Bye,
bearophile
On Wednesday, 18 September 2013 at 13:39:29 UTC, bearophile wrote:
> Just for a test, try to allocate all those arrays in a different way:
> - First try a std.array.Array using the LDC2 compiler;
> - Another thing to try is to allocate them on the stack using core.stdc.stdlib.alloca:
> auto p = cast(T*)alloca(T.sizeof * g.vertexCount);
> if (!p) throw new MemoryError...
> auto centrality = p[0 .. g.vertexCount];
> This probably can't be used for very large graphs, so you have to add even more logic to allocate the arrays on the C heap if they are too much large. This could be useful if you call betweenness() many many times.
> - Try to optionally accept the buffers from outside.

Nice suggestions, I'll try those out! :-)

I think I did give std.array.Array a trial when trying to speed up its performance, and I don't remember it making any difference (if anything it may have slowed things down).  But I'll give it a second look and report back.

I haven't yet tried alloca or other manual memory management -- I felt a bit resistant to this as I'd prefer to keep the code simple and readable -- but I'll give that a go too just to see how it goes.

It's interesting that you pick up on my use of to!T(0) etc.  I had some concern about whether that would affect speed at all.  At the time of originally writing the code, my impression was that not having it actually made things worse, and that the compiler was smart enough to carry out the conversion at compile-time.

However, my impression now is that having or not having it, or any other method (e.g. enum T0, T1, etc.) makes absolutely no difference, and that I might as well be simple and write 0 for integral-types, 0.0 for floating-point.
Joseph Rushton Wakeling:

> I haven't yet tried alloca or other manual memory management -- I felt a bit resistant to this as I'd prefer to keep the code simple and readable -- but I'll give that a go too just to see how it goes.

I'd like some stack-allocated variable-length array in D+Phobos, as in C++14. It's also a basis to build several other stack-allocated data structures.

> However, my impression now is that having or not having it, or any other method (e.g. enum T0, T1, etc.) makes absolutely no difference, and that I might as well be simple and write 0 for integral-types, 0.0 for floating-point.

Right. And this could be right even if you want T=Rational.

Bye,
bearophile
On Wednesday, 18 September 2013 at 15:22:51 UTC, bearophile wrote:
> Joseph Rushton Wakeling:
>
>> I haven't yet tried alloca or other manual memory management -- I felt a bit resistant to this as I'd prefer to keep the code simple and readable -- but I'll give that a go too just to see how it goes.
>
> I'd like some stack-allocated variable-length array in D+Phobos, as in C++14. It's also a basis to build several other stack-allocated data structures.

Yes, that'd be useful, although in the case of this code I think that stack vs. heap probably isn't that important.  If the data is small enough for stack allocation, the calculation will be quick anyway.

>> However, my impression now is that having or not having it, or any other method (e.g. enum T0, T1, etc.) makes absolutely no difference, and that I might as well be simple and write 0 for integral-types, 0.0 for floating-point.
>
> Right. And this could be right even if you want T=Rational.

Actually, on re-testing this just now, I'm returning to my original view.  I find that if you put in raw floating-point numbers like 0.0, 1.0 etc. the resulting code can be slower in the case where T = float.  Putting to!T or using enums as you suggest appears to make no difference.

This is a guess rather than confirmed by looking at assembly, but I'm presuming that to!T(0) and other conversions of compile-time constants are evaluated at compile time, so you don't in practice carry the cost you'd normally get from using to().
Joseph Rushton Wakeling:

> in the case of this code I think that stack vs. heap probably isn't that important.  If the data is small enough for stack allocation, the calculation will be quick anyway.

How many times or how often do you need to call betweenness()? If it's called only few times or once in a while then using the GC is good enough. But if you have to call it many millions of times, all those GC array allocations could cost some time, even if they are small arrays. In this case allocating them on the stack (or not allocating them, using externally allocated buffers) helps.

Bye,
bearophile
On Wednesday, 18 September 2013 at 15:17:25 UTC, Joseph Rushton Wakeling wrote:
> I think I did give std.array.Array a trial when trying to speed up its performance, and I don't remember it making any difference (if anything it may have slowed things down).  But I'll give it a second look and report back.

I tried rewriting the variable declarations:

Array!T centrality;
centrality.length = g.vertexCount;
centrality[] = to!T(0);
Array!size_t stack;
stack.length = g.vertexCount;
Array!T sigma;
sigma.length = g.vertexCount;
Array!T delta;
delta.length = g.vertexCount;
Array!long d;
d.length = g.vertexCount;
auto q = VertexQueue(g.vertexCount);
Appender!(size_t[])[] p = new Appender!(size_t[])[g.vertexCount];

It results in a moderate slowdown of the code, at a guess because it increases the total number of mallocs/deallocs.

I note that there seems to be no way to initialize std.container.Array as having a certain length ... ?

I tried instead using std.array.uninitializedArray to initialize the arrays in the betweenness() function -- it was a bit difficult to judge here: with a relatively small number of calls to betweenness() it seems to have resulted in a speedup, but with very many calls, it seems to have been slower.
On Wednesday, 18 September 2013 at 13:39:29 UTC, bearophile wrote:
> - Try to optionally accept the buffers from outside.

Does this look good to you?

/////////////////////////////////////////////////

auto ref betweenness(T = double, Graph)(ref Graph g, bool[] ignore = null)
if (isFloatingPoint!T && isGraph!Graph)
{
T[] centrality = new T[g.vertexCount];
return betweenness!(T, Graph)(g, centrality, ignore);
}

auto ref betweenness(T = double, Graph)(ref Graph g, ref T[] centrality, bool[] ignore = null)
{
centrality.length = g.vertexCount;
centrality[] = to!T(0);
// ... the rest as before
}

/////////////////////////////////////////////////
On Wednesday, 18 September 2013 at 17:13:28 UTC, bearophile wrote:
> How many times or how often do you need to call betweenness()? If it's called only few times or once in a while then using the GC is good enough. But if you have to call it many millions of times, all those GC array allocations could cost some time, even if they are small arrays. In this case allocating them on the stack (or not allocating them, using externally allocated buffers) helps.

I think millions of times is excessive. In my test code I have an example with a 50-node network calculating betweenness centrality 10000 times, which takes under 1.5 seconds. So obviously if you scale up to millions, small improvements could make a difference, but it's not going to be on such a scale that it's worth prioritizing, at least for now. I think even 10000 calculations is excessive for any practical case I can think of.

You're right to call for the opportunity to pass a buffer, though, and I've enabled that. In the current test code it doesn't seem to improve anything, but that may be simply a matter of scale.
On Wednesday, 18 September 2013 at 13:39:29 UTC, bearophile wrote:
>     auto centrality = minimallyInitializedArray!(typeof(return))(g.vertexCount);
>     centrality[] = T0;
>     auto stack = new size_t[g.vertexCount];
>     auto sigma = minimallyInitializedArray!T(g.vertexCount);
>     sigma[] = T0;
>     auto delta = minimallyInitializedArray!T(g.vertexCount);
>     delta[] = T0;
>     auto d = minimallyInitializedArray!long(g.vertexCount);
>     d[] = -1;
>     auto q = VertexQueue(g.vertexCount);
>     auto p = new Appender!(size_t[])[g.vertexCount];

As an experiment I tried something along these lines -- using uninitializedArray for most of the arrays here and minimallyInitializedArray for p.

It seems to result in a small speed improvement, although I'm not certain about that -- certainly not a speed loss.

On the other hand, if inside the VertexQueue implementation, I replace the "new" declaration with an uninitializedArray call, the code gets slower by a noticeable amount.

Very odd -- any ideas why that might be?
« First   ‹ Prev
1 2