July 11, 2013
On Thu, Jul 11, 2013 at 08:34:17PM +0200, Joseph Rushton Wakeling wrote:
> On Thursday, 11 July 2013 at 18:14:01 UTC, w0rp wrote:
> >I don't like it. Here is why.
> >
> >1. You can only use a size_t type as the vertex type. What about strings?
> 
> This is a core graph type designed explicitly for speed and efficiency. If you want string ids for vertices, it's better to wrap this with maps from string to size_t and back again. Having strings or other arbitrary data types involved in the core data type will just be a nightmare space- and speed-wise.
[...]
> >My most common use of graphs is to start from nothing, build from adding edges arbitrarily, usually from IDs which may be integers, and may be strings, and then use graph algorithms to gain some understanding of the relation between objects mapped to these IDs. With this library, this will be difficult or impossible.
> 
> I understand those requirements, I just think that the core data structure on which calculations are done is not the ideal place to implement them.
> 
> (Sorry for terse replies, I'm writing on my phone...)

That makes sense. The core graph type is what's used internally by the graphing algorithms, right? So it should be optimized for maximum performance in those algorithms.

However, I think we should also provide friendlier representations for user-facing types. One possibility is a wrapper type that accepts string IDs for nodes/edges, and internally maintains a mapping between them and numerical IDs, so that the user doesn't have to keep doing this manually. This way the algorithms can run at top speed, and the user can still get a nice API for string-based IDs.


T

-- 
Freedom of speech: the whole world has no right *not* to hear my spouting off!
July 11, 2013
On Thursday, 11 July 2013 at 18:42:10 UTC, H. S. Teoh wrote:
> However, I think we should also provide friendlier representations for
> user-facing types. One possibility is a wrapper type that accepts string
> IDs for nodes/edges, and internally maintains a mapping between them and
> numerical IDs, so that the user doesn't have to keep doing this
> manually. This way the algorithms can run at top speed, and the user can
> still get a nice API for string-based IDs.

Yes, I agree. The goal right now is to get the internal stuff really well done, with a friendly face to follow.
July 11, 2013
11-Jul-2013 20:02, Joseph Rushton Wakeling пишет:
> On 07/11/2013 05:17 PM, Andrei Alexandrescu wrote:
>> Unstable sort should be faster than stable sort. If I remember correctly (1) our
>> TimSort is indeed slower than quicksort on random numbers, (2) there is a
>> performance bug that makes our quicksort perform quadratically on data that's
>> essentially sorted but has one unsorted element at the end. Is that the case?
>
> That's _exactly_ the case here. :-\

In such a case if inserting exactly one element into a sorted range you may want to special case it to lowerbound + insert.
E.g. something like this:

auto idx = sortedRange.lowerbound(val).length;
//insert should be called on array backing that sortedRange
insert(sortedRange, idx, val);

> Thanks for the clarification!
>


-- 
Dmitry Olshansky
July 11, 2013
On 07/11/2013 10:43 PM, Dmitry Olshansky wrote:
> In such a case if inserting exactly one element into a sorted range you may want
> to special case it to lowerbound + insert.
> E.g. something like this:
> 
> auto idx = sortedRange.lowerbound(val).length;
> //insert should be called on array backing that sortedRange
> insert(sortedRange, idx, val);

... which would basically correspond to an insertion sort, right?

I'm quite tired on reading this so will have to think it over fresh tomorrow. What I'm not certain of is how to tweak it given that actually here we have two arrays:

    size_t[] values;  // doesn't change, but we've added a new value at the end
    size_t[] sortedIndices; // indices 0 .. values.length, sorted according
                            // to values[i]

... so, it's sortedIndices that we need to sort, but according to the corresponding entries in the array of values.

Anyway, thanks for the thought -- lowerbound() is a function I've never used before so it wouldn't have occurred to me.
July 11, 2013
On 07/11/2013 08:14 PM, w0rp wrote:
> 3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges.

I'm not sure I understand why this is a problem with the class presented here. Can you clarify what your concern is?
July 11, 2013
12-Jul-2013 02:23, Joseph Rushton Wakeling пишет:
> On 07/11/2013 10:43 PM, Dmitry Olshansky wrote:
>> In such a case if inserting exactly one element into a sorted range you may want
>> to special case it to lowerbound + insert.
>> E.g. something like this:
>>
>> auto idx = sortedRange.lowerbound(val).length;
>> //insert should be called on array backing that sortedRange
>> insert(sortedRange, idx, val);
>
> ... which would basically correspond to an insertion sort, right?

Yup, but done in one step. I'd put it the other way around: simply put this is inserting an item into a sorted range => binary search + insert. An insertion sort is built around this operation performed repeatedly.

>
> I'm quite tired on reading this so will have to think it over fresh tomorrow.
> What I'm not certain of is how to tweak it given that actually here we have two
> arrays:
>
>      size_t[] values;  // doesn't change, but we've added a new value at the end
>      size_t[] sortedIndices; // indices 0 .. values.length, sorted according
>                              // to values[i]

Append new value to values.

Then use 'values.length-1' (new length - 1 i.e. the old length) as an item to insert into sortedIndices.

The last step is to figure out what range to call lowerBound on - I'd say something like:

assumeSorted(sortedIndices.map!(x => values[x]))

then use that to find a suitable place to insert in sortedIndices.
>
> ... so, it's sortedIndices that we need to sort, but according to the
> corresponding entries in the array of values.
>
> Anyway, thanks for the thought -- lowerbound() is a function I've never used
> before so it wouldn't have occurred to me.
>


-- 
Dmitry Olshansky
July 12, 2013
On 07/11/2013 08:14 PM, w0rp wrote:
> 1. You can only use a size_t type as the vertex type. What about strings?

I want to reply at greater length on this point, because I don't want you to feel I'm dismissing your concerns.  I remember that back in our earlier discussions you posted a basic data type of something like,

    WeightType[VertexID][VertexID];

where VertexID might be an integer, a string, or potentially something else.

Let's simplify and suppose we have bool[VertexID][VertexID], so that it's just true of there's a link, false otherwise.  In principle you could simplify still further and just have a set of VertexID pairs, except that Phobos still doesn't have a decent Set implementation (... hint, hint ... :-)

This will be quick (and fairly memory efficient, I'd have thought) to create, and the distinction between Set and MultiSet would allow an easy constraint on whether you have a graph or a multigraph.

You also have a ready and efficient way to check if a given link (head, tail) is in the graph, although I think I can probably do a good job on that with the current implementation (more on that another time:-).

What I'm not sure that you have so readily is an efficient way to extract e.g. degree(v), or neighbours(v), and I'm concerned that the extra data you'll have to carry to change this may actually lead to a memory blow-up.

I'd be very happy to be proved wrong about this, so if you have some ideas, or better still code for comparison and benchmarking, I'll happily reconsider the design.

But for now my contention is that generic vertex IDs are only relevant when it comes to input or output, and that's where they should be implemented, not in the core processing code.

> 3. I have to control the graph's capacity manually, and I can't use arbitrary numbers for edges.

I don't see the capacity issue as being different from e.g. the fact that an array's capacity must be controlled manually unless you explicitly attempt to append to the end of it.

As for arbitrary numbers for edges: I think in practice this should be OK, so
long as you ensure that vertexCount is greater than the maximum vertex ID
number, and there are not _too_ many gaps between the vertex IDs that are used.
 (e.g. if your vertex IDs are 1 to 50 rather than 0 to 49, you should have no
problem.)  Then, most results can be adapted simply by filtering out "nodes"
with degree == 0.

If on the other hand your node IDs are 10, 20, 30, ... you might not get such a good performance ... :-)

> 4. Adding an edge doesn't add vertices.

Here I don't think I have anything to add apart from that I don't like the idea of a graph that might surreptitiously add an arbitrary number of vertices simply on the basis of an edge being added.  I'd rather force the user to check that the link being added is valid given the vertices available, and provide a wrapper to offer those kinds of checks and the opportunity to expand the number of vertices if needed.
July 12, 2013
On 07/12/2013 12:37 AM, Dmitry Olshansky wrote:
> Append new value to values.
> 
> Then use 'values.length-1' (new length - 1 i.e. the old length) as an item to insert into sortedIndices.
> 
> The last step is to figure out what range to call lowerBound on - I'd say something like:
> 
> assumeSorted(sortedIndices.map!(x => values[x]))
> 
> then use that to find a suitable place to insert in sortedIndices.

Thanks very much for that, I'll try it out and report back on performance. :-)

I think I may move the discussion over to D.learn as I have some other new profiling results to follow up on -- it'll be an interesting lesson in how to manage memory for performance.
July 12, 2013
On 07/12/2013 12:37 AM, Dmitry Olshansky wrote:
> Append new value to values.
> 
> Then use 'values.length-1' (new length - 1 i.e. the old length) as an item to insert into sortedIndices.
> 
> The last step is to figure out what range to call lowerBound on - I'd say something like:
> 
> assumeSorted(sortedIndices.map!(x => values[x]))
> 
> then use that to find a suitable place to insert in sortedIndices.

So, just to report back -- this is the code I came up with based on your suggestion:

    void indexEdges()
    {
        immutable size_t l = _indexHead.length;
        foreach(e; iota(l, _head.length))
        {
            size_t i = _indexHead.map!(a =>
_head[a]).assumeSorted.lowerBound(_head[e]).length;
            insertInPlace(_indexHead, i, e);
            i = _indexTail.map!(a =>
_tail[a]).assumeSorted.lowerBound(_tail[e]).length;
            insertInPlace(_indexTail, i, e);
        }
    }

This is _much_ faster than the previous code (although still not as fast as igraph when adding multiple edges in one go), and also more memory efficient.

I'm a little bothered that the insertion implies potentially many re-allocs, and I wonder if it might be even better if the length of _indexHead and _indexTail can be increased in one go, and the "insertion" of the new edge index happen without risking a reallocation.

Anyway, thanks very much for the useful idea!

igraph is still ultimately faster in the case where multiple edges are added in a single go, but that's an issue we can confront later.
July 12, 2013
On 07/12/2013 03:12 PM, Joseph Rushton Wakeling wrote:
> I'm a little bothered that the insertion implies potentially many re-allocs, and I wonder if it might be even better if the length of _indexHead and _indexTail can be increased in one go, and the "insertion" of the new edge index happen without risking a reallocation.

An attempt at an alternative:

        immutable size_t l = _indexHead.length;
        _indexHead.length = _head.length;
        _indexTail.length = _tail.length;
        foreach(e; iota(l, _head.length))
        {
            size_t i, j;
            i = _indexHead[0 .. e].map!(a =>
_head[a]).assumeSorted.lowerBound(_head[e]).length;
            for(j = e; j > i; --j)
                _indexHead[j] = _indexHead[j - 1];
            _indexHead[i] = e;
            i = _indexTail[0 .. e].map!(a =>
_tail[a]).assumeSorted.lowerBound(_tail[e]).length;
            for(j = e; j > i; --j)
                _indexTail[j] = _indexTail[j - 1];
            _indexTail[i] = e;
        }

It doesn't seem to make any difference in speed or memory consumption to the current code (OK, perhaps a tiny speed increase, but very tiny).

It might have some relevance in the case where multiple edges are added in one go.  I'll keep it in mind for that.
1 2
Next ›   Last »