Thread overview
Opportunistic worst-case complexity upgrade? aka binary search with `find`
Jul 24, 2014
Jakob Ovrum
Jul 24, 2014
Jakob Ovrum
Jul 24, 2014
Jonathan M Davis
Jul 27, 2014
Jakob Ovrum
Jul 27, 2014
Xinok
July 24, 2014
We should talk about a design question surrounding binary search with `canFind`/`find` and possibly other linear-search functions.

Currently we have binary search in Phobos as part of std.range.SortedRange. Its interface is not compatible with `canFind` or `find` - you can't simply wrap the haystack in a SortedRange and pass it to an algorithm to give you logarithmic complexity.

The first question is whether this opportunistic upgrade is desirable - binary search has much better worst-case complexity than linear search, but it's not necessarily faster in the average case, which depends on the specific use pattern. One important thing to note is that, assuming the binary-search specialization is documented, the user can use `SortedRange.release` to explicitly request linear search.

Myself and others have sometimes mistakenly expected `find` and friends to be specialized for `SortedRange` inputs, opportunistically providing better worst-case complexity, but this is not the case. It seems simple at first glance, but the problem lies in the predicate - binary search can only be leveraged when the specific order is known:

---
auto data = assumeSorted([1, 2, 3]);

// Equality, so bsearch then trot left
auto result = data.find!((x, e) => x == e)(2); // default predicate
assert(result.equal([2, 3]));

// Opposite order and exclusive, bsearch then trot right
result = data.find!((x, e) => x > e)(2);
assert(result.equal([3]));

// Same order, bsearch then trot left.
// Compare first element as an optimization?
result = data.find!((x, e) => x < e)(0);
assert(result.empty);

struct S { string name; }
auto data2 = assumeSorted!((a, b) => a.name < b.name)(
    [S("a"), S("b"), S("c")]
);

// Same order and exclusive, but with a transformation, yikes...
auto result2 = data2.find!((x, e) => x.name < e.name)(S("b"));
assert(result2.equal(data2));
---

Identifying the characteristics described in the code comments above is the biggest problem: predicate functions don't have any notion of equality.

String lambdas can be compared for equality, but they're really fragile: "a == b" != "b == a" etc. Besides, string lambdas are undesirable for other reasons and should be phased out in the long-term[1].

Someone suggested defining a standard set of predicates, making it look like this:

---
auto data = assumeSorted!less([1, 2, 3]); // Would be default

// Equality, so bsearch then trot left
auto result = data.find!equality(2); // Would be default
---

Templates can be compared with __traits(isSame, ...), so this approach seems feasible. If we want to do this, this seems like the most realistic approach. I'm not sure if it can be made to work when transformation of arguments is involved, but it might still be worth doing.

Another issue is that SortedRange's interface does not actually support all the above patterns because it splits the result into 3 instead of 2; we would need to amend SortedRange to support the inverse functions of lowerBound and upperBound.

So, what does the community think? Desirable, or not? Thoughts about implementation?

[1] http://forum.dlang.org/post/jnlqesrwxfekdsxjerlp@forum.dlang.org
July 24, 2014
On Thursday, 24 July 2014 at 01:21:44 UTC, Jakob Ovrum wrote:
> -snip-

Another point is that the range types of the two currently available sorted containers - RedBlackTree and BinaryHeap - are *not* instances of SortedRange. If algorithms working on sorted ranges become a thing, it seems like they should be.
July 24, 2014
On Thursday, 24 July 2014 at 01:26:48 UTC, Jakob Ovrum wrote:
> On Thursday, 24 July 2014 at 01:21:44 UTC, Jakob Ovrum wrote:
>> -snip-
>
> Another point is that the range types of the two currently available sorted containers - RedBlackTree and BinaryHeap - are *not* instances of SortedRange. If algorithms working on sorted ranges become a thing, it seems like they should be.

Maybe a better approach would be do just have sorted range types define find themselves? UFCS would then make it so that the best search function for that type would be used (e.g. I don't think that binary search is the best method to use a red-black tree).

However, regardless of that, it could be useful to know whether a range is sorted or not in general, and we may want to change how we do that so that there's an enum which indicates it rather than simply wrapping it in SortedRange (e.g. we know that RedBlackTree's range type is sorted, but I doubt that we want to always be wrapping it in a SortedRange to show that).

- Jonathan M Davis
July 27, 2014
On Thursday, 24 July 2014 at 21:08:58 UTC, Jonathan M Davis wrote:
> On Thursday, 24 July 2014 at 01:26:48 UTC, Jakob Ovrum wrote:
>> On Thursday, 24 July 2014 at 01:21:44 UTC, Jakob Ovrum wrote:
>>> -snip-
>>
>> Another point is that the range types of the two currently available sorted containers - RedBlackTree and BinaryHeap - are *not* instances of SortedRange. If algorithms working on sorted ranges become a thing, it seems like they should be.
>
> Maybe a better approach would be do just have sorted range types define find themselves? UFCS would then make it so that the best search function for that type would be used (e.g. I don't think that binary search is the best method to use a red-black tree).

Right, the two tree structures can't be searched properly using a generic binary search algorithm.

I don't think having containers define `find` as a member is really feasible. `find` is a highly generic function with a multitude of forms, and to be interchangeable, every container would have to support all these forms. That sounds extremely impractical.

> However, regardless of that, it could be useful to know whether a range is sorted or not in general, and we may want to change how we do that so that there's an enum which indicates it rather than simply wrapping it in SortedRange (e.g. we know that RedBlackTree's range type is sorted, but I doubt that we want to always be wrapping it in a SortedRange to show that).
>
> - Jonathan M Davis

The container could simply make its Range type an alias to SortedRange!(RangeImpl, pred). It wouldn't cost anything. But, of course, as you pointed out, it's not possible for the tree structures, at least not without changing something in SortedRange.

Maybe SortedRange's search functions should defer to the underlying range if the underlying range supports logarithmic search, such as by defining `lowerBound` et al. as members of the range type. This would cause some duplication of course, as currently, `lowerBound` et al. are container primitives. However, wouldn't it be the right place for them?

Currently, using `SortedRange`, we can do binary search on a *subset* of the original range, as its search functions in turn return sorted ranges:

---
auto data = [1, 2, 3, 4, 5].assumeSorted();

auto lower = data.lowerBound(5);
auto target = lower.upperBound(1); // Narrower search area!

assert(target.equal([2, 3, 4]));
---

In the current container model, where they are primitives of the container and not the container's range, this is not possible.
July 27, 2014
On Thursday, 24 July 2014 at 01:26:48 UTC, Jakob Ovrum wrote:
> On Thursday, 24 July 2014 at 01:21:44 UTC, Jakob Ovrum wrote:
>> -snip-
>
> Another point is that the range types of the two currently available sorted containers - RedBlackTree and BinaryHeap - are *not* instances of SortedRange. If algorithms working on sorted ranges become a thing, it seems like they should be.

Just to clarify, BinaryHeap is heapified but not sorted, so it would be impossible to do a binary search on it. It's only an InputRange so, at best, you could do a linear search on it.