C++Now! 2012 slides (page 3)

On Friday, June 08, 2012 00:30:06 Dmitry Olshansky wrote:
> On 07.06.2012 22:34, Peter Alexander wrote:
> > Don't get me started on std.algorithm.find...
> 
> Why not ? It's perfect example of goodness of generic programing,
> flexibility (for free) + speciality (when you need) for the win.
> If you language lacks such construct you end up inveting ways around it:
> - make findInt, findString, quickFindString, findPerson etc. * number of
> containters
> - shoehorn everything into a simpler design: qsort, boxing-unboxing etc.

std.algorithm does a lot of really cool stuff in a very generic way, and on the whole, it works very well. It's true that its functions tend to be more complicated than most, but they're generally not complicated to use, and if you're going to have that kind of complication, it's best to have it in the standard library written by experts in the language rather than trying to write it yourself. And D manages to make such functions _very_ sane in comparison to what C++ would need to do to get the same result. If anything, the problem here is simply that not all of these functions are tested as well as they should be (as the bug with min shows).

Personally, my favorite function in std.algorithm is startsWith (or endsWith). It just about blew my mind when I realized what it was doing. It is fantastically cool how it uses recursive templates to get its job done, and it makes for an incredibly powerful and easy to use function.

- Jonathan M Davis

June 08, 2012

Re: C++Now! 2012 slides

Posted by Andrei Alexandrescu
in reply to Peter Alexander

Permalink

Andrei Alexandrescu

Posted in reply to Peter Alexander

Permalink

On 6/7/12 6:50 PM, Peter Alexander wrote:
> On Thursday, 7 June 2012 at 22:29:09 UTC, Andrei Alexandrescu wrote:
>> Great points, example could be a lot better. Maybe it's time you do
>> get started on find(). Destroy or be destroyed.
>
> Ok...
>
> This overload:
>
> R find(alias pred = "a == b", R, E)(R haystack, E needle)
> if (isInputRange!R &&
> is(typeof(binaryFun!pred(haystack.front, needle)) : bool))
> {
> for (; !haystack.empty; haystack.popFront())
> {
> if (binaryFun!pred(haystack.front, needle)) break;
> }
> return haystack;
> }
>
>
> Is fine. In fact it's practically perfect. It's the obvious solution to
> a simple problem. The biggest problem is the "is typeof" syntax, but
> that's not your fault.

So far so good.

> Second overload:
>
> R1 find(alias pred = "a == b", R1, R2)(R1 haystack, R2 needle)
> if (isForwardRange!R1 && isForwardRange!R2
> && is(typeof(binaryFun!pred(haystack.front, needle.front)) : bool)
> && !isRandomAccessRange!R1)
> {
> static if (is(typeof(pred == "a == b")) && pred == "a == b" &&
> isSomeString!R1 && isSomeString!R2
> && haystack[0].sizeof == needle[0].sizeof)
> {
> //return cast(R1) find(representation(haystack), representation(needle));
> // Specialization for simple string search
> alias Select!(haystack[0].sizeof == 1, ubyte[],
> Select!(haystack[0].sizeof == 2, ushort[], uint[]))
> Representation;
> // Will use the array specialization
> return cast(R1) .find!(pred, Representation, Representation)
> (cast(Representation) haystack, cast(Representation) needle);
> }
> else
> {
> return simpleMindedFind!pred(haystack, needle);
> }
> }
>
> As far as I can tell, this is a proxy for various substring search
> implementations.
>
> Problem 1: Why is find overloaded to do substring search? Why does it do
> substring and not subset or subsequence? substring is a completely
> different algorithm from linear search and even has different asymptotic
> running time. This is needlessly overloaded, it adds nothing.

This overload is an optimization that exploits the fact that comparing two strings for equality is the same as comparing their representations.

> Problem 2: The attempted specialisation is atrocious. It compares the
> predicate string with "a == b".

Yah, that's to detect the defaulted parameter (not to detect all comparisons for equality). I should have defined a separate overload that took one less parameter, but back when I wrote find() there were some compiler bugs that prevented me from doing so.

> I just did a quick check, and this means
> that these two calls DO NOT use the same algorithm:
>
> string a = ..., b = ...;
> auto fast = find!"a == b"(a, b);
> auto slow = find!"a==b"(a, b);
>
> I have to be honest, I have only just noticed this, but that's really
> sloppy.

Again, the point there is to detect the default (most frequently used), not all equality comparisons (which would be a much harder task). Probably it's worth revisiting this implementation at a point and do it the right way: an overload of find() for strings when a lambda is not specified.

> It's also a direct symptom of over-complicated code. As a user, I would
> 100% expect these calls to be the same.

Why? Would you expect that these calls are also the same?

find!((a, b) => a == b)(a, b);
find!((a, b) => b == a)(a, b);
find!((a, b) => !(b != a))(a, b);
auto lambda = (a, b) => a == b;
find!lambda(a, b);

I'm sure you figure how difficult the task of detecting all semantically equivalent lambdas are. The point there is, again, to optimize the frequent case when people just call find(a, b). That's it.

> If I accidentally used the
> second version, I would have no warning: my code would just run slower
> and I'd be none the wiser. Only upon careful inspection of the source
> could you discover what this "simple" code does, and you would be
> horrified like I am now.

I don't see any reason that justifies one being horrified.

> The next two overloads of find appears to implement a couple of
> reasonably fast substring. Again, it would take me a minute or so to
> figure out exactly what algorithm was being used.

This is actually not bad at all. Standard library implementations are very highly leveraged, and writing them in a "simple" manner that a beginner can understand immediately is optimizing along the wrong dimensions. No language that's actually used does that, including those for which traditionally performance is not a front concern, such as Lisp or Haskell.

> After that there's a multi-range find. Seems simple enough, but it's yet
> another overload to consider and I'm not sure it's commonly used enough
> to even warrant existence. I'd hate to think what would happen if I
> wanted the single-range search but accidentally added an extra parameter.

Implementing multi-range properly is difficult by an end user and a natural thing to add to the standard library.

> In summary: the specialisation detection is shocking, which leads me to
> question what other static ifs are accidentally firing or failing. If
> the code was more simple, and less overloaded it would be easy to reason
> about all this, but it's not. The various find implementations span a
> few hundred lines, and all have numerous compile time branches and
> checks. The cyclomatic complexity must be huge.

I don't think you are putting forth good arguments to simplify things. Essentially you're saying you want to read the source of the standard library but don't care to have it run fast.

> How I'd do things:
>
> - The first version of find would be the only one. No overloads, no
> specialisation.

So out the window is fast string searching and many other simple optimizations such as e.g. looking for the last element of a bidirectional sequence in a random-access range first.

> - substring would be a separate, non-predicated function. It would be
> specialised for strings. I'm too tired now to think about the variety of
> range specialisations, but I'm sure there's a more elegant way to handle
> to combinations.

So now when we have strings we can use the slow find and the fast substring for essentially the same operation? Why? "Well we thought it makes the standard library implementation one bit simpler." Ehm.

> - Drop the variadic find altogether.

I guess we could do that, but again, it's difficult to implement correctly by the user. Then there's the backwards compatibility to worry about.

> - Get rid of BoyerMooreFinder + find overload, replace with a boyerMoore
> function.

That's an experiment that I haven't decided whether to declare successful or not. I figured the fundamental optimizations of find(a, b) are effected as follows:

1. Add structure to a (e.g. a is sorted, has a trie associated with it etc);

2. Add structure to b (e.g. KMP, BM, RK)

3. Add structure to both (e.g. search a sorter range in another)

The decision on which algorithm to use is often dictated by this concern - where am I willing to focus the preprocessing effort? So I thought I'd clarify that in the way find(a, b) is invoked: depending on the structure of a and b, different search decisions are taken.

I'd been planning to add several other find implementations along these lines, but haven't gotten around to it yet.

Andrei

Forums