June 08, 2012
On Friday, June 08, 2012 00:30:06 Dmitry Olshansky wrote:
> On 07.06.2012 22:34, Peter Alexander wrote:
> > Don't get me started on std.algorithm.find...
> 
> Why not ? It's perfect example of goodness of generic programing,
> flexibility (for free) + speciality (when you need) for the win.
> If you language lacks such construct you end up inveting ways around it:
> - make findInt, findString, quickFindString, findPerson etc. * number of
> containters
> - shoehorn everything into a simpler design: qsort, boxing-unboxing etc.

std.algorithm does a lot of really cool stuff in a very generic way, and on the whole, it works very well. It's true that its functions tend to be more complicated than most, but they're generally not complicated to use, and if you're going to have that kind of complication, it's best to have it in the standard library written by experts in the language rather than trying to write it yourself. And D manages to make such functions _very_ sane in comparison to what C++ would need to do to get the same result. If anything, the problem here is simply that not all of these functions are tested as well as they should be (as the bug with min shows).

Personally, my favorite function in std.algorithm is startsWith (or endsWith). It just about blew my mind when I realized what it was doing. It is fantastically cool how it uses recursive templates to get its job done, and it makes for an incredibly powerful and easy to use function.

- Jonathan M Davis
June 08, 2012
On 6/7/12 6:50 PM, Peter Alexander wrote:
> On Thursday, 7 June 2012 at 22:29:09 UTC, Andrei Alexandrescu wrote:
>> Great points, example could be a lot better. Maybe it's time you do
>> get started on find(). Destroy or be destroyed.
>
> Ok...
>
> This overload:
>
> R find(alias pred = "a == b", R, E)(R haystack, E needle)
> if (isInputRange!R &&
> is(typeof(binaryFun!pred(haystack.front, needle)) : bool))
> {
> for (; !haystack.empty; haystack.popFront())
> {
> if (binaryFun!pred(haystack.front, needle)) break;
> }
> return haystack;
> }
>
>
> Is fine. In fact it's practically perfect. It's the obvious solution to
> a simple problem. The biggest problem is the "is typeof" syntax, but
> that's not your fault.

So far so good.

> Second overload:
>
> R1 find(alias pred = "a == b", R1, R2)(R1 haystack, R2 needle)
> if (isForwardRange!R1 && isForwardRange!R2
> && is(typeof(binaryFun!pred(haystack.front, needle.front)) : bool)
> && !isRandomAccessRange!R1)
> {
> static if (is(typeof(pred == "a == b")) && pred == "a == b" &&
> isSomeString!R1 && isSomeString!R2
> && haystack[0].sizeof == needle[0].sizeof)
> {
> //return cast(R1) find(representation(haystack), representation(needle));
> // Specialization for simple string search
> alias Select!(haystack[0].sizeof == 1, ubyte[],
> Select!(haystack[0].sizeof == 2, ushort[], uint[]))
> Representation;
> // Will use the array specialization
> return cast(R1) .find!(pred, Representation, Representation)
> (cast(Representation) haystack, cast(Representation) needle);
> }
> else
> {
> return simpleMindedFind!pred(haystack, needle);
> }
> }
>
> As far as I can tell, this is a proxy for various substring search
> implementations.
>
> Problem 1: Why is find overloaded to do substring search? Why does it do
> substring and not subset or subsequence? substring is a completely
> different algorithm from linear search and even has different asymptotic
> running time. This is needlessly overloaded, it adds nothing.

This overload is an optimization that exploits the fact that comparing two strings for equality is the same as comparing their representations.

> Problem 2: The attempted specialisation is atrocious. It compares the
> predicate string with "a == b".

Yah, that's to detect the defaulted parameter (not to detect all comparisons for equality). I should have defined a separate overload that took one less parameter, but back when I wrote find() there were some compiler bugs that prevented me from doing so.

> I just did a quick check, and this means
> that these two calls DO NOT use the same algorithm:
>
> string a = ..., b = ...;
> auto fast = find!"a == b"(a, b);
> auto slow = find!"a==b"(a, b);
>
> I have to be honest, I have only just noticed this, but that's really
> sloppy.

Again, the point there is to detect the default (most frequently used), not all equality comparisons (which would be a much harder task). Probably it's worth revisiting this implementation at a point and do it the right way: an overload of find() for strings when a lambda is not specified.

> It's also a direct symptom of over-complicated code. As a user, I would
> 100% expect these calls to be the same.

Why? Would you expect that these calls are also the same?

find!((a, b) => a == b)(a, b);
find!((a, b) => b == a)(a, b);
find!((a, b) => !(b != a))(a, b);
auto lambda = (a, b) => a == b;
find!lambda(a, b);

I'm sure you figure how difficult the task of detecting all semantically equivalent lambdas are. The point there is, again, to optimize the frequent case when people just call find(a, b). That's it.

> If I accidentally used the
> second version, I would have no warning: my code would just run slower
> and I'd be none the wiser. Only upon careful inspection of the source
> could you discover what this "simple" code does, and you would be
> horrified like I am now.

I don't see any reason that justifies one being horrified.

> The next two overloads of find appears to implement a couple of
> reasonably fast substring. Again, it would take me a minute or so to
> figure out exactly what algorithm was being used.

This is actually not bad at all. Standard library implementations are very highly leveraged, and writing them in a "simple" manner that a beginner can understand immediately is optimizing along the wrong dimensions. No language that's actually used does that, including those for which traditionally performance is not a front concern, such as Lisp or Haskell.

> After that there's a multi-range find. Seems simple enough, but it's yet
> another overload to consider and I'm not sure it's commonly used enough
> to even warrant existence. I'd hate to think what would happen if I
> wanted the single-range search but accidentally added an extra parameter.

Implementing multi-range properly is difficult by an end user and a natural thing to add to the standard library.

> In summary: the specialisation detection is shocking, which leads me to
> question what other static ifs are accidentally firing or failing. If
> the code was more simple, and less overloaded it would be easy to reason
> about all this, but it's not. The various find implementations span a
> few hundred lines, and all have numerous compile time branches and
> checks. The cyclomatic complexity must be huge.

I don't think you are putting forth good arguments to simplify things. Essentially you're saying you want to read the source of the standard library but don't care to have it run fast.

> How I'd do things:
>
> - The first version of find would be the only one. No overloads, no
> specialisation.

So out the window is fast string searching and many other simple optimizations such as e.g. looking for the last element of a bidirectional sequence in a random-access range first.

> - substring would be a separate, non-predicated function. It would be
> specialised for strings. I'm too tired now to think about the variety of
> range specialisations, but I'm sure there's a more elegant way to handle
> to combinations.

So now when we have strings we can use the slow find and the fast substring for essentially the same operation? Why? "Well we thought it makes the standard library implementation one bit simpler." Ehm.

> - Drop the variadic find altogether.

I guess we could do that, but again, it's difficult to implement correctly by the user. Then there's the backwards compatibility to worry about.

> - Get rid of BoyerMooreFinder + find overload, replace with a boyerMoore
> function.

That's an experiment that I haven't decided whether to declare successful or not. I figured the fundamental optimizations of find(a, b) are effected as follows:

1. Add structure to a (e.g. a is sorted, has a trie associated with it etc);

2. Add structure to b (e.g. KMP, BM, RK)

3. Add structure to both (e.g. search a sorter range in another)

The decision on which algorithm to use is often dictated by this concern - where am I willing to focus the preprocessing effort? So I thought I'd clarify that in the way find(a, b) is invoked: depending on the structure of a and b, different search decisions are taken.

I'd been planning to add several other find implementations along these lines, but haven't gotten around to it yet.


Andrei
June 09, 2012
On Friday, June 08, 2012 01:50:36 Peter Alexander wrote:
> - Drop the variadic find altogether.

It's _great_ that functions like find are variadic. It makes them more powerful, and they're still easy to use.

Ease of use, power, and performance are _far_ more important than code readibility when talking about the standard library. These functions are being used by _everyone_ and are maintained by a relatively small number of people. If you were arguing that they were hard to use, then that would be one thing, but arguing that they're hard to read is arguing about something which matters comparatively little.

And for as powerful as the functions in std.algorithm are, they don't tend to be all that complicated. find is certainly one of the most complicated (probably _the_ most complicated actually), but it's not complicated to use, and you get a lot of power out of it. And considering how insanely common it is to use find, having it be powerful is critical.

- Jonathan M Davis
June 11, 2012
On Thursday, 7 June 2012 at 22:46:33 UTC, Dmitry Olshansky wrote:
> On 08.06.2012 0:46, Dmitry Olshansky wrote:
>> On 07.06.2012 20:04, bearophile wrote:
>>> The slide packs of the conference C++Now! 2012 are available:
>>> https://github.com/boostcon/cppnow_presentations_2012
>>>
>>
>> Thanks, nice stuff for my brain to chew on.
>>
> - It's fun to see M$ guy give a talk about regex. Last time I checked their STL it was a bleak shadow of boost ones in terms of performance (why not just copy good stuff if you can't get it right?)

Big companies have a lot of NIH syndrome.

Plus many, specially Microsoft, don't like to use licenses other
than their own.

--
Paulo
1 2 3
Next ›   Last »