March 04, 2009
Andrei Alexandrescu:
> The approach I took with the new phobos is:
> 
> int[] haystack;
> int[] needle;
> ...
> auto pos1 = find(haystack, needle); // linear
> sort(haystack);
> auto pos2 = find(assumeSorted(haystack), needle);

In my dlibs I do in a simpler and shorter way:

auto pos1 = haystack.index(needle);
haystack.sort();
auto pos2 = haystack.bisect(needle);

Here there's no need to give the same name to two very different functions.

If you really like to use a single function mame, with named arguments you may also do (bisect is false by default):

auto pos1 = haystack.index(needle);
haystack.sort();
auto pos2 = haystack.index(needle, bisect=true);

Bye,
bearophile
March 04, 2009
Max Samukha wrote:
> 
> If you intruduce a dummy type, why not make it perform validation in a
> debug build when sumthing like debug=slowButSafe is set?

Because in the case of binarySearch slowButSafe quickly becomes slowAndUseless. It's happened to me - I had an assert(isSorted) in binary search (I guess it's in one of the older phobos releases!) and when I was using the debug version, my program would take forever to run.

A debug build should at most change the constant multiplying the complexity, not the complexity.


Andrei
March 04, 2009
Sean Kelly wrote:
> I'm not terribly fond of the added verbosity however, or that this seems like I couldn't use the property form:
> 
>     assumeSorted("abcd").find('c')
> 
> Truth be told, my initial inclination would be to repackage the binary search as a one-liner with a different name, which kind of sabotages the whole idea.  But I'll try to resist this urge.

I understand. This is rather new, but I found it irresistibly cool to unify find() routines under one name and specify structure in the arguments' types. Usage is very easy, there's little to remember, and every piece of structure is where it should be. Consider:

int[] a = [ 1, 2, 3, 4 ];
int[] b = [ 2. 3 ];

These algorithms each performs search a different way because each is informed in different ways about the structure of their arguments:

find(a, b);
find(assumeSorted(a), b);
find(a, assumeSorted(b));
find(assumeSorted(a), assumeSorted(b));
find(a, boyerMooreFinder(b));

There's three names to remember that compose modularly. The run-of-the-mill approach is:

find(a, b);
binaryFind(a, b);
findRhsSorted(a, b);
binaryFindRhsSorted(a, b);
boyerMooreFind(a, b);

To add insult to injury, boyerMooreFind is not enough because it hides the structure created around b. So there's also need for one extra type e.g. BoyerMooreFinder!(int[]) for cases when there are multiple searches of the same thing. It's just onerous.


Andrei

P.S. By the way, this is the running example used in Chapter 4 of TDPL.
March 04, 2009
bearophile wrote:
> Andrei Alexandrescu:
>> The approach I took with the new phobos is:
>> 
>> int[] haystack; int[] needle; ... auto pos1 = find(haystack,
>> needle); // linear sort(haystack); auto pos2 =
>> find(assumeSorted(haystack), needle);
> 
> In my dlibs I do in a simpler and shorter way:
> 
> auto pos1 = haystack.index(needle); haystack.sort(); auto pos2 =
> haystack.bisect(needle);
> 
> Here there's no need to give the same name to two very different
> functions.

They do the exact same thing. Unifying them under the same name is good
abstraction.

> If you really like to use a single function mame, with named
> arguments you may also do (bisect is false by default):
> 
> auto pos1 = haystack.index(needle); haystack.sort(); 
> auto pos2 = haystack.index(needle, bisect=true);

I prefer encoding structural information in types.


Andrei
March 04, 2009
On Wed, 04 Mar 2009 10:27:55 -0800, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

>Max Samukha wrote:
>> 
>> If you intruduce a dummy type, why not make it perform validation in a debug build when sumthing like debug=slowButSafe is set?
>
>Because in the case of binarySearch slowButSafe quickly becomes slowAndUseless. It's happened to me - I had an assert(isSorted) in binary search (I guess it's in one of the older phobos releases!) and when I was using the debug version, my program would take forever to run.
>
>A debug build should at most change the constant multiplying the complexity, not the complexity.
>
>
>Andrei

I intentionaly proposed a special debug mode, not regular asserts, which are on in any debug build. I would like, knowing that I can wait a couple of days to make sure my program is correct, to be able to turn on validation.
March 04, 2009
Andrei Alexandrescu:
> int[] a = [ 1, 2, 3, 4 ];
> int[] b = [ 2. 3 ];

I guess you meant:

int[] b = [ 2, 3 ];

> find(a, b);
> find(assumeSorted(a), b);
> find(a, assumeSorted(b));
> find(assumeSorted(a), assumeSorted(b));
> find(a, boyerMooreFinder(b));

Are you talking about finding the position of a subarray into a bigger array? Then the two useful cases are:
a.index(b);
a.indexBoyerMoore(b);
The other cases aren't common enough, I think.

Bye,
bearophile
March 04, 2009
Max Samukha wrote:
> On Wed, 04 Mar 2009 10:27:55 -0800, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> Max Samukha wrote:
>>> If you intruduce a dummy type, why not make it perform validation in a
>>> debug build when sumthing like debug=slowButSafe is set?
>> Because in the case of binarySearch slowButSafe quickly becomes slowAndUseless. It's happened to me - I had an assert(isSorted) in binary search (I guess it's in one of the older phobos releases!) and when I was using the debug version, my program would take forever to run.
>>
>> A debug build should at most change the constant multiplying the complexity, not the complexity.
>>
>>
>> Andrei
> 
> I intentionaly proposed a special debug mode, not regular asserts,
> which are on in any debug build. I would like, knowing that I can wait
> a couple of days to make sure my program is correct, to be able to
> turn on validation.

I am waiting a couple of days in release mode with all optimizations turned on and wind from behind.

Andrei
March 04, 2009
bearophile wrote:
> Andrei Alexandrescu:
>> int[] a = [ 1, 2, 3, 4 ];
>> int[] b = [ 2. 3 ];
> 
> I guess you meant:
> 
> int[] b = [ 2, 3 ];
> 
>> find(a, b);
>> find(assumeSorted(a), b);
>> find(a, assumeSorted(b));
>> find(assumeSorted(a), assumeSorted(b));
>> find(a, boyerMooreFinder(b));
> 
> Are you talking about finding the position of a subarray into a bigger array? Then the two useful cases are:
> a.index(b);
> a.indexBoyerMoore(b);
> The other cases aren't common enough, I think.
> 
> Bye,
> bearophile

Binary search is rather common.

As an aside, your use of "index" suggests you return integrals out of the function. IMHO that's strongly unrecommended.


Andrei
March 04, 2009
Andrei Alexandrescu:
> Binary search is rather common.

Oh, yes, sorry, I meant among the ones you have listed there...


> As an aside, your use of "index" suggests you return integrals out of the function. IMHO that's strongly unrecommended.

I don't want to use too much of your time (that it may be better spent with your new child), but I don't understand what you mean.
That index() function is meant the index position of the item or sub-sequence into the bigger array (or iterable), and it returns -1 if not found. This is an usual design.
Some people think that such controls for -1 value aren't always done, so to avoid that and some bugs, it's better to raise something like IndexException when the needle isn't found.

Bye,
bearophile
March 04, 2009
bearophile wrote:
> Andrei Alexandrescu:
>> Binary search is rather common.
> 
> Oh, yes, sorry, I meant among the ones you have listed there...

Of five, three are frequent (linear, binary, Boyer-Moore), one is a form
of set intersection (find sorted in sorted), and the odd one is:

find(a, assumeSorted(b));

This is rare but has excellent best-case complexity (is it O(a.length /
b.length)?) and is easy to add for completeness.

>> As an aside, your use of "index" suggests you return integrals out
>> of the function. IMHO that's strongly unrecommended.
> 
> I don't want to use too much of your time (that it may be better
> spent with your new child), but I don't understand what you mean. That index() function is meant the index position of the item or
> sub-sequence into the bigger array (or iterable), and it returns -1
> if not found. This is an usual design.

This is an extremely sloppy design. That it is usual doesn't make things
any better!

> Some people think that such controls for -1 value aren't always done,
> so to avoid that and some bugs, it's better to raise something like
> IndexException when the needle isn't found.

Yah, this is the subject of a long rant but in short: returning int from find means that essentially that find is unusable with anything except random access structures. This in turn means you'd have to have different means, APIs, and user code to deal with e.g. lists, in spite of the fact that linear search is the same boring thing for all: look at the current thing, yes/no, move on to the next thing. IMHO ever since the STL has seen the light of day there is no excuse, not even sheer ignorance, to ever traffic in integers as a mean to access elements in containers, in any language that has even the most modest parameterized types capability.

Returning int from find is an insult. To add injury to it, have find also return an int for a list.

"Is this item in this list?"
"Yeppers. It's the 538th element. Took me a hike to find it."
"Well I wanted to do something with it."
"Then go get it. I'm telling you, it'll take exactly 538 steps."


Andrei