August 28, 2008
This is another small associative array benchmark, with strings. Please spot any problem/bug in it.

You can generate the data set with this Python script:


from array import array
from random import randrange, seed
def generate(filename, N):
    fout = file(filename, "w")
    seed(1)
    a = array("B", " ") * 10
    for i in xrange(N):
        n = randrange(3, 11)
        for j in xrange(n):
            a[j] = randrange(97, 123)
        print >>fout, a.tostring()[:n]
import psyco; psyco.full()
generate("words.txt", 1600000)


It generates a text file of about 13.3 MB, each line contains a random word. Such dataset isn't exactly like a real dataset, because generally words aren't random, they contain lot of redundancy that may worsen a little the performance of an hash function. So this is probably a better situation for an associative array.


The D code:

import std.stream, std.stdio, std.gc;
void main() {
    //disable();
    int[string] d;
    foreach (string line; new BufferedFile("words.txt"))
        d[line.dup] = 0;
}

Note that this program tests the I/O performance too. If you want to avoid that you can read all the file lines up-front and time just the AA creation. (SEE BELOW).

This is a first Python+Psyco version, it's not equal, because the good Python developers have chosen the newlines at the end of the words:

def main():
    d = {}
    for line in file("words.txt"):
        d[line] = 0
import psyco; psyco.full()
main()


This second slower Python version strips the lines from their newline, rstrip() is similar to std.string.stripr():

def main():
    d = {}
    for line in file("words.txt"):
        d[line.rstrip()] = 0

import psyco; psyco.full()
main()


Few timings:

N = 800_000:
	Psyco:          0.69 s
	Psyco stripped: 0.77 s
	D:              1.26 s
	D no GC:        0.96 s

N = 1_600_000:
	Psyco:          1.19 s
	Psyco stripped: 1.35 s
	D:              2.80 s
	D no GC:        2.08 s

Note that disabling the GC in those two Python programs has no effects on their running time.

-------------------------------------

To be sure to not compare apples with oranges I have written two "empty" benchmarks, that measure the time needed to just read the lines:

D code:

import std.stream;
void main() {
    foreach (string line; new BufferedFile("words.txt"))
        line.dup;
}


Python code:

def main():
    for line in file("words.txt"):
        pass
import psyco; psyco.full()
main()


The D version contains a dup to make the comparison more accurate, because the "line" variable contains an actual copy and not just a slice.

Line reading timings, N = 1_600_000:
  D:     0.58 s
  Psyco: 0.30 s

So the I/O of Phobos is slower and may enjoy some tuning, so my timings of the hash benchmarks are off.

Removing 0.58 - 0.30 = 0.28 seconds from the timings of the D associative array benchmarks you have:

N = 1_600_000:
	Psyco:          1.19 s
	Psyco stripped: 1.35 s
	D:              2.80 s
	D no GC:        2.08 s
	D, I/O c:       2.80 - 0.28 = 2.52 s
	D no GC, I/O c: 2.08 - 0.28 = 1.80 s

------------------------------------

To hopefully create a more meaningful benchmark I have then written code that loads all the lines before creating the hash.

The Python+Psyco code:

from timeit import default_timer as clock
def main():
    words = []
    for line in open("words.txt"):
        words.append(line.rstrip())
    t = clock()
    d = {}
    for line in words:
        d[line] = 0
    print round(clock() - t, 2), "s"
import psyco; psyco.bind(main)
main()


The D code:

import std.stream, std.stdio, std.c.time, std.gc;
void main() {
    string[] words;
    foreach (string line; new BufferedFile("words.txt"))
        words ~= line.dup;
    //disable();
    auto t = clock();
    int[string] d;
    foreach (word; words)
        d[word] = 0;
    writefln((cast(double)(clock()-t))/CLOCKS_PER_SEC, " s");
}


Timings, N = 1_600_000:
	Psyco:          0.61 s  (total running time: 1.36 s)
	D:              1.42 s  (total running time: 2.46 s)
	D no GC:        1.22 s  (total running time: 2.28 s)

Bye,
bearophile
August 28, 2008
On 2008-08-28 13:06:59 +0200, Michel Fortin <michel.fortin@michelf.com> said:

> On 2008-08-28 05:47:22 -0400, Fawzi Mohamed <fmohamed@mac.com> said:
> 
>> An iterator should be like a generator, have a method next, and one at_end or something similar packaged (and maybe prev() and at_start() if it can also go back) in a single struct, furthermore it should work seamlessly with a kind of for_each(x;iterator) construct.
> 
> I perfectly agree with this. This is why I prefer much Objective-C's NSEnumerator over C++ iterators: they're much simpler to use.
> 
> Some time ago, Walter asked if he could change D arrays to have internally an end pointer instead of a length value. He said it was to allow arrays, or slices, to be used as iterator efficiently. I hope he hasn't given up on that idea since slices are much easier to understand and use than C++-style iterators.

I agree

>> Instead C++ choose to have begin & end iterators, simply because with that construct it is trivial for the compiler to optimize it for arrays, and you can use pointers as iterators without a cast/constructor.
> 
> I would add that if you are using many iterators for the same array (vector) in a given algorithm or in function parameters, having only one "end" pointer save some space.

ok given, but I would say that one almost never takes advantage of this, and even then the real advantage is probably small.

August 28, 2008
Nick Sabalausky Wrote:

> "Steven Schveighoffer" <schveiguy@yahoo.com> wrote in message news:g95b89$1jii$1@digitalmars.com...
> >
> > When someone sees an index operator the first thought is that it is a quick lookup.
> 
> This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.
> 
> The '+' operator means "add". Addition is typically O(1). But vectors can be added, and that's an O(n) operation. Should opAdd never be used for vectors?

This is big, big mistake! I think I do not know how to explain it. I could not so far because we talk from one mistake to other.

If you add vector a[] = b[] + c[] then that is one operation that is repeat for each element of vector. Complexity is linear in cost of one operation. It would be mistake if each + take O(a.length). In other case it is simply as expected proportional to length of input.

> > You can force yourself to think differently, but the reality is that most people think that because of the universal usage of square brackets (except for VB, and I feel pity for anyone who needs to use VB) to mean 'lookup by key', and usually this is only useful on objects where the lookup is quick ( < O(n) ).  Although there is no requirement, nor enforcement, the 'quick' contract is expected by the user, no matter how much docs you throw at them.
> >
> > Look, for instance, at Tango's now-deprecated LinkMap, which uses a linked-list of key/value pairs (copied from Doug Lea's implementation). Nobody in their right mind would use link map because lookups are O(n), and it's just as easy to use a TreeMap or HashMap.  Would you ever use it?
> >
> >> If you *really* need that sort of guarantee (and I can imagine it may be useful in some cases), then the implementation of the guarantee does *not* belong in the realm of "implements vs doesn't-implement a particular operator overload". Doing so is an abuse of operator overloading, since operator overloading is there for defining syntactic sugar, not for acting as a makeshift contract.
> >
> > I don't think anybody is asking for a guarantee from the compiler or any specific tool.  I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick.  You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'.  You can say 'hey, it's not meant to be used that way', but then why can the user use it that way?  A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections.
> >
> > And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do.  Notice that the operator for + is opAdd, not opPlus.  This is because opAdd is supposed to mean you are performing an addition operation.  Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design.  opIndex is supposed to be an index function, not a linear search.  It's not called opSearch for a reason.  Sure you can redefine it however you want semantically, but it's considered bad design.  That's all we're saying.
> >
> 
> Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.

I think this is also big mistake. I am so sorry! I can not explain it. But it looks you think if you call it different it behaves different. Sorry, Dee Girl
August 28, 2008
Nick Sabalausky Wrote:

> "Dee Girl" <deegirl@noreply.com> wrote in message news:g94j7a$2875$1@digitalmars.com...
> > Nick Sabalausky Wrote:
> >
> >> "Dee Girl" <deegirl@noreply.com> wrote in message news:g943oi$11f4$1@digitalmars.com...
> >> > Benji Smith Wrote:
> >> >
> >> >> Dee Girl wrote:
> >> >> > Michiel Helvensteijn Wrote:
> >> >> >> That's simple. a[i] looks much nicer than a.nth(i).
> >> >> >
> >> >> > It is not nicer. It is more deceiving (correct spell?). If you look
> >> >> > at
> >> >> > code it looks like array code.
> >> >> >
> >> >> > foreach (i; 0 .. a.length)
> >> >> > {
> >> >> >     a[i] += 1;
> >> >> > }
> >> >> >
> >> >> > For array works nice. But for list it is terrible! Many operations
> >> >> > for
> >> >> > incrementing only small list.
> >> >>
> >> >> Well, that's what you get with operator overloading.
> >> >
> >> > I am sorry. I disagree. I think that is what you get with bad design.
> >> >
> >> >> The same thing could be said for "+" or "-". They're inherently deceiving, because they look like builtin operations on primitive data types.
> >> >>
> >> >> For expensive operations (like performing division on an unlimited-precision decimal object), should the author of the code use "opDiv" or should he implement a separate "divide" function?
> >> >
> >> > The cost of + and - is proportional to digits in number. For small
> >> > number
> >> > of digits computer does fast in hardware. For many digits the cost
> >> > grows.
> >> > The number of digits is log n. I think + and - are fine for big
> >> > integer. I
> >> > am not surprise.
> >> >
> >> >> Forget opIndex for a moment, and ask the more general question about
> >> >> all
> >> >> overloaded operators. Should they imply any sort of asymptotic
> >> >> complexity guarantee?
> >> >
> >> > I think depends on good design. For example I think ++ or -- for
> >> > iterator.
> >> > If it is O(n) it is bad design. Bad design make people say like you
> >> > "This
> >> > is what you get with operator overloading".
> >> >
> >> >> Personally, I don't think so.
> >> >>
> >> >> I don't like "nth".
> >> >>
> >> >> I'd rather use the opIndex. And if I'm using a linked list, I'll be aware of the fact that it'll exhibit linear-time indexing, and I'll be cautious about which algorithms to use.
> >> >
> >> > But inside algorithm you do not know if you use a linked list or a
> >> > vector.
> >> > You lost that information in bad abstraction. Also abstraction is bad
> >> > because if you change data structure you have concept errors that still
> >> > compile. And run until tomorrow ^_^.
> >> >
> >>
> >> A generic algoritm has absolutely no business caring about the complexity
> >> of
> >> the collection it's operating on. If it does, then you've created a
> >> concrete
> >> algoritm, not a generic one.
> >
> > I appreciate your view point. Please allow me explain. The view point is in opposition with STL. In STL each algorithm defines what kind of iterator it operates with. And it requires what iterator complexity.
> >
> > I agree that other design can be made. But STL has that design. In my opinion is much part of what make STL so successful.
> >
> > I disagree that algorithm that knows complexity of iterator is concrete. I think exactly contrary. Maybe it is good that you read book about STL by Josuttis. STL algorithms are the most generic I ever find in any language. I hope std.algorithm in D will be better. But right now std.algorithm works only with array.
> >
> >> If an algoritm uses [] and doesn't know the
> >> complexity of the []...good! It shouldn't know, and it shouldn't care.
> >> It's
> >> the code that sends the collection to the algoritm that knows and cares.
> >
> > I think this is mistake. Algorithm should know. Otherwise "linear find" is not "linear find"! It is "cuadratic find" (spell?). If you want to define something called linear find then you must know iterator complexity.
> >
> 
> If a generic algorithm describes itself as "linear find" then I know damn well that it's referring to the behavior of *just* the function itself, and is not a statement that the function *combined* with the behavior of the collection and/or a custom comparison is always going to be O(n).

I think this is wrong. (Maybe I wake up moody! ^_^) Linear find that use another linear find each iteration is not linear find.

> A question about STL: If I create a collection that, internally, is like a linked list, but starts each indexing operation from the position of the last indexing operation (so that a "find first" would run in O(n) instead of O(n*n)), is it possible to send that collection to STL's generic "linear find first"? I would argue that it should somehow be possible *even* if the STL's generic "linear find first" guarantees a *total* performance of O(n) (Since, in this case, it would still be O(n) anyway). Because otherwise, the STL wouldn't be very extendable, which would be a bad thing for a library of "generic" algorithms.

Of course you can design bad collection and bad iterator. Let me ask this.

interface IUnknown
{
    void AddRef();
    void Release();
    int QueryInterface(IID*, void**);
}

Now I come and ask you. If I implement functions bad to do wrong things, can I use my class with COM? Maybe but I have leaks and other bad things.

Compiler or STL can not enforce meaning of words. It only can give you a framework to express meanings correctly. Framework can be better or bad. You hide that nth element costs O(n) as detail. Then I can not write find or binary_search with your framework. Then I say STL better than your framework.

> Another STL question: It is possible to use STL to do a "linear find" using a custom comparison? If so, it is possible to make STL's "linear find" function use a comparison that just happens to be O(n)? If so, doesn't that violate the linear-time guarantee, too? If not, how does it know that the custom comparison is O(n) instead of O(1) or O(log n)?

An element of array does not have easy access to all array. But if you really want it can store it as a member or use a global array. So you can make find do O(n*n) or even more bad.

But I think it is same mistake. If you can do something bad it does not mean framework is bad. The test is if you can do good thing easy. STL allows you to do good thing easy. Your framework makes doing good thing impossible. Thank you, Dee Girl
August 28, 2008
On Thu, 28 Aug 2008 13:47:22 +0400, Fawzi Mohamed <fmohamed@mac.com> wrote:

> On 2008-08-28 00:24:50 +0200, Dee Girl <deegirl@noreply.com> said:
>
>> Derek Parnell Wrote:
>>
>>> On Wed, 27 Aug 2008 17:08:47 -0400, Nick Sabalausky wrote:
>>>
>>>> The way I see it, encapsulation is all about the black box idea. And the
>>>> only things you can see from outside the black box are the inputs and
>>>> outputs.
>>>  Well said.
>>  I am sorry I will say my opinion. This sounds good but is simplistic. Black box is good in principle. But it assume you found right interface for the black box. If you define bad interface you have a bad black box. Also please remember that iterator is black box also. But it defines right interface.
>
> I agree with the meaning, but I disagree with the example.
> I think that iterators are an example of bad interface, as also others brought up the iterator as good example I though that I should say something.
>
> An iterator should be like a generator, have a method next, and one at_end or something similar packaged (and maybe prev() and at_start() if it can also go back) in a single struct, furthermore it should work seamlessly with a kind of for_each(x;iterator) construct.
>
> Instead C++ choose to have begin & end iterators, simply because with that construct it is trivial for the compiler to optimize it for arrays, and you can use pointers as iterators without a cast/constructor.
>
> This means a worse interface for 99% of the uses, apart form arrays and vectors I think one is better off without end iterator, and even when this is not the case writing something like for_each(x;FromTo(a,b)), with FromTo constructing a generator is (I think) better than for(i=t.begin();i!=t.end();++t), and the implementation of an a generator itself is easier (no ==,!=,increment, decrement(pre/post),... for performance reasons)
>
> As I believe that the optimizations to make the better interface be as efficient as the iterator one are perfectly doable (some work, yes, but not extremely much so), I see no good reason for C++ design.
>
> Fawzi
>

Agreed. I usually write iterators that support "T geValue()", "void moveNext()" and "bool isValid()" operations.
Also I don't like Java-style "T getNext()" iterator idiom and think it should be split into two methods: getValue() and moveNext().
August 28, 2008
On 2008-08-28 13:31:26 +0200, "David Wilson" <dw@botanicus.net> said:

> 2008/8/28 Fawzi Mohamed <fmohamed@mac.com>:
>> On 2008-08-28 00:24:50 +0200, Dee Girl <deegirl@noreply.com> said:
>> 
>>> Derek Parnell Wrote:
>>> 
>>>> On Wed, 27 Aug 2008 17:08:47 -0400, Nick Sabalausky wrote:
>>>> 
>>>>> The way I see it, encapsulation is all about the black box idea. And the
>>>>> only things you can see from outside the black box are the inputs and
>>>>> outputs.
>>>> 
>>>> Well said.
>>> 
>>> I am sorry I will say my opinion. This sounds good but is simplistic.
>>> Black box is good in principle. But it assume you found right interface for
>>> the black box. If you define bad interface you have a bad black box. Also
>>> please remember that iterator is black box also. But it defines right
>>> interface.
>> 
>> I agree with the meaning, but I disagree with the example.
>> I think that iterators are an example of bad interface, as also others
>> brought up the iterator as good example I though that I should say
>> something.
>> 
>> An iterator should be like a generator, have a method next, and one at_end
>> or something similar packaged (and maybe prev() and at_start() if it can
>> also go back) in a single struct, furthermore it should work seamlessly with
>> a kind of for_each(x;iterator) construct.
>> 
>> Instead C++ choose to have begin & end iterators, simply because with that
>> construct it is trivial for the compiler to optimize it for arrays, and you
>> can use pointers as iterators without a cast/constructor.
>> 
>> This means a worse interface for 99% of the uses, apart form arrays and
>> vectors I think one is better off without end iterator, and even when this
>> is not the case writing something like for_each(x;FromTo(a,b)), with FromTo
>> constructing a generator is (I think) better than
>> for(i=t.begin();i!=t.end();++t), and the implementation of an a generator
>> itself is easier (no ==,!=,increment, decrement(pre/post),... for
>> performance reasons)
> 
> It is worse in simple "for each element" uses, but one of the cooler
> things about C++ iterators is exactly that they are exchangeable for
> any pointer-like object, and that it's trivial to iterate over a
> subset of a container. Emulating pointers means that in some ways,
> once you grok how they work, they are more trivial than e.g. Python or
> D since there is no further specialised syntax beyond "*", "++", and
> maybe "--".
> 
> I love the concise D and Python syntaxes, just playing devil's advocate here. :)

I appreciate it, and in fact a general iterator is more powerful than the simple interface I gave.
And the iterator concept *is* a powerful one.
The thing is a general random iterator can also be implemented as a single structure (not a pair, start and end), you just need more methods (and a comparison operator if you want to compare iterators).
Indeed you have a little overhead for arrays and vectors (you also store the end), but I think that in most cases the real difference is 0, and a nicer handling of the most common case more than offsets all these considerations (and it is also nicer to implement a new iterator).

> 
>> As I believe that the optimizations to make the better interface be as
>> efficient as the iterator one are perfectly doable (some work, yes, but not
>> extremely much so), I see no good reason for C++ design.
>> 
>> Fawzi


August 28, 2008
Fawzi Mohamed wrote:

> I think

Sorry. Although I cancelled my posting within seconds, you grabbed it even faster.

-manfred

-- 
If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)

August 28, 2008
bearophile wrote:

>> This statement seems to be as true as the statement:
> No, it's a different statement.

Yes. It is a different statement. So what?

If you do not recognize how well the accompanying elaborations tied it to your statement, then there is no value in going any further.

Just convince Walter and the case is settled.


> I'll try to post another more real-looking benchmark later

Different people might perceive reality different, don't they?

-manfred

-- 
If life is going to exist in this Universe, then the one thing it cannot afford to have is a sense of proportion. (Douglas Adams)

August 28, 2008
"Steven Schveighoffer" <schveiguy@yahoo.com> wrote in message news:g9650h$cp9$1@digitalmars.com...
> "Nick Sabalausky" wrote
>> "Steven Schveighoffer" wrote in message news:g95b89$1jii$1@digitalmars.com...
>>>
>>> When someone sees an index operator the first thought is that it is a quick lookup.
>>
>> This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.
>
> Perhaps it is a mistake to assume it, but it is a common mistake. And the expectation is intuitive.

Why in the world would any halfway competent programmer ever look at a *linked list* and assume that the linked lists's [] (if implemented) is O(1)?

As for the risk that could create of accidentially sending a linked list to a "search" (ie, a "search for an element which contains data X") that uses [] internally instead of iterators (but then, why wouldn't it just use iterators anyway?): I'll agree that in a case like this there should be some mechanism for automatic choosing of an algorithm, but that mechanism should be at a separate level of abstraction. There would be a function "search" that, through either RTTI or template constraints or something else, says "does collection 'c' implement ConstantTimeForewardDirectionIndexing?" or better yet IMO "does the collection have attribute ForewardDirectionIndexingComplexity that is set equal to Complexity.Constant?", and based on that passes control to either IndexingSearch or IteratorSearch.

>  You don't see people making light switches that look like outlets, even
> though it's possible.  You might perhaps make a library where opIndex is a
> linear search in your list, but I would expect that people would not use
> that indexing feature correctly.  Just as if I plug my lamp into the light
> switch that looks like an outlet, I'd expect it to get power, and be
> confused when it doesn't.  Except the opIndex mistake is more subtle
> because I *do* get what I actually want, but I just am not realizing the
> cost of it.
>
>> The '+' operator means "add". Addition is typically O(1). But vectors can be added, and that's an O(n) operation. Should opAdd never be used for vectors?
>
> As long as it's addition, I have no problem with O(n) operation (and it depends on your view of n).  Indexing implies speed, look at all other cases where indexing is used.  For addition to be proportional to the size of the element is natural and expected.
>

When people look at '+' they typically think "integer/float addition". Why would, for example, the risk of mistaking an O(n) "big int" addition for an O(1) integer/float addition be any worse than the risk of mistaking an O(n) linked list "get element at index" for an O(1) array "get element at index"?

>>> You can force yourself to think differently, but the reality is that most people think that because of the universal usage of square brackets (except for VB, and I feel pity for anyone who needs to use VB) to mean 'lookup by key', and usually this is only useful on objects where the lookup is quick ( < O(n) ).  Although there is no requirement, nor enforcement, the 'quick' contract is expected by the user, no matter how much docs you throw at them.
>>>
>>> Look, for instance, at Tango's now-deprecated LinkMap, which uses a linked-list of key/value pairs (copied from Doug Lea's implementation). Nobody in their right mind would use link map because lookups are O(n), and it's just as easy to use a TreeMap or HashMap.  Would you ever use it?
>>>
>>>> If you *really* need that sort of guarantee (and I can imagine it may be useful in some cases), then the implementation of the guarantee does *not* belong in the realm of "implements vs doesn't-implement a particular operator overload". Doing so is an abuse of operator overloading, since operator overloading is there for defining syntactic sugar, not for acting as a makeshift contract.
>>>
>>> I don't think anybody is asking for a guarantee from the compiler or any specific tool.  I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick.  You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'.  You can say 'hey, it's not meant to be used that way', but then why can the user use it that way?  A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections.
>>>
>>> And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do.  Notice that the operator for + is opAdd, not opPlus.  This is because opAdd is supposed to mean you are performing an addition operation.  Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search.  It's not called opSearch for a reason.  Sure you can redefine it however you want semantically, but it's considered bad design.  That's all we're saying.
>>>
>>
>> Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.
>
> It is a search, you are searching for the element whose key matches.  When the key can be used to lookup the element efficiently, we call that indexing.  Indexing is a more constrained form of searching.
>

If you've got a linked list, and you want to get element N, are you *really* going to go reaching for a function named "search"? How often do you really see a generic function named "search" or "find" that takes a numeric index as a the "to be found" parameter instead of something to be matched against the element's value? I would argue that that would be confusing for most people. Like I said in a different post farther down, the implementation of a "getAtIndex()" is obviously going to work like a search, but from "outside the box", what you're asking for is not the same.


August 28, 2008
Nick Sabalausky Wrote:

> "Steven Schveighoffer" <schveiguy@yahoo.com> wrote in message news:g9650h$cp9$1@digitalmars.com...
> > "Nick Sabalausky" wrote
> >> "Steven Schveighoffer" wrote in message news:g95b89$1jii$1@digitalmars.com...
> >>>
> >>> When someone sees an index operator the first thought is that it is a quick lookup.
> >>
> >> This seems to be a big part of the disagreement. Personally, I think it's insane to look at [] and just assume it's a cheap lookup. That's was true in pure C where, as someone else said, it was nothing more than a shorthand for a specific pointer arithmentic operation, but carrying that idea over to more modern languages (especially one that also uses [] for associative arrays) is a big mistake.
> >
> > Perhaps it is a mistake to assume it, but it is a common mistake. And the expectation is intuitive.
> 
> Why in the world would any halfway competent programmer ever look at a *linked list* and assume that the linked lists's [] (if implemented) is O(1)?

A programmer can look at generic code. Generic code does not know it is a linked list or something else. I think this really helps thinking in generic code. Because generic code that makes problem interesting.

> As for the risk that could create of accidentially sending a linked list to a "search" (ie, a "search for an element which contains data X") that uses [] internally instead of iterators (but then, why wouldn't it just use iterators anyway?): I'll agree that in a case like this there should be some mechanism for automatic choosing of an algorithm, but that mechanism should be at a separate level of abstraction. There would be a function "search" that, through either RTTI or template constraints or something else, says "does collection 'c' implement ConstantTimeForewardDirectionIndexing?" or better yet IMO "does the collection have attribute ForewardDirectionIndexingComplexity that is set equal to Complexity.Constant?", and based on that passes control to either IndexingSearch or IteratorSearch.

I think this is extreme complicate design. What is advantage of this design from STL?

> >  You don't see people making light switches that look like outlets, even
> > though it's possible.  You might perhaps make a library where opIndex is a
> > linear search in your list, but I would expect that people would not use
> > that indexing feature correctly.  Just as if I plug my lamp into the light
> > switch that looks like an outlet, I'd expect it to get power, and be
> > confused when it doesn't.  Except the opIndex mistake is more subtle
> > because I *do* get what I actually want, but I just am not realizing the
> > cost of it.
> >
> >> The '+' operator means "add". Addition is typically O(1). But vectors can be added, and that's an O(n) operation. Should opAdd never be used for vectors?
> >
> > As long as it's addition, I have no problem with O(n) operation (and it depends on your view of n).  Indexing implies speed, look at all other cases where indexing is used.  For addition to be proportional to the size of the element is natural and expected.
> >
> 
> When people look at '+' they typically think "integer/float addition". Why would, for example, the risk of mistaking an O(n) "big int" addition for an O(1) integer/float addition be any worse than the risk of mistaking an O(n) linked list "get element at index" for an O(1) array "get element at index"?

This is again wrong for two reasons. I am sorry! One small thing is I think big int "+" is O(log n) not O(n). But real problem is people look at a[] = b[] + c[] and see operands. It is evident from operands that cost is proportional to input size. If it is any shorter it would be miracle! Because it means some elements are not even seen. You compare wrong situations. I mean different situation. And operator or function is not important. I said if array access is index(a, n) and everybody always thinks so then index(a, n) should not do linear search.

> >>> You can force yourself to think differently, but the reality is that most people think that because of the universal usage of square brackets (except for VB, and I feel pity for anyone who needs to use VB) to mean 'lookup by key', and usually this is only useful on objects where the lookup is quick ( < O(n) ).  Although there is no requirement, nor enforcement, the 'quick' contract is expected by the user, no matter how much docs you throw at them.
> >>>
> >>> Look, for instance, at Tango's now-deprecated LinkMap, which uses a linked-list of key/value pairs (copied from Doug Lea's implementation). Nobody in their right mind would use link map because lookups are O(n), and it's just as easy to use a TreeMap or HashMap.  Would you ever use it?
> >>>
> >>>> If you *really* need that sort of guarantee (and I can imagine it may be useful in some cases), then the implementation of the guarantee does *not* belong in the realm of "implements vs doesn't-implement a particular operator overload". Doing so is an abuse of operator overloading, since operator overloading is there for defining syntactic sugar, not for acting as a makeshift contract.
> >>>
> >>> I don't think anybody is asking for a guarantee from the compiler or any specific tool.  I think what we are saying is that violating the 'opIndex is fast' notion is bad design because you end up with users thinking they are doing something that's quick.  You end up with people posting benchmarks on your containers saying 'why does python beat the pants off your list implementation?'.  You can say 'hey, it's not meant to be used that way', but then why can the user use it that way?  A better design is to nudge the user into using a correct container for the job by only supporting operations that make sense on the collections.
> >>>
> >>> And as far as operator semantic meaning, D's operators are purposely named after what they are supposed to do.  Notice that the operator for + is opAdd, not opPlus.  This is because opAdd is supposed to mean you are performing an addition operation.  Assigning a different semantic meaning is not disallowed by the compiler, but is considered bad design. opIndex is supposed to be an index function, not a linear search.  It's not called opSearch for a reason.  Sure you can redefine it however you want semantically, but it's considered bad design.  That's all we're saying.
> >>>
> >>
> >> Nobody is suggesting using [] to invoke a search (Although we have talked about using [] *in the search function's implementation*). Search means you want to get the position of a given element, or in other words, "element" -> search -> "key/index". What we're talking about is the reverse: getting the element at a given position, ie, "key/index" -> [] -> "element". It doesn't matter if it's an array, a linked list, or a super-duper-collection-from-2092: that's still indexing, not searching.
> >
> > It is a search, you are searching for the element whose key matches.  When the key can be used to lookup the element efficiently, we call that indexing.  Indexing is a more constrained form of searching.
> >
> 
> If you've got a linked list, and you want to get element N, are you *really* going to go reaching for a function named "search"?

Yes. This is exactly STL is doing. And there is no wrong with it. Again I think STL book by Josuttis very helpful. Also Stepanov notes online very interesting! Thank you Don.

> How often do you really see a generic function named "search" or "find" that takes a numeric index as a the "to be found" parameter instead of something to be matched against the element's value? I would argue that that would be confusing for most people.

I think you lose argueing. There is experience in STL that it is not confusing. STL is most success library for C++ even if C++ is now old and has problems.