August 27, 2008
"superdan" <super@dan.org> wrote in message news:g92drj$h0u$1@digitalmars.com...
> Nick Sabalausky Wrote:
>
>> "superdan" <super@dan.org> wrote in message news:g921nb$2qqq$1@digitalmars.com...
>> > Nick Sabalausky Wrote:
>> >
>> >> "Denis Koroskin" <2korden@gmail.com> wrote in message news:op.ugie28dxo7cclz@proton.creatstudio.intranet...
>> >> > On Tue, 26 Aug 2008 23:29:33 +0400, superdan <super@dan.org> wrote: [snip]
>> >> >>> The D spec certainly doesn't make any guarantees about
>> >> >>> the time/memory complexity of opIndex; it's up to the implementing
>> >> >>> class
>> >> >>> to do so.
>> >> >>
>> >> >> it don't indeed. it should. that's a problem with the spec.
>> >> >>
>> >> >
>> >> > I agree. You can't rely on function invokation, i.e. the following
>> >> > might
>> >> > be slow as death:
>> >> >
>> >> > auto n = collection.at(i);
>> >> > auto len = collection.length();
>> >> >
>> >> > but index operations and properties getters should be real-time and
>> >> > have
>> >> > O(1) complexity by design.
>> >> >
>> >> > auto n = collection[i];
>> >> > auto len = collection.length;
>> >>
>> >> I disagree. That strategy strikes me as a very clear example of
>> >> breaking
>> >> encapsulation by having implementation details dictate certain aspects
>> >> of
>> >> the API. At the very least, that will make the API overly rigid,
>> >> hindering
>> >> future changes that could otherwise have been non-breaking,
>> >> behind-the-scenes changes.
>> >
>> > take this:
>> >
>> > auto c = new Customer;
>> > c.loadCustomerInfo("123-12-1234");
>> >
>> > that's all cool. there's no performance guarantee other than some best
>> > effort kinda thing `we won't sabotage things'. if you come with faster
>> > or
>> > slower methods to load customers, no problem. coz noone assumes any.
>> >
>> > now take sort. sort says. my input is a range that supports indexing
>> > and
>> > swapping independent of the range size. if you don't have that just let
>> > me
>> > know and i'll use a totally different method. just don't pretend.
>> >
>>
>> Choosing a sort method is a separate task from the actual sorting.
>
> thot you were the one big on abstraction and encapsulation and all those good things. as a user i want to sort stuff. i let the library choose what's best for the collection at hand.
>
> sort(stuff)
> {
>    1) figure out best algo for stuff
>    2) have at it
> }
>
> i don't want to make that decision outside. for example i like one sort routine. not quicksort, heapsort, quicksort_with_median_of_5, or god forbid bubblesort.
>

I never said that shouldn't be available. In fact, I did say it should be there. But just not forced.

>> Any sort
>> that expects to be able to index will still work correctly if given a
>> collection that has O(n) or worse indexing, just as it will still work
>> correctly if given an O(n) or worse comparison delegate.
>
> i disagree but now that christopher gave me a black eye guess i have to shut up.
>

That's not a matter of agreeing or disagreeing, it's a verifiable fact. Grab a working sort function that operates on collection classes that implement indexing-syntax and a length property, feed it an unsorted linked list that has opIndex overloaded to return the nth node and a proper length property, and when it returns, the list will be sorted.

Or are you maybe talking about a sort function that's parameterized specifically to take an "array" instead of "a collection that implements opIndex and a length property"? Because that might make a difference depending on the language (not sure about D offhand).

>> It's up to the
>> caller of the sort function to know that an O(n log n) sort (for
>> instance)
>> is only O(n log n) if the indexing and comparison are both O(1). And then
>> it's up to them to decide if they still want to send it a linked list or
>> a
>> complex comparison. The sort shouldn't crap out at compile-time (or
>> runtime)
>> just because some novice might not know that doing a generalized bubble
>> sort
>> on a linked list scales poorly.
>
> it should coz there's an obvious good choice. there's no good tradeoff in that. there never will be a case to call the bad sort on the bad range.
>

No matter what type of collection you're using, the "best sort" is still going to vary depending on factors like the number of elements to be sorted, whether duplicates might exist, how close the collection is to either perfectly sorted, perfectly backwards or totally random, how likely it is to be random/sorted/backwards at any given time, etc. And then there can be different variations of the same basic algorithm that can be better or worse for certain scenarios. And then there's the issue of how does the algorithm-choosing sort handle user-created collections, if at all.

>> If you want automatic choosing of an appropriate sort algorithm (which is certainly a good thing to have), then that can be done at a separate, optional, level of abstraction using function overloading, template specialization, RTTI, etc. That way you're not imposing arbitrary restrictions on anyone.
>
> i think u missed a lil point i was making.
>
> All collections: implement nth()
> Indexable collections: implement opIndex
>
> is all. there is no restriction. just use nth.
>

If you want opIndex to be reserved for highly scalable indexing, then I can see how that would lead to what you describe here. But I'm in the camp that feels opIndex means "indexing", not "cheap/scalable indexing", in which case it becomes unnecessary to also expose the separate "nth()" function.

>> > with that indexing and swapping complexity ain't implementation detail. they're part of the spec. guess stepanov's main contribution was to clarify that.
>> >
>>
>> When I called indexing an implementation detail, I was referring to the
>> collection itself. The method of indexing *is* an implementation detail
>> of
>> the collection.
>
> not when it gets composed in higher level ops.
>

It's not? If a collection's indexing isn't implemented by the collection's own class (or the equivalent functions in non-OO), then where is it implemented? Don't tell me it's the sort function, because I know that I'm not calling a sort function every time I say "collection[i]". The method of indexing is implemented by the collection class, therefore, it's an implementation detail of that method/class, not the functions that call it. Claiming otherwise is like saying that all of the inner working of printf() are implementation details of main().

>> It should not be considered an implementation detail of the
>> sort algorithm since it's encapsulated in the collection and thus hidden
>> away from the sort algorithm. If you make a rule that collections with
>> cheap
>> indexing are indexed via opIndex and collections with expensive indexing
>> are
>> indexed via a function, then you've just defined the API in terms of the
>> collection's implementation (and introduced an unnecessary inconsistency
>> into the API).
>
> no. there is consistency. nth() is consistent across - o(n) or better indexing.
>
> i think u have it wrong when u think of "cheap" as if it were "fewer machine instructions". no. it's about asymptotic complexity and that does matter.
>

Since we're talking about algorithmic complexity, I figured "cheap" and "expensive" would be understood as being intended in the same sense. So yes, I'm well aware of that.

>> If a sort function is desired that only accepts collections with O(1) indexing, then that can be accomplished at a higher level of abstraction (using function overloading, RTTI, etc.) without getting in the way when such a guarantee is not needed.
>
> exactly. nth() and opIndex() fit the bill. what's there not to love?
>

The "nth()" deviates from standard indexing syntax. I consider "collection[i]" to mean "indexing", not "low-complexity indexing".

>> >> For realtime code, I can see the benefit to what you're saying.
>> >> Although
>> >> in
>> >> many cases only part of a program needs to be realtime, and for the
>> >> rest
>> >> of
>> >> the program's code I'd hate to have to sacrifice the encapsulation
>> >> benefits.
>> >
>> > realtime has nothin' to do with it.
>>
>> For code that needs to run in realtime, I agree with Denis Koroskin that
>> it
>> could be helpful to be able to look at a piece of code and have some sort
>> of
>> guarantee that there is no behind-the-scenes overloading going on that is
>> any more complex than the operators' default behaviors. But for code that
>> doesn't need to finish within a maximum amount of time, that becomes less
>> important and the encapsulation/syntactic-consistency gained from the use
>> of
>> such things becomes a more worthy pursuit. That's what I was saying about
>> realtime.
>
> i disagree but am in a rush now. guess i can't convince u.
>
>> > encapsulation ain't broken by making complexity part of the reqs. any
>> > more
>> > than any req ain't breakin' encapsulation. if it looks like a problem
>> > then
>> > encapsulation was misdesigned and needs change.
>> >
>> > case in point. all containers should provide 'nth' say is it's o(n) or
>> > better. then there's a subclass of container that is indexed_container.
>> > that provides opIndex and says it's o(log n) or better. it also
>> > provides
>> > 'nth' by just forwarding to opIndex. faster than o(n) ain't a problem.
>> >
>> > but forcing a list to blurt something for opIndex - that's just bad design.
>>
>> I agree that not all collections should implement an opIndex. Anything
>> without a natural sequence or mapping should lack opIndex (such as a tree
>> or
>> graph). But forcing the user of a collection that *does* have a natural
>> sequence (like a linked list) to use function-call-syntax instead of
>> standard indexing-syntax just because the collection is implemented in a
>> way
>> that causes indexing to be less scalable than other collections - that's
>> bad
>> design.
>
> no. it's great design. because it's not lyin'. you want o(1) indexing you say a[n]. you are ok with o(n) indexing you say a.nth(n). this is how generic code works, with consistent notation. not with lyin'.
>

Number of different ways to index a collection:
One way: Consistent
Two ways: Not consistent

>> The way I see it, "group[n]" means "Get the nth element of group". Not
>> "Get
>> the element at location group.ptr + (n * sizeof(group_base_type)) or
>> something else that's just as scalable."
>
> no need to get that low. just say o(1) and understand o(1) has nothing to do with the count of assembler ops.
>

Here:

"group.ptr + (n * sizeof(group_base_type))..."

One multiplication, one addition, one memory read, no loops, no recursion: O(1)

"...or something else that's just as scalable"

Something that's just as scalable as O(1) must be O(1). So yes, that's what I said.

>> In plain C, those are one and the
>> same. But when you start talking about generic collections, encapsulation
>> and interface versus implementation, they are very different: the former
>> is
>> interface, the latter is implementation.
>
> so now would u say stl has a poor design? because it's all about stuff that you consider badly designed.

I abandoned C++ back when STL was still fairly new, so I can't honestly say. I seem to remember C++ having some trouble with certain newer language concepts, it might be that STL is the best than can be reasonably done given the drawbacks of C++. Or there might be room for improvement.


August 27, 2008
On 2008-08-26 14:15:28 +0200, bearophile <bearophileHUGS@lycos.com> said:

> [...]
> In my site I am keeping a gallery of tiny benchmarks where D code (with DMD) is 10 or more times slower than very equivalent Python, C, Java code (I have about 12 programs so far, very different from each other. There's a benchmark regarding the associative arrays too). Hopefully it will become useful once people will start tuning D implementations.
> 
> Bye,
> bearophile

You know I have got the impression that you have a naive view of datastructures, and each time you find a performance problem you ask for the data structure to be improved.
One cannot expect to have a single data structure accomodate all uses, simply because something is a container, and support a given operation it does not mean that it does support it efficiently.
*if* something is slow for a given purpose what I do is to sit down a think a little which datastructure is optimal for my problem, and then switch to it (maybe taking it from tango containers).

Don't get me wrong, it is useful to know which usage patterns give performance problems with the default data structures, and if associative arrays would use a tree or some sorted structure for small sizes (avoiding the cost of hashing) I would not complain, but I do not think (for example) that arrays should necessarily be very optimized for appending...

Fawzi

August 27, 2008
superdan wrote:
> Steven Schveighoffer Wrote:
> 
>> "superdan" wrote
>>> Christopher Wright Wrote:
>>>
>>>> superdan wrote:
>>>>>> class LinkedList(T)
>>>>>> {
>>>>>> Node!(T) head;
>>>>>>
>>>>>> /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
>>>>>> the list. Time complexity: O(N) for a list of length N. This operation
>>>>>> is provided for completeness and not recommended for frequent use in
>>>>>> large lists. */
>>>>>> T opIndex(int i)
>>>>>> {
>>>>>> auto current = head;
>>>>>> while (i)
>>>>>> {
>>>>>> current = current.next;
>>>>>> }
>>>>>> return current.value;
>>>>>> }
>>>>>> }
>>>>> oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.
>>>> WRONG!
>>>> Those sorting algorithms are correct. Their runtime is now O(n^2 log n)
>>>> for this linked list.
>>> you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?
>> O(n^2 log n) is still considered solved.
> 
> of course. for traveling salesman that is. just not sorting. o(n^2 log n) sorting algo, that's  a no-go. notice also i didn't say solved/unsolved. i said scalable. stuff beyond n log n don't scale.

Why are people unable to have the brains to understand that they're using data structures in an suboptimal nature?

Furthermore, you're faulting linked lists as having a bad opIndex.  Why not implement a cursor (Java LinkedList-like iterator) in the opIndex function?  Thus you could retain the reference to the last indexed location, and simply use that instead of the root node when calling opIndex.  Granted that whenever the contents of the list are modified that reference would have to be considered invalid (start from the root node again), but it'd work with an O(1) efficiency for sequential accesses from 0 to length.  True, it'll add another pointer to a node in memory, as well as a integer representing the position of that node reference.

> sure thing. problem is, you must know it's a list. otherwise you wouldn't make a copy.
> 
> don't forget how this all started when answering tiny bits of my post. it started with a dood claiming vector and list both have opIndex and then sort works with both without knowing the details. it don't work with both.

Wrong.  It works.  That it's not precisely what the spec for sort dictates (which is probably in error, since no spec can guarantee a precise efficiency if it doesn't know the precise container type).  You are also misinterpreting the spec.  It is saying that it uses a specific efficiency of algorithm, not that you can arbitrarily expect a certain efficiency out of it regardless of how dumb you might be with the choice of container you use.

>> So the total is O(2n + n lgn), and we all know you always take the most significant part of the polynomial, so it then becomes:
>>
>> O(n lgn)
>>
>> Can I have my PhD now? :P
> 
> sure. i must have seen an email with an offer somewhere ;)

A Ph.D from superdan... gee, I'd value that just above my MSDN membership.  Remember: I value nothing less than my MSDN membership.



August 27, 2008
Fawzi Mohamed:
>You know I have got the impression that you have a naive view of datastructures,<

I think you are wrong, see below.


>One cannot expect to have a single data structure accomodate all uses,<

My view on the topic are:
- Each data structure (DS) is a compromise, it allows you to do some operations with a certain performance, while giving you a different performance on onther operations, so it gives you a performance profile. Generally you can't have a DS with the best performance for every operation.
- Sometimes you may want to choose a DS with a worse performance just because its implementation is simpler, to reduce development time, bug count, etc.
- The standard library of a modern language has to contain most of the most commmon DSs, to avoid lot of problems and speed up programming, etc.
- If the standard library doesn't contain a certain DS or if the DS required is very uncommon, or the performance profile you need for a time-critical part of your code is very sharp, then your language is supposed able to allow you to write your own DS (some scripting languages may need you to drop do a lower level language to do this).
- A modern language is supposed to have some built-in DSs. What ones to choose? This is a tricky question, but the answer I give is that a built-in data structure has to be very flexible, so it has to be efficient enough in a large variety of situations, without being optimal for any situations. This allows the programmers to use it in most situations, where max performance isn't required, so the programmer has to use DSs of the standard library (or even his/her own ones) only once in a while. Creating a very flexible DS is not easy, it requires lot of tuning, tons of benchmarks done on real code, your DS often needs to use some extra memory to be flexible (you can even add subsytems to such DS that collect its usage statistics during the runtime to adapt itself to the specific usage in the code).


>Don't get me wrong, it is useful to know which usage patterns give performance problems with the default data structures,<

If you want to write efficient programs such knowledge is very important, even in scripting languages.


>and if associative arrays would use a tree or some sorted structure for small sizes (avoiding the cost of hashing) I would not complain,<

Python hashes are optimized for being little too, and they don't use a tree.


>but I do not think (for example) that arrays should necessarily be very optimized for appending...<

From the long thread it seems that allowing both fast slices, mutability and fast append isn't easy. I think all three features are important, so some compromise has to be found, because appending is a common enough operation and at the moment is slow or relly slow, too much slow.

Now you say that the built-in arrays don't need to be very optimized for appending. In the Python world they solve this problem mining real programs to collect real usage statistics. So they try to know if the append is actually used in real programs, and how often, where, etc.

Bye,
bearophile
August 27, 2008
superdan wrote:

> my point was opIndex should not be written for a list to begin with.

Ok. So you are not opposed to the random access operation on a list, as long as it doesn't use opIndex but a named function, correct?

You are saying that there is a rule somewhere (either written or unwritten)
that guarantees a time-complexity of O(1) for opIndex, wherever it appears.

This of course means that a linked list cannot define opIndex, since a
random access operation on it will take O(n) (there are tricks that can
make it faster in most practical cases, but I digress).

That, in turn, means that a linked list and a dynamic array can not share a common interface that includes opIndex.

Aren't you making things difficult for yourself with this rule?

A list and an array are very similar data-structures and it is natural for
them to share a common interface. The main differences are:
* A list takes more memory.
* A list has slower random access.
* A list has faster insertions and growth.

But the interface shouldn't necessarily make any complexity guarantees. The implementations should. And any programmer worth his salt will be able to use this wisely and choose the right sorting algorithm for the right data-structure. There are other algorithms, I'm sure, that work equally well on either. Of course, any algorithm should give its time-complexity in terms of the complexity of the operations it uses.

I do understand your point, however. And I believe my argument would be stronger if there were some sort of automatic complexity analysis tool. This could either warn a programmer in case he makes the wrong choice, or even take the choice out of the programmers hands and automatically choose the right sorting algorithm for the job. That's a bit ambitious. I guess a profiler is the next best thing.

-- 
Michiel

August 27, 2008
On 2008-08-27 13:21:10 +0200, bearophile <bearophileHUGS@lycos.com> said:

> Fawzi Mohamed:
>> You know I have got the impression that you have a naive view of datastructures,<
> 
> I think you are wrong, see below.

good :)

>> One cannot expect to have a single data structure accomodate all uses,<
> 
> My view on the topic are:
> - Each data structure (DS) is a compromise, it allows you to do some operations with a certain performance, while giving you a different performance on onther operations, so it gives you a performance profile. Generally you can't have a DS with the best performance for every operation.
> - Sometimes you may want to choose a DS with a worse performance just because its implementation is simpler, to reduce development time, bug count, etc.
> - The standard library of a modern language has to contain most of the most commmon DSs, to avoid lot of problems and speed up programming, etc.
> - If the standard library doesn't contain a certain DS or if the DS required is very uncommon, or the performance profile you need for a time-critical part of your code is very sharp, then your language is supposed able to allow you to write your own DS (some scripting languages may need you to drop do a lower level language to do this).
> - A modern language is supposed to have some built-in DSs. What ones to choose? This is a tricky question, but the answer I give is that a built-in data structure has to be very flexible, so it has to be efficient enough in a large variety of situations, without being optimal for any situations. This allows the programmers to use it in most situations, where max performance isn't required, so the programmer has to use DSs of the standard library (or even his/her own ones) only once in a while. Creating a very flexible DS is not easy, it requires lot of tuning, tons of benchmarks done on real code, your DS often needs to use some extra memory to be flexible (you can even add subsytems to such DS that collect its usage statistics during the runtime to adapt itself to the specific usage in the code).
> 
> 
>> Don't get me wrong, it is useful to know which usage patterns give performance problems with the default data structures,<
> 
> If you want to write efficient programs such knowledge is very important, even in scripting languages.

on this we agree

>> and if associative arrays would use a tree or some sorted structure for small sizes (avoiding the cost of hashing) I would not complain,<
> 
> Python hashes are optimized for being little too, and they don't use a tree.
> 
> 
>> but I do not think (for example) that arrays should necessarily be very optimized for appending...<
> 
> From the long thread it seems that allowing both fast slices, mutability and fast append isn't easy. I think all three features are important, so some compromise has to be found, because appending is a common enough operation and at the moment is slow or relly slow, too much slow.
> 
> Now you say that the built-in arrays don't need to be very optimized for appending. In the Python world they solve this problem mining real programs to collect real usage statistics. So they try to know if the append is actually used in real programs, and how often, where, etc.

you know as you say it depends on the programs, but it also depend on the language, if in a language it is clear that using the default structure you are not supposed to append often, and if you really have to do it you use a special method if you do, then the programs written in that language will use that.

On the other hand when you translate code from a language to another you might encounter problems.

The standard array as I see it embodies the philosophy of the C array:
- minimal memory overhead (it is ok to have lots of them, D does have some overhead vs C)
- normal memory layout (usable with low level routines that pass memory around and with C)
- you can check bounds (C can't)
- you can slice it
- appending to it is difficult

D tries to do different things to mitigate the fact that appending is difficult, maybe it could do more, but it will *never* be as efficient as a structure that gives up that fact that an array has to be just a chunk of contiguous memory.

Now I find the choice of having the basic array being just a chunk of contiguous memory, and that the overhead of the structure should be minimal very reasonable for a system programming language that has to interact with C, so I also find ok that appending is not as fast as with other structures.

Clearly someone coming from lisp, any functional language or even python, might disagree about what is requested from the "default" array container.
It isn't that one is right and the other wrong, it is just a question of priorities of the language, the feel of it, its style...

This does not mean that if improvements can be done without compromising too much it shouldn't be done, just that failing some benchmarks might be ok :)

Fawzi

> 
> Bye,
> bearophile


August 27, 2008
Michiel Helvensteijn Wrote:

> superdan wrote:
> 
> > my point was opIndex should not be written for a list to begin with.
> 
> Ok. So you are not opposed to the random access operation on a list, as long as it doesn't use opIndex but a named function, correct?

correctamundo.

> You are saying that there is a rule somewhere (either written or unwritten)
> that guarantees a time-complexity of O(1) for opIndex, wherever it appears.

yeppers. amend that to o(log n). in d, that rule is a social contract derived from the built-in vector and hash indexing syntax.

> This of course means that a linked list cannot define opIndex, since a
> random access operation on it will take O(n) (there are tricks that can
> make it faster in most practical cases, but I digress).

it oughtn't. & you digress in the wrong direction. you can't prove a majority of "practical cases" will not suffer a performance hit. the right direction is to define the right abstraction for forward iteration.

i mean opIndex optimization is making a shitty design run better. y not make a good design to start with?

> That, in turn, means that a linked list and a dynamic array can not share a common interface that includes opIndex.

so what. they can share a common interface that includes nth(). what exactly is yer problem with that.

> Aren't you making things difficult for yourself with this rule?

not @ all. i want o(n) index access, i use nth() and i know it's gonna take me o(n) and i'll design my higher-level algorithm accordingly. if random access helps my particular algo then is(typeof(a[n])) tells me that a supports random access. if i can't live without a[n] my algo won't compile. every1's happy.

> A list and an array are very similar data-structures and it is natural for them to share a common interface.

sure. both r sequence containers. opIndex ain't part of a sequence container interface.

> The main differences are:
> * A list takes more memory.
> * A list has slower random access.

nooonononono. o(n) vs. o(1) to be precise. that's not "slower". that's sayin' "list don't have random access, you can as well get up your lazy ass & do a linear search by calling nth()". turn that on its head. if i give u a container and say "it has random access" u'd rightly expect better than a linear search. a deque has slower random access than a vector @ the same complexity.

> * A list has faster insertions and growth.

o(1) insertion if u have the insertion point.  both list and vector have o(1) growth.

list also has o(1) splicing. that's important.

we get to the point where we realize the two r fundamentally different structures built around fundamentally different tradeoffs. they do satisfy the same interface. just ain't the vector interface. it's a sequential-access interface. not a random-access interface.

> But the interface shouldn't necessarily make any complexity guarantees.

problem is many said so til stl came and said enuff is enuff. for fundamental data structures & algos complexity /must/ be part of the spec & design. otherwise all u get is a mishmash of crap & u simply can't do generic stuff w/ a mishmash of crap. as other cont/algo libs've copiously shown. that approach's impressive coz it challenged some stupid taboos & proved them worthless. it was contrarian & to great effect. for that alone stl puts to shame previous container/algo libs. i know i'd used half a dozen and wrote a couple. thot the whole container/algo design is old hat. when stl came along i was like, holy effin' guacamole. that's why i say. even if u don't use c++ for its many faults. understand stl coz it's d shiznit.

> The
> implementations should. And any programmer worth his salt will be able to
> use this wisely and choose the right sorting algorithm for the right
> data-structure.

here's where the thing blows apart. i agree with choosing manually if i didn't want to do generic programming. if u wanna do generic programming u want help from the compiler in mixing n matching stuff. it's not about the saltworthiness. it's about writing generic code.

> There are other algorithms, I'm sure, that work equally
> well on either. Of course, any algorithm should give its time-complexity in
> terms of the complexity of the operations it uses.
> 
> I do understand your point, however. And I believe my argument would be stronger if there were some sort of automatic complexity analysis tool.

stl makes-do without an automatic tool.

> This could either warn a programmer in case he makes the wrong choice, or even take the choice out of the programmers hands and automatically choose the right sorting algorithm for the job. That's a bit ambitious. I guess a profiler is the next best thing.

i have no doubt stl has had big ambitions. for what i can tell it fulfilled them tho c++ makes higher-level algorithms look arcane. so i'm happy with the lambdas in std.algorithm. & can't figure why containers don't come along. walt?

August 27, 2008
Chris R. Miller Wrote:

> superdan wrote:
> > Steven Schveighoffer Wrote:
> > 
> >> "superdan" wrote
> >>> Christopher Wright Wrote:
> >>>
> >>>> superdan wrote:
> >>>>>> class LinkedList(T)
> >>>>>> {
> >>>>>> Node!(T) head;
> >>>>>>
> >>>>>> /** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
> >>>>>> the list. Time complexity: O(N) for a list of length N. This operation
> >>>>>> is provided for completeness and not recommended for frequent use in
> >>>>>> large lists. */
> >>>>>> T opIndex(int i)
> >>>>>> {
> >>>>>> auto current = head;
> >>>>>> while (i)
> >>>>>> {
> >>>>>> current = current.next;
> >>>>>> }
> >>>>>> return current.value;
> >>>>>> }
> >>>>>> }
> >>>>> oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.
> >>>> WRONG!
> >>>> Those sorting algorithms are correct. Their runtime is now O(n^2 log n)
> >>>> for this linked list.
> >>> you got the wrong definition of correct. sort became a non-scalable algo from a scalable algo. whacha gonna say next. bubble sort is viable?!?
> >> O(n^2 log n) is still considered solved.
> > 
> > of course. for traveling salesman that is. just not sorting. o(n^2 log n) sorting algo, that's  a no-go. notice also i didn't say solved/unsolved. i said scalable. stuff beyond n log n don't scale.
> 
> Why are people unable to have the brains to understand that they're using data structures in an suboptimal nature?

coz they wants to do generic programming. they can't know what structures are using. so mos def structures must define expressive interfaces that describe their capabilities.

> Furthermore, you're faulting linked lists as having a bad opIndex.  Why not implement a cursor (Java LinkedList-like iterator) in the opIndex function?  Thus you could retain the reference to the last indexed location, and simply use that instead of the root node when calling opIndex.  Granted that whenever the contents of the list are modified that reference would have to be considered invalid (start from the root node again), but it'd work with an O(1) efficiency for sequential accesses from 0 to length.  True, it'll add another pointer to a node in memory, as well as a integer representing the position of that node reference.

you: "this scent will make skunk farts stink less."
me: "let's kick the gorram skunk outta here!"

> > sure thing. problem is, you must know it's a list. otherwise you wouldn't make a copy.
> > 
> > don't forget how this all started when answering tiny bits of my post. it started with a dood claiming vector and list both have opIndex and then sort works with both without knowing the details. it don't work with both.
> 
> Wrong.  It works.  That it's not precisely what the spec for sort dictates (which is probably in error, since no spec can guarantee a precise efficiency if it doesn't know the precise container type).

sure it can. in big oh.

>  You
> are also misinterpreting the spec.  It is saying that it uses a specific
> efficiency of algorithm, not that you can arbitrarily expect a certain
> efficiency out of it regardless of how dumb you might be with the choice
> of container you use.

in stl the spec says as i say. in d the spec is not precise. it should.

> >> So the total is O(2n + n lgn), and we all know you always take the most significant part of the polynomial, so it then becomes:
> >>
> >> O(n lgn)
> >>
> >> Can I have my PhD now? :P
> > 
> > sure. i must have seen an email with an offer somewhere ;)
> 
> A Ph.D from superdan... gee, I'd value that just above my MSDN membership.  Remember: I value nothing less than my MSDN membership.

humor is a sign of intelligence. but let me explain it. i was referring to the spam emails advertising phds from non-accredited universities.

August 27, 2008
Fawzi Mohamed:
> it also depend on the language, if in a language it is clear that using the default structure you are not supposed to append often, and if you really have to do it you use a special method if you do, then the programs written in that language will use that.

I agree. But then the D specs have to be updated to say that the append to built-in D arrays is a slow or very slow operation (not amortized O(1)), so people coming from the C++ STL, Python, Ruby, TCl, Lua, Lisp, Clean, Oz, ecc will not receive a bite from it.


> D tries to do different things to mitigate the fact that appending is difficult, maybe it could do more, but it will *never* be as efficient as a structure that gives up that fact that an array has to be just a chunk of contiguous memory.

I agree, a deque will probably be always faster in appending than a dynamic array. But I think D may do more here :-)


> Now I find the choice of having the basic array being just a chunk of contiguous memory, and that the overhead of the structure should be minimal very reasonable for a system programming language that has to interact with C

Note that both vector of the C++ STL and "list" of Python are generally (or always) implemented with a chunk of contiguous memory.


> Clearly someone coming from lisp, any functional language or even
> python, might disagree about what is requested from the "default" array
> container.
> It isn't that one is right and the other wrong, it is just a question
> of priorities of the language, the feel of it, its style...

I agree, that's why I have suggested to collect statistics from real D code, and not from Lisp programs, to see what's the a good performance profile compromise for D built-in dynamic arrays :-) This means that I'll stop caring for a fast array append in D if most D programmers don't need fast appends much.


> This does not mean that if improvements can be done without compromising too much it shouldn't be done, just that failing some benchmarks might be ok :)

Well, in my opinion that's okay, but there's a limit. So I think a built-in data structure has to be optimized for being flexible, so while not being very good in anything, it has to be not terrible in any commonly done operation.
On the other hand, I can see what you say, that in a low level language a too much flexible data structure may be not fit as built-in, while a simpler and less flexible one may be fitter. You may be right and I may be wrong on this point :-)

Bye,
bearophile
August 27, 2008
superdan wrote:

>> This of course means that a linked list cannot define opIndex, since a
>> random access operation on it will take O(n) (there are tricks that can
>> make it faster in most practical cases, but I digress).
> 
> it oughtn't. & you digress in the wrong direction. you can't prove a majority of "practical cases" will not suffer a performance hit.

Perhaps. It's been a while since I've worked with data-structures on this level, but I seem to remember there are ways.

What if you maintain a linked list of small arrays? Say each node in the
list contains around log(n) of the elements in the entire list. Wouldn't
that bring random access down to O(log n)? Of course, this would also bring
insertions up to O(log n).

And what if you save an index/pointer pair after each access. Then with each
new access request, you can choose from three locations to start walking:
* The start of the list.
* The end of the list.
* The last access-point of the list.

In a lot of practical cases a new access is close to the last access. Of course, the general case would still be O(n).

>> That, in turn, means that a linked list and a dynamic array can not share a common interface that includes opIndex.
> 
> so what. they can share a common interface that includes nth(). what exactly is yer problem with that.

That's simple. a[i] looks much nicer than a.nth(i).

By the way, I suspect that if opIndex is available only on arrays and nth() is available on all sequence types, algorithm writers will forget about opIndex and use nth(), to make their algorithm more widely compatible. And I wouldn't blame them, though I guess you would.

>> * A list has faster insertions and growth.
> 
> o(1) insertion if u have the insertion point.  both list and vector have
> o(1) growth.

Yeah, but dynamic arrays have to re-allocate once in a while. Lists don't.

> we get to the point where we realize the two r fundamentally different structures built around fundamentally different tradeoffs. they do satisfy the same interface. just ain't the vector interface. it's a sequential-access interface. not a random-access interface.

I believe we agree in principle, but are just confused about each others definitions. If the "random-access interface" guarantees O(1) for nth/opIndex/whatever, of course you are right.

But if time-complexity is not taken into consideration, the sequential-access interface and the random-access interface are equivalent, no?

I'm not opposed to complexity guarantees in public contracts. Far from it, in fact. Just introduce both interfaces and let the algorithm writers choose which one to accept. But give both interfaces opIndex, since it's just good syntax.

I do think it's a good idea for algorithms to support the interface with the weakest constraints (sequential-access). As long as they specify their time-complexity in terms of the complexities of the interface operations, not in absolute terms.

Then, when a programmer writes 'somealgorithm(someinterfaceobject)', a hypothetical analysis tool could tell him the time-complexity of the the resulting operation. The programmer might even assert a worst-case complexity at that point and the compiler could bail out if it doesn't match.

> even if u don't use c++ for its many faults. understand stl coz it's d shiznit.

I use C++. I use STL. I love both. But that doesn't mean there is no room for improvement. The STL is quite complex, and maybe it doesn't have to be.

-- 
Michiel