August 26, 2008
Benji Smith Wrote:

> superdan wrote:
> > relax. believe me i'm tryin', maybe you could put it a better way and meet me in the middle.
> 
> Okay. I'll try :)

'preciate that.

> Think about a collection API.

okay.

> The container classes are all written to satisfy a few basic primitive operations: you can get an item at a particular index, you can iterate in sequence (either forward or in reverse). You can insert items into a hashtable or retrieve them by key. And so on.

how do you implement getting an item at a particular index for a linked list?

how do you make a hashtable, an array, and a linked list obey the same interface? guess hashtable has stuff that others don't support?

these are serious questions. not in jest not rhetorical and not trick.

> Someone else comes along and writes a library of algorithms. The algorithms can operate on any container that implements the necessary operations.

hm. things are starting to screech a bit. but let's see your answers to your questions above.

> When someone clever comes along and writes a new sorting algorithm, I can plug my new container class right into it, and get the algorithm for free. Likewise for the guy with the clever new collection class.

things ain't that simple.

saw this flick "the devil wears prada", an ok movie but one funny remark stayed with me. "you are in desperate need of chanel."

i'll paraphrase. "you are in desperate need of stl." you need to learn stl and then you'll figure why you can't plug a new sorting algorithm into a container. you need more guarantees. and you need iterators.

> We don't bat an eye at the idea of containers & algorithms connecting to one another using a reciprocal set of interfaces.

i do. it's completely wrong. you need iterators that broker between containers and algos. and iterators must give complexity guarantees.

> In most cases, you get a performance **benefit** because you can mix and match the container and algorithm implementations that most suit your needs. You can design your own performance solution, rather than being stuck a single "low level" implementation that might be good for the general case but isn't ideal for your problem.

assuming there are iterators in the picture, sure. there is a performance benefit. even more so when said mixing and matching is done during compilation.

> Over in another message BCS said he wants an array index to compile to 3 ASM ops. Cool I'm all for it.

great. but then you must be all for the consequences of it.

> I don't know a whole lot about the STL, but my understanding is that most C++ compilers are smart enough that they can produce the same ASM from an iterator moving over a vector as incrementing a pointer over an array.

they are because stl is designed in a specific way. that specific way is lightyears away from the design you outline above.

> So the default implementation is damn fast.

not sure what you mean by default here, but playing along.

> But if someone else, with special design constraints, needs to implement a custom container template, it's no problem. As long as the container provides a function for getting iterators to the container elements, it can consume any of the STL algorithms too, even if the performance isn't as good as the performance for a vector.
> 
> There's no good reason the same technique couldn't provide both speed and API flexibility for text processing.

you see here's the problem. you systematically forget to factor in the cost of reaching through a binary interface. and if that's not there, congrats. you just discovered perpetual motion.

stl is fast for two main reasons. one. it uses compile-time interfaces and not run-time interfaces as you want. two. it defines and strictly uses a compile-time hierarchy of iterators with stringent complexity guarantees.

your container design can't be fast because it uses runtime interfaces. let alone that you don't mention complexity guarantees. but let's say those can be provided. but the fundamental problem is that you want runtime interfaces for a very low level data structure. fast that can't be. please understand.
August 26, 2008
On 2008-08-25 21:56:18 -0400, Benji Smith <dlanguage@benjismith.net> said:

> But if someone else, with special design constraints, needs to implement a custom container template, it's no problem. As long as the container provides a function for getting iterators to the container elements, it can consume any of the STL algorithms too, even if the performance isn't as good as the performance for a vector.

Indeed. But notice that the Standard Template Library containers doesn't use inheritance, but templates. You can create your own version of std::string by creating a different class and implementing the same functions, but then every function accepting a std::string would have to be a template capable of accepting either one as input, or changed to use your new string class. That's why std::find and std::foreach, akin many others, are template functions: those would work with your custom string class.

The situation is no different in D: you can create your own string class or struct, but only functions taking your string class or struct, or template functions where the string type is a template argument, will be able to use it.

If your argument is that string functions in Phobos should be template functions accepting any kind of string as input, then that sounds reasonable to me. But that's not exacly what you said you wanted.

> There's no good reason the same technique couldn't provide both speed and API flexibility for text processing.

This is absolutely right... but unfortunately, using virtual inheritance (as interfaces in D imply) isn't the same technique as in the STL at all. Template algorithms parametrized on the container and iterator type is what the STL is all about, and from there come its speed.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

August 26, 2008
Michel Fortin Wrote:

> On 2008-08-25 21:56:18 -0400, Benji Smith <dlanguage@benjismith.net> said:
> 
> > But if someone else, with special design constraints, needs to implement a custom container template, it's no problem. As long as the container provides a function for getting iterators to the container elements, it can consume any of the STL algorithms too, even if the performance isn't as good as the performance for a vector.
> 
> Indeed. But notice that the Standard Template Library containers doesn't use inheritance, but templates. You can create your own version of std::string by creating a different class and implementing the same functions, but then every function accepting a std::string would have to be a template capable of accepting either one as input, or changed to use your new string class. That's why std::find and std::foreach, akin many others, are template functions: those would work with your custom string class.
> 
> The situation is no different in D: you can create your own string class or struct, but only functions taking your string class or struct, or template functions where the string type is a template argument, will be able to use it.
> 
> If your argument is that string functions in Phobos should be template functions accepting any kind of string as input, then that sounds reasonable to me. But that's not exacly what you said you wanted.

perfect answer. u da man.

for example look at this fn from std.string.

int cmp(C1, C2)(in C1[] s1, in C2[] s2);

so it looks like cmp accepts arrays of any character type. that is cool but the [] limits the thing to builtin arrays. the correct sig is

int cmp(S1, S2)(in S1 s1, in S2 s2)
    if (isSortaString!(S1) && isSortaString!(S2));

correct?
August 26, 2008
superdan wrote:
> Benji Smith Wrote:
> 
>> superdan wrote:
>>> relax. believe me i'm tryin', maybe you could put it a better way and meet me in the middle.
>> Okay. I'll try :)
> 
> 'preciate that.
> 
>> Think about a collection API.
> 
> okay.
> 
>> The container classes are all written to satisfy a few basic primitive operations: you can get an item at a particular index, you can iterate in sequence (either forward or in reverse). You can insert items into a hashtable or retrieve them by key. And so on.
> 
> how do you implement getting an item at a particular index for a linked list? 

class Node(T)
{
	Node!(T) next;
	T value;
}

class LinkedList(T)
{
	Node!(T) head;

	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of the list. Time complexity: O(N) for a list of length N. This operation is provided for completeness and not recommended for frequent use in large lists. */
	T opIndex(int i)
	{
		auto current = head;
		while (i)
		{
			current = current.next;
		}
		return current.value;
	}
}

> how do you make a hashtable, an array, and a linked list obey the same interface? guess hashtable has stuff that others don't support?

You have an interface for collections (you can add, remove, and get the length, maybe a couple other things).

You have an interface for lists (they're collections, and you can index them).

Then you can use all the collection-oriented stuff with lists, and you can do special list-type things with them if you want.

> these are serious questions. not in jest not rhetorical and not trick.

Yes, but they're solved problems.

>> Someone else comes along and writes a library of algorithms. The algorithms can operate on any container that implements the necessary operations.
> 
> hm. things are starting to screech a bit. but let's see your answers to your questions above.
> 
>> When someone clever comes along and writes a new sorting algorithm, I can plug my new container class right into it, and get the algorithm for free. Likewise for the guy with the clever new collection class.
> 
> things ain't that simple.

Collection-oriented library code will care sufficiently about performance that this mix-and-match stuff is not feasible. Almost anything else doesn't care enough to take only an AssociativeArrayHashSet and not a TreeSet or a LinkedList or a primitive array.

> saw this flick "the devil wears prada", an ok movie but one funny remark stayed with me. "you are in desperate need of chanel." 
> 
> i'll paraphrase. "you are in desperate need of stl." you need to learn stl and then you'll figure why you can't plug a new sorting algorithm into a container. you need more guarantees. and you need iterators.
> 
>> We don't bat an eye at the idea of containers & algorithms connecting to one another using a reciprocal set of interfaces.
> 
> i do. it's completely wrong. you need iterators that broker between containers and algos. and iterators must give complexity guarantees.

I don't. If I'm just going to iterate through the items of a collection, I only care about the opApply. If I need to index stuff, I don't care if I get a primitive array or Bob Dole's brand-new ArrayList class.
August 26, 2008
On 2008-08-25 22:52:52 -0400, superdan <super@dan.org> said:

> int cmp(S1, S2)(in S1 s1, in S2 s2)
>     if (isSortaString!(S1) && isSortaString!(S2));
> 
> correct?

That's sorta what I had in mind.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

August 26, 2008
Christopher Wright Wrote:

> superdan wrote:
> > Benji Smith Wrote:
> > 
> >> superdan wrote:
> >>> relax. believe me i'm tryin', maybe you could put it a better way and meet me in the middle.
> >> Okay. I'll try :)
> > 
> > 'preciate that.
> > 
> >> Think about a collection API.
> > 
> > okay.
> > 
> >> The container classes are all written to satisfy a few basic primitive operations: you can get an item at a particular index, you can iterate in sequence (either forward or in reverse). You can insert items into a hashtable or retrieve them by key. And so on.
> > 
> > how do you implement getting an item at a particular index for a linked list?
> 
> class Node(T)
> {
> 	Node!(T) next;
> 	T value;
> }

so far so good.

> class LinkedList(T)
> {
> 	Node!(T) head;
> 
> 	/** Gets the ith item of the list. Throws: SIGSEGV if i >= length of
> the list. Time complexity: O(N) for a list of length N. This operation
> is provided for completeness and not recommended for frequent use in
> large lists. */
> 	T opIndex(int i)
> 	{
> 		auto current = head;
> 		while (i)
> 		{
> 			current = current.next;
> 		}
> 		return current.value;
> 	}
> }

oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that.

this design is a stillborn.

what needs done is to allow different kinds of containers implement different interfaces. in fact a better way to factor things is via iterators, as stl has shown.

> > how do you make a hashtable, an array, and a linked list obey the same interface? guess hashtable has stuff that others don't support?
> 
> You have an interface for collections (you can add, remove, and get the length, maybe a couple other things).

this is an incomplete response. what do you add for a vector!(T)? A T. what do you add for a hash!(T, U)? you tell me. and you tell me how you make that signature consistent across vector and hash.

> You have an interface for lists (they're collections, and you can index them).

wrong. don't ever mention a linear-time indexing operator in an interview. you will fail it right then. you can always define linear-time indexing as a named function. but never masquerade it as an index operator.

> Then you can use all the collection-oriented stuff with lists, and you can do special list-type things with them if you want.
> 
> > these are serious questions. not in jest not rhetorical and not trick.
> 
> Yes, but they're solved problems.

apparently not since you failed at'em.

> >> Someone else comes along and writes a library of algorithms. The algorithms can operate on any container that implements the necessary operations.
> > 
> > hm. things are starting to screech a bit. but let's see your answers to your questions above.
> > 
> >> When someone clever comes along and writes a new sorting algorithm, I can plug my new container class right into it, and get the algorithm for free. Likewise for the guy with the clever new collection class.
> > 
> > things ain't that simple.
> 
> Collection-oriented library code will care sufficiently about performance that this mix-and-match stuff is not feasible.

what's that supposed to mean? you sayin' stl don't exist?

> Almost anything else doesn't care enough to take only an AssociativeArrayHashSet and not a TreeSet or a LinkedList or a primitive array.

wrong for the reasons above.

> > saw this flick "the devil wears prada", an ok movie but one funny remark stayed with me. "you are in desperate need of chanel."
> > 
> > i'll paraphrase. "you are in desperate need of stl." you need to learn stl and then you'll figure why you can't plug a new sorting algorithm into a container. you need more guarantees. and you need iterators.
> > 
> >> We don't bat an eye at the idea of containers & algorithms connecting to one another using a reciprocal set of interfaces.
> > 
> > i do. it's completely wrong. you need iterators that broker between containers and algos. and iterators must give complexity guarantees.
> 
> I don't. If I'm just going to iterate through the items of a collection, I only care about the opApply. If I need to index stuff, I don't care if I get a primitive array or Bob Dole's brand-new ArrayList class.

you too are in desperate need for stl.
August 26, 2008
Benji Smith wrote:
> In another thread (about array append performance) I mentioned that Strings ought to be implemented as classes rather than as simple builtin
> arrays. Superdan asked why. Here's my response...
> 
> I'll start with a few of the softball, easy reasons.
> 
 > But much more important than either of those reasons is the lack of
> polymorphism on character arrays. Arrays can't have subclasses, and they can't implement interfaces.

I don't think polymorphic strings is right for D strings.  This is the sort of thing a lib could implement but D should (and does) provide the basic components of which to build more complex components.  You can already extend D strings by using strings as a component, if it its necessary.

I don't want all this extra overhead in the primitive array type.  It seems to me a classic case of feature creep, pretty soon we have something that has been designed for all but its original purpose.  I'm ok with having features like hash caching as long as they can be implemented without changing the core mechanics of the primitive.

To me its not even correct design to inherit from a concrete class (there are quite a few books on that, Effective C++ talks about it a bit and so does Herb/Alexandrescu 101 Coding standard book).  I think, there are *much better* ways to handle this sort of thing.  Personally I don't want to encourage that sort of design.

-Joel
August 26, 2008
Benji Smith wrote:
> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
> 
> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

I've written internationalized applications that dealt with multibyte utf strings. It looks like one would regularly need all those operations, but interestingly it just doesn't come up. It turns out that one needs to slice with the byte offset, or get the byte length, or get the nth byte. In the very rare case where one wants to do it with characters, one seems to already have the right offsets at hand.

If you choose to use dchar's instead, there is a 1:1 mapping between characters and indices, and it doesn't cost you any class overhead. It's also a simple conversion from UTF-8 <==> UCS-2. I can't think of a scenario where using classes would produce any performance advantage.

August 26, 2008
Benji Smith wrote:
> For starters, with strings implemented as character arrays, writing library code that accepts and operates on strings is a bit of a pain in the neck, since you always have to write templates and template code is slightly less readable than non-template code.
> You can't distribute your code as a DLL or a shared object, because the template instantiations won't be included (unless you create wrapper functions with explicit template instantiations, bloating your code size, but more importantly tripling the number of functions in your API).

Is the problem you're referring to the fact that there are 3 character types?

> Another good low-hanging argument is that strings are frequently used as keys in associative arrays. Every insertion and retrieval in an associative array requires a hashcode computation. And since D strings are just dumb arrays, they have no way of memoizing their hashcodes.

True, but I've written a lot of string processing programs (compilers are just one example of such). This has never been an issue, because the AA itself memoizes the hash, and from then on the dictionary handle is used.


> We've already observed that D assoc arrays are less performant than even Python maps, so the extra cost of lookup operations is unwelcome.

Every one of those benchmarks that purported to show that D AA's were relatively slow turned out to be, on closer examination, D running the garbage collector more often than Python does. It had NOTHING to do with the AA's.

> But much more important than either of those reasons is the lack of polymorphism on character arrays. Arrays can't have subclasses, and they can't implement interfaces.
> 
> A good example of what I'm talking about can be seen in the Phobos and Tango regular expression engines. At least the Tango implementation matches against all string types (the Phobos one only works with char[] strings).
> 
> But what if I want to consume a 100 MB logfile, counting all lines that match a pattern?
> 
> Right now, to use the either regex engine, I have to read the entire logfile into an enormous array before invoking the regex search function.
> 
> Instead, what if there was a CharacterStream interface? And what if all the text-handling code in Phobos & Tango was written to consume and return instances of that interface?
> 
> A regex engine accepting a CharacterStream interface could process text from string literals, file input streams, socket input streams, database records, etc, etc, etc... without having to pollute the API with a bunch of casts, copies, and conversions. And my logfile processing application would consume only a tiny fraction of the memory needed by the character array implementation.
> 
> Most importantly, the contract between the regex engine and its consumers would provide a well-defined interface for processing text, regardless of the source or representation of that text.

I think a better solution is for regexp to accept an Iterator as its source. That doesn't require polymorphic behavior via inheritance, it can do polymorphism by value (which is what templates do).

> 
> Along a similar vein, I've worked on a lot of parsers over the past few years, for domain specific languages and templating engines, and stuff like that. Sometimes it'd be very handy to define a "Token" class that behaves exactly like a String, but with some additional behavior. Ideally, I'd like to implement that Token class as an implementor of the CharacterStream interface, so that it can be passed directly into other text-handling functions.
> 
> But, in D, with no polymorphic text handling, I can't do that.

Templates are the ideal solution to that, and the more specific idiom is to use iterators.


> But then again, I haven't used any of the const functionality in D2, so I can't actually comment on relative usability of compiler-enforced immutability versus interface-enforced immutability.

From my own experience, I didn't 'get' invariant strings until I'd used them for a while.
August 26, 2008
Benji Smith wrote:
> I don't know a whole lot about the STL,

STL is a piece of brilliance in C++ (and one can reasonably argue that STL saved C++). The design of STL solves the problems you are talking about.

Andrei has been hard at work getting equivalent functionality into the D library (see http://www.digitalmars.com/d/2.0/phobos/std_algorithm.html).