Kinds of containers

I'm finally getting the cycles to get to thinking about Design by Introspection containers. First off, there are several general categories of containers. I think D should support all properly. One question is which we go for in the standard library.

1. Functional containers.

These are immutable; once created, neither their topology nor their elements may be observably changed. Manipulating a container entails creating an entire new container, often based on an existing container (e.g. append takes a container and an element and creates a whole new container).

Internally, functional containers take advantage of common substructure and immutability to share actual data. The classic resource for defining and implementing functional containers is http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.

2. Reference containers.

These have classic reference semantics (à la Java). Internally, they may be implemented either as class objects or as reference counted structs.

They're by default mutable. Qualifiers should apply to them gracefully.

3. Eager value containers.

These are STL-style. Somewhat surprisingly I think these are the worst of the pack; they expensively duplicate at the drop of a hat and need to be carefully passed around by reference lest performance silently drops. Nevertheless, when used as members inside other data structures value semantics might be the appropriate choice. Also, thinking of them as values often makes code simpler.

By default eager value containers are mutable. They should support immutable and const meaningfully.

4. Copy-on-write containers.

These combine the advantages of value and reference containers: you get to think of them as values, yet they're not expensive to copy. Copying only occurs by necessity upon the first attempt to change them.

The disadvantage is implementations get somewhat complicated. Also, they are shunned in C++ because there is no proper support for COW; for example, COW strings have been banned starting with C++11 which is quite the bummer.

Together with Scott Meyers, Walter figured out a way to change D to support COW properly. The language change consists of two attributes.

=======

I'll attempt to implement a few versions of each and see what they look like. The question here is what containers are of interest for D's standard library.


Andrei

October 21, 2015

Re: Kinds of containers

Posted by Rikki Cattermole
in reply to Andrei Alexandrescu

Permalink

Rikki Cattermole

Posted in reply to Andrei Alexandrescu

Permalink

On 22/10/15 12:05 AM, Andrei Alexandrescu wrote:
> I'm finally getting the cycles to get to thinking about Design by
> Introspection containers. First off, there are several general
> categories of containers. I think D should support all properly. One
> question is which we go for in the standard library.
>
> 1. Functional containers.
>
> These are immutable; once created, neither their topology nor their
> elements may be observably changed. Manipulating a container entails
> creating an entire new container, often based on an existing container
> (e.g. append takes a container and an element and creates a whole new
> container).
>
> Internally, functional containers take advantage of common substructure
> and immutability to share actual data. The classic resource for defining
> and implementing functional containers is
> http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.
>
>
> 2. Reference containers.
>
> These have classic reference semantics (à la Java). Internally, they may
> be implemented either as class objects or as reference counted structs.
>
> They're by default mutable. Qualifiers should apply to them gracefully.
>
> 3. Eager value containers.
>
> These are STL-style. Somewhat surprisingly I think these are the worst
> of the pack; they expensively duplicate at the drop of a hat and need to
> be carefully passed around by reference lest performance silently drops.
> Nevertheless, when used as members inside other data structures value
> semantics might be the appropriate choice. Also, thinking of them as
> values often makes code simpler.
>
> By default eager value containers are mutable. They should support
> immutable and const meaningfully.
>
> 4. Copy-on-write containers.
>
> These combine the advantages of value and reference containers: you get
> to think of them as values, yet they're not expensive to copy. Copying
> only occurs by necessity upon the first attempt to change them.
>
> The disadvantage is implementations get somewhat complicated. Also, they
> are shunned in C++ because there is no proper support for COW; for
> example, COW strings have been banned starting with C++11 which is quite
> the bummer.
>
> Together with Scott Meyers, Walter figured out a way to change D to
> support COW properly. The language change consists of two attributes.
>
> =======
>
> I'll attempt to implement a few versions of each and see what they look
> like. The question here is what containers are of interest for D's
> standard library.
>
>
> Andrei

I've got a list and a map implementation using allocators. They are pretty primitive but they may lead to insight so links provided.

https://github.com/rikkimax/alphaPhobos/blob/master/source/std/experimental/internal/containers/list.d
https://github.com/rikkimax/alphaPhobos/blob/master/source/std/experimental/internal/containers/map.d

October 21, 2015

Re: Kinds of containers

Posted by Russel Winder
in reply to Andrei Alexandrescu

Permalink

Russel Winder

Posted in reply to Andrei Alexandrescu

Attachments:

signature.asc (This is a digitally signed message part)

Permalink

On Wed, 2015-10-21 at 07:05 -0400, Andrei Alexandrescu via Digitalmars- d wrote:
> 
[…]
> 1. Functional containers.
> 
> These are immutable; once created, neither their topology nor their
> elements may be observably changed. Manipulating a container entails
> creating an entire new container, often based on an existing
> container
> (e.g. append takes a container and an element and creates a whole new
> container).
> 
> Internally, functional containers take advantage of common
> substructure
> and immutability to share actual data. The classic resource for
> defining
> and implementing functional containers is
> http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0
> 521663504.

I believe (currently, I am open to being re-educated) that most of the rest of the world calls these persistent data structures, despite the title of Okasaki's thesis. I am not sure I like either of the labels "functional" or "persistent", there is little affordance with the concept underneath.

> 2. Reference containers.
> 
> These have classic reference semantics (à la Java). Internally, they
> may
> be implemented either as class objects or as reference counted
> structs.
> 
> They're by default mutable. Qualifiers should apply to them gracefully.
> 
> 3. Eager value containers.
> 
> These are STL-style. Somewhat surprisingly I think these are the
> worst
> of the pack; they expensively duplicate at the drop of a hat and need
> to
> be carefully passed around by reference lest performance silently
> drops.
> Nevertheless, when used as members inside other data structures value
> semantics might be the appropriate choice. Also, thinking of them as
> values often makes code simpler.
> 
> By default eager value containers are mutable. They should support immutable and const meaningfully.

Is there a need for separate eager structures explicity? Is it not enough for all structures to be lazy with there being a pull operation?

> 4. Copy-on-write containers.
> 
> These combine the advantages of value and reference containers: you
> get
> to think of them as values, yet they're not expensive to copy.
> Copying
> only occurs by necessity upon the first attempt to change them.
> 
> The disadvantage is implementations get somewhat complicated. Also,
> they
> are shunned in C++ because there is no proper support for COW; for
> example, COW strings have been banned starting with C++11 which is
> quite
> the bummer.
> 
> Together with Scott Meyers, Walter figured out a way to change D to support COW properly. The language change consists of two attributes.
> 
> =======
> 
> I'll attempt to implement a few versions of each and see what they
> look
> like. The question here is what containers are of interest for D's
> standard library.

N-dimensional arrays, N-dimensional dynamic array, singly-linked list, doubly-linked list, hash map (aka dictionary, associative array,…), red-black tree, 2-3 tree, doubly-linked tree, tree map, b-tree. And no doubt a few others will be proposed since they are classic, appear in undergraduate courses, and are actually used for real.

However these are classic sequential data structures. We should be looking to support data structures that can be used in a multi-threaded context without having explicit locks. The obvious exemplars to indicate the way are ConcurrentHashMap in Java and NumPy array. (cf. other thread talking about these quasi-weird array data structures.) NumPy array is an opaque data structure that should never be accessed in any way other than the functions and higher-order functions from the library. It is a nice design, works well enough for data scientists, quants, bioinformatics people to think it is great, but fundamentally is seriously slow. If D could provide data structures that did NumPy/SciPy/Pandas things significantly faster than NumPy/SciPy/Pandas, there would be traction.

In this sense std.concurrent, std.parallelism, and std.datastructures (?) whilst almost certainly remaining separate, must be able to interwork.

The alternative is the sort of hack that goes on in Java where data structures are sequential but you can provide thread-safe versions. This is fine for small data structures, but is not acceptable for large ones in a multicore situation.  Hence ConcurrentHashMap, and the ilke.

Then there is the question whether to try and do the sort of thing Chapel and X10 are doing. They have the ideas of domain and places which allow for partitioned arrays that can be mapped to clusters of multicore, multiprocessor machines — Partitioned Global Address Space. IBM have now released their APGAS stuff over Hazelcast for Java users. This will begin to challenge Apache Spark (and Hadoop).

Apache Spark is in many ways like NumPy in it's architecture, an opaque data N-dimensional array operated on by functions and higher-order functions.

Obviously I am conflicted here: I use Go for networking (but mostly because I earn money running learning workshops), quite like Rust because it is the fastest language currently on my microbenchmarks, like D because it's nice, love Chapel because it is great (better than D in so many ways), sort of like X10 (because it is a bit Scala like and can be JVM or native), but end up having to work on some C++ codebases. I am not entirely sure D can compete with Chapel in HPC and increasingly the Python-verse. Laeeth and others are championing D instead of Python, but really the competition is Python/Chapel vs. Python/NumPy/SciPy/Pandas. Unless that is D can get the Pandas style data structures. Which brings my ramblings back on point.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

October 21, 2015

Re: Kinds of containers

Posted by Andrei Alexandrescu
in reply to Russel Winder

Permalink

Andrei Alexandrescu

Posted in reply to Russel Winder

Permalink

On 10/21/2015 09:01 AM, Russel Winder via Digitalmars-d wrote:
> I believe (currently, I am open to being re-educated) that most of the
> rest of the world calls these persistent data structures, despite the
> title of Okasaki's thesis. I am not sure I like either of the labels
> "functional" or "persistent", there is little affordance with the
> concept underneath.

OK. We should go with the more widespread name.

>> By default eager value containers are mutable. They should support
>> immutable and const meaningfully.
>
> Is there a need for separate eager structures explicity? Is it not
> enough for all structures to be lazy with there being a pull operation?

Well it is but I wanted to clarify where STL-style containers fit.

> N-dimensional arrays, N-dimensional dynamic array, singly-linked list,
> doubly-linked list, hash map (aka dictionary, associative array,…),
> red-black tree, 2-3 tree, doubly-linked tree, tree map, b-tree. And no
> doubt a few others will be proposed since they are classic, appear in
> undergraduate courses, and are actually used for real.

These are orthogonal on the kinds I discussed. So we have container topologies (which you enumerate a few of) and container kinds, which concerns copying semantics of the entire containers.

> However these are classic sequential data structures.

Hashes and some of the trees would actually be associative data structures.

> We should be
> looking to support data structures that can be used in a multi-threaded
> context without having explicit locks. The obvious exemplars to
> indicate the way are ConcurrentHashMap in Java and NumPy array. (cf.
> other thread talking about these quasi-weird array data structures.)
> NumPy array is an opaque data structure that should never be accessed
> in any way other than the functions and higher-order functions from the
> library. It is a nice design, works well enough for data scientists,
> quants, bioinformatics people to think it is great, but fundamentally
> is seriously slow. If D could provide data structures that did
> NumPy/SciPy/Pandas things significantly faster than NumPy/SciPy/Pandas,
> there would be traction.

Cool thoughts, thanks. Those would have their own APIs, separate from mainstream containers.

> Then there is the question whether to try and do the sort of thing
> Chapel and X10 are doing. They have the ideas of domain and places
> which allow for partitioned arrays that can be mapped to clusters of
> multicore, multiprocessor machines — Partitioned Global Address Space.
> IBM have now released their APGAS stuff over Hazelcast for Java users.
> This will begin to challenge Apache Spark (and Hadoop).

We don't support PGAS at language level, but I'll need to look into how APGAS works with Java.

> Obviously I am conflicted here: I use Go for networking (but mostly
> because I earn money running learning workshops), quite like Rust
> because it is the fastest language currently on my microbenchmarks,
> like D because it's nice, love Chapel because it is great (better than
> D in so many ways), sort of like X10 (because it is a bit Scala like
> and can be JVM or native), but end up having to work on some C++
> codebases. I am not entirely sure D can compete with Chapel in HPC and
> increasingly the Python-verse. Laeeth and others are championing D
> instead of Python, but really the competition is Python/Chapel vs.
> Python/NumPy/SciPy/Pandas. Unless that is D can get the Pandas style
> data structures. Which brings my ramblings back on point.

My finance folks also rave about Pandas. I wish I could fork myself to look into it.

Nevertheless, what I'm working on now should help libraries such as Pandas for D. We have a large language but are unclear on recipe-style idioms for writing library code.


Andrei

October 21, 2015

Re: Kinds of containers

Posted by Andrea Fontana
in reply to Andrei Alexandrescu

Permalink

Andrea Fontana

Posted in reply to Andrei Alexandrescu

Permalink

On Wednesday, 21 October 2015 at 11:05:12 UTC, Andrei Alexandrescu wrote:
> I'll attempt to implement a few versions of each and see what they look like. The question here is what containers are of interest for D's standard library.
>
>
> Andrei

Tree (binary, quad, octo, r/b, etc...), graph (directed, undirected, sparse, weighted, etc...) have my votes. They enable a lot of algorithms to be implemented.

October 21, 2015

Re: Kinds of containers

Posted by Robert burner Schadek
in reply to Andrei Alexandrescu

Permalink

Robert burner Schadek

Posted in reply to Andrei Alexandrescu

Permalink

On Wednesday, 21 October 2015 at 11:05:12 UTC, Andrei Alexandrescu wrote:
> I'm finally getting the cycles to get to thinking about Design by Introspection containers.

What do you have in mind about the Design by Introspection (DyI, DbI is taken) for the container?

How do you plan to use the allocators? Have the container create and handle the allocator, pass it at container construction, or (YOUR IDEA HERE)?

anyway, container yuhu http://www.alfabetajuega.com/multimedia/imagenes/201310/53112.homer_yuhu_simpsons.jpg

October 21, 2015

Re: Kinds of containers

Posted by Jonathan M Davis
in reply to Andrei Alexandrescu

Permalink

Jonathan M Davis

Posted in reply to Andrei Alexandrescu

Permalink

On Wednesday, 21 October 2015 at 11:05:12 UTC, Andrei Alexandrescu wrote:
> I'm finally getting the cycles to get to thinking about Design by Introspection containers. First off, there are several general categories of containers. I think D should support all properly. One question is which we go for in the standard library.
>
> 1. Functional containers.

I fully expect that these have their place, but I honestly have no idea what they would be. When I've used functional languages, I've only ever found these attributes to be a royal pain to deal with, not a benefit. From what I've seen, containers are the sort of thing that almost always always need to be mutable to be of any real benefit. Upon occasion, constructing a container up front and then never tweaking it after that is useful, but that's pretty rare from what I've seen.

The only cases that I can think of where this could be really beneficial would be something like strings, and we're using arrays for those currently (though they are arguably functional containers given that they have immutable elements).

> 2. Reference containers.

These are where it's at IMHO. 99.999999% this is what makes sense.

> 3. Eager value containers.

_Maybe_ this makes sense for specific container types, but in general, it's a horrible idea IMHO. Almost all containers involve allocating something on the heap internally, so the fact that they're treated as values doesn't avoid heap allocations, and reference-counting reference containers solves the problem of the containers living past the end of the lifetime of whatever owns them. And having value type containers is incredibly error-prone - both from an efficiency standpoint and correctness standpoint. It's _way_ too easy to copy them when you don't mean to, and simple stuff like

vector<T> foo() {...}

for(auto iter = foo().begin(); iter != foo().end(); ++iter) {...}

ends up doing nasty things, because you have iterators (or ranges) to a container that's already been destroyed.

Honestly, unless you're counting something like a point or tuple as a container, I really don't see any value in these sort of containers. They're just too error-prone without adding any real value.

> 4. Copy-on-write containers.

This could be interesting for some applications (probably more so than functional containers), but I would expect that it would only really be useful for very specific containers. I certainly wouldn't want to end up with a copy of my vector or list because I added a value to it. COW for strings makes a lot of sense, because they don't tend to get mutated much (which is also why string is immutable(char)[]). But since we're already using arrays for strings, a COW string would then be for special cases only.

Overall, I think that we should focus on getting good reference containers while leaving the door open for cool stuff from the other container types. It's the reference containers that most programs are going to need, whereas I expect that the others are more likely to be nice-to-haves for a minority of applications and outright useless for the majority.

- Jonathan M Davis

October 21, 2015

Re: Kinds of containers

Posted by Andrei Alexandrescu
in reply to Andrea Fontana

Permalink

Andrei Alexandrescu

Posted in reply to Andrea Fontana

Permalink

On 10/21/2015 09:13 AM, Andrea Fontana wrote:
> On Wednesday, 21 October 2015 at 11:05:12 UTC, Andrei Alexandrescu wrote:
>> I'll attempt to implement a few versions of each and see what they
>> look like. The question here is what containers are of interest for
>> D's standard library.
>>
>>
>> Andrei
>
> Tree (binary, quad, octo, r/b, etc...), graph (directed, undirected,
> sparse, weighted, etc...) have my votes. They enable a lot of algorithms
> to be implemented.

As I also replied to Russel, I am now discussing semantics of container values. -- Andrei

October 21, 2015

Re: Kinds of containers

Posted by Andrei Alexandrescu
in reply to Robert burner Schadek

Permalink

Andrei Alexandrescu

Posted in reply to Robert burner Schadek

Permalink

On 10/21/2015 09:46 AM, Robert burner Schadek wrote:
> On Wednesday, 21 October 2015 at 11:05:12 UTC, Andrei Alexandrescu wrote:
>> I'm finally getting the cycles to get to thinking about Design by
>> Introspection containers.
>
> What do you have in mind about the Design by Introspection (DyI, DbI is
> taken) for the container?

Containers will obey subsets of a nomenclature of primitives.

> How do you plan to use the allocators? Have the container create and
> handle the allocator, pass it at container construction, or (YOUR IDEA
> HERE)?

Container will pick the current allocator reference, or get it as a creation parameters, and store it for the life of the container.


Andrei

October 21, 2015

Re: Kinds of containers

Posted by Timon Gehr
in reply to Andrei Alexandrescu

Permalink

Timon Gehr

Posted in reply to Andrei Alexandrescu

Permalink

On 10/21/2015 01:05 PM, Andrei Alexandrescu wrote:
> I'm finally getting the cycles to get to thinking about Design by
> Introspection containers. First off, there are several general
> categories of containers. I think D should support all properly. One
> question is which we go for in the standard library.
>
> 1. Functional containers.
>
> These are immutable; once created, neither their topology nor their
> elements may be observably changed. Manipulating a container entails
> creating an entire new container, often based on an existing container
> (e.g. append takes a container and an element and creates a whole new
> container).
>
> Internally, functional containers take advantage of common substructure
> and immutability to share actual data. The classic resource for defining
> and implementing functional containers is
> http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.
>
> ...

I still think those should be mutable by default in order to have painless interchangeability with other value type containers. Why should corresponding ephemeral and persistent containers have different interfaces?

I assume you envision code using those to look as follows?

FunSet!int a;
a=a.with_(1);
auto b=a;
a=a.with_(2);
a=a.with_(3);
// b = {1}, a={1,2,3};

I think this should be allowed too:

FunSet!int a;
a.insert(1);
auto b=a;
a.insert(2);
a.insert(3);
// b = {1}, a={1,2,3};

One of the two versions should be automatically implemented via UFCS.

> 2. Reference containers.
>
> These have classic reference semantics (à la Java). Internally, they may be implemented either as class objects or as reference counted structs.
>
> They're by default mutable. Qualifiers should apply to them gracefully.
>
> 3. Eager value containers.
>
> These are STL-style. Somewhat surprisingly I think these are the worst of the pack; they expensively duplicate at the drop of a hat and need to be carefully passed around by reference lest performance silently drops. Nevertheless, when used as members inside other data structures value semantics might be the appropriate choice. Also, thinking of them as values often makes code simpler.
>
> By default eager value containers are mutable. They should support immutable and const meaningfully.

>
> 4. Copy-on-write containers.
>
> These combine the advantages of value and reference containers: you get
> to think of them as values, yet they're not expensive to copy. Copying
> only occurs by necessity upon the first attempt to change them.
> ...

IMO "1." ought to combine the advantages of value and reference containers as well, just without any expensive copying at all, even when updates happen.

> The disadvantage is implementations get somewhat complicated. Also, they
> are shunned in C++ because there is no proper support for COW; for
> example, COW strings have been banned starting with C++11 which is quite
> the bummer.
>
> Together with Scott Meyers, Walter figured out a way to change D to
> support COW properly. The language change consists of two attributes.
>
> =======
>
> I'll attempt to implement a few versions of each and see what they look
> like. The question here is what containers are of interest for D's
> standard library.
> ...

List:
 - forward iteration
 - bidirectional iteration

Stack:
 - basic stack
 - ordered stack [0]

Queue:
 - basic queue
 - heap

Set:
- hash set
- ordered set
- accumulating set [1]
- trie/radix tree
+ multiset versions

Map:
- hash map
- ordered map
- accumulating map [1]
- accumulating map with range update [2]
- trie/radix tree
- accumulating trie [1]
- accumulating trie with range update [2]
+ multimap versions
- array
- accumulating array [1]
- accumulating array with range update [2]
+ O(1) reset versions [3]
- rope
- accumulating rope [1]
- accumulating rope with range update [2]

[0] ordered stack: Push operations automatically pop the minimal number of elements off the stack prior to pushing, such as to guarantee that the elements on the stack remain ordered. The stack should expose a sorted range in order to support binary search.

[1] accumulation: The (ordered!) data structure allows fast queries for the result of some binary associative operations on the elements in a certain range. (the allowed operations are determined in advance and some intermediate results are automatically maintained). (for map, the operation would be just on the values, not on the keys.) This is usually quite easy support, but very useful.

[2] range update: here, the idea is that the data structure allows all elements in a certain range to be updated. the updates are performed lazily and have to be compatible with the associative operations (if any).

[3] fast reset: here the idea is that the map allows fast reset of its values at the cost of some small additional overhead per lookup. (destructors are called lazily.)

Top | Forum index | About this forum

Forums