Container insertion and removal (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Container insertion and removal (page 2)

March 07, 2010

Re: Container insertion and removal

Posted by Steven Schveighoffer
in reply to Robert Jacques

Steven Schveighoffer

Posted in reply to Robert Jacques

On Sat, 06 Mar 2010 11:19:15 -0500, Robert Jacques <sandford@jhu.edu> wrote:

> On Sat, 06 Mar 2010 08:19:36 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>>
>> How can softRemove not affect iterating ranges?  What if the range is positioned on the element removed?
>
> It always affects ranges in so far as the range and container are inconsistent, but that is a problem of softInsert as well. Removing an element from an array doesn't invalidate the underlying range, since the memory is still there. And so long as you're not trying to use free lists, linked-lists, trees, etc. can be written so that the ranges never enter an invalid state. If they happen to be pointing at the removed node, they just end up stopping early.

If my linked list range has two node pointers as the implementation, and you remove the end node, it's going to end later, not early.
Of course, you can get around this by just storing a current node and a length, but that may not be what the user is expecting.  For example, if you did a range of a sorted tree from "a" to "f" and it stops at "n", I think that would be extremely unexpected.

Stopping early is invalidation also IMO.  If your program logic depends on iterating over all the elements you originally intended to iterate, then we have big problems if they stop early.

>> The only two containers that would support softInsert would be linked list and sorted map/set.  Anything else might completely screw up the iteration.  I don't see a lot of "generic" use for it.
>
> There's all the containers based upon linked-lists, etc like hashes, stacks, queues and dequeues.

Hashes may be rehashed when inserting, completely invalidating a range (possibly the end point will be before the starting point).

Yes, the others you mentioned will be valid.  But I still don't see it being more useful than just using documentation to indicate insertion will not invalidate ranges.  Perhaps I am wrong.

>> Another option is to use a "mutation" field that is checked every chance by the range.  If it changes, then the range is invalidated.
>
> The mutation field would have to be a version number to support multiple ranges, and given experience with lock-free algorithms which use a 'tag' in a similar manner, this concept is bug prone and should not be relied upon. It would be better to 'lock' the node or container to topology changes, though this does slow things down and has no conflict resolution: removing a locked node would have to throw an exception.

I was not thinking of multithreaded applications.  I don't think it's worth making containers by default be multithreaded safe.

The mutation index has been used in Tango forever, and I think was in Doug Lea's original container implementations.  I'm pretty sure it is sound in single-threaded uses.

-Steve

March 07, 2010

Re: Container insertion and removal

Posted by Steven Schveighoffer
in reply to Andrei Alexandrescu

Steven Schveighoffer

Posted in reply to Andrei Alexandrescu

On Sat, 06 Mar 2010 14:55:49 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> Steven Schveighoffer wrote:
>> Andrei Alexandrescu Wrote:
>>
>>> In the STL world, writing container-independent code is generally shunned (see e.g. http://www.informit.com/content/images/0201749629/items/item2-2.pdf).
>>>
>>> One problem is a very small intersection between the functionalities offered by the various STL containers, and the conceptual organization that is weaker than that of iterators.
>>>
>>> A worse problem is iterator invalidation rules, something that we'll need to address too. I'm thinking that the best defense is a strong offense, and I plan to define the following naming convention:
>>>
>>> Methods such as insert, remove, pushFront, pushBack, removeFront, removeBack, are assumed to affect the container's topology and must be handled in user code as such.
>>>
>>> In addition to those, a container may also define functions named after the above by adding a "soft" prefix (e.g. softInsert, softRemove...) that are guaranteed to not affect the ranges currently iterating the container.
>>>
>>> Generic code that needs specific iterator (non-)invalidation rules can use softXxx methods, in confidence that containers not supporting it will be ruled out during compilations.
>>>
>>> Sounds good?
>>  How can softRemove not affect iterating ranges?  What if the range is positioned on the element removed?
>
> With GC, you can softRemove things without invalidating iterators.

If you define invalidation by "not pointing to unallocated memory," wouldn't normal remove (not soft remove) also result in valid memory being pointed to?  Why do we need a special soft version?

Even if this is what you mean, with allocators (which significantly speed up container insert/removal), you may be pointing to a newly valid piece of memory that may not be where you want.

I thought you meant by not invalidating that the range would iterate over the elements it was originally scheduled to iterate over, sans the removed element.

>
>> The only two containers that would support softInsert would be linked list and sorted map/set.  Anything else might completely screw up the iteration.  I don't see a lot of "generic" use for it.
>
> Singly linked lists, doubly linked lists, various trees - three's a crowd. Most node-based containers support softInsert.

Hashes may rehash on insert, invalidating any ranges (it's one of the notes of dcollections' HashMap and HashSet).  I guess it depends on if an insert alters the ordering of the nodes.  This is why I said sorted containers which should be unaffected.

You have a point that this covers a lot of containers however, but I think the usefulness of just having a softInsert (I think softRemove is not a useful concept) is pretty limited.  It limits you to

BTW, I think its a good idea to think about how to make this work, if we can come up with something that covers a lot of ground, I think that's a good thing.  I thought of making ranges slower but more robust to container changes in debug mode, does this sound useful?  I.e. if you compile in non-release mode, removing/adding an element may trigger the range to do a O(n) check to validate itself.

-Steve

March 07, 2010

Re: Container insertion and removal

Posted by Steven Schveighoffer
in reply to BLS

Steven Schveighoffer

Posted in reply to BLS

On Sat, 06 Mar 2010 19:41:57 -0500, BLS <windevguy@hotmail.de> wrote:

> ..and frankly said I wonder .. no offense.. why you are the (only) one who is creating and designing the collection library ?
> finally :
> from a pragmatic view > why create from scratch instead of reusing Stevens data-structures..

We have had private discussions, and we have fundamental differences of opinion for what belongs in containers.  It may be that through trial and error and constructive discussion we come up with the same conclusions, in which case I'm sure we will combine forces (recently, I think we both moved our points of view closer, but not quite there yet).  However, I will not stop maintaining dcollections as long as Phobos' collection solution does not implement the features I think are important.  I am in the process of porting/improving dcollections to D2, and when it is done, I think it will have some really useful features in it.

-Steve

March 07, 2010

Re: Container insertion and removal

Posted by Robert Jacques
in reply to Steven Schveighoffer

Robert Jacques

Posted in reply to Steven Schveighoffer

On Sat, 06 Mar 2010 21:54:50 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> On Sat, 06 Mar 2010 11:19:15 -0500, Robert Jacques <sandford@jhu.edu> wrote:
>
>> On Sat, 06 Mar 2010 08:19:36 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>>>
>>> How can softRemove not affect iterating ranges?  What if the range is positioned on the element removed?
>>
>> It always affects ranges in so far as the range and container are inconsistent, but that is a problem of softInsert as well. Removing an element from an array doesn't invalidate the underlying range, since the memory is still there. And so long as you're not trying to use free lists, linked-lists, trees, etc. can be written so that the ranges never enter an invalid state. If they happen to be pointing at the removed node, they just end up stopping early.
>
> If my linked list range has two node pointers as the implementation, and you remove the end node, it's going to end later, not early.

A linked list maps to a forward range: so the range simply points to a single internal node. If you're going to store a pointer to the end, then you should be using a doubly linked list, not a singly linked list.

> Of course, you can get around this by just storing a current node and a length, but that may not be what the user is expecting.  For example, if you did a range of a sorted tree from "a" to "f" and it stops at "n", I think that would be extremely unexpected.

By this logic, any and all forms of mutation, not just insertions and deletions must not be allowed.

> Stopping early is invalidation also IMO.  If your program logic depends on iterating over all the elements you originally intended to iterate, then we have big problems if they stop early.

Stopping early is the result of a logical error by the programmer. The code itself, however, is completely valid.

>>> The only two containers that would support softInsert would be linked list and sorted map/set.  Anything else might completely screw up the iteration.  I don't see a lot of "generic" use for it.
>>
>> There's all the containers based upon linked-lists, etc like hashes, stacks, queues and dequeues.
>
> Hashes may be rehashed when inserting, completely invalidating a range (possibly the end point will be before the starting point).

Wouldn't re-hashing necessitate re-allocation? (Thus the range would see a stale view)

> Yes, the others you mentioned will be valid.  But I still don't see it being more useful than just using documentation to indicate insertion will not invalidate ranges.  Perhaps I am wrong.

The difference is that algorithms can document in their template constraints that they need a container with 'soft' properties.

>>> Another option is to use a "mutation" field that is checked every chance by the range.  If it changes, then the range is invalidated.
>>
>> The mutation field would have to be a version number to support multiple ranges, and given experience with lock-free algorithms which use a 'tag' in a similar manner, this concept is bug prone and should not be relied upon. It would be better to 'lock' the node or container to topology changes, though this does slow things down and has no conflict resolution: removing a locked node would have to throw an exception.
>
> I was not thinking of multithreaded applications.  I don't think it's worth making containers by default be multithreaded safe.

I wasn't thinking of multi-threaded containers. I was trying to point out that version ids have failed in lock-free containers, where things are happening on the order of a single atomic op or a context switch. Given the time a range could go unused in standard code, versioning won't work.

> The mutation index has been used in Tango forever, and I think was in Doug Lea's original container implementations.  I'm pretty sure it is sound in single-threaded uses.

No it's not. version tags + integer overflow = bug. Doug Lea knew about the problem and but thought it would never happen in real code. And Bill Gates thought no one will need more than 640k of ram. They both have been proven wrong.

March 07, 2010

Re: Container insertion and removal

Posted by Andrei Alexandrescu
in reply to Steven Schveighoffer

Andrei Alexandrescu

Posted in reply to Steven Schveighoffer

Steven Schveighoffer wrote:
> On Sat, 06 Mar 2010 19:41:57 -0500, BLS <windevguy@hotmail.de> wrote:
> 
>> ..and frankly said I wonder .. no offense.. why you are the (only) one who is creating and designing the collection library ?
>> finally :
>> from a pragmatic view > why create from scratch instead of reusing Stevens data-structures..
> 
> We have had private discussions, and we have fundamental differences of opinion for what belongs in containers.  It may be that through trial and error and constructive discussion we come up with the same conclusions, in which case I'm sure we will combine forces (recently, I think we both moved our points of view closer, but not quite there yet).

I concur with Steve's characterization of the situation. My interface definitions are very crisp and my cost function looks more like a step function, so I'm difficult to negotiate with. I'm running a mile away faster than you can say "Here, there's this Container.contains() method, which..."

But Steve suggested that he could volunteer, with credit, some of his implementation code to my interfaces, even though he has disagreements regarding the exact definitions of those interfaces.

> However, I will not stop maintaining dcollections as long as Phobos' collection solution does not implement the features I think are important.  I am in the process of porting/improving dcollections to D2, and when it is done, I think it will have some really useful features in it.

That's great. This kind of competition can only be best for everyone.

Andrei

March 07, 2010

Re: Container insertion and removal

Posted by Andrei Alexandrescu
in reply to Steven Schveighoffer

Andrei Alexandrescu

Posted in reply to Steven Schveighoffer

Steven Schveighoffer wrote:
> On Sat, 06 Mar 2010 14:55:49 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> Steven Schveighoffer wrote:
>>> Andrei Alexandrescu Wrote:
>>>
>>>> In the STL world, writing container-independent code is generally shunned (see e.g. http://www.informit.com/content/images/0201749629/items/item2-2.pdf).
>>>>
>>>> One problem is a very small intersection between the functionalities offered by the various STL containers, and the conceptual organization that is weaker than that of iterators.
>>>>
>>>> A worse problem is iterator invalidation rules, something that we'll need to address too. I'm thinking that the best defense is a strong offense, and I plan to define the following naming convention:
>>>>
>>>> Methods such as insert, remove, pushFront, pushBack, removeFront, removeBack, are assumed to affect the container's topology and must be handled in user code as such.
>>>>
>>>> In addition to those, a container may also define functions named after the above by adding a "soft" prefix (e.g. softInsert, softRemove...) that are guaranteed to not affect the ranges currently iterating the container.
>>>>
>>>> Generic code that needs specific iterator (non-)invalidation rules can use softXxx methods, in confidence that containers not supporting it will be ruled out during compilations.
>>>>
>>>> Sounds good?
>>>  How can softRemove not affect iterating ranges?  What if the range is positioned on the element removed?
>>
>> With GC, you can softRemove things without invalidating iterators.
> 
> If you define invalidation by "not pointing to unallocated memory," wouldn't normal remove (not soft remove) also result in valid memory being pointed to?  Why do we need a special soft version?

Well consider an array of File objects, which are reference counted. I'd want the resizing of an array to not increment the reference count of the files. That means that resizing leaves the old array without meaningful content by calling clear() against the remaining files.

> Even if this is what you mean, with allocators (which significantly speed up container insert/removal), you may be pointing to a newly valid piece of memory that may not be where you want.
> 
> I thought you meant by not invalidating that the range would iterate over the elements it was originally scheduled to iterate over, sans the removed element.

For remove that should be the case, but probably not for insert.

>>> The only two containers that would support softInsert would be linked list and sorted map/set.  Anything else might completely screw up the iteration.  I don't see a lot of "generic" use for it.
>>
>> Singly linked lists, doubly linked lists, various trees - three's a crowd. Most node-based containers support softInsert.
> 
> Hashes may rehash on insert, invalidating any ranges (it's one of the notes of dcollections' HashMap and HashSet).  I guess it depends on if an insert alters the ordering of the nodes.  This is why I said sorted containers which should be unaffected.
> 
> You have a point that this covers a lot of containers however, but I think the usefulness of just having a softInsert (I think softRemove is not a useful concept) is pretty limited.  It limits you to

Iterator invalidation has been a notion thoroughly explored by the STL and a commonly-mentioned liability of STL iterators. People find it very jarring that syntactically identical interfaces have distinct effects on iterators, it's a dependency very difficult to track. I'm glad that that experience has already been accumulated and that we can build upon it.

> BTW, I think its a good idea to think about how to make this work, if we can come up with something that covers a lot of ground, I think that's a good thing.  I thought of making ranges slower but more robust to container changes in debug mode, does this sound useful?  I.e. if you compile in non-release mode, removing/adding an element may trigger the range to do a O(n) check to validate itself.

I think debug vs. release should not affect complexity.


Andrei

March 07, 2010

Re: Container insertion and removal

Posted by Fawzi Mohamed
in reply to Steven Schveighoffer

Fawzi Mohamed

Posted in reply to Steven Schveighoffer

I am coming late into this discussion, some comments:

I like a duck typing approach whenever possible, and don't need runtime adaptability (which seems to be in line with others)

different container have different trade offs, and switching between them should be a design choice

Generic algorithm in my opinion are useful to easily (and with little well tested code) give a wise range interface to several containers.
They should not "forbid" special implementation of some algorithms in a container.
To have a common interface one can always use a trampoline function, that looks for the specialized implementation in various places.
I know that in theory one should be able to write specialized templates instead, but in practice I have often encountered bugs and strange behavior, in container implementation seems much cleaner to me

Different containers have different "threadsafeness". Normally some subset of operations is threadsafe, but mixing in other might break it. This in my opinion *must* be documented, for example one should not have to look to the source to know if addition of an element to an AA breaks a foreach loop or not.
Threadsafeness has a cost, in general non threadsafe structures are faster, so it makes sense to have them (especially thinking that often even in the non threadsafe structures one can use a subset of operations concurrently without problems.
There is a large class of structures that *are* threadsafe, these have some beautiful properties, and are very useful to know. A very good review of them (I don't say introduction because the book is quite advanced) is "Purely Functional Data Structures" by Chris Okasaki. In these structures the values read are always the same, but lazy evaluation might be used to transform amortized cost in non amortized cost.

I had already spoken about these back during the const discussion, because I would have liked that these structures were "immutable" and usable by pure functions, which meant a slighlty different thing from immutable. Some casting might be used to implement them, if one assumes that casting a mutable structure to immutable and modifying it later is doable (no compiler optimization on immutable disallow it, this could be done even just for a special "thunk" that contain a lazy function that will produce the value). Anyway that is another discussion.

For the current discussion the important thing is that yes, there are fully threadsafe structures, and it would be useful to have them, but in general they have a slightly larger cost.
Not safe structures are also useful as they are faster, in general in the documentation one expects to find which operations are safe and which not.
The minima baseline typically is that just reading is threadsafe, modifications in general aren't and might invalidate all pointers to the structure.
Exceptions to this should be clearly stated, for example appending to an array can be made threadsafe without a large cost, whereas insertion (if the update of an element is not atomic) cannot.

If D really wants to go after multithreading it makes sense to offer as many safe structures as possible, and in any case clearly document the non threadsafeness of the others.

it might make sense to give "safe" interfaces (I would say basically just reading without updates), but one cannot expect to really express all threadsafeness of a container though interfaces, because that would be too complicated and then one would end up with one container per interface without much gain.

Fawzi

March 07, 2010

Re: Container insertion and removal

Posted by Steven Schveighoffer
in reply to Robert Jacques

Steven Schveighoffer

Posted in reply to Robert Jacques

Robert Jacques Wrote:

> On Sat, 06 Mar 2010 21:54:50 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> > On Sat, 06 Mar 2010 11:19:15 -0500, Robert Jacques <sandford@jhu.edu> wrote:
> >
> >> On Sat, 06 Mar 2010 08:19:36 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> >>>
> >>> How can softRemove not affect iterating ranges?  What if the range is positioned on the element removed?
> >>
> >> It always affects ranges in so far as the range and container are inconsistent, but that is a problem of softInsert as well. Removing an element from an array doesn't invalidate the underlying range, since the memory is still there. And so long as you're not trying to use free lists, linked-lists, trees, etc. can be written so that the ranges never enter an invalid state. If they happen to be pointing at the removed node, they just end up stopping early.
> >
> > If my linked list range has two node pointers as the implementation, and you remove the end node, it's going to end later, not early.
> 
> A linked list maps to a forward range: so the range simply points to a single internal node. If you're going to store a pointer to the end, then you should be using a doubly linked list, not a singly linked list.

How do you know when to stop?  A range has a beginning and an ending, otherwise it's an iterator.  Whether you store it via a pointer to the last (not-to-be-iterated) node, or via a length, or via a value to compare with for stopping, you have to use something.  Or are you asserting the only useful range for a linked list iterates the entire list?

Think of it as the equivalent of a slice of an array.

> 
> > Of course, you can get around this by just storing a current node and a length, but that may not be what the user is expecting.  For example, if you did a range of a sorted tree from "a" to "f" and it stops at "n", I think that would be extremely unexpected.
> 
> By this logic, any and all forms of mutation, not just insertions and deletions must not be allowed.

Allowed, yes.  Not invalidating iterators, no.

> 
> > Stopping early is invalidation also IMO.  If your program logic depends on iterating over all the elements you originally intended to iterate, then we have big problems if they stop early.
> 
> Stopping early is the result of a logical error by the programmer. The code itself, however, is completely valid.

I still fail to see the difference between "soft" operations and non-soft.  What does soft guarantee?  Give me a concrete definition, an example would help too.

> >>> The only two containers that would support softInsert would be linked list and sorted map/set.  Anything else might completely screw up the iteration.  I don't see a lot of "generic" use for it.
> >>
> >> There's all the containers based upon linked-lists, etc like hashes, stacks, queues and dequeues.
> >
> > Hashes may be rehashed when inserting, completely invalidating a range (possibly the end point will be before the starting point).
> 
> Wouldn't re-hashing necessitate re-allocation? (Thus the range would see a stale view)

God no.  If my hash collision solution is linked-list based (which it is in dcollections), why should I reallocate all those nodes?  I just rearrange them in a new bucket array.

> 
> > Yes, the others you mentioned will be valid.  But I still don't see it being more useful than just using documentation to indicate insertion will not invalidate ranges.  Perhaps I am wrong.
> 
> The difference is that algorithms can document in their template constraints that they need a container with 'soft' properties.

What is the advantage?  Why would an algorithm require soft functions?  What is an example of such an algorithm?

> 
> >>> Another option is to use a "mutation" field that is checked every chance by the range.  If it changes, then the range is invalidated.
> >>
> >> The mutation field would have to be a version number to support multiple ranges, and given experience with lock-free algorithms which use a 'tag' in a similar manner, this concept is bug prone and should not be relied upon. It would be better to 'lock' the node or container to topology changes, though this does slow things down and has no conflict resolution: removing a locked node would have to throw an exception.
> >
> > I was not thinking of multithreaded applications.  I don't think it's worth making containers by default be multithreaded safe.
> 
> I wasn't thinking of multi-threaded containers. I was trying to point out that version ids have failed in lock-free containers, where things are happening on the order of a single atomic op or a context switch. Given the time a range could go unused in standard code, versioning won't work.

Are you trying to say that if you don't use your range for exactly 2^32 mutations, it could mistakenly think the range is still valid?  That's a valid, but very very weak point.

> 
> > The mutation index has been used in Tango forever, and I think was in Doug Lea's original container implementations.  I'm pretty sure it is sound in single-threaded uses.
> 
> No it's not. version tags + integer overflow = bug. Doug Lea knew about the problem and but thought it would never happen in real code. And Bill Gates thought no one will need more than 640k of ram. They both have been proven wrong.

Overflow != bug.  Wrapping completely to the same value == bug, but is so unlikely, it's worth the possibility.

-Steve

March 07, 2010

Re: Container insertion and removal

Posted by Fawzi Mohamed
in reply to Steven Schveighoffer

Fawzi Mohamed

Posted in reply to Steven Schveighoffer

On 2010-03-07 14:23:03 +0100, Steven Schveighoffer <schveiguy@yahoo.com> said:

> Robert Jacques Wrote:
> 
>> On Sat, 06 Mar 2010 21:54:50 -0500, Steven Schveighoffer
>> <schveiguy@yahoo.com> wrote:
>>> On Sat, 06 Mar 2010 11:19:15 -0500, Robert Jacques <sandford@jhu.edu>
>>> wrote:
>>> 
>>>> On Sat, 06 Mar 2010 08:19:36 -0500, Steven Schveighoffer
>>>> <schveiguy@yahoo.com> wrote:
>>>>> 
>>>>> How can softRemove not affect iterating ranges?  What if the range is
>>>>> positioned on the element removed?
>>>> 
>>>> It always affects ranges in so far as the range and container are
>>>> inconsistent, but that is a problem of softInsert as well. Removing an
>>>> element from an array doesn't invalidate the underlying range, since
>>>> the memory is still there. And so long as you're not trying to use free
>>>> lists, linked-lists, trees, etc. can be written so that the ranges
>>>> never enter an invalid state. If they happen to be pointing at the
>>>> removed node, they just end up stopping early.
>>> 
>>> If my linked list range has two node pointers as the implementation, and
>>> you remove the end node, it's going to end later, not early.
>> 
>> A linked list maps to a forward range: so the range simply points to a
>> single internal node. If you're going to store a pointer to the end, then
>> you should be using a doubly linked list, not a singly linked list.
> 
> How do you know when to stop?  A range has a beginning and an ending, otherwise it's an iterator.  Whether you store it via a pointer to the last (not-to-be-iterated) node, or via a length, or via a value to compare with for stopping, you have to use something.  Or are you asserting the only useful range for a linked list iterates the entire list?
> 
> Think of it as the equivalent of a slice of an array.
> 
>> 
>>> Of course, you can get around this by just storing a current node and a
>>> length, but that may not be what the user is expecting.  For example, if
>>> you did a range of a sorted tree from "a" to "f" and it stops at "n", I
>>> think that would be extremely unexpected.
>> 
>> By this logic, any and all forms of mutation, not just insertions and
>> deletions must not be allowed.
> 
> Allowed, yes.  Not invalidating iterators, no.

I agree with this point of view, it makes sense to make some operations invalidating iteration, and even there there is a hierarcy of "invalidation":

1) all iterator are still valid, and iterate on the same set that they were iterating before. This is the best thing from a user, and is what persistent structures do, all the structures described in the book I gave have this property.
To achieve this you need to basically *always* create a copy when modifying a structure, but fortunately often there are ways to share most of the structure with the old value. Still this has a cost, and normally puts pressure on the memory/gc (all the nodes, or groups of them have to be single objects for the gc).

2) iterators are still valid, but they might iterate on a different set of elements than originally planned.
Array append with atomic update of elements gives this for example, removal in general has more problems, even if the memory might still be valid. Invariants of the array might be violated (for example strict ordering), one could separate this in two sub points, invariant violating and non, but probably it is not worth the effort.

3) iterators are fully invalid, and at worst they might iterate on completely wrong/invalid data without detection of failure, hopefully at least in debug mode detection should be an option.

>>> Stopping early is invalidation also IMO.  If your program logic depends
>>> on iterating over all the elements you originally intended to iterate,
>>> then we have big problems if they stop early.
>> 
>> Stopping early is the result of a logical error by the programmer. The
>> code itself, however, is completely valid.
> 
> I still fail to see the difference between "soft" operations and non-soft.  What does soft guarantee?  Give me a concrete definition, an example would help too.

I suppose soft should guarantee either 1 or 2, from what andrei was saying I suppose he wants 2

This information is very important for the user for example to iterate +manipulate (addition/removal), one doesn't necessarily need to do a copy with all containers.

In general one wants to have clean interface, + just one or two "good" implementations, I don't feel that there is a strong need to have many implementations of the basic containers in a standard library, (the other would be just implemented for a specific application, if needed). Then there might be different containers (db, distributed hashes, xml,...), and it would be nice if those can be as close as possible in some aspects of the api to the standard containers.
Again here generic algorithms are useful to augment the usefulness of special containers with a minimal coding effort, but should not "disallow" ad-hoc implementation in the container.

March 07, 2010

Re: Container insertion and removal

Posted by Robert Jacques
in reply to Steven Schveighoffer

Robert Jacques

Posted in reply to Steven Schveighoffer

On Sun, 07 Mar 2010 08:23:03 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> Robert Jacques Wrote:
>
>> On Sat, 06 Mar 2010 21:54:50 -0500, Steven Schveighoffer
>> <schveiguy@yahoo.com> wrote:
>> > On Sat, 06 Mar 2010 11:19:15 -0500, Robert Jacques <sandford@jhu.edu>
>> > wrote:
>> >
>> >> On Sat, 06 Mar 2010 08:19:36 -0500, Steven Schveighoffer
>> >> <schveiguy@yahoo.com> wrote:
>> >>>
>> >>> How can softRemove not affect iterating ranges?  What if the range  
>> is
>> >>> positioned on the element removed?
>> >>
>> >> It always affects ranges in so far as the range and container are
>> >> inconsistent, but that is a problem of softInsert as well. Removing  
>> an
>> >> element from an array doesn't invalidate the underlying range, since
>> >> the memory is still there. And so long as you're not trying to use  
>> free
>> >> lists, linked-lists, trees, etc. can be written so that the ranges
>> >> never enter an invalid state. If they happen to be pointing at the
>> >> removed node, they just end up stopping early.
>> >
>> > If my linked list range has two node pointers as the implementation,  
>> and
>> > you remove the end node, it's going to end later, not early.
>>
>> A linked list maps to a forward range: so the range simply points to a
>> single internal node. If you're going to store a pointer to the end, then
>> you should be using a doubly linked list, not a singly linked list.
>
> How do you know when to stop?  A range has a beginning and an ending, otherwise it's an iterator.  Whether you store it via a pointer to the last (not-to-be-iterated) node, or via a length, or via a value to compare with for stopping, you have to use something.  Or are you asserting the only useful range for a linked list iterates the entire list?
>
> Think of it as the equivalent of a slice of an array.

Please define for me an O(1) slice or index operation for a linked-list. The only way of doing this is to search the entire list in order, comparing for the search terms or counting the number of nodes passed. The range itself can do this just as easily as the host container, if you really want this functionality (I'd argue this isn't a valid list operation). For simple list[5..10] operations its definitely more efficient and then the range could even throw an exception when it reaches the end of the list before finding the end slice value (due to list topology manipulation). The efficiency of more complex ranges, like list["apple".."oranges"], is still much more efficient for the single range case. Even when you start to take many slices, cache effects can marginalize a lot of the cost.

>>
>> > Of course, you can get around this by just storing a current node and  
>> a
>> > length, but that may not be what the user is expecting.  For example,  
>> if
>> > you did a range of a sorted tree from "a" to "f" and it stops at "n",  
>> I
>> > think that would be extremely unexpected.
>>
>> By this logic, any and all forms of mutation, not just insertions and
>> deletions must not be allowed.
>
> Allowed, yes.  Not invalidating iterators, no.

If I change a node in a sorted tree from 'a' to 'n', the node moves, changing the tree topology. Any range currently at the node would continue to look for the 'f' node, and iterate the rest of the tree erroneously. By the way, having ranges detect if they reach their end nodes or not is fairly easy to do.

>>
>> > Stopping early is invalidation also IMO.  If your program logic  
>> depends
>> > on iterating over all the elements you originally intended to iterate,
>> > then we have big problems if they stop early.
>>
>> Stopping early is the result of a logical error by the programmer. The
>> code itself, however, is completely valid.
>
> I still fail to see the difference between "soft" operations and non-soft.  What does soft guarantee?  Give me a concrete definition, an example would help too.

There are a couple of possible definitions for soft operations: 1) the memory safety of the ranges of a collection are guaranteed. 2) That for the topology viewed by a range isn't logically changed. i.e. the range will continue to perform the same logical function if the topology its operating on is updated 3) That for the topology viewed by a range isn't actually changed and all elements selected at range creation will be viewed. 4) Like 3, but with all values being viewed.

For example, modifying an array in any way doesn't change 1, 2 or 3 for any of its slices.
For a linked list defining a forward range, mutation, insertion and removal can be done under 1 & 2.
The same can be said about doubly linked lists and bidirectional ranges.
For other containers, such as a sorted tree, mutation can break a 2/3 though insertion and deletion don't break 2.
Although, the ranges will see many values, they may not see all the values currently in the collection nor all the values in the collection when the iterator was generated. So code that relies on such properties would be logically invalid.

I'd probably define hard ops as being 1) and soft ops at level 2. 4) is really only possible with immutable containers.

>> >>> The only two containers that would support softInsert would be  
>> linked
>> >>> list and sorted map/set.  Anything else might completely screw up  
>> the
>> >>> iteration.  I don't see a lot of "generic" use for it.
>> >>
>> >> There's all the containers based upon linked-lists, etc like hashes,
>> >> stacks, queues and dequeues.
>> >
>> > Hashes may be rehashed when inserting, completely invalidating a range
>> > (possibly the end point will be before the starting point).
>>
>> Wouldn't re-hashing necessitate re-allocation? (Thus the range would see a
>> stale view)
>
> God no.  If my hash collision solution is linked-list based (which it is in dcollections), why should I reallocate all those nodes?  I just rearrange them in a new bucket array.

Sorry, I was assuming that if you were going to implement a hash collection you wouldn't be using a linked list approach, since that's what D's associative arrays already do. The are some really good reasons to not use a list based hash in D due to GC false pointer issues, but basically none to re-implementing (poorly?) D's built-in data structure.

>>
>> > Yes, the others you mentioned will be valid.  But I still don't see it
>> > being more useful than just using documentation to indicate insertion
>> > will not invalidate ranges.  Perhaps I am wrong.
>>
>> The difference is that algorithms can document in their template
>> constraints that they need a container with 'soft' properties.
>
> What is the advantage?  Why would an algorithm require soft functions?  What is an example of such an algorithm?

Something that uses toUpperCase or toLowerCase, for example.

>>
>> >>> Another option is to use a "mutation" field that is checked every
>> >>> chance by the range.  If it changes, then the range is invalidated.
>> >>
>> >> The mutation field would have to be a version number to support
>> >> multiple ranges, and given experience with lock-free algorithms which
>> >> use a 'tag' in a similar manner, this concept is bug prone and should
>> >> not be relied upon. It would be better to 'lock' the node or  
>> container
>> >> to topology changes, though this does slow things down and has no
>> >> conflict resolution: removing a locked node would have to throw an
>> >> exception.
>> >
>> > I was not thinking of multithreaded applications.  I don't think it's
>> > worth making containers by default be multithreaded safe.
>>
>> I wasn't thinking of multi-threaded containers. I was trying to point out
>> that version ids have failed in lock-free containers, where things are
>> happening on the order of a single atomic op or a context switch. Given
>> the time a range could go unused in standard code, versioning won't work.
>
> Are you trying to say that if you don't use your range for exactly 2^32 mutations, it could mistakenly think the range is still valid?  That's a valid, but very very weak point.

Umm, no. That is a valid point that happens in production code to disastrous effects. Worse, it doesn't happen often and there's no good unit test for it. I for one, never want to debug a program that only glitches after days of intensive use in a live environment with real customer data. Integer overflow bugs like this are actually one of the few bugs that have ever killed anyone.

>>
>> > The mutation index has been used in Tango forever, and I think was in
>> > Doug Lea's original container implementations.  I'm pretty sure it is
>> > sound in single-threaded uses.
>>
>> No it's not. version tags + integer overflow = bug. Doug Lea knew about
>> the problem and but thought it would never happen in real code. And Bill
>> Gates thought no one will need more than 640k of ram. They both have been
>> proven wrong.
>
> Overflow != bug.  Wrapping completely to the same value == bug, but is so unlikely, it's worth the possibility.

Statistics 101: do a test enough times and even the highly improbable will happen.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation