October 03, 2011
On 2011-10-03 15:57, Robert Jacques wrote:
> So, in essence, you are saying that by the time archiving occurs,
> isSliceOf will always return false? Then why is it part of the public API?

No, I'm not saying that. Example:

struct Array
{
	void* ptr;
	size_t length;
	size_t elementSize;
	
	bool isSliceOf (Array b)
	{
		return ptr >= b.ptr && ptr + length * elementSize <= b.ptr + b.length * b.elementSize;
	}
}

void main ()
{
	auto a = [0, 1, 2, 3, 4];
	auto b = a[2 .. 4];
	
	auto aArr = Array(a.ptr, a.length, 4);
	auto bArr = Array(b.ptr, b.length, 4);
	
	assert(bArr.isSliceOf(aArr));
	assert(!aArr.isSliceOf(bArr));
}

Both the asserts in the above code passes as expected. See, no serialization or archiving in sight. Is there something I'm missing?

Actually it does not need to be part of the public API when I think about it. I can move it into Serializer. Array would still need to be public since both Serailzer and Archive need access to it and the package attribute doesn't work very well.

>>> The solution, in my mind, is to think in terms of memory blocks/chucks.
>>> Every reference can be thought as pointing to a memory chunk defined by
>>> two pointers and a flag:
>>>
>>> {
>>> void* head; // Start of the memory chunk
>>> void* tail; // End of the memory chunk
>>> bool hasAliases; // True if there are more than one reference to this
>>> chunk
>>> }
>>>
>>> For alias detection / resolution, you build a balanced tree of memory
>>> chunks, widening the chunk and flagging hasAliases as appropriate. Which
>>> should give you O(N log(N)) performance.
>>
>> I'm not sure I understand. That would require that the arrays are stored
>> in a continues block of memory? Won't "head" and "tail" always point to
>> start of the array and the end of the array?
>
> Most of the time yes. But not all of the time. It would be fair to say
> that 'head' and 'tail' would be inside the GC memory region of the array
> and are <= or >= of the start/end of an array, respectively. More
> importantly, both objects and pointers would resolve to memory chunks as
> well and be included in the alias resolution algorithm.

Now I think I start to understand.

> I'm assuming you
> currently separate object and array alias resolution and don't handle
> pointers at all.

Yes. Pointers are handled as well. It's handled similar to arrays/slices. First it always serializes what the pointer points to. Then in a post process step, after all serialization is done, it replaces all serialized pointers, that points to a value that has been serialized, with a reference. If a given pointer doesn't point to a value that has be serialized it's left as is.

I can see if I this memory chunk approach can be used instead. How will this be used with the balanced tree, could you give a simple example?

-- 
/Jacob Carlborg
October 04, 2011
On Mon, 03 Oct 2011 14:10:52 -0400, Jacob Carlborg <doob@me.com> wrote:
> On 2011-10-03 15:57, Robert Jacques wrote:
>> So, in essence, you are saying that by the time archiving occurs,
>> isSliceOf will always return false? Then why is it part of the public API?
>
> No, I'm not saying that. Example:
>
> struct Array
> {
> 	void* ptr;
> 	size_t length;
> 	size_t elementSize;
> 	
> 	bool isSliceOf (Array b)
> 	{
> 		return ptr >= b.ptr && ptr + length * elementSize <= b.ptr + b.length
> * b.elementSize;
> 	}
> }
>
> void main ()
> {
> 	auto a = [0, 1, 2, 3, 4];
> 	auto b = a[2 .. 4];
> 	
> 	auto aArr = Array(a.ptr, a.length, 4);
> 	auto bArr = Array(b.ptr, b.length, 4);
> 	
> 	assert(bArr.isSliceOf(aArr));
> 	assert(!aArr.isSliceOf(bArr));
> }
>
> Both the asserts in the above code passes as expected. See, no
> serialization or archiving in sight. Is there something I'm missing?

That putting isSliceOf in the public API, implies its usage by the archiver.

> Actually it does not need to be part of the public API when I think
> about it. I can move it into Serializer. Array would still need to be
> public since both Serailzer and Archive need access to it and the
> package attribute doesn't work very well.

Regarding design, I agree, although I'd go one further and define Array as a public type inside the Serializer class. However, this concept of an 'Array' is fundamentally flawed. Consider:

auto c = a[1..3];

auto cArr = Array(c.ptr,c.length,4);

assert(!cArr.isSliceOf(bArr));
assert(!bArr.isSliceOf(cArr));

// and

b ~= 5;

bArr = Array(b.ptr, b.length, 4);

assert(!bArr.isSliceOf(aArr));

In short, a serializer must be capable of handling overlapping arrays, not just strict slices. The array representation therefore needs to parameterize both the underlying array common to all slices and the actual slice that was serialized for that key.

>>>> The solution, in my mind, is to think in terms of memory blocks/chucks.
>>>> Every reference can be thought as pointing to a memory chunk defined by
>>>> two pointers and a flag:
>>>>
>>>> {
>>>> void* head; // Start of the memory chunk
>>>> void* tail; // End of the memory chunk
>>>> bool hasAliases; // True if there are more than one reference to this
>>>> chunk
>>>> }
>>>>
>>>> For alias detection / resolution, you build a balanced tree of memory
>>>> chunks, widening the chunk and flagging hasAliases as appropriate. Which
>>>> should give you O(N log(N)) performance.
>>>
>>> I'm not sure I understand. That would require that the arrays are stored
>>> in a continues block of memory? Won't "head" and "tail" always point to
>>> start of the array and the end of the array?
>>
>> Most of the time yes. But not all of the time. It would be fair to say
>> that 'head' and 'tail' would be inside the GC memory region of the array
>> and are <= or >= of the start/end of an array, respectively. More
>> importantly, both objects and pointers would resolve to memory chunks as
>> well and be included in the alias resolution algorithm.
>
> Now I think I start to understand.
>
>> I'm assuming you
>> currently separate object and array alias resolution and don't handle
>> pointers at all.
>
> Yes. Pointers are handled as well. It's handled similar to
> arrays/slices. First it always serializes what the pointer points to.
> Then in a post process step, after all serialization is done, it
> replaces all serialized pointers, that points to a value that has been
> serialized, with a reference. If a given pointer doesn't point to a
> value that has be serialized it's left as is.

Can a pointer point to the interior of an object? To an element of an array?

> I can see if I this memory chunk approach can be used instead. How will
> this be used with the balanced tree, could you give a simple example?

Well, balanced trees need a comparison function so:

struct Node {
    void* head; // Start of the memory chunk
    void* tail; // End of the memory chunk
    bool hasAliases; // True if there are more than one reference to this chunk
    //... Other meta-data, i.e. ID,

    int opCmp(const ref Node b) {
        if(  tail < b.head) return -1;
        if(b.tail <   head) return 1;
        return 0;
    }
}

On equality / assignment, one just has to combine the heads and tail with min/max, respectively, and update hasAliases, etc. The difficulty is when a new node 'bridges the gap' between two existing nodes. This has to handled explicitly as part of the tree re-balancing, but you may want to consider making the merging of nodes part of the comparison operator for simplicity/efficiency:

head = min(head,b.head);
tail = max(tail,b.tail);
hasAliases = true;

After pruning, updating meta-data, etc, the aliased memory chunk for any given pointer can be found using a separate comparison operator:

    int opCmp(const ref void* b) {
        if(tail < b   ) return -1;
        if(b    < head) return 1;
        return 0;
    }

// Which you'd probably use like:

if( auto node = arr.ptr in setOfAliases ) {
    auto offset = arr.ptr - node.head;
    //...
} else {//...


So at the pseudo-code level, it would be something like:

foreach(obj; [objects,pointers,arrays]) {
   auto node = Node(obj);
   setOfAliases.insert(node);
   // Convert obj into an intermediate form
}

setOfAliases.pruneUnaliasedNodes;
setOfAliases.postProcessMetadata; // i.e. assign unique alias ID, etc

foreach(obj; [objects,pointers,arrays]) {
    if( auto node = arr.ptr in setOfAliases ) {
        auto offset = arr.ptr - node.head;
        // Perform an aliased archival
    } else {
        // Perform an unaliased archival
    }
}

I hope that helped, though I'm not sure if I really answered your question or not.
October 04, 2011
On 2011-10-04 07:21, Robert Jacques wrote:
> That putting isSliceOf in the public API, implies its usage by the
> archiver.

Ok, then I'll remove it.

>> Actually it does not need to be part of the public API when I think
>> about it. I can move it into Serializer. Array would still need to be
>> public since both Serailzer and Archive need access to it and the
>> package attribute doesn't work very well.
>
> Regarding design, I agree, although I'd go one further and define Array
> as a public type inside the Serializer class. However, this concept of
> an 'Array' is fundamentally flawed. Consider:
>
> auto c = a[1..3];
>
> auto cArr = Array(c.ptr,c.length,4);
>
> assert(!cArr.isSliceOf(bArr));
> assert(!bArr.isSliceOf(cArr));

Any suggestion how to fix this, how to properly detect if an array is a slice of some other array?

> // and
>
> b ~= 5;
>
> bArr = Array(b.ptr, b.length, 4);
>
> assert(!bArr.isSliceOf(aArr));

Appending to "b" will reallocate "b" making it a regular array and not a slice:

b ~= 5;
b[] = 100;
assert(a == [0, 1, 2, 3, 4]);

"a" is not modified and the assert passes.

> Can a pointer point to the interior of an object? To an element of an
> array?

Yes, have a look at: https://github.com/jacob-carlborg/orange/blob/master/tests/Pointer.d

Not sure about the array part though, but I think so. Each element of an array is serialized just like all the other types.

>> I can see if I this memory chunk approach can be used instead. How will
>> this be used with the balanced tree, could you give a simple example?
>
> Well, balanced trees need a comparison function so:
>
> struct Node {
> void* head; // Start of the memory chunk
> void* tail; // End of the memory chunk
> bool hasAliases; // True if there are more than one reference to this chunk
> //... Other meta-data, i.e. ID,
>
> int opCmp(const ref Node b) {
> if( tail < b.head) return -1;
> if(b.tail < head) return 1;
> return 0;
> }
> }
>
> On equality / assignment, one just has to combine the heads and tail
> with min/max, respectively, and update hasAliases, etc. The difficulty
> is when a new node 'bridges the gap' between two existing nodes. This
> has to handled explicitly as part of the tree re-balancing, but you may
> want to consider making the merging of nodes part of the comparison
> operator for simplicity/efficiency:
>
> head = min(head,b.head);
> tail = max(tail,b.tail);
> hasAliases = true;
>
> After pruning, updating meta-data, etc, the aliased memory chunk for any
> given pointer can be found using a separate comparison operator:
>
> int opCmp(const ref void* b) {
> if(tail < b ) return -1;
> if(b < head) return 1;
> return 0;
> }
>
> // Which you'd probably use like:
>
> if( auto node = arr.ptr in setOfAliases ) {
> auto offset = arr.ptr - node.head;
> //...
> } else {//...
>
>
> So at the pseudo-code level, it would be something like:
>
> foreach(obj; [objects,pointers,arrays]) {
> auto node = Node(obj);
> setOfAliases.insert(node);
> // Convert obj into an intermediate form
> }
>
> setOfAliases.pruneUnaliasedNodes;
> setOfAliases.postProcessMetadata; // i.e. assign unique alias ID, etc
>
> foreach(obj; [objects,pointers,arrays]) {
> if( auto node = arr.ptr in setOfAliases ) {
> auto offset = arr.ptr - node.head;
> // Perform an aliased archival
> } else {
> // Perform an unaliased archival
> }
> }
>
> I hope that helped, though I'm not sure if I really answered your
> question or not.

Yes, I think it at least helped somewhat. The thing is that I'm not very familiar with using trees like these.

Would this something similar to: https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/Serializer.d#L1520 ?

What is the advantage with using a tree? Is the advantage that you loop over the elements once in the pseudo-code compared to that I loop over them twice, as in: https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/Serializer.d#L1495 ?

-- 
/Jacob Carlborg
October 04, 2011
On Tue, 04 Oct 2011 03:22:35 -0400, Jacob Carlborg <doob@me.com> wrote:
> On 2011-10-04 07:21, Robert Jacques wrote:

[snip]

>>> Actually it does not need to be part of the public API when I think
>>> about it. I can move it into Serializer. Array would still need to be
>>> public since both Serailzer and Archive need access to it and the
>>> package attribute doesn't work very well.
>>
>> Regarding design, I agree, although I'd go one further and define Array
>> as a public type inside the Serializer class. However, this concept of
>> an 'Array' is fundamentally flawed. Consider:
>>
>> auto c = a[1..3];
>>
>> auto cArr = Array(c.ptr,c.length,4);
>>
>> assert(!cArr.isSliceOf(bArr));
>> assert(!bArr.isSliceOf(cArr));
>
> Any suggestion how to fix this, how to properly detect if an array is a
> slice of some other array?

Well, there are two problems. First, there is an API issue: When you pass an aliased array to the archiver, you must pass both an array and a slice of that array.
Second, is detection. The comparison operator I suggested below covers this issue. Just remember that with a loop based approach, you can't terminate the search early: you have to check all existing arrays for possible matches.

>> // and
>>
>> b ~= 5;
>>
>> bArr = Array(b.ptr, b.length, 4);
>>
>> assert(!bArr.isSliceOf(aArr));
>
> Appending to "b" will reallocate "b" making it a regular array and not a
> slice:
>
> b ~= 5;
> b[] = 100;
> assert(a == [0, 1, 2, 3, 4]);
>
> "a" is not modified and the assert passes.

I'm sorry, you're right. In my mind b extended to the end of the a array, for some reason. However, if you do define b to extend to the end of the a array, then it can append without allocating:

	auto a = [0, 1, 2, 3, 4];
	auto b = a[2 .. $];
	b ~= 5;
	assert(b[0]==2);
	a[2] = 10;
	assert(b[0]==10);

So please don't dismiss this point.

[snip]

> Would this something similar to:
> https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/Serializer.d#L1520
> ?

I'm not sure, that routine seems to be filtering pointers between those with aliases and those without aliases, which would be similar in effect to:

if( auto node = arr.ptr in setOfAliases ) {} else {}

> What is the advantage with using a tree? Is the advantage that you loop
> over the elements once in the pseudo-code compared to that I loop over
> them twice, as in:
> https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/Serializer.d#L1495
> ?

Primarily, it's O(N logN) vs O(N^2). Also, it solves the isSliceOf problem we discussed above and puts arrays and objects into the same framework, as objects containing fixed sized arrays can have slices.
October 04, 2011
On 2011-10-04 17:14, Robert Jacques wrote:
> I'm sorry, you're right. In my mind b extended to the end of the a
> array, for some reason. However, if you do define b to extend to the end
> of the a array, then it can append without allocating:
>
> auto a = [0, 1, 2, 3, 4];
> auto b = a[2 .. $];
> b ~= 5;
> assert(b[0]==2);
> a[2] = 10;
> assert(b[0]==10);
>
> So please don't dismiss this point.
>

Hmm, D1 and D2 behaves differently in this case. In D1 "a" is not changed when "b" is changed. Since you append to "b" I though that it would always require a reallocation of "b".

> [snip]
>
>> Would this something similar to:
>> https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/Serializer.d#L1520
>>
>> ?
>
> I'm not sure, that routine seems to be filtering pointers between those
> with aliases and those without aliases, which would be similar in effect
> to:
>
> if( auto node = arr.ptr in setOfAliases ) {} else {}
>
>> What is the advantage with using a tree? Is the advantage that you loop
>> over the elements once in the pseudo-code compared to that I loop over
>> them twice, as in:
>> https://github.com/jacob-carlborg/orange/blob/master/orange/serialization/Serializer.d#L1495
>>
>> ?
>
> Primarily, it's O(N logN) vs O(N^2). Also, it solves the isSliceOf
> problem we discussed above and puts arrays and objects into the same
> framework, as objects containing fixed sized arrays can have slices.

I guess I just have to try this and see how it works out. Anyway, thank you for your review and your patience.

-- 
/Jacob Carlborg
October 05, 2011
On Tue, 04 Oct 2011 12:54:27 -0400, Jacob Carlborg <doob@me.com> wrote:

> On 2011-10-04 17:14, Robert Jacques wrote:
>> I'm sorry, you're right. In my mind b extended to the end of the a
>> array, for some reason. However, if you do define b to extend to the end
>> of the a array, then it can append without allocating:
>>
>> auto a = [0, 1, 2, 3, 4];
>> auto b = a[2 .. $];
>> b ~= 5;
>> assert(b[0]==2);
>> a[2] = 10;
>> assert(b[0]==10);
>>
>> So please don't dismiss this point.
>>
>
> Hmm, D1 and D2 behaves differently in this case. In D1 "a" is not
> changed when "b" is changed. Since you append to "b" I though that it
> would always require a reallocation of "b".

Appending has never mandated reallocation. Maybe you're confusing it with concatenation (i.e. b = b ~ 5;), which does always reallocate. Or the period in time when the anti-array-stomping parts of druntime was overly conservative and would have prevented b from appending in place.
October 05, 2011
On 2011-10-05 03:49, Robert Jacques wrote:
> On Tue, 04 Oct 2011 12:54:27 -0400, Jacob Carlborg <doob@me.com> wrote:
>
>> On 2011-10-04 17:14, Robert Jacques wrote:
>>> I'm sorry, you're right. In my mind b extended to the end of the a
>>> array, for some reason. However, if you do define b to extend to the end
>>> of the a array, then it can append without allocating:
>>>
>>> auto a = [0, 1, 2, 3, 4];
>>> auto b = a[2 .. $];
>>> b ~= 5;
>>> assert(b[0]==2);
>>> a[2] = 10;
>>> assert(b[0]==10);
>>>
>>> So please don't dismiss this point.
>>>
>>
>> Hmm, D1 and D2 behaves differently in this case. In D1 "a" is not
>> changed when "b" is changed. Since you append to "b" I though that it
>> would always require a reallocation of "b".
>
> Appending has never mandated reallocation. Maybe you're confusing it
> with concatenation (i.e. b = b ~ 5;), which does always reallocate. Or
> the period in time when the anti-array-stomping parts of druntime was
> overly conservative and would have prevented b from appending in place.

Yeah, maybe you're right.

-- 
/Jacob Carlborg
1 2 3
Next ›   Last »