Range interface for std.serialization - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Range interface for std.serialization

Thread overview

Range interface for std.serialization
Aug 21, 2013 Jacob Carlborg
Aug 21, 2013 Dicebot
Aug 22, 2013 Tyler Jameson Little
Aug 22, 2013 Jacob Carlborg
Aug 22, 2013 Tyler Jameson Little
Aug 22, 2013 Jacob Carlborg
Aug 22, 2013 Dicebot
Aug 22, 2013 John Colvin
Aug 22, 2013 Dicebot
Aug 23, 2013 Tyler Jameson Little
Aug 22, 2013 Jacob Carlborg
Aug 22, 2013 Dicebot
Aug 22, 2013 Jacob Carlborg
Aug 22, 2013 Johannes Pfau
Aug 22, 2013 Dicebot
Aug 22, 2013 Jacob Carlborg
Aug 22, 2013 Johannes Pfau
Aug 22, 2013 Dicebot
Aug 25, 2013 Daniel Murphy
Aug 25, 2013 Dmitry Olshansky
Aug 25, 2013 Dicebot
Aug 25, 2013 Dmitry Olshansky
Aug 26, 2013 Jacob Carlborg
Aug 26, 2013 Dmitry Olshansky
Aug 26, 2013 Jacob Carlborg
Aug 26, 2013 Dicebot
Aug 26, 2013 Jacob Carlborg
Aug 26, 2013 Dmitry Olshansky
Aug 26, 2013 Jacob Carlborg
Aug 26, 2013 Dmitry Olshansky
Aug 27, 2013 Jacob Carlborg
Aug 27, 2013 Dmitry Olshansky
Aug 28, 2013 Jacob Carlborg
Aug 28, 2013 Dmitry Olshansky
Aug 28, 2013 Dmitry Olshansky
Aug 28, 2013 Jacob Carlborg
Sep 24, 2013 Jacob Carlborg
Sep 24, 2013 Dmitry Olshansky
Sep 25, 2013 Jacob Carlborg
Aug 28, 2013 Jacob Carlborg
Aug 28, 2013 Jacob Carlborg
Aug 28, 2013 Dmitry Olshansky
Oct 10, 2013 Jacob Carlborg
Oct 10, 2013 Dicebot
Oct 10, 2013 Jacob Carlborg
Oct 10, 2013 Dmitry Olshansky
Aug 26, 2013 Dicebot
Aug 26, 2013 Jacob Carlborg
Aug 26, 2013 Dmitry Olshansky
Aug 26, 2013 Dmitry Olshansky
Aug 22, 2013 Jacob Carlborg
Aug 22, 2013 Johannes Pfau
Aug 22, 2013 Jacob Carlborg

August 21, 2013

Range interface for std.serialization

Posted by Jacob Carlborg

Jacob Carlborg

After have been reading the review thread for std.serialization I've been trying to figure out how a range interface for std.serialization could look like. There have been several suggestions how to implement the range interface and I feel that I really don't know that the best choice would be.

What I can see there are two parts of the package that makes sense to support the range API's. Those are the serializer (Serializer) and archives (Archive).

If we start with the archive and the output, used for serializing. One idea is to have the current method "data" return an input range.

Alternative AO1 (Archive Output 1):

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

auto inputRange = archive.data;

This is pretty straight forward and the returned range can later be used to write to disk or whatever the user chooses.

If this alternative is chosen how should the range for the XmlArchive work like? Currently the archive returns a string, should the range just wrap the string and step through character by character? That doesn't sound very effective.



Alternative AO2:

Another idea is the archive is an output range, having this interface:

auto archive = new XmlArchive!(char);
archive.writeTo(outputRange);

auto serializer = new Serializer(archive);
serializer.serialize(new Object);

Use the output range when the serialization is done.

This has the same question as the input range, should I put to the range character by character?



Now we come to input for the archive, used for deserializing. I think the only alternative is an input range. I guess this is pretty straight forward. The archive needs to take the range as a template parameter to be able to store the range.

A problem with this, actually I don't know if it's considered a problem, is that the following won't be possible:

auto archive = new XmlArchive!(InputRange);
archive.data = archive.data;

Which one would usually expect from an OO API. The problem here is that the archive is typed for the original input range but the returned range from "data" is of a different type.



Then it comes to the serializer. For the input to the serializer there is a couple of alternatives:

Alternative SI1:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

The serializer can accept any type and will just serialize it. This is the current approach.

Alternative SI2:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
serializer.serialize([1,2,3,4,5].stride(2).take(2));

The serialize recognizes input ranges and treat them differently. It can be serialize in a couple of different ways:

* Serialize as an array
* Serialize the range as the "serialize" method has been called multiple times
* Found out a new structure and serialize it as a range

Alternative SI3:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);
[1,2,3,4,5].stride(2).take(2).copy(serializer);

The serializer can be an output range and implement a "put" method. I guess this has the same problem, as alternative SI2, of how it would serialize the range.



For the output of the serializer (deserializing) I'm not sure if it makes sense to return a range. That's because you need to tell the serialize what root type to return:

Alternative SO1:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);

serializer.serialize(new Object);

auto object = serializer.deserialize!(Object)(data);

This is the current interface.



Alternative SO2:

auto archive = new XmlArchive!(char);
auto serializer = new Serializer(archive);

serializer.serialize(new Object);

auto range = serializer.deserialize!(?)(data);

If the serializer returns a range, what type should be used in place of the question mark?



Conclusion:

As far as I can see there are many alternatives and I don't know which is best to choose.

-- 
/Jacob Carlborg

August 21, 2013

Re: Range interface for std.serialization

Posted by Dicebot
in reply to Jacob Carlborg

Dicebot

Posted in reply to Jacob Carlborg

My 5 cents:

On Wednesday, 21 August 2013 at 18:45:48 UTC, Jacob Carlborg wrote:
> If this alternative is chosen how should the range for the XmlArchive work like? Currently the archive returns a string, should the range just wrap the string and step through character by character? That doesn't sound very effective.

It should be range of strings - one call to popFront should serialize one object from input object range and provide matching string buffer.

> Alternative AO2:
>
> Another idea is the archive is an output range, having this interface:
>
> auto archive = new XmlArchive!(char);
> archive.writeTo(outputRange);
>
> auto serializer = new Serializer(archive);
> serializer.serialize(new Object);
>
> Use the output range when the serialization is done.

I can't imagine a use case for this. Adding ranges just because you can is not very good :)

> A problem with this, actually I don't know if it's considered a problem, is that the following won't be possible:
>
> auto archive = new XmlArchive!(InputRange);
> archive.data = archive.data;

What this snippet should do?

> Which one would usually expect from an OO API. The problem here is that the archive is typed for the original input range but the returned range from "data" is of a different type.

Range-based algorithms don't assign ranges. Transferring data from one range to another is done via copy(sourceRange, destRange) and similar tools.

> ... snip

It looks like difficulties come from your initial assumption that one call to serialize/deserialize implies one object - in that model ranges hardly are useful. I don't think it is a reasonable restriction. What is practically useful is (de)serialization of large list of objects lazily - and this is a natural job for ranges.

August 22, 2013

Re: Range interface for std.serialization

Posted by Tyler Jameson Little
in reply to Dicebot

Tyler Jameson Little

Posted in reply to Dicebot

On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
> My 5 cents:
>
> On Wednesday, 21 August 2013 at 18:45:48 UTC, Jacob Carlborg wrote:
>> If this alternative is chosen how should the range for the XmlArchive work like? Currently the archive returns a string, should the range just wrap the string and step through character by character? That doesn't sound very effective.
>
> It should be range of strings - one call to popFront should serialize one object from input object range and provide matching string buffer.

I don't like this because it still caches the whole object into memory. In a memory-restricted application, this is unacceptable.

I think one call to popFront should release part of the serialized object. For example:

struct B {
    int c, d;
}

struct A {
    int a;
    B b;
}

The JSON output of this would be:

    {
        a: 0,
        b: {
            c: 0,
            d: 0
        }
    }

There's no reason why the serializer can't output this in chunks:

Chunk 1:

    {
        a: 0,

Chunk 2:

        b: {

Etc...

Most archive formats should support chunking. I realize this may be a rather large change to Orange, but I think it's a direction it should be headed.

>> Alternative AO2:
>>
>> Another idea is the archive is an output range, having this interface:
>>
>> auto archive = new XmlArchive!(char);
>> archive.writeTo(outputRange);
>>
>> auto serializer = new Serializer(archive);
>> serializer.serialize(new Object);
>>
>> Use the output range when the serialization is done.
>
> I can't imagine a use case for this. Adding ranges just because you can is not very good :)

I completely agree.

>> A problem with this, actually I don't know if it's considered a problem, is that the following won't be possible:
>>
>> auto archive = new XmlArchive!(InputRange);
>> archive.data = archive.data;
>
> What this snippet should do?
>
>> Which one would usually expect from an OO API. The problem here is that the archive is typed for the original input range but the returned range from "data" is of a different type.
>
> Range-based algorithms don't assign ranges. Transferring data from one range to another is done via copy(sourceRange, destRange) and similar tools.

This is just a read-only property, which arguably doesn't break misconceptions. There should be no reason to assign directly to a range.

> It looks like difficulties come from your initial assumption that one call to serialize/deserialize implies one object - in that model ranges hardly are useful. I don't think it is a reasonable restriction. What is practically useful is (de)serialization of large list of objects lazily - and this is a natural job for ranges.

I agree that (de)serializing a large list of objects lazily is important, but I don't think that's the natural interface for a Serializer. I think that each object should be lazily serialized instead to maximize throughput.

If a Serializer is defined as only (de)serializing a single object, then serializing a range of Type would be as simple as using map() with a Serializer (getting a range of Serialize). If the allocs are too much, then the same serializer can be used, but serialize one-at-a-time.

My main point here is that data should be written as it's being serialized. In a networked application, it may take a few packets to encode a larger object, so the first packets should be sent ASAP.

As usual, feel free to destroy =D

August 22, 2013

Re: Range interface for std.serialization

Posted by Jacob Carlborg
in reply to Dicebot

Jacob Carlborg

Posted in reply to Dicebot

On 2013-08-21 22:21, Dicebot wrote:

> It should be range of strings - one call to popFront should serialize
> one object from input object range and provide matching string buffer.

How should nesting been handled for a format like XML? Example:

class Bar
{
    int a;
}

class Foo
{
    int b;
    Bar bar;
}

Currently the following XML is procured when serializing Foo:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
    <int key="b" id="1">4</int>
    <object runtimeType="main.Bar" type="main.Bar" key="bar" id="2">
        <int key="a" id="3">3</int>
    </object>
</object>

If I shouldn't return the whole object, Foo, how can we know that when the string for Bar is returned it should actually be nested inside Foo?

> I can't imagine a use case for this. Adding ranges just because you can
> is not very good :)

Ok.

> What this snippet should do?

That was just a dummy snippet to set the data. This is a slightly better example:

auto archive = new XmlArchive!(string);
auto serializer = new Serializer(archive);
serializer.serialize(new Object);

writeToFile("foo.xml", archive.data);

Now I want to deserialize the data back:

archive.data = readFromFile("foo.xml"); // Error, cannot covert ReadFromFileRange to string

> Range-based algorithms don't assign ranges. Transferring data from one
> range to another is done via copy(sourceRange, destRange) and similar
> tools.

So how should the API look like for setting the data used when deserializing, like this?

auto data = readFromFile("foo.xml");
auto archive = new XmlArchive!(string);
copy(data, archive.data);

> It looks like difficulties come from your initial assumption that one
> call to serialize/deserialize implies one object - in that model ranges
> hardly are useful. I don't think it is a reasonable restriction. What is
> practically useful is (de)serialization of large list of objects lazily
> - and this is a natural job for ranges.

It depends on how you look at it. Currently it's only possible to serialize a single object with a single call to "serialize". So if you want to serialize multiple objects you do as you would do normally in your code, use an array, a linked list or similar. An array is still a single object, though it contains multiple objects, that is handled perfectly fine.

The question is if a range should be treated as multiple objects, and not a single object (which it really is). How should it be serialized?

* Something like an array, resulting in this XML:

<array type="int" length="5" key="0" id="0">
    <int key="0" id="1">1</int>
    <int key="1" id="2">2</int>
    <int key="2" id="3">3</int>
    <int key="3" id="4">4</int>
    <int key="4" id="5">5</int>
</array>

* Or like calling "serialize" multiple times, resulting in this XML:

<int key="0" id="0">1</int>
<int key="1" id="1">2</int>
<int key="2" id="2">3</int>
<int key="3" id="3">4</int>
<int key="4" id="4">5</int>

* Or as a single object:

Then it would actually serialize the struct/class representing the range.

And the most important question, how should ranges be deserialized? One have to tell the serializer what type to return, otherwise it won't work. But the whole point of ranges is that you shouldn't need to know the type. Sometimes you cannot even name the type, i.e. Voldemort types.

-- 
/Jacob Carlborg

August 22, 2013

Re: Range interface for std.serialization

Posted by Jacob Carlborg
in reply to Tyler Jameson Little

Jacob Carlborg

Posted in reply to Tyler Jameson Little

On 2013-08-22 05:13, Tyler Jameson Little wrote:

> I don't like this because it still caches the whole object into memory.
> In a memory-restricted application, this is unacceptable.

It need to store all serialized reference types, otherwise it cannot properly serialize a complete object graph. We don't want duplicates. Example:

The following code:

auto bar = new Bar;
bar.a = 3;

auto foo = new Foo;
foo.a = bar;
foo.b = bar;

Is serialized as:

<object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
    <object runtimeType="main.Bar" type="main.Bar" key="a" id="1">
        <int key="a" id="2">3</int>
    </object>
    <reference key="b">1</reference>
</object>

When "foo.b" is just serializes a reference, not the complete object, because that has already been serialized. The serializer needs to keep track of that.

> I think one call to popFront should release part of the serialized
> object. For example:
>
> struct B {
>      int c, d;
> }
>
> struct A {
>      int a;
>      B b;
> }
>
> The JSON output of this would be:
>
>      {
>          a: 0,
>          b: {
>              c: 0,
>              d: 0
>          }
>      }
>
> There's no reason why the serializer can't output this in chunks:
>
> Chunk 1:
>
>      {
>          a: 0,
>
> Chunk 2:
>
>          b: {
>
> Etc...

It seems hard to keep track of nesting. I can't see how pretty printing using this technique would work.

> This is just a read-only property, which arguably doesn't break
> misconceptions. There should be no reason to assign directly to a range.

How should I set the data used for deserializing?

> I agree that (de)serializing a large list of objects lazily is
> important, but I don't think that's the natural interface for a
> Serializer. I think that each object should be lazily serialized instead
> to maximize throughput.
>
> If a Serializer is defined as only (de)serializing a single object, then
> serializing a range of Type would be as simple as using map() with a
> Serializer (getting a range of Serialize). If the allocs are too much,
> then the same serializer can be used, but serialize one-at-a-time.
>
> My main point here is that data should be written as it's being
> serialized. In a networked application, it may take a few packets to
> encode a larger object, so the first packets should be sent ASAP.
>
> As usual, feel free to destroy =D

Again, how does one keep track of nesting in formats like XML, JSON and YAML?

-- 
/Jacob Carlborg

August 22, 2013

Re: Range interface for std.serialization

Posted by Tyler Jameson Little
in reply to Jacob Carlborg

Tyler Jameson Little

Posted in reply to Jacob Carlborg

On Thursday, 22 August 2013 at 07:16:11 UTC, Jacob Carlborg wrote:
> On 2013-08-22 05:13, Tyler Jameson Little wrote:
>
>> I don't like this because it still caches the whole object into memory.
>> In a memory-restricted application, this is unacceptable.
>
> It need to store all serialized reference types, otherwise it cannot properly serialize a complete object graph. We don't want duplicates. Example:
>
> The following code:
>
> auto bar = new Bar;
> bar.a = 3;
>
> auto foo = new Foo;
> foo.a = bar;
> foo.b = bar;
>
> Is serialized as:
>
> <object runtimeType="main.Foo" type="main.Foo" key="0" id="0">
>     <object runtimeType="main.Bar" type="main.Bar" key="a" id="1">
>         <int key="a" id="2">3</int>
>     </object>
>     <reference key="b">1</reference>
> </object>
>
> When "foo.b" is just serializes a reference, not the complete object, because that has already been serialized. The serializer needs to keep track of that.

Right, but it doesn't need to keep the serialized data in memory.

>> I think one call to popFront should release part of the serialized
>> object. For example:
>>
>> struct B {
>>     int c, d;
>> }
>>
>> struct A {
>>     int a;
>>     B b;
>> }
>>
>> The JSON output of this would be:
>>
>>     {
>>         a: 0,
>>         b: {
>>             c: 0,
>>             d: 0
>>         }
>>     }
>>
>> There's no reason why the serializer can't output this in chunks:
>>
>> Chunk 1:
>>
>>     {
>>         a: 0,
>>
>> Chunk 2:
>>
>>         b: {
>>
>> Etc...
>
> It seems hard to keep track of nesting. I can't see how pretty printing using this technique would work.

Can't you just keep a counter? When you enter anything that would increase the indentation level, increment the indentation level. When leaving, decrement. At each level, insert whitespace equal to indentationLevel * whitespacePerLevel. This seems pretty trivial, unless I'm missing something.

Also, I didn't check, but it turns off pretty-printing be default, right?

>> This is just a read-only property, which arguably doesn't break
>> misconceptions. There should be no reason to assign directly to a range.
>
> How should I set the data used for deserializing?

How about passing it in with a function? Each range passed this way would represent a single object, so the current deserialize!Foo(InputRange) would work the same way it does now.

>> I agree that (de)serializing a large list of objects lazily is
>> important, but I don't think that's the natural interface for a
>> Serializer. I think that each object should be lazily serialized instead
>> to maximize throughput.
>>
>> If a Serializer is defined as only (de)serializing a single object, then
>> serializing a range of Type would be as simple as using map() with a
>> Serializer (getting a range of Serialize). If the allocs are too much,
>> then the same serializer can be used, but serialize one-at-a-time.
>>
>> My main point here is that data should be written as it's being
>> serialized. In a networked application, it may take a few packets to
>> encode a larger object, so the first packets should be sent ASAP.
>>
>> As usual, feel free to destroy =D
>
> Again, how does one keep track of nesting in formats like XML, JSON and YAML?

YAML will take a little extra care since whitespace is significant, but it should work well enough as I've described above.

August 22, 2013

Re: Range interface for std.serialization

Posted by Dicebot
in reply to Tyler Jameson Little

Dicebot

Posted in reply to Tyler Jameson Little

On Thursday, 22 August 2013 at 03:13:46 UTC, Tyler Jameson Little wrote:
> On Wednesday, 21 August 2013 at 20:21:49 UTC, Dicebot wrote:
>> It should be range of strings - one call to popFront should serialize one object from input object range and provide matching string buffer.
>
> I don't like this because it still caches the whole object into memory. In a memory-restricted application, this is unacceptable.

Well, in memory-restricted applications having large object at all is unacceptable. Rationale is that you hardly ever want half-deserialized object. If environment is very restrictive, smaller objects will be used anyway (list of smaller objects).

> ...
> There's no reason why the serializer can't output this in chunks

Outputting on its own is not useful to discuss - in pipe model output matches input. What is the point in outputting partial chunks of serialized object if you still need to provide it as a whole to the input?

August 22, 2013

Re: Range interface for std.serialization

Posted by Dicebot
in reply to Jacob Carlborg

Dicebot

Posted in reply to Jacob Carlborg

I'll focus on part I find crucial:

On Thursday, 22 August 2013 at 07:08:28 UTC, Jacob Carlborg wrote:
> The question is if a range should be treated as multiple objects, and not a single object (which it really is). How should it be serialized?
>
> * Something like an array, resulting in this XML:
>
> <array type="int" length="5" key="0" id="0">
>     <int key="0" id="1">1</int>
>     <int key="1" id="2">2</int>
>     <int key="2" id="3">3</int>
>     <int key="3" id="4">4</int>
>     <int key="4" id="5">5</int>
> </array>
>
> * Or like calling "serialize" multiple times, resulting in this XML:
>
> <int key="0" id="0">1</int>
> <int key="1" id="1">2</int>
> <int key="2" id="2">3</int>
> <int key="3" id="3">4</int>
> <int key="4" id="4">5</int>

Is there a reasons arrays needs to be serialized as (1), not (2)? I'd expect any input-range compliant data to be serialized as (2) and lazy. That allows you to use deserializer as a pipe over some sort of network-based string feed to get a potentially infinite input range of deserialized objects.

August 22, 2013

Re: Range interface for std.serialization

Posted by John Colvin
in reply to Dicebot

John Colvin

Posted in reply to Dicebot

On Thursday, 22 August 2013 at 14:48:57 UTC, Dicebot wrote:
> Outputting on its own is not useful to discuss - in pipe model output matches input. What is the point in outputting partial chunks of serialized object if you still need to provide it as a whole to the input?

Partial chunks of serialized objects can be useful for applications that aren't immediately deserializing: E.g. sending over a network, storing to disk etc.

August 22, 2013

Re: Range interface for std.serialization

Posted by Dicebot
in reply to John Colvin

Dicebot

Posted in reply to John Colvin

On Thursday, 22 August 2013 at 14:55:50 UTC, John Colvin wrote:
> Partial chunks of serialized objects can be useful for applications that aren't immediately deserializing: E.g. sending over a network, storing to disk etc.

But text I/O operates on character ranges anyway, it just uses whatever data is available:
// some imaginary stuff
InputRange!Object.serialize.until(5).copy(stdout);

`copy` will write text buffer that matches one Object at time. What is the point in serializing only half of given object if it is already in memory and available?

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation