August 22, 2013
Am Wed, 21 Aug 2013 22:21:48 +0200
schrieb "Dicebot" <public@dicebot.lv>:

> 
> > Alternative AO2:
> >
> > Another idea is the archive is an output range, having this interface:
> >
> > auto archive = new XmlArchive!(char);
> > archive.writeTo(outputRange);
> >
> > auto serializer = new Serializer(archive);
> > serializer.serialize(new Object);
> >
> > Use the output range when the serialization is done.
> 
> I can't imagine a use case for this. Adding ranges just because you can is not very good :)
> 


I'm kinda confused why nobody here sees the benefits of the output range model. Most serialization libraries in other languages are implemented like that. For example, .NET:

--------
IFormatter formatter = ...
Stream stream = new FileStream(...)
formatter.Serialize(stream, obj);
stream.Close();
--------

The reason is simple: In serialization it is not common to post-process the serialized data as far as I know. Usually it's either written to a file or sent over network which are perfect examples of Streams (or output ranges). Common usage is like this:

-------
auto s = FileStream;
auto serializer = Serializer(s);
serializer.serialize(1);
serializer.serialize("Hello");
foreach(value;...)
    serializer.serialize(value);
-------

The classic way to efficiently implement this pattern is using an OutputRange/Stream. Serialization must be capable of outputting many 100MBs to a file or network without significant memory overhead.




There are two specific ways how a InputRange interface can be useful: In case the serializer works as a filter for another Range:
--------
auto serializer = new Serializer([1,2,3,4,5].take(3));
foreach(ubyte[] data; serializer)
--------
But InputRanges are limited to the same type for all elements, the "serialize" call isn't. Of course you can use Variant. But what about big structs? And performance matters so the InputRange approach only works nicely if you serialize values of the same type.

The other way is if you only want to serialize one element:
--------
auto serializer = new Serializer(myobject);
foreach(ubyte[] data; serializer)
--------

It does not work well if you want to mix it with the "serialize" call:
-------
auto serializer = new Serializer();
serializer.serialize(1);
serializer.serialize("Hello");
serializer.serialize(3);
serializer.serialize(4);
foreach(ubyte[] data; serializer)
-------

Here the serializer has to cache data or the original objects until the data is processed via foreach. If serializer had access to an output range the "serialize" calls could directly write to the streams without any caching. So the output-range model is clearly superior in this case.
August 22, 2013
On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
> The reason is simple: In serialization it is not common to post-process
> the serialized data as far as I know. Usually it's either written to a
> file or sent over network which are perfect examples of Streams (or
> output ranges).

Hm but in this model it is file / socket which is an OutputRange, isn't it? Serializer itself just provides yet another InputRange which can be fed to target OutputRange. Am I getting this part wrong?

> But InputRanges are limited to the same type for all elements, the
> "serialize" call isn't.

I was thinking about this but do we have any way to express non-monotone range in D? Variant seems only option, it implies that any format used by Archiver must always store type information though.

> Of course you can use Variant. But what about
> big structs?

After some thinking I come to conclusion that is simply a matter of two `data` ranges - one parametrized by output type and other "raw". Latter than can output stuff in string chunks of undefined size (as small as serialization implementation allows). Does that help?
August 22, 2013
On 2013-08-22 16:16, Tyler Jameson Little wrote:

> Right, but it doesn't need to keep the serialized data in memory.

No, exactly.

> Can't you just keep a counter? When you enter anything that would
> increase the indentation level, increment the indentation level. When
> leaving, decrement. At each level, insert whitespace equal to
> indentationLevel * whitespacePerLevel. This seems pretty trivial, unless
> I'm missing something.

That sounds like it will require quite some work. Currently I'm using std.xml.Document.toString.

> Also, I didn't check, but it turns off pretty-printing be default, right?

No, not currently, see above.

> How about passing it in with a function? Each range passed this way
> would represent a single object, so the current
> deserialize!Foo(InputRange) would work the same way it does now.

The archive needs to store it somehow, so pass it in the constructor?

-- 
/Jacob Carlborg
August 22, 2013
On 2013-08-22 16:52, Dicebot wrote:

> Is there a reasons arrays needs to be serialized as (1), not (2)?

For arrays, one advantage is that I can allocate the whole array at once instead of appending, since the length of the array is serialized.

> I'd expect any input-range compliant data to be serialized as (2) and lazy.
> That allows you to use deserializer as a pipe over some sort of
> network-based string feed to get a potentially infinite input range of
> deserialized objects.

I still don't know how I would deserialize a range. I need to know the type to deserialize not just the interface.

-- 
/Jacob Carlborg
August 22, 2013
On 2013-08-22 17:33, Johannes Pfau wrote:

> The reason is simple: In serialization it is not common to post-process
> the serialized data as far as I know.

Perhaps compression or encryption.

-- 
/Jacob Carlborg
August 22, 2013
On 2013-08-22 17:49, Dicebot wrote:

> I was thinking about this but do we have any way to express non-monotone
> range in D? Variant seems only option, it implies that any format used
> by Archiver must always store type information though.

I'm wondering how interesting this really is to support. I basically only serialize a single object, which is the start of an object graph or an array.

> After some thinking I come to conclusion that is simply a matter of two
> `data` ranges - one parametrized by output type and other "raw". Latter
> than can output stuff in string chunks of undefined size (as small as
> serialization implementation allows). Does that help?

Do you mean that only the archive should handle ranges and the serializer shouldn't?

-- 
/Jacob Carlborg
August 22, 2013
Am Thu, 22 Aug 2013 17:49:04 +0200
schrieb "Dicebot" <public@dicebot.lv>:

> On Thursday, 22 August 2013 at 15:33:07 UTC, Johannes Pfau wrote:
> > The reason is simple: In serialization it is not common to
> > post-process
> > the serialized data as far as I know. Usually it's either
> > written to a
> > file or sent over network which are perfect examples of Streams
> > (or
> > output ranges).
> 
> Hm but in this model it is file / socket which is an OutputRange, isn't it? Serializer itself just provides yet another InputRange which can be fed to target OutputRange. Am I getting this part wrong?

Yes, but the important point is that Serializer is _not_ an InputRange of serialized data. Instead it _uses_ a OutputRange / Stream internally.

I'll show a very simplified example:
---------------------
struct Serializer(T) //if(isOutputRange!(T, ubyte[]))
{
    private T _output;
    this(T output)
    {
        _output = output;
    }

    void serialize(T)(T data)
    {
        _output.put((cast(ubyte*)&data)[0..T.sizeof]);
    }
}

void put(File f, ubyte[] data) //File is not an OutputRange...
{
	f.write(data);
}

void main()
{
    auto serializer = Serializer!File(stdout);
    serializer.serialize("Test");
    serializer.serialize("Hello World!");
}
---------------------

As you can see there are absolutely no memory allocations necessary. Of course in reality you'll need a fixed buffer but there's no dynamic allocation.

Now try to implement this in an efficient way as an InputRange. Here's the skeleton:
---------------------
struct Serializer
{
    void serialize(T)(T data) {}
    bool empty() {}
    ubyte[] front;
    void popFront() {}
}

void main()
{
    auto serializer = Serializer!File(stdout);
    serializer.serialize("Test");
    serializer.serialize("Hello World!");
    foreach(ubyte[] data; serializer)
}
---------------------

How would you implement this? This can only work efficiently if Serializer wraps its InputRange or if there's only one value to serialize. But the serialize method as defined above cannot be implemented efficiently with this approach.

Now I do confess that an InputRange filter is useful. But only for specific use cases, the more common use case is directly outputting to an OutputRange and this should be as efficient as possible. With a good design it should be possible to support both cases efficiently with the same "backends". But implementing a InputRange serializer filter will still be much more difficult than the OutputRange case (the serializer must be capable of resuming serialization at any point as your output buffer might be full)




I'd like to make another comment about performance. I think there are two possible usages / user groups of std.serialization.

1) The classical, heavyweight C#/Java style serialization which can serialize complete Object Graphs, deals with inheritance and so on

2) The simple "Just write the JSON representation of this struct to this file" kind of usage.

For usecase 2 it's important that there's as little overhead as possible. Consider this struct:

struct Song
{
    string artist;
    string title;
}

If I'd write JSON serialization manually, it would look like this:
---------
auto a = Appender!string; //or any outputRange
Song s;
a.put("{\n");
a.put(`    "artist"="`);
a.put(song.artist);
a.put(`",\n`);
a.put(`    "title"="`);
a.put(song.title);
a.put(`"\n}\n`);
---------

As you can see this code does basically nothing: No allocation, no
string processing, it just copies data. But it's annoying to write
this boilerplate.
I'd expect a serialization lib to let me do this:
serialize!JSON(a, s);
And performance should be very close to the hand-written code written
above.
August 22, 2013
Am Thu, 22 Aug 2013 18:13:23 +0200
schrieb Jacob Carlborg <doob@me.com>:

> On 2013-08-22 17:33, Johannes Pfau wrote:
> 
> > The reason is simple: In serialization it is not common to post-process the serialized data as far as I know.
> 
> Perhaps compression or encryption.
> 

But compression or encryption are usually implemented as OutputRanges / OutputStreams.
August 22, 2013
On Thursday, 22 August 2013 at 17:39:19 UTC, Johannes Pfau wrote:
> Yes, but the important point is that Serializer is _not_ an InputRange
> of serialized data. Instead it _uses_ a OutputRange / Stream
> internally.

Shame on me. I have completely misunderstood you and though you want to make serializer OutputRange itself.

Your examples make a lot sense and I do agree it is a use case worth supporting. Need some more time to imagine how that may impact API in general.
August 22, 2013
On 2013-08-22 19:41, Johannes Pfau wrote:

> But compression or encryption are usually implemented as OutputRanges /
> OutputStreams.

Ok, I didn't know that.

-- 
/Jacob Carlborg