August 03, 2014
On 8/3/14, 2:38 AM, Sönke Ludwig wrote:
[snip]

We need to address the matter of std.jgrandson competing with
vibe.data.json. Clearly at a point only one proposal will have to be
accepted so the other would be wasted work.

Following our email exchange I decided to work on this because (a) you
mentioned more work is needed and your schedule was unclear, (b) we need
this at FB sooner rather than later, (c) there were a few things I
thought can be improved in vibe.data.json. I hope that taking
std.jgrandson to proof spurs things into action.

Would you want to merge some of std.jgrandson's deltas into a new
proposal std.data.json based on vibe.data.json? Here's a few things that
I consider necessary:

1. Commit to a schedule. I can't abandon stuff in wait for the perfect design that may or may not come someday.

2. Avoid UTF decoding.

3. Offer a lazy token stream as a basis for a non-lazy parser. A lazy general parser would be considerably more difficult to write and would only serve a small niche. On the other hand, a lazy tokenizer is easy to write and make efficient, and serve as a basis for user-defined specialized lazy parsers if the user wants so.

4. Avoid string allocation. String allocation can be replaced with slices of the input when these two conditions are true: (a) input type is string, immutable(byte)[], or immutable(ubyte)[]; (b) there are no backslash-encoded sequences in the string, i.e. the input string and the actual string are the same.

5. Build on std.variant through and through. Again, anything that doesn't work is a usability bug in std.variant, which was designed for exactly this kind of stuff. Exposing the representation such that user code benefits of the Algebraic's primitives may be desirable.

6. Address w0rp's issue with undefined. In fact std.Algebraic does have an uninitialized state :o).

Sönke, what do you think?


Andrei

August 03, 2014
On Sunday, 3 August 2014 at 15:14:43 UTC, Andrei Alexandrescu wrote:
>> 3. Use of "opDispatch" for an open set of members has been criticized
>> for vibe.data.json before and I agree with that criticism. The only
>> advantage is saving a few keystrokes (json.key instead of json["key"]),
>> but I came to the conclusion that the right approach to work with JSON
>> values in D is to always directly deserialize when/if possible anyway,
>> which mostly makes this is a moot point.
>
> Interesting. Well if experience with opDispatch is negative then it should probably not be used here, or only offered on an opt-in basis.

I support this opinion. opDispatch looks cool with JSON objects when you implement it but it results in many subtle quirks when you consider something like range traits for example - most annoying to encounter and debug. It is not worth the gain.
August 03, 2014
Am Sun, 03 Aug 2014 08:34:20 -0700
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 8/3/14, 2:38 AM, Sönke Ludwig wrote:
> [snip]
> 
> We need to address the matter of std.jgrandson competing with vibe.data.json. Clearly at a point only one proposal will have to be accepted so the other would be wasted work.
> 
> [...]
> 
> 4. Avoid string allocation. String allocation can be replaced with
> slices of the input when these two conditions are true: (a) input
> type is string, immutable(byte)[], or immutable(ubyte)[]; (b) there
> are no backslash-encoded sequences in the string, i.e. the input
> string and the actual string are the same.

I think for the lowest level interface we could avoid allocation
completely:
The tokenizer could always return slices to the raw string, even if a
string contains backslash-encode sequences or if the token is a number.
Simply expose that as token.rawValue. Then add a function,
Token.decodeString() and token.decodeNumber() to actually decode the
numbers. decodeString could additionally support decoding into a buffer.

If the input is not sliceable, read the input into an internal buffer first and slice that buffer.

The main usecase for this is if you simply stream lots of data and you only want to parse very little of it and skip over most content. Then you don't need to decode the strings. This is also true if you only write a JSON formatter: No need to decode and encode the strings.

> 
> 5. Build on std.variant through and through. Again, anything that doesn't work is a usability bug in std.variant, which was designed for exactly this kind of stuff. Exposing the representation such that user code benefits of the Algebraic's primitives may be desirable.
> 

Variant uses TypeInfo internally, right? I think as long as it uses TypeInfo it can't replace all use-cases for a standard tagged union.


August 03, 2014
On 8/3/14, 8:51 AM, Johannes Pfau wrote:
> Am Sun, 03 Aug 2014 08:34:20 -0700
> schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:
>
>> On 8/3/14, 2:38 AM, Sönke Ludwig wrote:
>> [snip]
>>
>> We need to address the matter of std.jgrandson competing with
>> vibe.data.json. Clearly at a point only one proposal will have to be
>> accepted so the other would be wasted work.
>>
>> [...]
>>
>> 4. Avoid string allocation. String allocation can be replaced with
>> slices of the input when these two conditions are true: (a) input
>> type is string, immutable(byte)[], or immutable(ubyte)[]; (b) there
>> are no backslash-encoded sequences in the string, i.e. the input
>> string and the actual string are the same.
>
> I think for the lowest level interface we could avoid allocation
> completely:
> The tokenizer could always return slices to the raw string, even if a
> string contains backslash-encode sequences or if the token is a number.
> Simply expose that as token.rawValue. Then add a function,
> Token.decodeString() and token.decodeNumber() to actually decode the
> numbers. decodeString could additionally support decoding into a buffer.

That works but not e.g. for File.byLine which reuses its internal buffer. But it's a neat idea for arrays of immutable bytes.

> If the input is not sliceable, read the input into an internal buffer
> first and slice that buffer.

At that point the cost of decoding becomes negligible.

> The main usecase for this is if you simply stream lots of data and you
> only want to parse very little of it and skip over most content. Then
> you don't need to decode the strings.

Awesome.

> This is also true if you only
> write a JSON formatter: No need to decode and encode the strings.

But wouldn't that still need to encode \n, \r, \t, \v?

>> 5. Build on std.variant through and through. Again, anything that
>> doesn't work is a usability bug in std.variant, which was designed
>> for exactly this kind of stuff. Exposing the representation such that
>> user code benefits of the Algebraic's primitives may be desirable.
>>
>
> Variant uses TypeInfo internally, right?

No.


Andrei

August 03, 2014
Am 03.08.2014 09:16, schrieb Andrei Alexandrescu:
> We need a better json library at Facebook. I'd discussed with Sönke the
> possibility of taking vibe.d's json to std but he said it needs some
> more work. So I took std.jgrandson to proof of concept state and hence
> ready for destruction:
>
> http://erdani.com/d/jgrandson.d
> http://erdani.com/d/phobos-prerelease/std_jgrandson.html


Is the name supposed to stay or just a working title?
"std.j*grandson*" (being the successor of "std.j*son*") is of course a funny play of words, but it's not really obvious on the first sight what it does.
i.e. if someone skims the std. modules in the documentation, looking for json, he'd probably not think that this is the new json module.
std.json2 or something like that would be more obvious.

Cheers,
Daniel
August 03, 2014
I don't want to pay for anything I don't use.  No allocations should occur within the parser and it should simply slice up the input.  So the lowest layer should allow me to iterate across symbols in some way.  When I've done this in the past it was SAX-style (ie. a callback per type) but with the range interface that shouldn't be necessary.

The parser shouldn't decode or convert anything unless I ask it to.  Most of the time I only care about specific values, and paying for conversions on everything is wasted process time.

I suggest splitting number into float and integer types.  In a language like D where these are distinct internal types, it can be valuable to know this up front.

Is there support for output?  I see the makeArray and makeObject routines...  Ideally, there should be a way to serialize JSON against an OutputRange with optional formatting.
August 03, 2014
On 8/3/14, 9:49 AM, Daniel Gibson wrote:
> Am 03.08.2014 09:16, schrieb Andrei Alexandrescu:
>> We need a better json library at Facebook. I'd discussed with Sönke the
>> possibility of taking vibe.d's json to std but he said it needs some
>> more work. So I took std.jgrandson to proof of concept state and hence
>> ready for destruction:
>>
>> http://erdani.com/d/jgrandson.d
>> http://erdani.com/d/phobos-prerelease/std_jgrandson.html
>
>
> Is the name supposed to stay or just a working title?

Just a working title, but of course if it were wildly successful... but then again it's not. -- Andrei

August 03, 2014
On 8/3/14, 10:19 AM, Sean Kelly wrote:
> I don't want to pay for anything I don't use.  No allocations should
> occur within the parser and it should simply slice up the input.

What to do about arrays and objects, which would naturally allocate arrays and associative arrays respectively? What about strings with backslash-encoded characters?

No allocation works for tokenization, but parsing is a whole different matter.

> So the
> lowest layer should allow me to iterate across symbols in some way.

Yah, that would be the tokenizer.

> When I've done this in the past it was SAX-style (ie. a callback per
> type) but with the range interface that shouldn't be necessary.
>
> The parser shouldn't decode or convert anything unless I ask it to.
> Most of the time I only care about specific values, and paying for
> conversions on everything is wasted process time.

That's tricky. Once you scan for 2 specific characters you may as well scan for a couple more, the added cost is negligible. In contrast, scanning once for finding termination and then again for decoding purposes will definitely be a lot more expensive.

> I suggest splitting number into float and integer types.  In a language
> like D where these are distinct internal bfulifbucivrdfvhhjnrunrgultdjbjutypes, it can be valuable to
> know this up front.

Yah, that kept on sticking like a sore thumb throughout.

> Is there support for output?  I see the makeArray and makeObject
> routines...  Ideally, there should be a way to serialize JSON against an
> OutputRange with optional formatting.

Not yet, and yah those should be in.


Andrei

August 03, 2014
Am 03.08.2014 17:14, schrieb Andrei Alexandrescu:
> On 8/3/14, 2:38 AM, Sönke Ludwig wrote:
>> A few thoughts based on my experience with vibe.data.json:
>>
>> 1. No decoding of strings appears to mean that "Value" also always
>> contains encoded strings. This seems the be a leaky and also error prone
>> leaky abstraction. For the token stream, performance should be top
>> priority, so it's okay to not decode there, but "Value" is a high level
>> abstraction of a JSON value, so it should really hide all implementation
>> details of the storage format.
>
> Nonono. I think there's a confusion. The input strings are not UTF
> decoded for the simple need there's no need (all tokenization decisions
> are taken on the basis of ASCII characters/code units). The
> backslash-prefixed characters are indeed decoded.
>
> An optimization I didn't implement yet is to use slices of the input
> wherever possible (when the input is string, immutable(byte)[], or
> immutable(ubyte)[]). That will reduce allocations considerably.

Ah okay, *phew* ;) But in that case I'd actually think about leaving off the backslash decoding in the low level parser, so that slices could be used for immutable inputs in all cases - maybe with a name of "rawString" for the stored data and an additional "string" property that decodes on the fly. This may come in handy when the first comparative benchmarks together with rapidjson and the like are done.

>> 2. Algebraic is a good choice for its generic handling of operations on
>> the contained types (which isn't exposed here, though). However, a
>> tagged union type in my experience has quite some advantages for
>> usability. Since adding a type tag possibly affects the interface in a
>> non-backwards compatible way, this should be evaluated early on.
>
> There's a public opCast(Payload) that gives the end user access to the
> Payload inside a Value. I forgot to add documentation to it.

I see. Suppose that opDispatch would be dropped, would anything speak against "alias this"ing _payload to avoid the need for the manually defined operators?

> What advantages are to a tagged union? (FWIW: to me Algebraic and
> Variant are also tagged unions, just that the tags are not 0, 1, ..., n.
> That can be easily fixed for Algebraic by defining operations to access
> the index of the currently-stored type.)

The two major points are probably that it's possible to use "final switch" on the type tag if it's an enum, and the type id can be easily stored in both integer and string form (which is not as conveniently possible with a TypeInfo).

> (...)
>
> The way I see it, good work on tagged unions must be either integrated
> within std.variant (either by modifying Variant/Algebraic or by adding
> new types to it). I am very strongly opposed to adding a tagged union
> type only for JSON purposes, which I'd consider essentially a usability
> bug in std.variant, the opposite of dogfooding, etc.

Definitely agree there.

An enum based tagged union design also currently has the unfortunate property that the order of enum values and that of the accepted types must be defined consistently, or bad things will happen. Supporting UDAs on enum values would be a possible direction to fix this:

	enum JsonType {
		@variantType!string string,
		@variantType!(JsonValue[]) array,
		@variantType!(JsonValue[string]) object
	}
	alias JsonValue = TaggedUnion!JsonType;

But then there are obviously still issues with cyclic type references. So, anyway, this is something that still requires some thought. It could also be designed in a way that is backwards compatible with a pure "Algebraic", so it shouldn't be a blocker for the current design.
August 03, 2014
Am Sun, 03 Aug 2014 09:17:57 -0700
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 8/3/14, 8:51 AM, Johannes Pfau wrote:
> >
> > Variant uses TypeInfo internally, right?
> 
> No.
> 

https://github.com/D-Programming-Language/phobos/blob/master/std/variant.d#L210 https://github.com/D-Programming-Language/phobos/blob/master/std/variant.d#L371 https://github.com/D-Programming-Language/phobos/blob/master/std/variant.d#L696

Also the handler function concept will always have more overhead than a simple tagged union. It is certainly useful if you want to store any type, but if you only want a limited set of types there are more efficient implementations.