August 11, 2015
On 11-Aug-2015 20:30, deadalnix wrote:
> On Tuesday, 11 August 2015 at 17:08:39 UTC, Atila Neves wrote:
>> On Tuesday, 28 July 2015 at 14:07:19 UTC, Atila Neves wrote:
>>> Start of the two week process, folks.
>>>
>>> Code: https://github.com/s-ludwig/std_data_json
>>> Docs: http://s-ludwig.github.io/std_data_json/
>>>
>>> Atila
>>
>> I forgot to give warnings that the two week period was about to be up,
>> and was unsure from comments if this would be ready for voting, so
>> let's give it another two days unless there are objections.
>>
>> Atila
>
> Ok some actionable items.
>
> 1/ How big is a JSON struct ? What is the biggest element in the union ?
> Is that element really needed ? Recurse.

+1 Also most JS engines use nan-boxing to fit type tag along with the payload in 8 bytes total. At least the _fast_ path of std.data.json should take advantage of similar techniques.

> 2/ As far as I can see, the element are discriminated using typeid. An
> enum is preferable as the compiler would know values ahead of time and
> optimize based on this. It also allow use of things like final switch.

> 3/ Going from the untyped world to the typed world and provide an API to
> get back to the untyped word is a loser strategy. That sounds true
> intuitively, but also from my experience manipulating JSON in various
> languages. The Nodes produced by this lib need to be "manipulatable" as
> the unstructured values they represent.
>


-- 
Dmitry Olshansky
August 11, 2015
Am 11.08.2015 um 19:30 schrieb deadalnix:
> Ok some actionable items.
>
> 1/ How big is a JSON struct ? What is the biggest element in the union ?
> Is that element really needed ? Recurse.

See http://s-ludwig.github.io/std_data_json/stdx/data/json/value/JSONValue.payload.html

The question whether each field is "really" needed obviously depends on the application. However, the biggest type is BigInt that, form a quick look, contains a dynamic array + a bool field, so it's not as compact as it could be, but also not really large. There is also an additional Location field that may sometimes be important for good error messages and the like and sometimes may be totally unneeded.

However, my goal when implementing this has never been to make the DOM representation as efficient as possible. The simple reason is that a DOM representation is inherently inefficient when compared to operating on the structure using either the pull parser or using a deserializer that directly converts into a static D type. IMO these should be advertised instead of trying to milk a dead cow (in terms of performance).

> 2/ As far as I can see, the element are discriminated using typeid. An
> enum is preferable as the compiler would know values ahead of time and
> optimize based on this. It also allow use of things like final switch.

Using a tagged union like structure is definitely what I'd like to have, too. However, the main goal was to build the DOM type upon a generic algebraic type instead of using a home-brew tagged union. The reason is that it automatically makes different DOM types with a similar structure interoperable (JSON/BSON/TOML/...).

Now Phobos unfortunately only has Algebraic, which not only doesn't have a type enum, but is currently also really bad at keeping static type information when forwarding function calls or operators. The only options were basically to resort to Algebraic for now, but have something that works, or to first implement an alternative algebraic type and get it accepted into Phobos, which would delay the whole process nearly indefinitely.

> 3/ Going from the untyped world to the typed world and provide an API to
> get back to the untyped word is a loser strategy. That sounds true
> intuitively, but also from my experience manipulating JSON in various
> languages. The Nodes produced by this lib need to be "manipulatable" as
> the unstructured values they represent.

It isn't really clear to me what you mean by this. What exactly about JSONValue can't be manipulated like the "unstructured values [it] represent[s]"?

Or do you perhaps mean the JSON -> deserialize -> manipulate -> serialize -> JSON approach? That definitely is not a "loser strategy"*, but yes, it is limited to applications where you have a partially fixed schema. However, arguably most applications fall into that category.

* OT: My personal observation is that sadly the overall tone in the community has generally become a lot less friendly over the last months. I'm a bit worried about where this may lead in the long term.
August 11, 2015
Am 11.08.2015 um 20:15 schrieb Dmitry Olshansky:
> On 11-Aug-2015 20:30, deadalnix wrote:
>>
>> Ok some actionable items.
>>
>> 1/ How big is a JSON struct ? What is the biggest element in the union ?
>> Is that element really needed ? Recurse.
>
> +1 Also most JS engines use nan-boxing to fit type tag along with the
> payload in 8 bytes total. At least the _fast_ path of std.data.json
> should take advantage of similar techniques.

But the array field already needs 16 bytes on 64-bit systems anyway. We could surely abuse some bits there to at least not use up more for the type tag, but before we go that far, we should first tackle some other questions, such as the allocation strategy of JSONValues during parsing, the Location field and BigInt/Decimal support.

Maybe we should first have a vote about whether BigInt/Decimal should be supported or not, because that would at least solve some of the controversial tradeoffs. I didn't have a use for those personally, but at least we had the real-world issue in vibe.d's implementation that a ulong wasn't exactly representable.

My view generally still is that the DOM representation is something for convenient manipulation of small chunks of JSON, so that performance is not a priority, but feature completeness is.
August 11, 2015
Am 04.08.2015 um 19:14 schrieb deadalnix:
> On Tuesday, 4 August 2015 at 13:10:11 UTC, Sönke Ludwig wrote:
>> This is how it used to be in the vibe.data.json module. I consider
>> that to be a mistake now for multiple reasons, at least on this
>> abstraction level. My proposal would be to have a clean, "strongly
>> typed" JSONValue and a generic jsvar like struct on top of that, which
>> is defined independently, and could for example work on a BSONValue,
>> too. The usage would simply be "var value = parseJSONValue(...);".
>
> That is not going to cut it. I've been working with these for ages. This
> is the very kind of scenarios where dynamically typed languages are way
> more convenient.
>
> I've used both quite extensively and this is clear cut: you don't want
> what you call the strongly typed version of things. I've done it in many
> languages, including in java for instance.
>
> jsvar interface remove the problematic parts of JS (use ~ instead of +
> for concat strings and do not implement the opDispatch part of the API).
>

I just said that jsvar should be supported (even in its full glory), so why is that not going to cut it? Also, in theory, Algebraic already does more or less exactly what you propose (forwards operators, but skips opDispatch and JS-like string operators).
August 11, 2015
On Tuesday, 11 August 2015 at 21:27:48 UTC, Sönke Ludwig wrote:
>> That is not going to cut it. I've been working with these for ages. This
>> is the very kind of scenarios where dynamically typed languages are way
>> more convenient.
>>
>> I've used both quite extensively and this is clear cut: you don't want
>> what you call the strongly typed version of things. I've done it in many
>> languages, including in java for instance.
>>
>> jsvar interface remove the problematic parts of JS (use ~ instead of +
>> for concat strings and do not implement the opDispatch part of the API).
>>
>
> I just said that jsvar should be supported (even in its full glory), so why is that not going to cut it? Also, in theory, Algebraic already does more or less exactly what you propose (forwards operators, but skips opDispatch and JS-like string operators).

Ok, then maybe there was a misunderstanding on my part.

My understanding was that there was a Node coming from the parser, and that the node could be wrapped in some facility providing a jsvar like API.

My position is that it is preferable to have whatever DOM node be jsvar like out of the box rather than having to wrap it into something to get that.
August 11, 2015
On Tuesday, 11 August 2015 at 21:06:24 UTC, Sönke Ludwig wrote:
> See http://s-ludwig.github.io/std_data_json/stdx/data/json/value/JSONValue.payload.html
>
> The question whether each field is "really" needed obviously depends on the application. However, the biggest type is BigInt that, form a quick look, contains a dynamic array + a bool field, so it's not as compact as it could be, but also not really large. There is also an additional Location field that may sometimes be important for good error messages and the like and sometimes may be totally unneeded.
>

Urg. Looks like BigInt should steal a bit somewhere instead of having a bool like this. That is not really your lib's fault, but that's quite an heavy cost.

Consider this, if the struct fit into 2 registers, it will be passed around as such rather than in memory. That is a significant difference. For BigInt itself, and, by proxy, for the JSON library.

Putting the BigInt thing aside, it seems like the biggest field in there is an array of JSONValues or a string. For the string, you can artificially limit the length by 3 bits to stick a tag. That still give absurdly large strings. For the JSONValue case, the alignment on the pointer is such as you can steal 3 bits from there. Or as for string, the length can be used.

It seems very realizable to me to have the JSONValue struct fit into 2 registers, granted the tag fit in 3 bits (8 different types).

I can help with that if you want to.

> However, my goal when implementing this has never been to make the DOM representation as efficient as possible. The simple reason is that a DOM representation is inherently inefficient when compared to operating on the structure using either the pull parser or using a deserializer that directly converts into a static D type. IMO these should be advertised instead of trying to milk a dead cow (in terms of performance).
>

Indeed. Still, JSON nodes should be as lightweight as possible.

>> 2/ As far as I can see, the element are discriminated using typeid. An
>> enum is preferable as the compiler would know values ahead of time and
>> optimize based on this. It also allow use of things like final switch.
>
> Using a tagged union like structure is definitely what I'd like to have, too. However, the main goal was to build the DOM type upon a generic algebraic type instead of using a home-brew tagged union. The reason is that it automatically makes different DOM types with a similar structure interoperable (JSON/BSON/TOML/...).
>

That is a great point that I haven't considered. I'd go the other way around about it: providing a compatible typeid based struct from the enum tagged one for compatibility. It can even be alias this so the transition is transparent.

The transformation is not bijective, so that'd be great to get the most restrictive form (the enum) and fallback on the least restrictive one (alias this) when wanted.

> Now Phobos unfortunately only has Algebraic, which not only doesn't have a type enum, but is currently also really bad at keeping static type information when forwarding function calls or operators. The only options were basically to resort to Algebraic for now, but have something that works, or to first implement an alternative algebraic type and get it accepted into Phobos, which would delay the whole process nearly indefinitely.
>

That's fine. Done is better than perfect. Still API changes tend to be problematic, so we need to nail that part at least, and an enum with fallback on typeid based solution seems like the best option.

> Or do you perhaps mean the JSON -> deserialize -> manipulate -> serialize -> JSON approach? That definitely is not a "loser strategy"*, but yes, it is limited to applications where you have a partially fixed schema. However, arguably most applications fall into that category.
>

Yes.

August 12, 2015
On 12-Aug-2015 00:21, Sönke Ludwig wrote:
> Am 11.08.2015 um 20:15 schrieb Dmitry Olshansky:
>> On 11-Aug-2015 20:30, deadalnix wrote:
>>>
>>> Ok some actionable items.
>>>
>>> 1/ How big is a JSON struct ? What is the biggest element in the union ?
>>> Is that element really needed ? Recurse.
>>
>> +1 Also most JS engines use nan-boxing to fit type tag along with the
>> payload in 8 bytes total. At least the _fast_ path of std.data.json
>> should take advantage of similar techniques.
>
> But the array field already needs 16 bytes on 64-bit systems anyway. We
> could surely abuse some bits there to at least not use up more for the
> type tag, but before we go that far, we should first tackle some other
> questions, such as the allocation strategy of JSONValues during parsing,
> the Location field and BigInt/Decimal support.

Pointer to array should work for all fields > 8 bytes. Depending on the ratio frequency of value vs frequency of array (which is at least an ~5-10 in any practical scenario) it would make things both more compact and faster.

> Maybe we should first have a vote about whether BigInt/Decimal should be
> supported or not, because that would at least solve some of the
> controversial tradeoffs. I didn't have a use for those personally, but
> at least we had the real-world issue in vibe.d's implementation that a
> ulong wasn't exactly representable.

Well I've stated why I think BigInt should be optional. The reason is C++ parsers don't even bother with anything beyond ULong/double, nor would any e.g. Node.js stuff bother with things beyond double.

Lastly we don't have BigFloat so supporting BigInt but not BigFloat is kinda half-way.

So please make it an option. And again add an extra indirection (that is BigInt*) for BigInt field in a union because they are extremely rare.

> My view generally still is that the DOM representation is something for
> convenient manipulation of small chunks of JSON, so that performance is
> not a priority, but feature completeness is.

I'm confused - there must be some struct that represents a useful value. And more importantly - is JSONValue going to be converted to jsvar? If not - I'm fine. Otherwise whatever inefficiency present in JSONValue would be accumulated by this conversion process.

-- 
Dmitry Olshansky
August 12, 2015
Am 11.08.2015 um 23:52 schrieb deadalnix:
> On Tuesday, 11 August 2015 at 21:27:48 UTC, Sönke Ludwig wrote:
>>> That is not going to cut it. I've been working with these for ages. This
>>> is the very kind of scenarios where dynamically typed languages are way
>>> more convenient.
>>>
>>> I've used both quite extensively and this is clear cut: you don't want
>>> what you call the strongly typed version of things. I've done it in many
>>> languages, including in java for instance.
>>>
>>> jsvar interface remove the problematic parts of JS (use ~ instead of +
>>> for concat strings and do not implement the opDispatch part of the API).
>>>
>>
>> I just said that jsvar should be supported (even in its full glory),
>> so why is that not going to cut it? Also, in theory, Algebraic already
>> does more or less exactly what you propose (forwards operators, but
>> skips opDispatch and JS-like string operators).
>
> Ok, then maybe there was a misunderstanding on my part.
>
> My understanding was that there was a Node coming from the parser, and
> that the node could be wrapped in some facility providing a jsvar like API.

Okay, no that's correct.

>
> My position is that it is preferable to have whatever DOM node be jsvar
> like out of the box rather than having to wrap it into something to get
> that.

But take into account that Algebraic already behaves much like jsvar (at least ideally), just without opDispatch and JavaScript operator emulation (which I'm strongly opposed to as a *default*). So the jsvar wrapper would really just be needed for the cases where really concise code is desired when operating on JSON objects.

We also discussed an alternative approach similar to opt(n).foo.bar[1].baz, where n is a JSONValue and opt() creates a wrapper that enables safe navigation within the DOM, propagating any missing/mismatched fields to the final result instead of throwing. This could also be combined with a final type query: opt!string(n).foo.bar
August 12, 2015
Am 12.08.2015 um 08:28 schrieb Dmitry Olshansky:
> On 12-Aug-2015 00:21, Sönke Ludwig wrote:
>> Am 11.08.2015 um 20:15 schrieb Dmitry Olshansky:
>>> On 11-Aug-2015 20:30, deadalnix wrote:
>>>>
>>>> Ok some actionable items.
>>>>
>>>> 1/ How big is a JSON struct ? What is the biggest element in the
>>>> union ?
>>>> Is that element really needed ? Recurse.
>>>
>>> +1 Also most JS engines use nan-boxing to fit type tag along with the
>>> payload in 8 bytes total. At least the _fast_ path of std.data.json
>>> should take advantage of similar techniques.
>>
>> But the array field already needs 16 bytes on 64-bit systems anyway. We
>> could surely abuse some bits there to at least not use up more for the
>> type tag, but before we go that far, we should first tackle some other
>> questions, such as the allocation strategy of JSONValues during parsing,
>> the Location field and BigInt/Decimal support.
>
> Pointer to array should work for all fields > 8 bytes. Depending on the
> ratio frequency of value vs frequency of array (which is at least an
> ~5-10 in any practical scenario) it would make things both more compact
> and faster.
>
>> Maybe we should first have a vote about whether BigInt/Decimal should be
>> supported or not, because that would at least solve some of the
>> controversial tradeoffs. I didn't have a use for those personally, but
>> at least we had the real-world issue in vibe.d's implementation that a
>> ulong wasn't exactly representable.
>
> Well I've stated why I think BigInt should be optional. The reason is
> C++ parsers don't even bother with anything beyond ULong/double, nor
> would any e.g. Node.js stuff bother with things beyond double.

The trouble begins with long vs. ulong, even if we'd leave larger numbers aside. We'd really have to support both, but choosing between the two is ambiguous, which isn't very pretty overall.

>
> Lastly we don't have BigFloat so supporting BigInt but not BigFloat is
> kinda half-way.

That's where Decimal would come in. There is some code for that commented out, but I really didn't want to add it without a standard Phobos implementation. But I wouldn't say that this is really an argument against BigInt, maybe more one for implementing a Decimal type.

>
> So please make it an option. And again add an extra indirection (that is
> BigInt*) for BigInt field in a union because they are extremely rare.

Good idea, didn't think about that.

>
>> My view generally still is that the DOM representation is something for
>> convenient manipulation of small chunks of JSON, so that performance is
>> not a priority, but feature completeness is.
>
> I'm confused - there must be some struct that represents a useful value.

There is also the lower level JSONParserNode that represents data of a single bit of the JSON document. But since that struct is just part of a range, its size doesn't matter for speed or memory consumption (they are not allocated or copied while parsing).

> And more importantly - is JSONValue going to be converted to jsvar? If
> not - I'm fine. Otherwise whatever inefficiency present in JSONValue
> would be accumulated by this conversion process.

By default and currently it isn't, but it might be an idea for the future. The jsvar struct could possibly be implemented as a wrapper around JSONValue as a whole, so that it doesn't have to perform an actual conversion of the whole document.

Generally, working with JSONValue is already rather inefficient due to all of the dynamic allocations to populate dynamic and associative arrays. Changing that would require switching to completely different underlying container types, which would at least make the API a lot less intuitive.

We could of course also simply provide an alternative value representation that is not based on Algebraic (or an enum tag based alternative) and is not augmented with location information, but optimized solely for speed and low memory consumption.
August 12, 2015
Am 12.08.2015 um 00:21 schrieb deadalnix:
> On Tuesday, 11 August 2015 at 21:06:24 UTC, Sönke Ludwig wrote:
>> See
>> http://s-ludwig.github.io/std_data_json/stdx/data/json/value/JSONValue.payload.html
>>
>>
>> The question whether each field is "really" needed obviously depends
>> on the application. However, the biggest type is BigInt that, form a
>> quick look, contains a dynamic array + a bool field, so it's not as
>> compact as it could be, but also not really large. There is also an
>> additional Location field that may sometimes be important for good
>> error messages and the like and sometimes may be totally unneeded.
>>
>
> Urg. Looks like BigInt should steal a bit somewhere instead of having a
> bool like this. That is not really your lib's fault, but that's quite an
> heavy cost.
>
> Consider this, if the struct fit into 2 registers, it will be passed
> around as such rather than in memory. That is a significant difference.
> For BigInt itself, and, by proxy, for the JSON library.

Agreed, this was what I also thought. Considering that BigInt is heavy anyway, Dimitry's suggestion to store a "BigInt*" sounds like a good idea to sidestep that issue, though.

> Putting the BigInt thing aside, it seems like the biggest field in there
> is an array of JSONValues or a string. For the string, you can
> artificially limit the length by 3 bits to stick a tag. That still give
> absurdly large strings. For the JSONValue case, the alignment on the
> pointer is such as you can steal 3 bits from there. Or as for string,
> the length can be used.
>
> It seems very realizable to me to have the JSONValue struct fit into 2
> registers, granted the tag fit in 3 bits (8 different types).
>
> I can help with that if you want to.

The question is mainly just, should we decide on a single way to represent values (either speed, or features), or let the library user decide by either making JSONValue a template, or by providing two separate structs optimized for each case.

In the latter case, we could really optimize on all fronts and for example use custom containers that use less allocations and are more cache friendly than the built-in ones.

>> However, my goal when implementing this has never been to make the DOM
>> representation as efficient as possible. The simple reason is that a
>> DOM representation is inherently inefficient when compared to
>> operating on the structure using either the pull parser or using a
>> deserializer that directly converts into a static D type. IMO these
>> should be advertised instead of trying to milk a dead cow (in terms of
>> performance).
>>
>
> Indeed. Still, JSON nodes should be as lightweight as possible.
>
>>> 2/ As far as I can see, the element are discriminated using typeid. An
>>> enum is preferable as the compiler would know values ahead of time and
>>> optimize based on this. It also allow use of things like final switch.
>>
>> Using a tagged union like structure is definitely what I'd like to
>> have, too. However, the main goal was to build the DOM type upon a
>> generic algebraic type instead of using a home-brew tagged union. The
>> reason is that it automatically makes different DOM types with a
>> similar structure interoperable (JSON/BSON/TOML/...).
>>
>
> That is a great point that I haven't considered. I'd go the other way
> around about it: providing a compatible typeid based struct from the
> enum tagged one for compatibility. It can even be alias this so the
> transition is transparent.
>
> The transformation is not bijective, so that'd be great to get the most
> restrictive form (the enum) and fallback on the least restrictive one
> (alias this) when wanted.

As long as the set of types is fixed, it would even be bijective. Anyway, I've just started to work on a generic variant of an enum based algebraic type that exploits as much static type information as possible. If that works out (compiler bugs?), it would be a great thing to have in Phobos, so maybe it's worth to delay the JSON module for that if necessary.

The optimization to store the type enum in the length field of dynamic arrays could also be built into the generic type.

>> Now Phobos unfortunately only has Algebraic, which not only doesn't
>> have a type enum, but is currently also really bad at keeping static
>> type information when forwarding function calls or operators. The only
>> options were basically to resort to Algebraic for now, but have
>> something that works, or to first implement an alternative algebraic
>> type and get it accepted into Phobos, which would delay the whole
>> process nearly indefinitely.
>>
>
> That's fine. Done is better than perfect. Still API changes tend to be
> problematic, so we need to nail that part at least, and an enum with
> fallback on typeid based solution seems like the best option.

Yeah, the transition is indeed problematic. Sadly the "alias this" idea wouldn't work for for that either, because operators and methods of the enum based algebraic type usually have different return types.

>> Or do you perhaps mean the JSON -> deserialize -> manipulate ->
>> serialize -> JSON approach? That definitely is not a "loser
>> strategy"*, but yes, it is limited to applications where you have a
>> partially fixed schema. However, arguably most applications fall into
>> that category.
>
> Yes.

Just to state explicitly what I mean: This strategy has the most efficient in-memory storage format and profits from all the static type checking niceties of the compiler. It also means that there is a documented schema in the code that be used for reference by the developers and that will automatically be verified by the serializer, resulting in less and better checked code. So where applicable I claim that this is the best strategy to work with such data.

For maximum efficiency, it can also be transparently combined with the pull parser. The pull parser can for example be used to jump between array entries and the serializer then reads each single array entry.