August 25, 2014
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
> Following up on the recent "std.jgrandson" thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block.
>
> Code: https://github.com/s-ludwig/std_data_json
> Docs: http://s-ludwig.github.io/std_data_json/
> DUB: http://code.dlang.org/packages/std_data_json
>
> The new code contains:
>  - Lazy lexer in the form of a token input range (using slices of the
>    input if possible)
>  - Lazy streaming parser (StAX style) in the form of a node input range
>  - Eager DOM style parser returning a JSONValue
>  - Range based JSON string generator taking either a token range, a
>    node range, or a JSONValue
>  - Opt-out location tracking (line/column) for tokens, nodes and values
>  - No opDispatch() for JSONValue - this has shown to do more harm than
>    good in vibe.data.json
>
> The DOM style JSONValue type is based on std.variant.Algebraic. This currently has a few usability issues that can be solved by upgrading/fixing Algebraic:
>
>  - Operator overloading only works sporadically
>  - No "tag" enum is supported, so that switch()ing on the type of a
>    value doesn't work and an if-else cascade is required
>  - Operations and conversions between different Algebraic types is not
>    conveniently supported, which gets important when other similar
>    formats get supported (e.g. BSON)
>
> Assuming that those points are solved, I'd like to get some early feedback before going for an official review. One open issue is how to handle unescaping of string literals. Currently it always unescapes immediately, which is more efficient for general input ranges when the unescaped result is needed, but less efficient for string inputs when the unescaped result is not needed. Maybe a flag could be used to conditionally switch behavior depending on the input range type.
>
> Destroy away! ;)
>
> [1]: http://forum.dlang.org/thread/lrknjl$co7$1@digitalmars.com


One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{"foo": NaN, "bar": Infinity, "baz": -Infinity}

You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON.





August 25, 2014
On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote:
> practice. So a JSON parser should at least be able to lex them.
>
> ie this should be parsable:
>
> {"foo": NaN, "bar": Infinity, "baz": -Infinity}
>
> You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON.

I believe you are allowed to use very high exponents, though. Like: 1E999 . So you need to decide if those should be mapped to +Infinity or to the max value…

NaN also come in two forms with differing semantics: signalling(NaNs) and quiet (NaN).  NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values and failure.

For some reason D does not seem to support this aspect of IEEE754? I cannot find ".nans" listed on the page http://dlang.org/property.html

The distinction is important when you do conditional branching. With NaNs you might not be able to figure out which branch to take since you might have missed out on a real value, with NaN you got the value (which is known to be not real) and you might be able to branch.
August 25, 2014
Am 25.08.2014 14:12, schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>":
> On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote:
>> I've added support (compile time option [1]) for long and BigInt in
>> the lexer (and parser), see [2]. JSONValue currently still only stores
>> double for numbers.
>
> It can be very useful to have a base 10 exponent representation in
> certain situations where you need to have the exact same results in two
> systems (like a third party ERP server versus a client side
> application). Base 2 exponents are tricky (incorrect) when you read ascii.
>
> E.g. I have resorted to using Decimal in Python just to avoid the weird
> round off issues when calculating prices where the price is given in
> fractions of the order unit.
>
> Perhaps a marginal problem, but could be important for some serious
> application areas where you need to integrate D with existing systems
> (for which you don't have the source code).

In fact, I've already prepared the code for that, but commented it out for now, because I wanted to have an efficient algorithm for converting double to Decimal and because we should probably first add a Decimal type to Phobos instead of adding it to the JSON module.
August 25, 2014
Am 25.08.2014 15:07, schrieb Don:
> On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
>> Following up on the recent "std.jgrandson" thread [1], I've picked up
>> the work (a lot earlier than anticipated) and finished a first version
>> of a loose blend of said std.jgrandson, vibe.data.json and some
>> changes that I had planned for vibe.data.json for a while. I'm quite
>> pleased by the results so far, although without a serialization
>> framework it still misses a very important building block.
>>
>> Code: https://github.com/s-ludwig/std_data_json
>> Docs: http://s-ludwig.github.io/std_data_json/
>> DUB: http://code.dlang.org/packages/std_data_json
>>
>> The new code contains:
>>  - Lazy lexer in the form of a token input range (using slices of the
>>    input if possible)
>>  - Lazy streaming parser (StAX style) in the form of a node input range
>>  - Eager DOM style parser returning a JSONValue
>>  - Range based JSON string generator taking either a token range, a
>>    node range, or a JSONValue
>>  - Opt-out location tracking (line/column) for tokens, nodes and values
>>  - No opDispatch() for JSONValue - this has shown to do more harm than
>>    good in vibe.data.json
>>
>> The DOM style JSONValue type is based on std.variant.Algebraic. This
>> currently has a few usability issues that can be solved by
>> upgrading/fixing Algebraic:
>>
>>  - Operator overloading only works sporadically
>>  - No "tag" enum is supported, so that switch()ing on the type of a
>>    value doesn't work and an if-else cascade is required
>>  - Operations and conversions between different Algebraic types is not
>>    conveniently supported, which gets important when other similar
>>    formats get supported (e.g. BSON)
>>
>> Assuming that those points are solved, I'd like to get some early
>> feedback before going for an official review. One open issue is how to
>> handle unescaping of string literals. Currently it always unescapes
>> immediately, which is more efficient for general input ranges when the
>> unescaped result is needed, but less efficient for string inputs when
>> the unescaped result is not needed. Maybe a flag could be used to
>> conditionally switch behavior depending on the input range type.
>>
>> Destroy away! ;)
>>
>> [1]: http://forum.dlang.org/thread/lrknjl$co7$1@digitalmars.com
>
>
> One missing feature (which is also missing from the existing std.json)
> is support for NaN and Infinity as JSON values. Although they are not
> part of the formal JSON spec (which is a ridiculous omission, the
> argument given for excluding them is fallacious), they do get generated
> if you use Javascript's toString to create the JSON. Many JSON libraries
> (eg Google's) also generate them, so they are frequently encountered in
> practice. So a JSON parser should at least be able to lex them.
>
> ie this should be parsable:
>
> {"foo": NaN, "bar": Infinity, "baz": -Infinity}

This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though.

>
> You should also put tests in for what happens when you pass NaN or
> infinity to toJSON. It shouldn't silently generate invalid JSON.

Good point. The current solution to just use formattedWrite("%.16g") is also not ideal.
August 25, 2014
Am 25.08.2014 16:04, schrieb Sönke Ludwig:
> Am 25.08.2014 15:07, schrieb Don:
>> On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
>>> Following up on the recent "std.jgrandson" thread [1], I've picked up
>>> the work (a lot earlier than anticipated) and finished a first version
>>> of a loose blend of said std.jgrandson, vibe.data.json and some
>>> changes that I had planned for vibe.data.json for a while. I'm quite
>>> pleased by the results so far, although without a serialization
>>> framework it still misses a very important building block.
>>>
>>> Code: https://github.com/s-ludwig/std_data_json
>>> Docs: http://s-ludwig.github.io/std_data_json/
>>> DUB: http://code.dlang.org/packages/std_data_json
>>>
>>> The new code contains:
>>>  - Lazy lexer in the form of a token input range (using slices of the
>>>    input if possible)
>>>  - Lazy streaming parser (StAX style) in the form of a node input range
>>>  - Eager DOM style parser returning a JSONValue
>>>  - Range based JSON string generator taking either a token range, a
>>>    node range, or a JSONValue
>>>  - Opt-out location tracking (line/column) for tokens, nodes and values
>>>  - No opDispatch() for JSONValue - this has shown to do more harm than
>>>    good in vibe.data.json
>>>
>>> The DOM style JSONValue type is based on std.variant.Algebraic. This
>>> currently has a few usability issues that can be solved by
>>> upgrading/fixing Algebraic:
>>>
>>>  - Operator overloading only works sporadically
>>>  - No "tag" enum is supported, so that switch()ing on the type of a
>>>    value doesn't work and an if-else cascade is required
>>>  - Operations and conversions between different Algebraic types is not
>>>    conveniently supported, which gets important when other similar
>>>    formats get supported (e.g. BSON)
>>>
>>> Assuming that those points are solved, I'd like to get some early
>>> feedback before going for an official review. One open issue is how to
>>> handle unescaping of string literals. Currently it always unescapes
>>> immediately, which is more efficient for general input ranges when the
>>> unescaped result is needed, but less efficient for string inputs when
>>> the unescaped result is not needed. Maybe a flag could be used to
>>> conditionally switch behavior depending on the input range type.
>>>
>>> Destroy away! ;)
>>>
>>> [1]: http://forum.dlang.org/thread/lrknjl$co7$1@digitalmars.com
>>
>>
>> One missing feature (which is also missing from the existing std.json)
>> is support for NaN and Infinity as JSON values. Although they are not
>> part of the formal JSON spec (which is a ridiculous omission, the
>> argument given for excluding them is fallacious), they do get generated
>> if you use Javascript's toString to create the JSON. Many JSON libraries
>> (eg Google's) also generate them, so they are frequently encountered in
>> practice. So a JSON parser should at least be able to lex them.
>>
>> ie this should be parsable:
>>
>> {"foo": NaN, "bar": Infinity, "baz": -Infinity}
>
> This would probably best added as another (CT) optional feature. I think
> the default should strictly adhere to the JSON specification, though.

http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.specialFloatLiterals.html

>
>>
>> You should also put tests in for what happens when you pass NaN or
>> infinity to toJSON. It shouldn't silently generate invalid JSON.
>
> Good point. The current solution to just use formattedWrite("%.16g") is
> also not ideal.

By default, floating-point special values are now output as 'null', according to the ECMA-script standard. Optionally, they will be emitted as 'NaN' and 'Infinity':

http://s-ludwig.github.io/std_data_json/stdx/data/json/generator/GeneratorOptions.specialFloatLiterals.html
August 25, 2014
On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote:
> By default, floating-point special values are now output as 'null', according to the ECMA-script standard. Optionally, they will be emitted as 'NaN' and 'Infinity':

ECMAScript presumes double. I think one should base Phobos on language-independent standards. I suggest:

http://tools.ietf.org/html/rfc7159

For a web server it would be most useful to get an exception since you risk ending up with web-clients not working with no logging. It is better to have an exception and log an error so the problem can be fixed.
August 25, 2014
On Monday, 25 August 2014 at 15:46:12 UTC, Ola Fosheim Grøstad wrote:
> For a web server it would be most useful to get an exception since you risk ending up with web-clients not working with no logging. It is better to have an exception and log an error so the problem can be fixed.

Let me expand a bit on the difference between web clients and servers, assuming D is used on the server:

* Web servers have to check all input and log illegal activity. It is either a bug or an attack.

* Web clients don't have to check input from the server (at most a crypto check) and should not do double work if servers validate anyway.

* Web servers detect errors and send the error as a response to the client that displays it as a warning to the user. This is the uncommon case so you don't want to burden the client with it.

From this we can infer:

- It makes more sense for ECMAScript to turn illegal values into null since it runs on the client.

- The server needs efficient validation of input so that it can have faster response.

- The more integration of validation of typedness you can have in the parser, the better.


Thus it would be an advantage to be able to configure the validation done in the parser (through template mechanisms):


1. On write: throw exception on all illegal values or values that cannot be represented in the format. If the values are illegal then the client should not receive it. It could cause legal problems (like wrong prices).


2. On read: add the ability to configure the validation of typedness on many parameters:

- no nulls, no dicts, only nesting arrays etc

- predetermined key-values and automatic mapping to structs on exact match.

- require all leaf arrays to be uniform (array of strings, array of numbers)

- match a predefined grammar

etc


August 25, 2014
On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote:
>> I'm not convinced that using an adapter algorithm won't be just as fast.
> Consider your own talks on optimizing the existing dmd lexer.  In those talks
> you've talked about the evils of additional processing on every byte.  That's
> what you're talking about here.  While it's possible that the inliner and other
> optimizer steps might be able to integrate the two phases and remove some
> overhead, I'll believe it when I see the resulting assembly code.

On the other hand, deadalnix demonstrated that the ldc optimizer was able to remove the extra code.

I have a reasonable faith that optimization can be improved where necessary to cover this.
August 25, 2014
On 8/23/2014 3:51 PM, Andrei Alexandrescu wrote:
> An adapter would solve the wrong problem here. There's nothing to adapt from and
> to.
>
> An adapter would be good if e.g. the stream uses UTF-16 or some Windows
> encoding. Bytes are the natural input for a json parser.

The adaptation is to take arbitrary byte input in an unknown encoding and produce valid UTF.

Note that many html readers scan the bytes to see if it is ASCII, UTF, some code page encoding, Shift-JIS, etc., and translate accordingly. I do not see why that is less costly to put inside the JSON lexer than as an adapter.

August 25, 2014
On 8/25/2014 6:23 AM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote:
>> practice. So a JSON parser should at least be able to lex them.
>>
>> ie this should be parsable:
>>
>> {"foo": NaN, "bar": Infinity, "baz": -Infinity}
>>
>> You should also put tests in for what happens when you pass NaN or infinity to
>> toJSON. It shouldn't silently generate invalid JSON.
>
> I believe you are allowed to use very high exponents, though. Like: 1E999 . So
> you need to decide if those should be mapped to +Infinity or to the max value…

Infinity. Mapping to max value would be a horrible bug.


> NaN also come in two forms with differing semantics: signalling(NaNs) and quiet
> (NaN).  NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values
> and failure.
>
> For some reason D does not seem to support this aspect of IEEE754? I cannot find
> ".nans" listed on the page http://dlang.org/property.html

Because I tried supporting them in C++. It doesn't work for various reasons. Nobody else supports them, either.