August 23, 2014
On Saturday, 23 August 2014 at 19:01:13 UTC, Brad Roberts via Digitalmars-d wrote:
> original string is ascii or utf-8 or other.  The cost during lexing is essentially zero.

I am not so sure when it comes to SIMD lexing. I think the specified behaviour should be done in a way which encourage later optimizations.
August 23, 2014
Some baselines for performance:

https://github.com/mloskot/json_benchmark

http://chadaustin.me/2013/01/json-parser-benchmarking/
August 23, 2014
On Saturday, 23 August 2014 at 09:22:01 UTC, Sönke Ludwig wrote:
> Main issues of using opDispatch:
>
>  - Prone to bugs where a normal field/method of the JSONValue struct is accessed instead of a JSON field
>  - On top of that the var.field syntax gives the wrong impression that you are working with static typing, while var["field"] makes it clear that runtime indexing is going on
>  - Every interface change of JSONValue would be a silent breaking change, because the whole string domain is used up for opDispatch

Yes, I don't mind missing that one. It look like a false good idea.
August 23, 2014
On 8/23/14, 10:46 AM, Walter Bright wrote:
> On 8/23/2014 10:42 AM, Sönke Ludwig wrote:
>> Am 23.08.2014 19:38, schrieb Walter Bright:
>>> On 8/23/2014 9:36 AM, Sönke Ludwig wrote:
>>>> input types "string" and "immutable(ubyte)[]"
>>>
>>> Why the immutable(ubyte)[] ?
>>
>> I've adopted that basically from Andrei's module. The idea is to allow
>> processing data with arbitrary character encoding. However, the output
>> will
>> always be Unicode and JSON is defined to be encoded as Unicode, too,
>> so that
>> could probably be dropped...
>
> I feel that non-UTF encodings should be handled by adapter algorithms,
> not embedded into the JSON lexer, so yes, I'd drop that.

I think accepting ubyte it's a good idea. It means "got this stream of bytes off of the wire and it hasn't been validated as a UTF string". It also means (which is true) that the lexer does enough validation to constrain arbitrary bytes into text, and saves caller from either a check (expensive) or a cast (unpleasant).

Reality is the JSON lexer takes ubytes and produces tokens.


Andrei

August 23, 2014
On 8/23/2014 12:00 PM, Brad Roberts via Digitalmars-d wrote:
> On 8/23/2014 10:46 AM, Walter Bright via Digitalmars-d wrote:
>> I feel that non-UTF encodings should be handled by adapter algorithms,
>> not embedded into the JSON lexer, so yes, I'd drop that.
>
> For performance purposes, determining encoding during lexing is useful.

I'm not convinced that using an adapter algorithm won't be just as fast.

August 23, 2014
On 8/23/2014 2:36 PM, Andrei Alexandrescu wrote:
> I think accepting ubyte it's a good idea. It means "got this stream of bytes off
> of the wire and it hasn't been validated as a UTF string". It also means (which
> is true) that the lexer does enough validation to constrain arbitrary bytes into
> text, and saves caller from either a check (expensive) or a cast (unpleasant).
>
> Reality is the JSON lexer takes ubytes and produces tokens.

Using an adapter still makes sense, because:

1. The adapter should be just as fast as wiring it in internally

2. The adapter then becomes a general purpose tool that can be used elsewhere where the encoding is unknown or suspect

3. The scope of the adapter is small, so it is easier to get it right, and being reusable means every user benefits from it

4. If we can't make adapters efficient, we've failed at the ranges+algorithms model, and I'm very unwilling to fail at that


August 23, 2014
On 8/23/14, 3:24 PM, Walter Bright wrote:
> On 8/23/2014 2:36 PM, Andrei Alexandrescu wrote:
>> I think accepting ubyte it's a good idea. It means "got this stream of
>> bytes off
>> of the wire and it hasn't been validated as a UTF string". It also
>> means (which
>> is true) that the lexer does enough validation to constrain arbitrary
>> bytes into
>> text, and saves caller from either a check (expensive) or a cast
>> (unpleasant).
>>
>> Reality is the JSON lexer takes ubytes and produces tokens.
>
> Using an adapter still makes sense, because:
>
> 1. The adapter should be just as fast as wiring it in internally
>
> 2. The adapter then becomes a general purpose tool that can be used
> elsewhere where the encoding is unknown or suspect
>
> 3. The scope of the adapter is small, so it is easier to get it right,
> and being reusable means every user benefits from it
>
> 4. If we can't make adapters efficient, we've failed at the
> ranges+algorithms model, and I'm very unwilling to fail at that

An adapter would solve the wrong problem here. There's nothing to adapt from and to.

An adapter would be good if e.g. the stream uses UTF-16 or some Windows encoding. Bytes are the natural input for a json parser.


Andrei


August 24, 2014
On 8/23/2014 3:20 PM, Walter Bright via Digitalmars-d wrote:
> On 8/23/2014 12:00 PM, Brad Roberts via Digitalmars-d wrote:
>> On 8/23/2014 10:46 AM, Walter Bright via Digitalmars-d wrote:
>>> I feel that non-UTF encodings should be handled by adapter algorithms,
>>> not embedded into the JSON lexer, so yes, I'd drop that.
>>
>> For performance purposes, determining encoding during lexing is useful.
>
> I'm not convinced that using an adapter algorithm won't be just as fast.

Consider your own talks on optimizing the existing dmd lexer.  In those talks you've talked about the evils of additional processing on every byte.  That's what you're talking about here.  While it's possible that the inliner and other optimizer steps might be able to integrate the two phases and remove some overhead, I'll believe it when I see the resulting assembly code.
August 25, 2014
I've added support (compile time option [1]) for long and BigInt in the lexer (and parser), see [2]. JSONValue currently still only stores double for numbers. There are two options for extending JSONValue:

1. Add long and BigInt to the set of supported types for JSONValue. This preserves all features of Algebraic and would later still allow transparent conversion to other similar value types (e.g. BSONValue). On the other hand it would be necessary to always check the actual type before accessing a number, or the Algebraic would throw.

2. Instead of double, store a JSONNumber in the Algebraic. This enables all the transparent conversions of JSONNumber and would thus be more convenient, but blocks the way for possible automatic conversions in the future.

I'm leaning towards 1, because allowing generic conversion between different JSONValue-like types was one of my prime goals for the new module.

[1]: http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.html
[2]: http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/JSONNumber.html
August 25, 2014
On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote:
> I've added support (compile time option [1]) for long and BigInt in the lexer (and parser), see [2]. JSONValue currently still only stores double for numbers.

It can be very useful to have a base 10 exponent representation in certain situations where you need to have the exact same results in two systems (like a third party ERP server versus a client side application). Base 2 exponents are tricky (incorrect) when you read ascii.

E.g. I have resorted to using Decimal in Python just to avoid the weird round off issues when calculating prices where the price is given in fractions of the order unit.

Perhaps a marginal problem, but could be important for some serious application areas where you need to integrate D with existing systems (for which you don't have the source code).