August 17, 2015
On 7/28/15 10:07 AM, Atila Neves wrote:
> Start of the two week process, folks.
>
> Code: https://github.com/s-ludwig/std_data_json
> Docs: http://s-ludwig.github.io/std_data_json/
>
> Atila

I'll preface my review with a general comment. This API comes at an interesting juncture; we're striving as much as possible for interfaces that abstract away lifetime management, so they can be used comfortably with GC, or at high performance (and hopefully no or only marginal loss of comfort) with client-chosen lifetime management policies.

The JSON API is a great test bed for our emerging recommended "push lifetime up" idioms; it's not too complicated yet it's not trivial either, and has great usefulness.

With this, here are some points:

* All new stuff should go in std.experimental. I assume "stdx" would change to that, should this work be merged.

* On the face of it, dedicating 6 modules to such a small specification as JSON seems excessive. I'm thinking one module here. (As a simple point: who would ever want to import only foundation, which in turn has one exception type and one location type in it?) I think it shouldn't be up for debate that we must aim for simple and clean APIs.

* stdx.data.json.generator: I think the API for converting in-memory JSON values to strings needs to be redone, as follows:

- JSONValue should offer a byToken range, which offers the contents of the value one token at a time. For example, "[ 1, 2, 3 ]" offers the '[' token followed by three numeric tokens with the respective values followed by the ']' token.

- On top of byToken it's immediate to implement a method (say toJSON or toString) that accepts an output range of characters and formatting options.

- On top of the method above with output range, implementing a toString overload that returns a string for convenience is a two-liner. However, it shouldn't return a "string"; Phobos APIs should avoid "hardcoding" the string type. Instead, it should return a user-chosen string type (including reference counting strings).

- While at it make prettyfication a flag in the options, not its own part of the function name.

* stdx.data.json.lexer:

- I assume the idea was to accept ranges of integrals to mean "there's some raw input from a file". This seems to be a bit overdone, e.g. there's no need to accept signed integers or 64-bit integers. I suggest just going with the three character types.

- I see tokenization accepts input ranges. This forces the tokenizer to store its own copy of things, which is no doubt the business of appenderFactory. Here the departure of the current approach from what I think should become canonical Phobos APIs deepens for multiple reasons. First, appenderFactory does allow customization of the append operation (nice) but that's not enough to allow the user to customize the lifetime of the created strings, which is usually reflected in the string type itself. So the lexing method should be parameterized by the string type used. (By default string (as is now) should be fine.) Therefore instead of customizing the append method just customize the string type used in the token.

- The lexer should internally take optimization opportunities, e.g. if the string type is "string" and the lexed type is also "string", great, just use slices of the input instead of appending them to the tokens.

- As a consequence the JSONToken type also needs to be parameterized by the type of its string that holds the payload. I understand this is a complication compared to the current approach, but I don't see an out. In the grand scheme of things it seems a necessary evil: tokens may or may not need a means to manage lifetime of their payload, and that's determined by the type of the payload. Hopefully simplifications in other areas of the API would offset this.

- At token level there should be no number parsing. Just store the payload with the token and leave it for later. Very often numbers are converted without there being a need, and the process is costly. This also nicely sidesteps the entire matter of bigints, floating point etc. at this level.

- Also, at token level strings should be stored with escapes unresolved. If the user wants a string with the escapes resolved, a lazy range does it.

- Validating UTF is tricky; I've seen some discussion in this thread about it. On the face of it JSON only accepts valid UTF characters. As such, a modularity-based argument is to pipe UTF validation before tokenization. (We need a lazy UTF validator and sanitizer stat!) An efficiency-based argument is to do validation during tokenization. I'm inclining in favor of modularization, which allows us to focus on one thing at a time and do it well, instead of duplicationg validation everywhere. Note that it's easy to write routines that do JSON tokenization and leave UTF validation for later, so there's a lot of flexibility in composing validation with JSONization.

- Litmus test: if the input type is a forward range AND if the string type chosen for tokens is the same as input type, successful tokenization should allocate exactly zero memory. I think this is a simple way to make sure that the tokenization API works well.

- If noThrow is a runtime option, some functions can't be nothrow (and consequently nogc). Not sure how important this is. Probably quite a bit because of the current gc implications of exceptions. IMHO: at lexing level a sound design might just emit error tokens (with the culprit as payload) and never throw. Clients may always throw when they see an error token.

* stdx.data.json.parser:

- Similar considerations regarding string type used apply here as well: everything should be parameterized with it - the use case to keep in mind is someone wants everything with refcounted strings.

- The JSON value does its own internal allocation (for e.g. arrays and hashtables), which should be fine as long as it's encapsulated and we can tweak it later (e.g. make it use reference counting inside).

- parseJSONStream should parameterize on string type, not on appenderFactory.

- Why both parseJSONStream and parseJSONValue? I'm thinking parseJSONValue would be enough because then you trivially parse a stream with repeated calls to parseJSONValue.

- FWIW I think the whole thing with accommodating BigInt etc. is an exaggeration. Just stick with long and double.

- readArray suddenly introduces a distinct kind of interacting - callbacks. Why? Should be a lazy range lazy range lazy range. An adapter using callbacks is then a two-liner.

- Why is readBool even needed? Just readJSONValue and then enforce it as a bool. Same reasoning applies to readDouble and readString.

- readObject is with callbacks again - it would be nice if it were a lazy range.

- skipXxx are nice to have and useful.

* stdx.data.json.value:

- The etymology of "opt" is unclear - no word starting with "opt" or obviously abbreviating to it is in the documentation. "opt2" is awkward. How about "path" and "dyn", respectively.

- I think Algebraic should be used throughout instead of TaggedAlgebraic, or motivation be given for the latter.

- JSONValue should be more opaque and not expose representation as much as it does now. In particular, offering a built-in hashtable is bound to be problematic because those are expensive to construct, create garbage, and are not customizable. Instead, the necessary lookup and set APIs should be provided by JSONValue whilst keeping the implementation hidden. The same goes about array - a JSONValue shall not be exposed; instead, indexed access primitives should be exposed. Separate types might be offered (e.g. JSONArray, JSONDictionary) if deemed necessary. The string type should be a type parameter of JSONValue.

==============================

So, here we are. I realize a good chunk of this is surprising ("you mean I shouldn't create strings in my APIs?"). My point here is, again, we're at a juncture. We're trying to factor garbage (heh) out of API design in ways that defer the lifetime management to the user of the API.

We could pull json into std.experimental and defer the policy decisions for later, but I think it's a great driver for them. (Thanks Sönke for doing all the work, this is a great baseline.) I think we should use the JSON API as a guinea pig for the new era of D API design in which we have a solid set of principles, tools, and guidelines to defer lifetime management. Please advise.



Andrei

August 17, 2015
On 8/17/15 2:47 PM, Dmitry Olshansky wrote:
>
> Actually one can combine the two:
> - use integer type tag for everything built-in
> - use pointer tag for what is not

But a pointer tag can do everything that an integer tag does. -- Andrei
August 17, 2015
On 8/17/15 2:51 PM, deadalnix wrote:
>  From the compiler perspective, the tag is much nicer. Compiler can use
> jump table for instance.

The pointer is a more direct conduit to a jump table.

> It is not a good solution for Variant (which needs to be able to
> represent arbitrary types) but if the amount of types is finite, tag is
> almost always a win.
> In the case of JSON, using a tag and packing trick, it is possible to
> pack everything in a 2 pointers sized struct without much trouble.

Point taken. Question is if this is worth it.


Andrei
August 17, 2015
On 8/17/15 2:56 PM, Sönke Ludwig wrote:
> - The enum is useful to be able to identify the types outside of the D
> code itself. For example when serializing the data to disk, or when
> communicating with C code.

OK.

> - It enables the use of pattern matching (final switch), which is often
> very convenient, faster, and safer than an if-else cascade.

Sounds tenuous.

> - A hypothesis is that it is faster, because there is no function call
> indirection involved.

Again: pointers do all integrals do. To compare:

if (myptr == ThePtrOf!int) { ... this is an int ... }

I want to make clear that this is understood.

> - It naturally enables fully statically typed operator forwarding as far
> as possible (have a look at the examples of the current version). A
> pointer based version could do this, too, but only by jumping through
> hoops.

I'm unclear on that. Could you please point me to the actual file and lines?

> - The same type can be used multiple times with a different enum name.
> This can alternatively be solved using a Typedef!T, but I had several
> occasions where that proved useful.

Unclear on this.



Andrei
August 18, 2015
On 2015-08-18 00:21, Andrei Alexandrescu wrote:

> * On the face of it, dedicating 6 modules to such a small specification
> as JSON seems excessive. I'm thinking one module here. (As a simple
> point: who would ever want to import only foundation, which in turn has
> one exception type and one location type in it?) I think it shouldn't be
> up for debate that we must aim for simple and clean APIs.

I don't think this is excessive. We should strive to have small modules. We already have/had problems with std.algorithm and std.datetime, let's not repeat those mistakes. A module with 2000 lines is more than enough.

-- 
/Jacob Carlborg
August 18, 2015
On 18-Aug-2015 01:33, Andrei Alexandrescu wrote:
> On 8/17/15 2:47 PM, Dmitry Olshansky wrote:
>>
>> Actually one can combine the two:
>> - use integer type tag for everything built-in
>> - use pointer tag for what is not
>
> But a pointer tag can do everything that an integer tag does. -- Andrei

albeit quite a deal slooower.

-- 
Dmitry Olshansky
August 18, 2015
On Monday, 17 August 2015 at 22:21:50 UTC, Andrei Alexandrescu wrote:
> * stdx.data.json.generator: I think the API for converting in-memory JSON values to strings needs to be redone, as follows:
>
> - JSONValue should offer a byToken range, which offers the contents of the value one token at a time. For example, "[ 1, 2, 3 ]" offers the '[' token followed by three numeric tokens with the respective values followed by the ']' token.

For iterating tree-like structures, a callback-based seems nicer, because it can naturally use the stack for storing its state. (I assume std.concurrency.Generator is too heavy-weight for this case.)

>
> - On top of byToken it's immediate to implement a method (say toJSON or toString) that accepts an output range of characters and formatting options.

If there really needs to be a range, `joiner` and `copy` should do the job.

>
> - On top of the method above with output range, implementing a toString overload that returns a string for convenience is a two-liner. However, it shouldn't return a "string"; Phobos APIs should avoid "hardcoding" the string type. Instead, it should return a user-chosen string type (including reference counting strings).

`to!string`, for compatibility with std.conv.

>
> - While at it make prettyfication a flag in the options, not its own part of the function name.

(That's already done.)

>
> * stdx.data.json.lexer:
>
> - I assume the idea was to accept ranges of integrals to mean "there's some raw input from a file". This seems to be a bit overdone, e.g. there's no need to accept signed integers or 64-bit integers. I suggest just going with the three character types.
>
> - I see tokenization accepts input ranges. This forces the tokenizer to store its own copy of things, which is no doubt the business of appenderFactory. Here the departure of the current approach from what I think should become canonical Phobos APIs deepens for multiple reasons. First, appenderFactory does allow customization of the append operation (nice) but that's not enough to allow the user to customize the lifetime of the created strings, which is usually reflected in the string type itself. So the lexing method should be parameterized by the string type used. (By default string (as is now) should be fine.) Therefore instead of customizing the append method just customize the string type used in the token.
>
> - The lexer should internally take optimization opportunities, e.g. if the string type is "string" and the lexed type is also "string", great, just use slices of the input instead of appending them to the tokens.
>
> - As a consequence the JSONToken type also needs to be parameterized by the type of its string that holds the payload. I understand this is a complication compared to the current approach, but I don't see an out. In the grand scheme of things it seems a necessary evil: tokens may or may not need a means to manage lifetime of their payload, and that's determined by the type of the payload. Hopefully simplifications in other areas of the API would offset this.

I've never seen JSON encoded in anything other than UTF-8. Is it really necessary to complicate everything for such an infrequent niche case?

>
> - At token level there should be no number parsing. Just store the payload with the token and leave it for later. Very often numbers are converted without there being a need, and the process is costly. This also nicely sidesteps the entire matter of bigints, floating point etc. at this level.
>
> - Also, at token level strings should be stored with escapes unresolved. If the user wants a string with the escapes resolved, a lazy range does it.

This was already suggested, and it looks like a good idea, though there was an objection because of possible performance costs. The other objection, that it requires an allocation, is no longer valid if sliceable input is used.

>
> - Validating UTF is tricky; I've seen some discussion in this thread about it. On the face of it JSON only accepts valid UTF characters. As such, a modularity-based argument is to pipe UTF validation before tokenization. (We need a lazy UTF validator and sanitizer stat!) An efficiency-based argument is to do validation during tokenization. I'm inclining in favor of modularization, which allows us to focus on one thing at a time and do it well, instead of duplicationg validation everywhere. Note that it's easy to write routines that do JSON tokenization and leave UTF validation for later, so there's a lot of flexibility in composing validation with JSONization.

Well, in an ideal world, there should be no difference in performance between manually combined tokenization/validation, and composed ranges. We should practice what we preach here.

> * stdx.data.json.parser:
>
> - FWIW I think the whole thing with accommodating BigInt etc. is an exaggeration. Just stick with long and double.

Or, as above, leave it to the end user and provide a `to(T)` method that can support built-in types and `BigInt` alike.
August 18, 2015
On Monday, 17 August 2015 at 22:34:36 UTC, Andrei Alexandrescu wrote:
> On 8/17/15 2:51 PM, deadalnix wrote:
>>  From the compiler perspective, the tag is much nicer. Compiler can use
>> jump table for instance.
>
> The pointer is a more direct conduit to a jump table.

Not really, because it most likely doesn't point to where you need it, but to a `TypeInfo` struct instead, which doesn't help you in a `switch` statement. Besides, you probably shouldn't compare pointers vs integers, but pointers vs enums.

>
>> It is not a good solution for Variant (which needs to be able to
>> represent arbitrary types) but if the amount of types is finite, tag is
>> almost always a win.
>> In the case of JSON, using a tag and packing trick, it is possible to
>> pack everything in a 2 pointers sized struct without much trouble.
>
> Point taken. Question is if this is worth it.

Anything that makes it fit in two registers instead of three (= 2 regs + memory, in practice) is most likely worth it.
August 18, 2015
Am Tue, 18 Aug 2015 09:10:25 +0000
schrieb "Marc Schütz" <schuetzm@gmx.net>:

> On Monday, 17 August 2015 at 22:34:36 UTC, Andrei Alexandrescu wrote:
> > On 8/17/15 2:51 PM, deadalnix wrote:
> >>  From the compiler perspective, the tag is much nicer.
> >> Compiler can use
> >> jump table for instance.
> >
> > The pointer is a more direct conduit to a jump table.
> 
> Not really, because it most likely doesn't point to where you need it, but to a `TypeInfo` struct instead, which doesn't help you in a `switch` statement. Besides, you probably shouldn't compare pointers vs integers, but pointers vs enums.

Here's an example with an enum tag, showing what compilers can do: http://goo.gl/NUZwNo

ARM ASM is easier to read for me. Feel free to switch to X86.

The jump table requires only one instruction (the cmp #4 shouldn't be necessary for a final switch, probably a GDC/GCC enhancement). All instructions/data should be in the instruction cache. There's no register save / function call overhead.


If you use a pointer:
http://goo.gl/9kb0vQ

No jump table optimization. Cache should be OK as well. No call overhead.


Note how both examples can also combine the code for uint/int. If you use a function pointer instead you'll call different function.


Calling a function through pointer:
http://goo.gl/zTU3sA

You have one indirect call. Probably hard for the branch prediction, although I don't really know. Probably also worse regarding cache. I also cheated by using one pointer only for add. In reality you'll need to store one pointer per operation or use a switch inside the called function.


I think it's reasonable to expect the enum version to be faster. To be really sure we'd need some benchmarks.

August 18, 2015
On 8/18/15 2:31 AM, Jacob Carlborg wrote:
> On 2015-08-18 00:21, Andrei Alexandrescu wrote:
>
>> * On the face of it, dedicating 6 modules to such a small specification
>> as JSON seems excessive. I'm thinking one module here. (As a simple
>> point: who would ever want to import only foundation, which in turn has
>> one exception type and one location type in it?) I think it shouldn't be
>> up for debate that we must aim for simple and clean APIs.
>
> I don't think this is excessive. We should strive to have small modules.
> We already have/had problems with std.algorithm and std.datetime, let's
> not repeat those mistakes. A module with 2000 lines is more than enough.

How about a module with 20? -- Andrei