August 18, 2015
On 8/18/15 2:55 AM, Dmitry Olshansky wrote:
> On 18-Aug-2015 01:33, Andrei Alexandrescu wrote:
>> On 8/17/15 2:47 PM, Dmitry Olshansky wrote:
>>>
>>> Actually one can combine the two:
>>> - use integer type tag for everything built-in
>>> - use pointer tag for what is not
>>
>> But a pointer tag can do everything that an integer tag does. -- Andrei
>
> albeit quite a deal slooower.

I think there's a misunderstanding. Pointers _are_ 64-bit integers and may be compared as such. You can use a pointer as an integer. -- Andrei

August 18, 2015
On 2015-08-18 15:18, Andrei Alexandrescu wrote:

> How about a module with 20? -- Andrei

If it's used in several other modules, I don't see a problem with it.

-- 
/Jacob Carlborg
August 18, 2015
On 8/18/15 5:10 AM, "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm@gmx.net>" wrote:
> On Monday, 17 August 2015 at 22:34:36 UTC, Andrei Alexandrescu wrote:
>> On 8/17/15 2:51 PM, deadalnix wrote:
>>>  From the compiler perspective, the tag is much nicer. Compiler can use
>>> jump table for instance.
>>
>> The pointer is a more direct conduit to a jump table.
>
> Not really, because it most likely doesn't point to where you need it,
> but to a `TypeInfo` struct instead

No, in std.variant it points to a dispatcher function. -- Andrei
August 18, 2015
On 8/18/15 7:02 AM, Johannes Pfau wrote:
> Am Tue, 18 Aug 2015 09:10:25 +0000
> schrieb "Marc Schütz" <schuetzm@gmx.net>:
>
>> On Monday, 17 August 2015 at 22:34:36 UTC, Andrei Alexandrescu
>> wrote:
>>> On 8/17/15 2:51 PM, deadalnix wrote:
>>>>   From the compiler perspective, the tag is much nicer.
>>>> Compiler can use
>>>> jump table for instance.
>>>
>>> The pointer is a more direct conduit to a jump table.
>>
>> Not really, because it most likely doesn't point to where you
>> need it, but to a `TypeInfo` struct instead, which doesn't help
>> you in a `switch` statement. Besides, you probably shouldn't
>> compare pointers vs integers, but pointers vs enums.
>
> Here's an example with an enum tag, showing what compilers can do:
> http://goo.gl/NUZwNo
>
> ARM ASM is easier to read for me. Feel free to switch to X86.
>
> The jump table requires only one instruction (the cmp #4 shouldn't be
> necessary for a final switch, probably a GDC/GCC enhancement). All
> instructions/data should be in the instruction cache. There's no
> register save / function call overhead.
>
>
> If you use a pointer:
> http://goo.gl/9kb0vQ

That's a language issue - switch does not work with any pointers. I just submitted https://issues.dlang.org/show_bug.cgi?id=14931. -- Andrei

August 18, 2015
On 8/18/15 9:31 AM, Jacob Carlborg wrote:
> On 2015-08-18 15:18, Andrei Alexandrescu wrote:
>
>> How about a module with 20? -- Andrei
>
> If it's used in several other modules, I don't see a problem with it.

Me neither if internal. I do see a problem if it's public. -- Andrei

August 18, 2015
Am Tue, 18 Aug 2015 10:58:17 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 8/18/15 7:02 AM, Johannes Pfau wrote:
> > Am Tue, 18 Aug 2015 09:10:25 +0000
> > schrieb "Marc Schütz" <schuetzm@gmx.net>:
> >
> >> On Monday, 17 August 2015 at 22:34:36 UTC, Andrei Alexandrescu wrote:
> >>> On 8/17/15 2:51 PM, deadalnix wrote:
> >>>>   From the compiler perspective, the tag is much nicer.
> >>>> Compiler can use
> >>>> jump table for instance.
> >>>
> >>> The pointer is a more direct conduit to a jump table.
> >>
> >> Not really, because it most likely doesn't point to where you need it, but to a `TypeInfo` struct instead, which doesn't help you in a `switch` statement. Besides, you probably shouldn't compare pointers vs integers, but pointers vs enums.
> >
> > Here's an example with an enum tag, showing what compilers can do: http://goo.gl/NUZwNo
> >
> > ARM ASM is easier to read for me. Feel free to switch to X86.
> >
> > The jump table requires only one instruction (the cmp #4 shouldn't be necessary for a final switch, probably a GDC/GCC enhancement). All instructions/data should be in the instruction cache. There's no register save / function call overhead.
> >
> >
> > If you use a pointer:
> > http://goo.gl/9kb0vQ
> 
> That's a language issue - switch does not work with any pointers. I just submitted https://issues.dlang.org/show_bug.cgi?id=14931. -- Andrei
> 

Yes, if we enable switch for pointers we get nicer D code.

No, this won't improve the ASM much: Enum values start at 0 and are consecutive. With a final switch they're also bounded. All these points do not apply to pointers. They don't start at 0, are not guaranteed to be consecutive and likely can't be used with final switch. Because of that a switch on pointers can never use jump tables.

August 18, 2015
On 8/18/15 11:39 AM, Johannes Pfau wrote:
> No, this won't improve the ASM much: Enum values start at 0 and are
> consecutive. With a final switch they're also bounded. All these points
> do not apply to pointers. They don't start at 0, are not guaranteed to
> be consecutive and likely can't be used with final switch. Because of
> that a switch on pointers can never use jump tables.

I agree there's a margin here in favor of integers, but it's getting thin. Meanwhile, pointers maintain large advantages of principle. I suggest we pursue better use of pointers as tags instead of adding integral-tagged unions to phobos. -- Andrei

August 18, 2015
On 18-Aug-2015 16:19, Andrei Alexandrescu wrote:
> On 8/18/15 2:55 AM, Dmitry Olshansky wrote:
>> On 18-Aug-2015 01:33, Andrei Alexandrescu wrote:
>>> On 8/17/15 2:47 PM, Dmitry Olshansky wrote:
>>>>
>>>> Actually one can combine the two:
>>>> - use integer type tag for everything built-in
>>>> - use pointer tag for what is not
>>>
>>> But a pointer tag can do everything that an integer tag does. -- Andrei
>>
>> albeit quite a deal slooower.
>
> I think there's a misunderstanding. Pointers _are_ 64-bit integers and
> may be compared as such. You can use a pointer as an integer. -- Andrei
>

Integer in a small range is faster to switch on. Plus comparing to zero is faster, so if the common type has tag == 0 it's a net gain.

Strictly speaking pointer with vtbl is about as fast as switch but when we have to switch on 2 types the vtbl dispatch needs to be based on 2 types instead of one. So ideally we need vtbl per pair of type to support e.g. fast binary operators on TaggedAlgebraic.

-- 
Dmitry Olshansky
August 18, 2015
On 8/18/15 12:31 PM, Dmitry Olshansky wrote:
> On 18-Aug-2015 16:19, Andrei Alexandrescu wrote:
>> On 8/18/15 2:55 AM, Dmitry Olshansky wrote:
>>> On 18-Aug-2015 01:33, Andrei Alexandrescu wrote:
>>>> On 8/17/15 2:47 PM, Dmitry Olshansky wrote:
>>>>>
>>>>> Actually one can combine the two:
>>>>> - use integer type tag for everything built-in
>>>>> - use pointer tag for what is not
>>>>
>>>> But a pointer tag can do everything that an integer tag does. -- Andrei
>>>
>>> albeit quite a deal slooower.
>>
>> I think there's a misunderstanding. Pointers _are_ 64-bit integers and
>> may be compared as such. You can use a pointer as an integer. -- Andrei
>>
>
> Integer in a small range is faster to switch on. Plus comparing to zero
> is faster, so if the common type has tag == 0 it's a net gain.

Agreed. These are small gains though unless tight loops are concerned.

> Strictly speaking pointer with vtbl is about as fast as switch but when
> we have to switch on 2 types the vtbl dispatch needs to be based on 2
> types instead of one. So ideally we need vtbl per pair of type to
> support e.g. fast binary operators on TaggedAlgebraic.

But I'm talking about using pointers for indirect calls IN ADDITION to using pointers for simple integral comparison. So the comparison is not appropriate. It's better to have both options instead of just one.


Andrei

August 18, 2015
Am 18.08.2015 um 00:21 schrieb Andrei Alexandrescu:
> I'll preface my review with a general comment. This API comes at an
> interesting juncture; we're striving as much as possible for interfaces
> that abstract away lifetime management, so they can be used comfortably
> with GC, or at high performance (and hopefully no or only marginal loss
> of comfort) with client-chosen lifetime management policies.
>
> The JSON API is a great test bed for our emerging recommended "push
> lifetime up" idioms; it's not too complicated yet it's not trivial
> either, and has great usefulness.
>
> With this, here are some points:
>
> * All new stuff should go in std.experimental. I assume "stdx" would
> change to that, should this work be merged.

Check.

>
> * On the face of it, dedicating 6 modules to such a small specification
> as JSON seems excessive. I'm thinking one module here. (As a simple
> point: who would ever want to import only foundation, which in turn has
> one exception type and one location type in it?) I think it shouldn't be
> up for debate that we must aim for simple and clean APIs.

That would mean a single module that is >5k lines long. Spreading out certain things, such as JSONValue into an own module also makes sense to avoid unnecessarily large imports where other parts of the functionality isn't needed. Maybe we could move some private things to "std.internal" or similar and merge some of the modules?

But I also think that grouping symbols by topic is a good thing and makes figuring out the API easier. There is also always package.d if you really want to import everything.

> * stdx.data.json.generator: I think the API for converting in-memory
> JSON values to strings needs to be redone, as follows:
>
> - JSONValue should offer a byToken range, which offers the contents of
> the value one token at a time. For example, "[ 1, 2, 3 ]" offers the '['
> token followed by three numeric tokens with the respective values
> followed by the ']' token.

An input range style generator is on the TODO list, but would a token range be really useful for anything in practice? I would just go straight for a char range.

Another thing I'd like to add is an output range that takes parser nodes and writes to a string output range. This would be the kind of interface that would be most useful for a serialization framework.

> - On top of byToken it's immediate to implement a method (say toJSON or
> toString) that accepts an output range of characters and formatting
> options.
>
> - On top of the method above with output range, implementing a toString
> overload that returns a string for convenience is a two-liner. However,
> it shouldn't return a "string"; Phobos APIs should avoid "hardcoding"
> the string type. Instead, it should return a user-chosen string type
> (including reference counting strings).

Without any existing code to test this against, how would this look like? Simply using an `Appender!rcstring`?

>
> - While at it make prettyfication a flag in the options, not its own
> part of the function name.

Already done. Pretty printing is now the default and there is GeneratorOptions.compact.

> * stdx.data.json.lexer:
>
> - I assume the idea was to accept ranges of integrals to mean "there's
> some raw input from a file". This seems to be a bit overdone, e.g.
> there's no need to accept signed integers or 64-bit integers. I suggest
> just going with the three character types.

It's funny you say that, because this was your own design proposal.

Regarding the three character types, if we drop everything but those, I think we could also go with Walter's suggestion and just drop everything apart from "char". Putting a conversion range from dchar to char would be trivial and should be fast enough.

> - I see tokenization accepts input ranges. This forces the tokenizer to
> store its own copy of things, which is no doubt the business of
> appenderFactory.  Here the departure of the current approach from what I
> think should become canonical Phobos APIs deepens for multiple reasons.
> First, appenderFactory does allow customization of the append operation
> (nice) but that's not enough to allow the user to customize the lifetime
> of the created strings, which is usually reflected in the string type
> itself. So the lexing method should be parameterized by the string type
> used. (By default string (as is now) should be fine.) Therefore instead
> of customizing the append method just customize the string type used in
> the token.

Okay, sounds reasonable if Appender!rcstring is just going to work.

> - The lexer should internally take optimization opportunities, e.g. if
> the string type is "string" and the lexed type is also "string", great,
> just use slices of the input instead of appending them to the tokens.

It does.

> - As a consequence the JSONToken type also needs to be parameterized by
> the type of its string that holds the payload. I understand this is a
> complication compared to the current approach, but I don't see an out.
> In the grand scheme of things it seems a necessary evil: tokens may or
> may not need a means to manage lifetime of their payload, and that's
> determined by the type of the payload. Hopefully simplifications in
> other areas of the API would offset this.

It wouldn't be too bad here, because it's presumably pretty rare to store tokens or parser nodes. Worse is JSONValue.

> - At token level there should be no number parsing. Just store the
> payload with the token and leave it for later. Very often numbers are
> converted without there being a need, and the process is costly. This
> also nicely sidesteps the entire matter of bigints, floating point etc.
> at this level.

Okay, again, this was your own suggestion. The downside of always storing the string representation is that it requires allocations if no slices are used, and that the string will have to be parsed twice if the number is indeed going to be used. This can have a considerable performance impact.

> - Also, at token level strings should be stored with escapes unresolved.
> If the user wants a string with the escapes resolved, a lazy range does it.

To make things efficient, it currently stores escaped strings if slices of the input are used, but stores unescaped strings if allocations are necessary anyway.

> - Validating UTF is tricky; I've seen some discussion in this thread
> about it. On the face of it JSON only accepts valid UTF characters. As
> such, a modularity-based argument is to pipe UTF validation before
> tokenization. (We need a lazy UTF validator and sanitizer stat!) An
> efficiency-based argument is to do validation during tokenization. I'm
> inclining in favor of modularization, which allows us to focus on one
> thing at a time and do it well, instead of duplicationg validation
> everywhere. Note that it's easy to write routines that do JSON
> tokenization and leave UTF validation for later, so there's a lot of
> flexibility in composing validation with JSONization.

It's unfortunate to see this change of mind in face of the work that already went into the implementation. I also still think that this is a good optimization opportunity that doesn't really affect the implementation complexity. Validation isn't duplicated, but reused from std.utf.

> - Litmus test: if the input type is a forward range AND if the string
> type chosen for tokens is the same as input type, successful
> tokenization should allocate exactly zero memory. I think this is a
> simple way to make sure that the tokenization API works well.

Supporting arbitrary forward ranges doesn't seem to be enough, it would at least have to be combined with something like take(), but then the type doesn't equal the string type anymore. I'd suggest to keep it to "if is sliceable and input type equals string type", at least for the initial version.

> - If noThrow is a runtime option, some functions can't be nothrow (and
> consequently nogc). Not sure how important this is. Probably quite a bit
> because of the current gc implications of exceptions. IMHO: at lexing
> level a sound design might just emit error tokens (with the culprit as
> payload) and never throw. Clients may always throw when they see an
> error token.

noThrow is a compile time option and there are @nothrow unit tests to make sure that the API is @nothrow at least for string inputs.

> * stdx.data.json.parser:
>
> - Similar considerations regarding string type used apply here as well:
> everything should be parameterized with it - the use case to keep in
> mind is someone wants everything with refcounted strings.

Okay.

> - The JSON value does its own internal allocation (for e.g. arrays and
> hashtables), which should be fine as long as it's encapsulated and we
> can tweak it later (e.g. make it use reference counting inside).

Since it's based on (Tagged)Algebraic, the internal types are part of the interface. Changing them later is bound to break some code. So AFICS this would either require to make the types used parameterized (string, array and AA types). Or to abstract them away completely, i.e. only forward operations but deny direct access to the type.

... thinking about it, TaggedAlgebraic could do that, while Algebraic can't.

> - parseJSONStream should parameterize on string type, not on
> appenderFactory.

Okay.

> - Why both parseJSONStream and parseJSONValue? I'm thinking
> parseJSONValue would be enough because then you trivially parse a stream
> with repeated calls to parseJSONValue.

parseJSONStream is the pull parser (StAX style) interface. It returns the contents of a JSON document as individual nodes instead of storing them in a DOM. This part is vital for high-performance parsing, especially of large documents.

> - FWIW I think the whole thing with accommodating BigInt etc. is an
> exaggeration. Just stick with long and double.

As mentioned earlier somewhere in this thread, there are practical needs to at least be able to handle ulong, too. Maybe the solution is indeed to just (optionally) store the string representation, so people can convert as they see fit.

> - readArray suddenly introduces a distinct kind of interacting -
> callbacks. Why? Should be a lazy range lazy range lazy range. An adapter
> using callbacks is then a two-liner.

It just has a more complicated implementation, but is already on the TODO list.

> - Why is readBool even needed? Just readJSONValue and then enforce it as
> a bool. Same reasoning applies to readDouble and readString.

This is for lower level access, using parseJSONValue would certainly be possible, but it would have quite some unneeded overhead and would also be non-@nogc.

> - readObject is with callbacks again - it would be nice if it were a
> lazy range.

Okay, is also already on the list.

> - skipXxx are nice to have and useful.
>
> * stdx.data.json.value:
>
> - The etymology of "opt" is unclear - no word starting with "opt" or
> obviously abbreviating to it is in the documentation. "opt2" is awkward.
> How about "path" and "dyn", respectively.

The names are just placeholders currently. I think one of the two should also be enough. I've just implemented both, so that both can be tested/seen in practice. There have also been some more name suggestions in a thread mentioned by Meta with a more general suggestion for normal D member access. I'll see if I can dig those up, too.

> - I think Algebraic should be used throughout instead of
> TaggedAlgebraic, or motivation be given for the latter.

There have already been quite some arguments that I think are compelling, especially with a lack of counter arguments (maybe their consequences need to be explained better, though). TaggedAlgebraic could also (implicitly) convert to Algebraic. An additional argument is the potential possibility of TaggedAlgebraic to abstract away the underlying type, since it doesn't rely on a has!T and get!T API.

But apart from that, algebraic is unfortunately currently quite unsuited for this kind of abstraction, even if that can be solved in theory (with a lot of work). It requires to write things like obj.get!(JSONValue[string])["foo"].get!JSONValue instead of just obj["foo"], because it simply returns Variant from all of its forwarded operators.

> - JSONValue should be more opaque and not expose representation as much
> as it does now. In particular, offering a built-in hashtable is bound to
> be problematic because those are expensive to construct, create garbage,
> and are not customizable. Instead, the necessary lookup and set APIs
> should be provided by JSONValue whilst keeping the implementation
> hidden. The same goes about array - a JSONValue shall not be exposed;
> instead, indexed access primitives should be exposed. Separate types
> might be offered (e.g. JSONArray, JSONDictionary) if deemed necessary.
> The string type should be a type parameter of JSONValue.

This would unfortunately at the same time destroy almost all benefits that using (Tagged)Algebraic has, namely that it would opens up the possibility to have interoperability between different data formats (for example, passing a JSONValue to a BSON generator without letting the BSON generator know about JSON). This is unfortunately an area that I've also not yet properly explored, but I think it's important as we go forward with other data formats.

>
> ==============================
>
> So, here we are. I realize a good chunk of this is surprising ("you mean
> I shouldn't create strings in my APIs?"). My point here is, again, we're
> at a juncture. We're trying to factor garbage (heh) out of API design in
> ways that defer the lifetime management to the user of the API.

Most suggestions so far sound very reasonable, namely parameterizing parsing/lexing on the string type and using ranges where possible. JSONValue is a different beast that needs some more thought if we really want to keep it generic in terms of allocation/lifetime model.

In terms of removing "garbage" from the API, I'm just not 100% sure if removing small but frequently used functions, such as a string conversion function (one that returns an allocated string) is really a good idea (what Walter's suggested).

>
> We could pull json into std.experimental and defer the policy decisions
> for later, but I think it's a great driver for them. (Thanks Sönke for
> doing all the work, this is a great baseline.) I think we should use the
> JSON API as a guinea pig for the new era of D API design in which we
> have a solid set of principles, tools, and guidelines to defer lifetime
> management. Please advise.