August 21, 2015
On 8/18/15 1:21 PM, Sönke Ludwig wrote:
> Am 18.08.2015 um 00:37 schrieb Andrei Alexandrescu:
>> On 8/17/15 2:56 PM, Sönke Ludwig wrote:
>>> - The enum is useful to be able to identify the types outside of the D
>>> code itself. For example when serializing the data to disk, or when
>>> communicating with C code.
>>
>> OK.
>>
>>> - It enables the use of pattern matching (final switch), which is often
>>> very convenient, faster, and safer than an if-else cascade.
>>
>> Sounds tenuous.
>
> It's more convenient/readable in cases where a complex type is used
> (typeID == Type.object vs. has!(JSONValue[string]). This is especially
> true if the type is ever changed (or parametric) and all has!()/get!()
> code needs to be adjusted accordingly.
>
> It's faster, even if there is no indirect call involved in the pointer
> case, because the compiler can emit efficient jump tables instead of
> generating a series of conditional jumps (if-else-cascade).
>
> It's safer because of the possibility to use final switch in addition to
> a normal switch.
>
> I wouldn't call that tenuous.

Well I guess I would, but no matter. It's something where reasonable people may disagree.

>>> - A hypothesis is that it is faster, because there is no function call
>>> indirection involved.
>>
>> Again: pointers do all integrals do. To compare:
>>
>> if (myptr == ThePtrOf!int) { ... this is an int ... }
>>
>> I want to make clear that this is understood.
>
> Got that.
>
>>
>>> - It naturally enables fully statically typed operator forwarding as far
>>> as possible (have a look at the examples of the current version). A
>>> pointer based version could do this, too, but only by jumping through
>>> hoops.
>>
>> I'm unclear on that. Could you please point me to the actual file and
>> lines?
>
> See the operator implementation code [1] that is completely statically
> typed until the final "switch" happens [2]. You can of course do the
> same for the pointer based Algebraic, but that would just
> duplicate/override the code that is already implemented by the pointer
> method.

Classic code factoring can be done to avoid duplication.

>>> - The same type can be used multiple times with a different enum name.
>>> This can alternatively be solved using a Typedef!T, but I had several
>>> occasions where that proved useful.
>>
>> Unclear on this.
>
> I'd say this is just a little perk of the representation but not a hard
> argument since it can be achieved in a different way relatively easily.
>
> [1]:
> https://github.com/s-ludwig/taggedalgebraic/blob/591b45ca8f99dbab1da966192c67f45354c1e34e/source/taggedalgebraic.d#L145
>
> [2]:
> https://github.com/s-ludwig/taggedalgebraic/blob/591b45ca8f99dbab1da966192c67f45354c1e34e/source/taggedalgebraic.d#L551

Thanks.


Andrei
August 21, 2015
On 8/18/15 12:54 PM, Sönke Ludwig wrote:
> Am 18.08.2015 um 00:21 schrieb Andrei Alexandrescu:
>> * On the face of it, dedicating 6 modules to such a small specification
>> as JSON seems excessive. I'm thinking one module here. (As a simple
>> point: who would ever want to import only foundation, which in turn has
>> one exception type and one location type in it?) I think it shouldn't be
>> up for debate that we must aim for simple and clean APIs.
>
> That would mean a single module that is >5k lines long. Spreading out
> certain things, such as JSONValue into an own module also makes sense to
> avoid unnecessarily large imports where other parts of the functionality
> isn't needed. Maybe we could move some private things to "std.internal"
> or similar and merge some of the modules?

That would help. My point is it's good design to make the response proportional to the problem. 5K lines is not a lot, but reducing those 5K in the first place would be a noble pursuit. And btw saving parsing time is so C++ :o).

> But I also think that grouping symbols by topic is a good thing and
> makes figuring out the API easier. There is also always package.d if you
> really want to import everything.

Figuring out the API easily is a good goal. The best way to achieve that is making the API no larger than necessary.

>> * stdx.data.json.generator: I think the API for converting in-memory
>> JSON values to strings needs to be redone, as follows:
>>
>> - JSONValue should offer a byToken range, which offers the contents of
>> the value one token at a time. For example, "[ 1, 2, 3 ]" offers the '['
>> token followed by three numeric tokens with the respective values
>> followed by the ']' token.
>
> An input range style generator is on the TODO list, but would a token
> range be really useful for anything in practice? I would just go
> straight for a char range.

Sounds good.

> Another thing I'd like to add is an output range that takes parser nodes
> and writes to a string output range. This would be the kind of interface
> that would be most useful for a serialization framework.

Couldn't that be achieved trivially by e.g. using map!(t => t.toString) or similar?

This is the nice thing about rangifying everything - suddenly you have a host of tools at your disposal.

>> - On top of byToken it's immediate to implement a method (say toJSON or
>> toString) that accepts an output range of characters and formatting
>> options.
>>
>> - On top of the method above with output range, implementing a toString
>> overload that returns a string for convenience is a two-liner. However,
>> it shouldn't return a "string"; Phobos APIs should avoid "hardcoding"
>> the string type. Instead, it should return a user-chosen string type
>> (including reference counting strings).
>
> Without any existing code to test this against, how would this look
> like? Simply using an `Appender!rcstring`?

Yes.

>> - While at it make prettyfication a flag in the options, not its own
>> part of the function name.
>
> Already done. Pretty printing is now the default and there is
> GeneratorOptions.compact.

Great, thanks.

>> * stdx.data.json.lexer:
>>
>> - I assume the idea was to accept ranges of integrals to mean "there's
>> some raw input from a file". This seems to be a bit overdone, e.g.
>> there's no need to accept signed integers or 64-bit integers. I suggest
>> just going with the three character types.
>
> It's funny you say that, because this was your own design proposal.

Ooops...

> Regarding the three character types, if we drop everything but those, I
> think we could also go with Walter's suggestion and just drop everything
> apart from "char". Putting a conversion range from dchar to char would
> be trivial and should be fast enough.

That's great, thanks.

>> - I see tokenization accepts input ranges. This forces the tokenizer to
>> store its own copy of things, which is no doubt the business of
>> appenderFactory.  Here the departure of the current approach from what I
>> think should become canonical Phobos APIs deepens for multiple reasons.
>> First, appenderFactory does allow customization of the append operation
>> (nice) but that's not enough to allow the user to customize the lifetime
>> of the created strings, which is usually reflected in the string type
>> itself. So the lexing method should be parameterized by the string type
>> used. (By default string (as is now) should be fine.) Therefore instead
>> of customizing the append method just customize the string type used in
>> the token.
>
> Okay, sounds reasonable if Appender!rcstring is just going to work.

Awesome, thanks.

>> - The lexer should internally take optimization opportunities, e.g. if
>> the string type is "string" and the lexed type is also "string", great,
>> just use slices of the input instead of appending them to the tokens.
>
> It does.

Yay to that.

>> - At token level there should be no number parsing. Just store the
>> payload with the token and leave it for later. Very often numbers are
>> converted without there being a need, and the process is costly. This
>> also nicely sidesteps the entire matter of bigints, floating point etc.
>> at this level.
>
> Okay, again, this was your own suggestion. The downside of always
> storing the string representation is that it requires allocations if no
> slices are used, and that the string will have to be parsed twice if the
> number is indeed going to be used. This can have a considerable
> performance impact.

Hmm, point taken. I'm not too worried about the parsing part but string allocation may be problematic.

>> - Also, at token level strings should be stored with escapes unresolved.
>> If the user wants a string with the escapes resolved, a lazy range
>> does it.
>
> To make things efficient, it currently stores escaped strings if slices
> of the input are used, but stores unescaped strings if allocations are
> necessary anyway.

That seems a good balance, and probably could be applied to numbers as well.

>> - Validating UTF is tricky; I've seen some discussion in this thread
>> about it. On the face of it JSON only accepts valid UTF characters. As
>> such, a modularity-based argument is to pipe UTF validation before
>> tokenization. (We need a lazy UTF validator and sanitizer stat!) An
>> efficiency-based argument is to do validation during tokenization. I'm
>> inclining in favor of modularization, which allows us to focus on one
>> thing at a time and do it well, instead of duplicationg validation
>> everywhere. Note that it's easy to write routines that do JSON
>> tokenization and leave UTF validation for later, so there's a lot of
>> flexibility in composing validation with JSONization.
>
> It's unfortunate to see this change of mind in face of the work that
> already went into the implementation. I also still think that this is a
> good optimization opportunity that doesn't really affect the
> implementation complexity. Validation isn't duplicated, but reused from
> std.utf.

Well if the validation is reused from std.utf, it can't have been very much work. I maintain that separating concerns seems like a good strategy here.

>> - Litmus test: if the input type is a forward range AND if the string
>> type chosen for tokens is the same as input type, successful
>> tokenization should allocate exactly zero memory. I think this is a
>> simple way to make sure that the tokenization API works well.
>
> Supporting arbitrary forward ranges doesn't seem to be enough, it would
> at least have to be combined with something like take(), but then the
> type doesn't equal the string type anymore. I'd suggest to keep it to
> "if is sliceable and input type equals string type", at least for the
> initial version.

I had "take" in mind. Don't forget that "take" automatically uses slices wherever applicable. So if you just use typeof(take(...)), you get the best of all worlds.

The more restrictive version seems reasonable for the first release.

>> - If noThrow is a runtime option, some functions can't be nothrow (and
>> consequently nogc). Not sure how important this is. Probably quite a bit
>> because of the current gc implications of exceptions. IMHO: at lexing
>> level a sound design might just emit error tokens (with the culprit as
>> payload) and never throw. Clients may always throw when they see an
>> error token.
>
> noThrow is a compile time option and there are @nothrow unit tests to
> make sure that the API is @nothrow at least for string inputs.

Awesome.

>> - The JSON value does its own internal allocation (for e.g. arrays and
>> hashtables), which should be fine as long as it's encapsulated and we
>> can tweak it later (e.g. make it use reference counting inside).
>
> Since it's based on (Tagged)Algebraic, the internal types are part of
> the interface. Changing them later is bound to break some code. So AFICS
> this would either require to make the types used parameterized (string,
> array and AA types). Or to abstract them away completely, i.e. only
> forward operations but deny direct access to the type.
>
> ... thinking about it, TaggedAlgebraic could do that, while Algebraic
> can't.

Well if you figure the general Algebraic type is better replaced by a type specialized for JSON, fine.

What we shouldn't endorse is two nearly identical library types (Algebraic and TaggedAlgebraic) that are only different in subtle matters related to performance in certain use patterns.

If integral tags are better for closed type universes, specialize Algebraic to use integral tags where applicable.

>> - Why both parseJSONStream and parseJSONValue? I'm thinking
>> parseJSONValue would be enough because then you trivially parse a stream
>> with repeated calls to parseJSONValue.
>
> parseJSONStream is the pull parser (StAX style) interface. It returns
> the contents of a JSON document as individual nodes instead of storing
> them in a DOM. This part is vital for high-performance parsing,
> especially of large documents.

So perhaps this is just a naming issue. The names don't suggest everything you said. What I see is "parse a JSON stream" and "parse a JSON value". So I naturally assumed we're looking at consuming a full stream vs. consuming only one value off a stream and stopping. How about better names?

>> - FWIW I think the whole thing with accommodating BigInt etc. is an
>> exaggeration. Just stick with long and double.
>
> As mentioned earlier somewhere in this thread, there are practical needs
> to at least be able to handle ulong, too. Maybe the solution is indeed
> to just (optionally) store the string representation, so people can
> convert as they see fit.

Great. I trust you'll find the right compromise there. All I'm saying is that BigInt here stands like a sore thumb in the whole affair. Best to just take it out and let folks who need it build on top of the lexer.

>> - readArray suddenly introduces a distinct kind of interacting -
>> callbacks. Why? Should be a lazy range lazy range lazy range. An adapter
>> using callbacks is then a two-liner.
>
> It just has a more complicated implementation, but is already on the
> TODO list.

Great. Let me say again that with ranges you get to instantly tap into a wealth of tools. I say get rid of the callbacks and let a "tee" take care of it for whomever needs it.

>> - Why is readBool even needed? Just readJSONValue and then enforce it as
>> a bool. Same reasoning applies to readDouble and readString.
>
> This is for lower level access, using parseJSONValue would certainly be
> possible, but it would have quite some unneeded overhead and would also
> be non-@nogc.

Meh, fine. But all of this is adding weight to the API in the wrong places.

>> - readObject is with callbacks again - it would be nice if it were a
>> lazy range.
>
> Okay, is also already on the list.

Awes!

>> - skipXxx are nice to have and useful.
>>
>> * stdx.data.json.value:
>>
>> - The etymology of "opt" is unclear - no word starting with "opt" or
>> obviously abbreviating to it is in the documentation. "opt2" is awkward.
>> How about "path" and "dyn", respectively.
>
> The names are just placeholders currently. I think one of the two should
> also be enough. I've just implemented both, so that both can be
> tested/seen in practice. There have also been some more name suggestions
> in a thread mentioned by Meta with a more general suggestion for normal
> D member access. I'll see if I can dig those up, too.

Okay.

>> - I think Algebraic should be used throughout instead of
>> TaggedAlgebraic, or motivation be given for the latter.
>
> There have already been quite some arguments that I think are
> compelling, especially with a lack of counter arguments (maybe their
> consequences need to be explained better, though). TaggedAlgebraic could
> also (implicitly) convert to Algebraic. An additional argument is the
> potential possibility of TaggedAlgebraic to abstract away the underlying
> type, since it doesn't rely on a has!T and get!T API.

To reiterate the point I made above: we should not endorse two mostly equivalent types that exhibit subtle performance differences. Feel free to change Algebraic to use integrals for some/most cases when the number of types involved is bounded. Adding new methods to Algebraic should also be fine. Just don't add a new type that's 98% the same.

> But apart from that, algebraic is unfortunately currently quite unsuited
> for this kind of abstraction, even if that can be solved in theory (with
> a lot of work). It requires to write things like
> obj.get!(JSONValue[string])["foo"].get!JSONValue instead of just
> obj["foo"], because it simply returns Variant from all of its forwarded
> operators.

Algebraic does not expose opIndex. We could add it to Algebraic such that obj["foo"] returns the same type a "this".

It's easy for anyone to say that what's there is unfit for a particular purpose. It's also easy for many to define a ever-so-slightly-different new artifact that fits a particular purpose. Where you come as a talented hacker is to operate with the understanding of the importance of making things work, and make it work.

>> - JSONValue should be more opaque and not expose representation as much
>> as it does now. In particular, offering a built-in hashtable is bound to
>> be problematic because those are expensive to construct, create garbage,
>> and are not customizable. Instead, the necessary lookup and set APIs
>> should be provided by JSONValue whilst keeping the implementation
>> hidden. The same goes about array - a JSONValue shall not be exposed;
>> instead, indexed access primitives should be exposed. Separate types
>> might be offered (e.g. JSONArray, JSONDictionary) if deemed necessary.
>> The string type should be a type parameter of JSONValue.
>
> This would unfortunately at the same time destroy almost all benefits
> that using (Tagged)Algebraic has, namely that it would opens up the
> possibility to have interoperability between different data formats (for
> example, passing a JSONValue to a BSON generator without letting the
> BSON generator know about JSON). This is unfortunately an area that I've
> also not yet properly explored, but I think it's important as we go
> forward with other data formats.

I think we need to do it. Otherwise we're stuck with "D's JSON API cannot be used without the GC". We want to escape that gravitational pull. I know it's hard. But it's worth it.

>> ==============================
>>
>> So, here we are. I realize a good chunk of this is surprising ("you mean
>> I shouldn't create strings in my APIs?"). My point here is, again, we're
>> at a juncture. We're trying to factor garbage (heh) out of API design in
>> ways that defer the lifetime management to the user of the API.
>
> Most suggestions so far sound very reasonable, namely parameterizing
> parsing/lexing on the string type and using ranges where possible.
> JSONValue is a different beast that needs some more thought if we really
> want to keep it generic in terms of allocation/lifetime model.
>
> In terms of removing "garbage" from the API, I'm just not 100% sure if
> removing small but frequently used functions, such as a string
> conversion function (one that returns an allocated string) is really a
> good idea (what Walter's suggested).

We must accommodate a GC-less world. It's definitely time to acknowledge the GC as a brake that limits D adoption, and put our full thrust behind removing it.


Andrei

August 21, 2015
On 8/21/15 1:30 PM, Andrei Alexandrescu wrote:
> So perhaps this is just a naming issue. The names don't suggest
> everything you said. What I see is "parse a JSON stream" and "parse a
> JSON value". So I naturally assumed we're looking at consuming a full
> stream vs. consuming only one value off a stream and stopping. How about
> better names?

I should add that in parseJSONStream, "stream" refers to the input, whereas in parseJSONValue, "value" refers to the output. -- Andrei
August 21, 2015
On Friday, 21 August 2015 at 17:30:43 UTC, Andrei Alexandrescu wrote:
> We must accommodate a GC-less world. It's definitely time to acknowledge the GC as a brake that limits D adoption, and put our full thrust behind removing it.
>
>
> Andrei

Wow. Just wow.
August 21, 2015
On 8/21/15 2:03 PM, tired_eyes wrote:
> On Friday, 21 August 2015 at 17:30:43 UTC, Andrei Alexandrescu wrote:
>> We must accommodate a GC-less world. It's definitely time to
>> acknowledge the GC as a brake that limits D adoption, and put our full
>> thrust behind removing it.
>>
>>
>> Andrei
>
> Wow. Just wow.

By "it" there I mean "the brake" :o). -- Andrei
August 21, 2015
On Fri, Aug 21, 2015 at 02:21:06PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 8/21/15 2:03 PM, tired_eyes wrote:
> >On Friday, 21 August 2015 at 17:30:43 UTC, Andrei Alexandrescu wrote:
> >>We must accommodate a GC-less world. It's definitely time to acknowledge the GC as a brake that limits D adoption, and put our full thrust behind removing it.
> >>
> >>
> >>Andrei
> >
> >Wow. Just wow.
> 
> By "it" there I mean "the brake" :o). -- Andrei

Wait, wait. So you're saying the GC is a brake, and we should remove the brake, and therefore we should remove the GC?  This is ... wow. I'm speechless here.


T

-- 
He who sacrifices functionality for ease of use, loses both and deserves neither. -- Slashdotter
August 21, 2015
On 8/21/15 2:50 PM, H. S. Teoh via Digitalmars-d wrote:
> On Fri, Aug 21, 2015 at 02:21:06PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
>> On 8/21/15 2:03 PM, tired_eyes wrote:
>>> On Friday, 21 August 2015 at 17:30:43 UTC, Andrei Alexandrescu wrote:
>>>> We must accommodate a GC-less world. It's definitely time to
>>>> acknowledge the GC as a brake that limits D adoption, and put our
>>>> full thrust behind removing it.
>>>>
>>>>
>>>> Andrei
>>>
>>> Wow. Just wow.
>>
>> By "it" there I mean "the brake" :o). -- Andrei
>
> Wait, wait. So you're saying the GC is a brake, and we should remove the
> brake, and therefore we should remove the GC?  This is ... wow. I'm
> speechless here.

Nothing new here. We want to make it a pleasant experience to use D without a garbage collector. -- Andrei

August 21, 2015
On Fri, Aug 21, 2015 at 03:22:25PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 8/21/15 2:50 PM, H. S. Teoh via Digitalmars-d wrote:
> >On Fri, Aug 21, 2015 at 02:21:06PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> >>On 8/21/15 2:03 PM, tired_eyes wrote:
> >>>On Friday, 21 August 2015 at 17:30:43 UTC, Andrei Alexandrescu wrote:
> >>>>We must accommodate a GC-less world. It's definitely time to acknowledge the GC as a brake that limits D adoption, and put our full thrust behind removing it.
> >>>>
> >>>>
> >>>>Andrei
> >>>
> >>>Wow. Just wow.
> >>
> >>By "it" there I mean "the brake" :o). -- Andrei
> >
> >Wait, wait. So you're saying the GC is a brake, and we should remove the brake, and therefore we should remove the GC?  This is ... wow. I'm speechless here.
> 
> Nothing new here. We want to make it a pleasant experience to use D without a garbage collector. -- Andrei

Making it pleasant to use without a GC is not the same thing as removing the GC. Which is it?


T

-- 
Try to keep an open mind, but not so open your brain falls out. -- theboz
August 21, 2015
On 8/21/15 3:22 PM, Andrei Alexandrescu wrote:
> On 8/21/15 2:50 PM, H. S. Teoh via Digitalmars-d wrote:
>> On Fri, Aug 21, 2015 at 02:21:06PM -0400, Andrei Alexandrescu via
>> Digitalmars-d wrote:
>>> On 8/21/15 2:03 PM, tired_eyes wrote:
>>>> On Friday, 21 August 2015 at 17:30:43 UTC, Andrei Alexandrescu wrote:
>>>>> We must accommodate a GC-less world. It's definitely time to
>>>>> acknowledge the GC as a brake that limits D adoption, and put our
>>>>> full thrust behind removing it.
>>>>>
>>>>>
>>>>> Andrei
>>>>
>>>> Wow. Just wow.
>>>
>>> By "it" there I mean "the brake" :o). -- Andrei
>>
>> Wait, wait. So you're saying the GC is a brake, and we should remove the
>> brake, and therefore we should remove the GC?  This is ... wow. I'm
>> speechless here.
>
> Nothing new here. We want to make it a pleasant experience to use D
> without a garbage collector. -- Andrei
>

Allow me to (possibly) clarify.

What Andrei is saying is that you should be able to use D and phobos *without* the GC, not that we should remove the GC.

e.g. what Walter was talking about at dconf2015 that instead of converting an integer to a GC-allocated string, you return a range that does the same thing but doesn't allocate.

-Steve
August 22, 2015
Am 21.08.2015 um 18:54 schrieb Andrei Alexandrescu:
> On 8/19/15 4:55 AM, Sönke Ludwig wrote:
>> Am 19.08.2015 um 03:58 schrieb Andrei Alexandrescu:
>>> On 8/18/15 1:24 PM, Jacob Carlborg wrote:
>>>> On 2015-08-18 17:18, Andrei Alexandrescu wrote:
>>>>
>>>>> Me neither if internal. I do see a problem if it's public. -- Andrei
>>>>
>>>> If it's public and those 20 lines are useful on its own, I don't see a
>>>> problem with that either.
>>>
>>> In this case at least they aren't. There is no need to import the JSON
>>> exception and the JSON location without importing anything else JSON. --
>>> Andrei
>>>
>>
>> The only other module where it would fit would be lexer.d, but that
>> means that importing JSONValue also has to import the parser and lexer
>> modules, which is usually only needed in a few places.
>
> I'm sure there are a number of better options to package things nicely.
> -- Andrei

I'm all ears ;)