RFC: std.json sucessor (page 16)

October 13, 2014

Re: RFC: std.json sucessor

Posted by Sönke Ludwig
in reply to Andrei Alexandrescu

Permalink

Sönke Ludwig

Posted in reply to Andrei Alexandrescu

Permalink

Am 12.10.2014 20:17, schrieb Andrei Alexandrescu:
> Here's my destruction of std.data.json.
>
> * lexer.d:
>
> ** Beautifully done. From what I understand, if the input is string or
> immutable(ubyte)[] then the strings are carved out as slices of the
> input, as opposed to newly allocated. Awesome.
>
> ** The string after lexing is correctly scanned and stored in raw format
> (escapes are not rewritten) and decoded on demand. Problem with decoding
> is that it may allocate memory, and it would be great (and not
> difficult) to make the lexer 100% lazy/non-allocating. To achieve that,
> lexer.d should define TWO "Kind"s of strings at the lexer level: regular
> string and undecoded string. The former is lexer.d's way of saying "I
> got lucky" in the sense that it didn't detect any '\\' so the raw and
> decoded strings are identical. No need for anyone to do any further
> processing in the majority of cases => win. The latter means the lexer
> lexed the string, saw at least one '\\', and leaves it to the caller to
> do the actual decoding.

This is actually more or less done in unescapeStringLiteral() - if it doesn't find any '\\', it just returns the original string. Also JSONString allows to access its .rawValue without doing any decoding/allocations.

https://github.com/s-ludwig/std_data_json/blob/master/source/stdx/data/json/lexer.d#L1421

Unfortunately .rawValue can't be @nogc because the "raw" value might have to be constructed first when the input is not a "string" (in this case unescaping is done on-the-fly for efficiency reasons).


>
> ** After moving the decoding business out of lexer.d, a way to take this
> further would be to qualify lexer methods as @nogc if the input is
> string/immutable(ubyte)[]. I wonder how to implement a conditional
> attribute. We'll probably need a language enhancement for that.

Isn't @nogc inferred? Everything is templated, so that should be possible. Or does attribute inference only work for template function and not for methods of templated types? Should it?

>
> ** The implementation uses manually-defined tagged unions for work.
> Could we use Algebraic instead - dogfooding and all that? I recall there
> was a comment in Sönke's original work that Algebraic has a specific
> issue (was it false pointers?) - so the question arises, should we fix
> Algebraic and use it thus helping other uses as well?

I had started on an implementation of a type and ID safe TaggedAlgebraic that uses Algebraic for its internal storage. If we can get that in first, it should be no problem to use it instead (with no or minimal API breakage). However, it uses a struct instead of an enum to define the "Kind" (which is the only nice way I could conceive to safely couple enum value and type at compile time), so it's not as nice in the generated documentation.

>
> ** I see the "boolean" kind, should we instead have the "true_" and
> "false_" kinds?

I always found it cumbersome and awkward to work like that. What would be the reason to go that route?

>
> ** Long story short I couldn't find any major issue with this module,
> and I looked! I do think the decoding logic should be moved outside of
> lexer.d or at least the JSONLexerRange.
>
> * generator.d: looking good, no special comments. Like the consistent
> use of structs filled with options as template parameters.
>
> * foundation.d:
>
> ** At four words per token, Location seems pretty bulky. How about
> reducing line and column to uint?

Single line JSON files >64k (or line counts >64k) are no exception, so that would only work in a limited way. My thought about this was that it is quite unusual to actually store the tokens for most purposes (especially when directly serializing to a native D type), so that it should have minimal impact on performance or memory consumption.

>
> ** Could JSONException create the message string in toString (i.e.
> when/if used) as opposed to in the constructor?

That could of course be done, but the you'd not get the full error message using ex.msg, only with ex.toString(), which usually prints a call trace instead. Alternatively, it's also possible to completely avoid using exceptions with LexOptions.noThrow.

>
> * parser.d:
>
> ** How about using .init instead of .defaults for options?

I'd slightly tend to prefer the more explicit "defaults", especially because "init" could mean either "defaults" or "none" (currently it means "none"). But another idea would be to invert the option values so that defaults==none... any objections?

>
> ** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End
> markers shouldn't appear as nodes. There should be an "object" node
> only. I guess that's needed for laziness.

While you could infer the end of an object in the parser range by looking for the first entry that doesn't start with a "key" node, the same would not be possible for arrays, so in general the end marker *is* required. Not that the parser range is a StAX style parser, which is still very close to the lexical structure of the document.

I was also wondering if there might be a better name than "JSONParserNode". It's not really embedded into a tree or graph structure, which the name tends to suggest.

>
> ** It's unclear where memory is being allocated in the parser. @nogc
> annotations wherever appropriate would be great.

The problem is that the parser accesses the lexer, which in turn accesses the underlying input range, which in turn could allocate. Depending on the options passed to the lexer, it could also throw, and thus allocate, an exception. In the end only JSONParserRange.empty could generally be made @nogc.

However, attribute inference should be possible here in theory (the noThrow option is compile-time).

>
> * value.d:
>
> ** Looks like this is/may be the only place where memory is being
> managed, at least if the input is string/immutable(ubyte)[]. Right?

Yes, at least when setting aside optional exceptions and lazy allocations.

>
> ** Algebraic ftw.
>
> ============================
>
> Overall: This is very close to everything I hoped! A bit more care to
> @nogc would be awesome, especially with the upcoming focus on memory
> management going forward.

I've tried to use @nogc (as well as nothrow) in more places, but mostly due to not knowing if the underlying input range allocates, it hasn't really been possible. Even on lower levels (private functions) almost any Phobos function that is called is currently not @nogc for reasons that are not always obvious, so I gave up on that for now.

>
> After one more pass it would be great to move forward for review.

There is also still one pending change that I didn't finish yet; the optional UTF input validation (never validate "string" inputs, but do validate "ubyte[]" inputs).

Oh and there is the open issue of how to allocate in case of non-array inputs. Initially I wanted to wait with this until we have an allocators module, but Walter would like to have a way to do manual memory management in the initial version. However, the ideal design is still unclear to me - it would either simply resemble a general allocator interface, or could use something like a callback that returns an output range, which would probably be quite cumbersome to work with. Any ideas in this direction would be welcome.

Sönke

On 10/13/14, 4:48 AM, Sönke Ludwig wrote:
> Am 13.10.2014 13:37, schrieb Jacob Carlborg:
>> On 22/08/14 00:35, Sönke Ludwig wrote:
>>> Following up on the recent "std.jgrandson" thread [1], I've picked up
>>> the work (a lot earlier than anticipated) and finished a first version
>>> of a loose blend of said std.jgrandson, vibe.data.json and some changes
>>> that I had planned for vibe.data.json for a while. I'm quite pleased by
>>> the results so far, although without a serialization framework it still
>>> misses a very important building block.
>>>
>>> Code: https://github.com/s-ludwig/std_data_json
>>
>> JSONToken.Kind and JSONParserNode.Kind could be "ubyte" to save space.
>>
>
> But it won't save space in practice, at least on x86, due to alignment,
> and depending on what the compiler assumes, the access can also be
> slower that way.

Correct. -- Andrei

Forums