std.data.json formal review (page 16) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » std.data.json formal review (page 16)

August 16, 2015

Re: std.data.json formal review

Posted by Walter Bright
in reply to Dmitry Olshansky

Walter Bright

Posted in reply to Dmitry Olshansky

On 8/16/2015 3:39 AM, Dmitry Olshansky wrote:
> About x2 faster then decode + check-if-alphabetic on my stuff:
>
> https://github.com/DmitryOlshansky/gsoc-bench-2012
>
> I haven't updated it in a while. There are nice bargraphs for decoding versions
> by David comparing DMD vs LDC vs GDC:
>
> Page 15 at http://dconf.org/2013/talks/nadlinger.pdf

Thank you.

August 16, 2015

Re: std.data.json formal review

Posted by Sönke Ludwig
in reply to Walter Bright

Sönke Ludwig

Posted in reply to Walter Bright

Am 16.08.2015 um 02:50 schrieb Walter Bright:
> On 8/15/2015 3:18 AM, Sönke Ludwig wrote:
>>> I don't know what 'isStringInputRange' is. Whatever it is, it should be
>>> a 'range of char'.
>>
>> I'll rename it to isCharInputRange. We don't have something like that
>> in Phobos,
>> right?
>
> That's right, there isn't one. But I use:
>
>      if (isInputRange!R && is(Unqual!(ElementEncodingType!R) == char))
>
> I'm not a fan of more names for trivia, the deluge of names has its own
> costs.

Good, I'll use `if (isInputRange!R && (isSomeChar!(ElementEncodingType!R) || isIntegral!(ElementEncodingType!R))`. It's just used in number of places and quite a bit more verbose (twice as long) and I guess a large number of algorithms in Phobos accept char ranges, so that may actually warrant a name in this case.

>>> There is no reason to validate UTF-8 input. The only place where
>>> non-ASCII code units can even legally appear is inside strings, and
>>> there they can just be copied verbatim while looking for the end of the
>>> string.
>> The idea is to assume that any char based input is already valid UTF
>> (as D
>> defines it), while integer based input comes from an unverified
>> source, so that
>> it still has to be validated before being cast/copied into a 'string'.
>> I think
>> this is a sensible approach, both semantically and performance-wise.
>
> The json parser will work fine without doing any validation at all. I've
> been implementing string handling code in Phobos with the idea of doing
> validation only if the algorithm requires it, and only for those parts
> that require it.

Yes, and it won't do that if a char range is passed in. If the integral range path gets removed there are basically two possibilities left, perform the validation up-front (slower), or risk UTF exceptions in unrelated parts of the code base. I don't see why we shouldn't take the opportunity for a full and fast validation here. But I'll relay this to Andrei, it was his idea originally.

> There are many validation algorithms in Phobos one can tack on - having
> two implementations of every algorithm, one with an embedded reinvented
> validation and one without - is too much.

There is nothing reinvented here. It simply implicitly validates all non-string parts of a JSON document and uses validate() for parts of JSON strings that can contain unicode characters.

> The general idea with algorithms is that they do not combine things, but
> they enable composition.

It's just that there is no way to achieve the same performance using composition in this case.

>>> Why do both? Always return an input range. If the user wants a string,
>>> he can pipe the input range to a string generator, such as .array
>> Convenience for one.
>
> Back to the previous point, that means that every algorithm in Phobos
> should have two versions, one that returns a range and the other a
> string? All these variations will result in a combinatorical explosion.

This may be a factor of two, but not a combinatorial explosion.

> The other problem, of course, is that returning a string means the
> algorithm has to decide how to allocate that string. As much as
> possible, algorithms should not be making allocation decisions.

Granted, the fact that format() and to!() support input ranges (I didn't notice that until now) makes the issue less important. But without those, it would basically mean that almost all places that generate JSON strings would have to import std.array and append .array. Nothing particularly bad if viewed in isolation, but makes the language appear a lot less clean/more verbose if it occurs often. It's also a stepping stone for language newcomers.

>> The lack of number to input range conversion functions is
>> another concern. I'm not really keen to implement an input range style
>> floating-point to string conversion routine just for this module.
>
> Not sure what you mean. Phobos needs such routines anyway, and you still
> have to do something about floating point.

There are output range and allocation based float->string conversions available, but no input range based one. But well, using an internal buffer together with formattedWrite would probably be a viable workaround...

>> Finally, I'm a little worried about performance. The output range
>> based approach
>> can keep a lot of state implicitly using the program counter register.
>> But an
>> input range would explicitly have to keep track of the current JSON
>> element, as
>> well as the current character/state within that element (and possibly
>> one level
>> deeper, for example for escape sequences). This means that it will
>> require
>> either multiple branches or indirection for each popFront().
>
> Often this is made up for by not needing to allocate storage. Also, that
> state is in the cached "hot zone" on top of the stack, which is much
> faster to access than a cold uninitialized array.

Just branch misprediction will most probably be problematic. But I think this can be made fast enough anyway by making the input range partially eager and serving chunks of strings at a time. That way, the additional branching only has to happen once per chunk. I'll have a look.

> I share your concern with performance, and I had very good results with
> Warp by keeping all the state on the stack in this manner.
>

August 16, 2015

Re: std.data.json formal review

Posted by Jacob Carlborg
in reply to Sönke Ludwig

Jacob Carlborg

Posted in reply to Sönke Ludwig

On 2015-08-16 14:34, Sönke Ludwig wrote:

> Good, I'll use `if (isInputRange!R &&
> (isSomeChar!(ElementEncodingType!R) ||
> isIntegral!(ElementEncodingType!R))`. It's just used in number of places
> and quite a bit more verbose (twice as long) and I guess a large number
> of algorithms in Phobos accept char ranges, so that may actually warrant
> a name in this case.

I agree. Signatures like this are what's making std.algorithm look more complicated than it is.

-- 
/Jacob Carlborg

August 16, 2015

Re: std.data.json formal review

Posted by Walter Bright
in reply to Sönke Ludwig

Walter Bright

Posted in reply to Sönke Ludwig

On 8/16/2015 5:34 AM, Sönke Ludwig wrote:
> Am 16.08.2015 um 02:50 schrieb Walter Bright:
>>      if (isInputRange!R && is(Unqual!(ElementEncodingType!R) == char))
>>
>> I'm not a fan of more names for trivia, the deluge of names has its own
>> costs.
>
> Good, I'll use `if (isInputRange!R && (isSomeChar!(ElementEncodingType!R) ||
> isIntegral!(ElementEncodingType!R))`. It's just used in number of places and
> quite a bit more verbose (twice as long) and I guess a large number of
> algorithms in Phobos accept char ranges, so that may actually warrant a name in
> this case.

Except that there is no reason to support wchar, dchar, int, ubyte, or anything other than char. The idea is not to support something just because you can, but there should be an identifiable, real use case for it first. Has anyone ever seen Json data as ulongs? I haven't either.


>> The json parser will work fine without doing any validation at all. I've
>> been implementing string handling code in Phobos with the idea of doing
>> validation only if the algorithm requires it, and only for those parts
>> that require it.
>
> Yes, and it won't do that if a char range is passed in. If the integral range
> path gets removed there are basically two possibilities left, perform the
> validation up-front (slower), or risk UTF exceptions in unrelated parts of the
> code base. I don't see why we shouldn't take the opportunity for a full and fast
> validation here. But I'll relay this to Andrei, it was his idea originally.

That argument could be used to justify validation in every single algorithm that deals with strings.


>>>> Why do both? Always return an input range. If the user wants a string,
>>>> he can pipe the input range to a string generator, such as .array
>>> Convenience for one.
>>
>> Back to the previous point, that means that every algorithm in Phobos
>> should have two versions, one that returns a range and the other a
>> string? All these variations will result in a combinatorical explosion.
>
> This may be a factor of two, but not a combinatorial explosion.

We're already up to validate or not, to string or not, i.e. 4 combinations.


>> The other problem, of course, is that returning a string means the
>> algorithm has to decide how to allocate that string. As much as
>> possible, algorithms should not be making allocation decisions.
>
> Granted, the fact that format() and to!() support input ranges (I didn't notice
> that until now) makes the issue less important. But without those, it would
> basically mean that almost all places that generate JSON strings would have to
> import std.array and append .array. Nothing particularly bad if viewed in
> isolation, but makes the language appear a lot less clean/more verbose if it
> occurs often. It's also a stepping stone for language newcomers.

This has been argued before, and the problem is it applies to EVERY algorithm in Phobos, and winds up with a doubling of the number of functions to deal with it. I do not view this as clean.

D is going to be built around ranges as a fundamental way of coding. Users will need to learn something about them. Appending .array is not a big hill to climb.


> There are output range and allocation based float->string conversions available,
> but no input range based one. But well, using an internal buffer together with
> formattedWrite would probably be a viable workaround...

I plan to fix that, so using a workaround in the meantime is appropriate.

August 17, 2015

Re: std.data.json formal review

Posted by Andrei Alexandrescu
in reply to Atila Neves

Andrei Alexandrescu

Posted in reply to Atila Neves

On 7/28/15 10:07 AM, Atila Neves wrote:
> Start of the two week process, folks.
>
> Code: https://github.com/s-ludwig/std_data_json
> Docs: http://s-ludwig.github.io/std_data_json/
>
> Atila

I'll submit a review in short order, but thought this might be of use in performance comparisons: https://www.reddit.com/r/programming/comments/3hbt4w/using_json_in_a_low_latency_environment/ -- Andrei

August 17, 2015

Re: std.data.json formal review

Posted by Andrei Alexandrescu
in reply to Andrei Alexandrescu

Andrei Alexandrescu

Posted in reply to Andrei Alexandrescu

On 8/14/15 7:40 AM, Andrei Alexandrescu wrote:
> On 8/12/15 5:43 AM, Sönke Ludwig wrote:
>>> Anyway, I've just started to work on a generic variant of an enum based
>>> algebraic type that exploits as much static type information as
>>> possible. If that works out (compiler bugs?), it would be a great thing
>>> to have in Phobos, so maybe it's worth to delay the JSON module for that
>>> if necessary.
>>>
>>
>> First proof of concept:
>> https://gist.github.com/s-ludwig/7a8a60150f510239f071#file-taggedalgebraic-d-L148
>>
>>
>>
>> It probably still has issues with const/immutable and ref in some
>> places, but the basics seem to work as expected.
>
> struct TaggedAlgebraic(U) if (is(U == union)) { ... }
>
> Interesting. I think it would be best to rename it to TaggedUnion
> (instantly recognizable; also TaggedAlgebraic is an oxymoron as there's
> no untagged algebraic type). A good place for it is straight in
> std.variant.
>
> What are the relative advantages of using an integral over a pointer to
> function? In other words, what's a side by side comparison of
> TaggedAlgebraic!U and Algebraic!(types inside U)?
>
> Thanks,
>
> Andrei

Ping on this. My working hypothesis:

- If there's a way to make a tag smaller than one word, e.g. by using various packing tricks, then the integral tag has an advantage over the pointer tag.

- If there's some ordering among types (e.g. all types below 16 have some property etc), then the integral tag again has an advantage over the pointer tag.

- Other than that the pointer tag is superior to the integral tag at everything. Where it really wins is there is one unique tag for each type, present or future, so the universe of types representable is the total set. The pointer may be used for dispatching but also as a simple integral tag, so the pointer tag is a superset of the integral tag.

I've noticed many people are surprised by std.variant's use of a pointer instead of an integral for tagging. I'd like to either figure whether there's an advantage to integral tags, or if not settle for good a misconception.


Andrei

August 17, 2015

Re: std.data.json formal review

Posted by Dmitry Olshansky
in reply to Andrei Alexandrescu

Dmitry Olshansky

Posted in reply to Andrei Alexandrescu

On 17-Aug-2015 21:12, Andrei Alexandrescu wrote:
> On 8/14/15 7:40 AM, Andrei Alexandrescu wrote:
>> On 8/12/15 5:43 AM, Sönke Ludwig wrote:
>>>> Anyway, I've just started to work on a generic variant of an enum based
>>>> algebraic type that exploits as much static type information as
>>>> possible. If that works out (compiler bugs?), it would be a great thing
>>>> to have in Phobos, so maybe it's worth to delay the JSON module for
>>>> that
>>>> if necessary.
>>>>
>>>
>>> First proof of concept:
>>> https://gist.github.com/s-ludwig/7a8a60150f510239f071#file-taggedalgebraic-d-L148
>>>
>>>
>>>
>>>
>>> It probably still has issues with const/immutable and ref in some
>>> places, but the basics seem to work as expected.
>>
>> struct TaggedAlgebraic(U) if (is(U == union)) { ... }
>>
>> Interesting. I think it would be best to rename it to TaggedUnion
>> (instantly recognizable; also TaggedAlgebraic is an oxymoron as there's
>> no untagged algebraic type). A good place for it is straight in
>> std.variant.
>>
>> What are the relative advantages of using an integral over a pointer to
>> function? In other words, what's a side by side comparison of
>> TaggedAlgebraic!U and Algebraic!(types inside U)?
>>
>> Thanks,
>>
>> Andrei
>
> Ping on this. My working hypothesis:
>
> - If there's a way to make a tag smaller than one word, e.g. by using
> various packing tricks, then the integral tag has an advantage over the
> pointer tag.
>
> - If there's some ordering among types (e.g. all types below 16 have
> some property etc), then the integral tag again has an advantage over
> the pointer tag.
>
> - Other than that the pointer tag is superior to the integral tag at
> everything. Where it really wins is there is one unique tag for each
> type, present or future, so the universe of types representable is the
> total set. The pointer may be used for dispatching but also as a simple
> integral tag, so the pointer tag is a superset of the integral tag.
>
> I've noticed many people are surprised by std.variant's use of a pointer
> instead of an integral for tagging. I'd like to either figure whether
> there's an advantage to integral tags, or if not settle for good a
> misconception.
>

Actually one can combine the two:
- use integer type tag for everything built-in
- use pointer tag for what is not

In code:
union NiftyTaggedUnion
{
 	// pointer must be at least 4-byte aligned
	// To discern int tag must have the LSB == 1
	// this assumes little-endian though, big-endian is doable too
	@property bool isIntTag(){ return common.head & 1; }
	IntTagged intTagged;
	PtrTagged ptrTagged;
	CommonUnion common;
}
struct CommonUnion
{
	ubyte[size_of_max_builtin] store;
// this is where the type-tag starts - pointer or int
	uint head;
}

union IntTagged // int-tagged
{
	union{  // builtins go here
		int ival;
		double dval;
		// ....
	}
	uint tag;
}

union PtrTagged // ptr to typeinfo scheme
{
	ubyte[size_of_max_builtin] payload;
	TypeInfo* pinfo;
}

It's going to be challenging but I think I can pull off even nan-boxing with this scheme.

-- 
Dmitry Olshansky

August 17, 2015

Re: std.data.json formal review

Posted by deadalnix
in reply to Andrei Alexandrescu

deadalnix

Posted in reply to Andrei Alexandrescu

On Monday, 17 August 2015 at 18:12:02 UTC, Andrei Alexandrescu wrote:
> On 8/14/15 7:40 AM, Andrei Alexandrescu wrote:
>> On 8/12/15 5:43 AM, Sönke Ludwig wrote:
>>>> Anyway, I've just started to work on a generic variant of an enum based
>>>> algebraic type that exploits as much static type information as
>>>> possible. If that works out (compiler bugs?), it would be a great thing
>>>> to have in Phobos, so maybe it's worth to delay the JSON module for that
>>>> if necessary.
>>>>
>>>
>>> First proof of concept:
>>> https://gist.github.com/s-ludwig/7a8a60150f510239f071#file-taggedalgebraic-d-L148
>>>
>>>
>>>
>>> It probably still has issues with const/immutable and ref in some
>>> places, but the basics seem to work as expected.
>>
>> struct TaggedAlgebraic(U) if (is(U == union)) { ... }
>>
>> Interesting. I think it would be best to rename it to TaggedUnion
>> (instantly recognizable; also TaggedAlgebraic is an oxymoron as there's
>> no untagged algebraic type). A good place for it is straight in
>> std.variant.
>>
>> What are the relative advantages of using an integral over a pointer to
>> function? In other words, what's a side by side comparison of
>> TaggedAlgebraic!U and Algebraic!(types inside U)?
>>
>> Thanks,
>>
>> Andrei
>
> Ping on this. My working hypothesis:
>
> - If there's a way to make a tag smaller than one word, e.g. by using various packing tricks, then the integral tag has an advantage over the pointer tag.
>
> - If there's some ordering among types (e.g. all types below 16 have some property etc), then the integral tag again has an advantage over the pointer tag.
>
> - Other than that the pointer tag is superior to the integral tag at everything. Where it really wins is there is one unique tag for each type, present or future, so the universe of types representable is the total set. The pointer may be used for dispatching but also as a simple integral tag, so the pointer tag is a superset of the integral tag.
>
> I've noticed many people are surprised by std.variant's use of a pointer instead of an integral for tagging. I'd like to either figure whether there's an advantage to integral tags, or if not settle for good a misconception.
>
>
> Andrei

From the compiler perspective, the tag is much nicer. Compiler can use jump table for instance.

It is not a good solution for Variant (which needs to be able to represent arbitrary types) but if the amount of types is finite, tag is almost always a win.

In the case of JSON, using a tag and packing trick, it is possible to pack everything in a 2 pointers sized struct without much trouble.

August 17, 2015

Re: std.data.json formal review

Posted by Sönke Ludwig
in reply to Andrei Alexandrescu

Sönke Ludwig

Posted in reply to Andrei Alexandrescu

Am 17.08.2015 um 20:12 schrieb Andrei Alexandrescu:
> On 8/14/15 7:40 AM, Andrei Alexandrescu wrote:
>>
>> struct TaggedAlgebraic(U) if (is(U == union)) { ... }
>>
>> Interesting. I think it would be best to rename it to TaggedUnion
>> (instantly recognizable; also TaggedAlgebraic is an oxymoron as there's
>> no untagged algebraic type). A good place for it is straight in
>> std.variant.
>>
>> What are the relative advantages of using an integral over a pointer to
>> function? In other words, what's a side by side comparison of
>> TaggedAlgebraic!U and Algebraic!(types inside U)?
>>
>> Thanks,
>>
>> Andrei
>
> Ping on this. My working hypothesis:
>
> - If there's a way to make a tag smaller than one word, e.g. by using
> various packing tricks, then the integral tag has an advantage over the
> pointer tag.
>
> - If there's some ordering among types (e.g. all types below 16 have
> some property etc), then the integral tag again has an advantage over
> the pointer tag.
>
> - Other than that the pointer tag is superior to the integral tag at
> everything. Where it really wins is there is one unique tag for each
> type, present or future, so the universe of types representable is the
> total set. The pointer may be used for dispatching but also as a simple
> integral tag, so the pointer tag is a superset of the integral tag.
>
> I've noticed many people are surprised by std.variant's use of a pointer
> instead of an integral for tagging. I'd like to either figure whether
> there's an advantage to integral tags, or if not settle for good a
> misconception.
>
>
> Andrei

(reposting to NG, accidentally replied by e-mail)

Some more points come to mind:

- The enum is useful to be able to identify the types outside of the D code itself. For example when serializing the data to disk, or when communicating with C code.

- It enables the use of pattern matching (final switch), which is often very convenient, faster, and safer than an if-else cascade.

- A hypothesis is that it is faster, because there is no function call indirection involved.

- It naturally enables fully statically typed operator forwarding as far as possible (have a look at the examples of the current version). A pointer based version could do this, too, but only by jumping through hoops.

- The same type can be used multiple times with a different enum name. This can alternatively be solved using a Typedef!T, but I had several occasions where that proved useful.

They both have their place, but IMO where the pointer approach really shines is for unbounded Variant types.

August 17, 2015

Re: std.data.json formal review

Posted by Johannes Pfau
in reply to Sönke Ludwig

Johannes Pfau

Posted in reply to Sönke Ludwig

Am Mon, 17 Aug 2015 20:56:18 +0200
schrieb Sönke Ludwig <sludwig@outerproduct.org>:

> Am 17.08.2015 um 20:12 schrieb Andrei Alexandrescu:
> > On 8/14/15 7:40 AM, Andrei Alexandrescu wrote:
> >>
> >> struct TaggedAlgebraic(U) if (is(U == union)) { ... }
> >>
> >> Interesting. I think it would be best to rename it to TaggedUnion (instantly recognizable; also TaggedAlgebraic is an oxymoron as there's no untagged algebraic type). A good place for it is straight in std.variant.
> >>
> >> What are the relative advantages of using an integral over a pointer to function? In other words, what's a side by side comparison of TaggedAlgebraic!U and Algebraic!(types inside U)?
> >>
> >> Thanks,
> >>
> >> Andrei
> >
> > Ping on this. My working hypothesis:
> >
> > - If there's a way to make a tag smaller than one word, e.g. by using various packing tricks, then the integral tag has an advantage over the pointer tag.
> >
> > - If there's some ordering among types (e.g. all types below 16 have some property etc), then the integral tag again has an advantage over the pointer tag.
> >
> > - Other than that the pointer tag is superior to the integral tag at everything. Where it really wins is there is one unique tag for each type, present or future, so the universe of types representable is the total set. The pointer may be used for dispatching but also as a simple integral tag, so the pointer tag is a superset of the integral tag.
> >
> > I've noticed many people are surprised by std.variant's use of a pointer instead of an integral for tagging. I'd like to either figure whether there's an advantage to integral tags, or if not settle for good a misconception.
> >
> >
> > Andrei
> 
> (reposting to NG, accidentally replied by e-mail)
> 
> Some more points come to mind:
> 
> - The enum is useful to be able to identify the types outside of the D code itself. For example when serializing the data to disk, or when communicating with C code.
> 
> - It enables the use of pattern matching (final switch), which is often very convenient, faster, and safer than an if-else cascade.
> 
> - A hypothesis is that it is faster, because there is no function call indirection involved.
> 
> - It naturally enables fully statically typed operator forwarding as far as possible (have a look at the examples of the current version). A pointer based version could do this, too, but only by jumping through hoops.
> 
> - The same type can be used multiple times with a different enum name. This can alternatively be solved using a Typedef!T, but I had several occasions where that proved useful.
> 
> They both have their place, but IMO where the pointer approach really shines is for unbounded Variant types.


I think Andrei's point is that a pointer tag can do most things a integral tag could as you don't have to dereference the pointer:

void* tag;
if (tag == &someFunc!A)

So the only benefit is that the compiler knows that the _enum_ (not
simply an integral) tag is bounded. So we gain:
* easier debugging (readable type tag)
* potentially better codegen (jump tables fit perfectly: ordered values,
  0-x, no gaps)
* final switch

In some cases enum tags might also be smaller than a pointer.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation