August 22, 2014
On Friday, 22 August 2014 at 20:02:41 UTC, Sönke Ludwig wrote:
> Am 22.08.2014 21:48, schrieb Christian Manning:
>> On Friday, 22 August 2014 at 17:45:03 UTC, Sönke Ludwig wrote:
>>> Am 22.08.2014 19:27, schrieb "Marc Schütz" <schuetzm@gmx.net>":
>>>> On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig wrote:
>>>>> Am 22.08.2014 18:31, schrieb Christian Manning:
>>>>>> It would be nice to have integers treated separately to doubles. I
>>>>>> know
>>>>>> it makes the number parsing simpler to just treat everything as
>>>>>> double,
>>>>>> but still, it could be annoying when you expect an integer type.
>>>>>
>>>>> That's how I've done it for vibe.data.json, too. For the new
>>>>> implementation, I've just used the number parsing routine from
>>>>> Andrei's std.jgrandson module. Does anybody have reservations about
>>>>> representing integers as "long" instead?
>>>>
>>>> It should automatically fall back to double on overflow. Maybe even use
>>>> BigInt if applicable?
>>>
>>> I guess BigInt + exponent would be the only lossless way to represent
>>> any JSON number. That could then be converted to any desired smaller
>>> type as required.
>>>
>>> But checking for overflow during number parsing would definitely have
>>> an impact on parsing speed, as well as using a BigInt of course, so
>>> the question is how we want set up the trade off here (or if there is
>>> another way that is overhead-free).
>>
>> You could check for a decimal point and a 0 at the front (excluding
>> possible - sign), either would indicate a double, making the reasonable
>> assumption that anything else will fit in a long.
>
> Yes, no decimal point + no exponent would work without overhead to detect integers, but that wouldn't solve the proposed automatic long->double overflow, which is what I meant. My current idea is to default to double and optionally support any of long, BigInt and "Decimal" (BigInt+exponent), where integer overflow only works for long->BigInt.

It might be the right choice anyway (seeing as json/js do overflow to double), but fwiw it's still atrocious.

double a = long.max;
assert(iota(1, 1000000).map!(d => (a+d)-a).until!"a != 0".walkLength == 1024);

Yuk.

Floating point numbers and integers are so completely different in behaviour that it's just dishonest to transparently switch between the two. This especially the case for overflow from long -> double, where by definition you're 10 bits past being able to reliably accurately represent the integer in question.
August 22, 2014
> Yes, no decimal point + no exponent would work without overhead to detect integers, but that wouldn't solve the proposed automatic long->double overflow, which is what I meant. My current idea is to default to double and optionally support any of long, BigInt and "Decimal" (BigInt+exponent), where integer overflow only works for long->BigInt.

Ah I see.

I have to say, if you are going to treat integers and floating point numbers differently, then you should store them differently. long should be used to store integers, double for floating point numbers. 64 bit signed integer (long) is a totally reasonable limitation for integers, but even that would lose precision stored as a double as you are proposing (if I'm understanding right). I don't think BigInt needs to be brought into this at all really.

In the case of integers met in the parser which are too large/small to fit in long, give an error IMO. Such integers should be (and are by other libs IIRC) serialised in the form "1.234e-123" to force double parsing, perhaps losing precision at that stage rather than invisibly inside the library. Size of JSON numbers is implementation defined and the whole thing shouldn't be degraded in both performance and usability to cover JSON serialisers who go beyond common native number types.

Of course, you are free to do whatever you like :)
August 22, 2014
Am 22.08.2014 20:08, schrieb Walter Bright:
> On 8/21/2014 3:35 PM, Sönke Ludwig wrote:
>> Destroy away! ;)
>
> Thanks for taking this on! This is valuable work. On to destruction!
>
> I'm looking at:
>
> http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/lexJSON.html
>
> I anticipate this will be used a LOT and in very high speed demanding
> applications. With that in mind,
>
>
> 1. There's no mention of what will happen if it is passed malformed JSON
> strings. I presume an exception is thrown. Exceptions are both slow and
> consume GC memory. I suggest an alternative would be to emit an "Error"
> token instead; this would be much like how the UTF decoding algorithms
> emit a "replacement char" for invalid UTF sequences.

The latest version now features a LexOptions.noThrow option which causes an error token to be emitted instead. After popping the error token, the range is always empty.

>
> 2. The escape sequenced strings presumably consume GC memory. This will
> be a problem for high performance code. I suggest either leaving them
> undecoded in the token stream, and letting higher level code decide what
> to do about them, or provide a hook that the user can override with his
> own allocation scheme.

The problem is that it really depends on the use case and on the type of input stream which approach is more efficient (storing the escaped version of a string might require *two* allocations if the input range cannot be sliced and if the decoded string is then requested by the parser). My current idea therefore is to simply make this configurable, too.

Enabling the use of custom allocators should be easily possible as an add-on functionality later on. At least my suggestion would be to wait with this until we have a finished std.allocator module.
August 22, 2014
Am 22.08.2014 18:13, schrieb Sönke Ludwig:
> Am 22.08.2014 17:47, schrieb Jacob Carlborg:
>>
>> * Opening braces should be put on their own line to follow Phobos style
>> guides
>
> Will do.
>
>> * I'm wondering about the assert in lexer.d, line 160. What happens if
>> two invalid tokens after each other occur?
>
> There are actually no invalid tokens at all, the "invalid" enum value is
> only used to denote that no token is currently stored in _front. If
> readToken() doesn't throw, there will always be a valid token.

Renamed from "invalid" to "none" now to avoid confusion ->

>
>> * I think we have talked about this before, when reviewing D lexers. I'm
>> thinking of how to handle invalid data. Is it the best solution to throw
>> an exception? Would it be possible to return an error token and have the
>> client decide what to do about? Shouldn't it be possible to build a JSON
>> validator on this?
>
> That would indeed be a possibility, it's how I used to handle it in my
> private version of std.lexer, too. It could also be made a compile time
> option.

and an additional "error" kind has been added, which implements the above. Enabled using LexOptions.noThrow.

>> * The lexer seems to always convert JSON types to their native D types,
>> is that wise to do? That's unnecessary if you're implementing syntax
>> highlighting
>
> It's basically the same trade-off as for unescaping string literals. For
> "string" inputs, it would be more efficient to just store a slice, but
> for generic input ranges it avoids the otherwise needed allocation. The
> proposed flag could make an improvement here, too.
>

August 23, 2014
On 8/22/2014 2:27 PM, Sönke Ludwig wrote:
> Am 22.08.2014 20:08, schrieb Walter Bright:
>> 1. There's no mention of what will happen if it is passed malformed JSON
>> strings. I presume an exception is thrown. Exceptions are both slow and
>> consume GC memory. I suggest an alternative would be to emit an "Error"
>> token instead; this would be much like how the UTF decoding algorithms
>> emit a "replacement char" for invalid UTF sequences.
> The latest version now features a LexOptions.noThrow option which causes an
> error token to be emitted instead. After popping the error token, the range is
> always empty.

Having a nothrow option may prevent the functions from being attributed as "nothrow".

But in any case, to worship at the Altar Of Composability, the error token could always be emitted, and then provide another algorithm which passes through all non-error tokens, and throws if it sees an error token.


>> 2. The escape sequenced strings presumably consume GC memory. This will
>> be a problem for high performance code. I suggest either leaving them
>> undecoded in the token stream, and letting higher level code decide what
>> to do about them, or provide a hook that the user can override with his
>> own allocation scheme.
>
> The problem is that it really depends on the use case and on the type of input
> stream which approach is more efficient (storing the escaped version of a string
> might require *two* allocations if the input range cannot be sliced and if the
> decoded string is then requested by the parser). My current idea therefore is to
> simply make this configurable, too.
>
> Enabling the use of custom allocators should be easily possible as an add-on
> functionality later on. At least my suggestion would be to wait with this until
> we have a finished std.allocator module.

I'm worried that std.allocator is stalled and we'll be digging ourselves deeper into needing to revise things later to remove GC usage. I'd really like to find a way to abstract the allocation away from the algorithm.
August 23, 2014
First thank you for your work. std.json is horrible to use right now, so a replacement is more than welcome.

I haven't played with your code yet, so I may be asking for somethign that already exists, but did you had a look to jsvar by Adam ?

You can find it here: https://github.com/adamdruppe/arsd/blob/master/jsvar.d

One of the big pain when one work with format like JSON is that you go from the untyped world to the typed world (the same problem occurs with XML and various config format as well).

I think Adam got the right balance in jsvar. It behave closely enough to javascript so it is convenient to manipulate, while removing the most dangerous behavior (concatenation is still done using ~and not + as in JS).

If that is not already the case, I'd love that the element I get out of my JSON behave that way. If you can do that, you have a user.
August 23, 2014
On 8/22/2014 6:05 PM, Walter Bright wrote:
>> The problem is that it really depends on the use case and on the type of input
>> stream which approach is more efficient (storing the escaped version of a string
>> might require *two* allocations if the input range cannot be sliced and if the
>> decoded string is then requested by the parser). My current idea therefore is to
>> simply make this configurable, too.
>>
>> Enabling the use of custom allocators should be easily possible as an add-on
>> functionality later on. At least my suggestion would be to wait with this until
>> we have a finished std.allocator module.

Another possibility is to have the user pass in a resizeable buffer which then will be used to store the strings in as necessary.

One example is std.internal.scopebuffer. The nice thing about that is the user can use the stack for the storage, which works out to be very, very fast.
August 23, 2014
On Sat, 23 Aug 2014 02:23:25 +0000
deadalnix via Digitalmars-d <digitalmars-d@puremagic.com> wrote:

> I haven't played with your code yet, so I may be asking for somethign that already exists, but did you had a look to jsvar by Adam ?

jsvar using opDispatch, and Sönke wrote:
>  - No opDispatch() for JSONValue - this has shown to do more harm than
>    good in vibe.data.json


August 23, 2014
On Saturday, 23 August 2014 at 02:30:23 UTC, Walter Bright wrote:
> Another possibility is to have the user pass in a resizeable buffer which then will be used to store the strings in as necessary.
>
> One example is std.internal.scopebuffer. The nice thing about that is the user can use the stack for the storage, which works out to be very, very fast.

Does this mean that D is getting resizable stack allocations in lower stack frames? That has a lot of implications for code gen.
August 23, 2014
On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
> On Saturday, 23 August 2014 at 02:30:23 UTC, Walter Bright wrote:
>> One example is std.internal.scopebuffer. The nice thing about that is the user
>> can use the stack for the storage, which works out to be very, very fast.
>
> Does this mean that D is getting resizable stack allocations in lower stack
> frames? That has a lot of implications for code gen.

scopebuffer does not require resizeable stack allocations.