August 25, 2014
On 08/25/2014 09:35 PM, Walter Bright wrote:
> On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote:
>>> I'm not convinced that using an adapter algorithm won't be just as fast.
>> Consider your own talks on optimizing the existing dmd lexer.  In
>> those talks
>> you've talked about the evils of additional processing on every byte.
>> That's
>> what you're talking about here.  While it's possible that the inliner
>> and other
>> optimizer steps might be able to integrate the two phases and remove some
>> overhead, I'll believe it when I see the resulting assembly code.
> 
> On the other hand, deadalnix demonstrated that the ldc optimizer was able to remove the extra code.
> 
> I have a reasonable faith that optimization can be improved where necessary to cover this.

I just happened to write a very small script yesterday and tested with the compilers (with dub --build=release).

dmd: 2.8 mb
gdc: 3.3 mb
ldc  0.5 mb

So ldc can remove quite a substantial amount of code in some cases.
August 25, 2014
On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote:
> The adaptation is to take arbitrary byte input in an unknown encoding and produce valid UTF.

I agree.

For a restful http service the encoding should be specified in the http header and the input rejected if it isn't UTF compatible. For that use scenario you only want validation, not conversion. However some validation is free, like if you only accept numbers you could just turn off parsing of strings in the template…

If files are read from storage then you can reread the file if it fails validation on the first pass.

I wonder, in which use scenario it is that both of these conditions fail?

1. unspecified character-set and cannot assume UTF for JSON
3. unable to re-parse
August 25, 2014
On Monday, 25 August 2014 at 19:42:03 UTC, Walter Bright wrote:
> Infinity. Mapping to max value would be a horrible bug.

Yes… but then you are reading an illegal value that JSON does not support…

>> For some reason D does not seem to support this aspect of IEEE754? I cannot find
>> ".nans" listed on the page http://dlang.org/property.html
>
> Because I tried supporting them in C++. It doesn't work for various reasons. Nobody else supports them, either.

I haven't tested, but Python is supposed to throw on NaNs.

gcc has support for nans in their documentation:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

IBM Fortran supports it…

I think supporting signaling NaN is important for correctness.
August 25, 2014
Am 25.08.2014 17:46, schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>":
> On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote:
>> By default, floating-point special values are now output as 'null',
>> according to the ECMA-script standard. Optionally, they will be
>> emitted as 'NaN' and 'Infinity':
>
> ECMAScript presumes double. I think one should base Phobos on
> language-independent standards. I suggest:
>
> http://tools.ietf.org/html/rfc7159

Well, of course it's based on that RFC, did you seriously think something else? However, that standard has no mention of infinity or NaN, and since JSON is designed to be a subset of ECMA script, it's basically the only thing that comes close.

>
> For a web server it would be most useful to get an exception since you
> risk ending up with web-clients not working with no logging. It is
> better to have an exception and log an error so the problem can be fixed.

Although you have a point there of course, it's also highly unlikely that those clients would work correctly if we presume that JSON supported infinity/NaN. So it would really be just coincidence to detect a bug like that.

But I generally agree, it's just that the anti-exception voices are pretty loud these days (including Walter's), so that I opted for a non-throwing solution instead. I guess it wouldn't hurt though to default to throwing an exception, while still providing the GeneratorOptions.specialFloatLiterals option to handle those values without exception overhead, but in a non standard-conforming way.
August 25, 2014
On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad wrote:
> I think supporting signaling NaN is important for correctness.

It is defined in C++11:

http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN


August 25, 2014
> - It makes more sense for ECMAScript to turn illegal values into null
> since it runs on the client.

Like... node.js?

Sorry, just kidding.

I don't think it makes sense for clients to be less strict about such things, but I do agree with your assessment about being as strict as possible on the server. I also do think that exceptions are a perfect tool especially for server applications and that instead of avoiding them because they are slow, they should better be made fast enough to not be an issue.
August 25, 2014
Am 25.08.2014 21:50, schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>":
> On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote:
>> The adaptation is to take arbitrary byte input in an unknown encoding
>> and produce valid UTF.
>
> I agree.
>
> For a restful http service the encoding should be specified in the http
> header and the input rejected if it isn't UTF compatible. For that use
> scenario you only want validation, not conversion. However some
> validation is free, like if you only accept numbers you could just turn
> off parsing of strings in the template…
>
> If files are read from storage then you can reread the file if it fails
> validation on the first pass.
>
> I wonder, in which use scenario it is that both of these conditions fail?
>
> 1. unspecified character-set and cannot assume UTF for JSON
> 3. unable to re-parse

BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF.
August 25, 2014
Am 25.08.2014 22:21, schrieb Sönke Ludwig:
> that standard has no mention of infinity or
> NaN

Sorry, to be precise, it has no suggestion of how to *handle* infinity or NaN.

August 25, 2014
On Monday, 25 August 2014 at 20:21:01 UTC, Sönke Ludwig wrote:
> Well, of course it's based on that RFC, did you seriously think something else?

I made no assumptions, just responded to what you wrote :-). It would be reasonable in the context of vibe.d to assume the ECMAScript spec.

> But I generally agree, it's just that the anti-exception voices are pretty loud these days (including Walter's), so that I opted for a non-throwing solution instead.

Yes, the minimum requirement is to just get "did not validate" directly as a single value. One can create a wrapper to get exceptions.

> I guess it wouldn't hurt though to default to throwing an exception, while still providing the GeneratorOptions.specialFloatLiterals option to handle those values without exception overhead, but in a non standard-conforming way.

What I care most about is getting all the free validation that can be added with no extra cost.

That will make writing web services easier. Like if you can define constraints like:

- root is array, values are strings.
- root is array, second level only arrays, third level is numbers
- root is dict, all arrays contain only numbers

What is a bit annoying about generic libs is that you have no idea what you are getting so you have to spend time creating dull validation code.

But maybe StructuredJSON should be a separate library. It would be useful for REST services to specify the grammar and auto-generate both javascript and D structures to hold it along with validation code.

However, just turning off parsing of "true", "false", "null", "[", "{" etc seems like a cheap addition that also can improve parsing speed if the compiler can make do with two if statements instead of a switch.

Ola.
August 25, 2014
On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:
> BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF.

The lexer cannot assume valid UTF since the client might be a rogue, but it can just bail out if the lookahead isn't jSON? So UTF-validation is limited to strings.

You have to parse the strings because of the \uXXXX escapes of course, so some basic validation is unavoidable? But I guess full validation of string content could be another useful option along with "ignore escapes" for the case where you want to avoid decode-encode scenarios. (like for a proxy, or if you store pre-escaped unicode in a database)