August 25, 2014
On 8/25/2014 1:21 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad wrote:
>> I think supporting signaling NaN is important for correctness.
>
> It is defined in C++11:
>
> http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN


I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile.
August 25, 2014
On 8/25/2014 1:35 PM, Sönke Ludwig wrote:
> BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is
> another argument for just letting the lexer assume valid UTF.

I think that settles it.
August 25, 2014
On 8/25/2014 12:49 PM, simendsjo wrote:
> I just happened to write a very small script yesterday and tested with
> the compilers (with dub --build=release).
>
> dmd: 2.8 mb
> gdc: 3.3 mb
> ldc  0.5 mb
>
> So ldc can remove quite a substantial amount of code in some cases.
>

Speed optimizations are different.
August 25, 2014
Am 25.08.2014 22:51, schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>":
> On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:
>> BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159,
>> which is another argument for just letting the lexer assume valid UTF.
>
> The lexer cannot assume valid UTF since the client might be a rogue, but
> it can just bail out if the lookahead isn't jSON? So UTF-validation is
> limited to strings.

But why should UTF validation be the job of the lexer in the first place? D's "string" type is also defined to be UTF-8, so given that, it would of course be free to assume valid UTF-8. I agree with Walter there that validation/conversion should be added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII.

>
> You have to parse the strings because of the \uXXXX escapes of course,
> so some basic validation is unavoidable?

At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes >0x7F, a sequence \uXXXX can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is.

> But I guess full validation of
> string content could be another useful option along with "ignore
> escapes" for the case where you want to avoid decode-encode scenarios.
> (like for a proxy, or if you store pre-escaped unicode in a database)

August 25, 2014
On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
> But why should UTF validation be the job of the lexer in the first place?

Because you want to save time, it is faster to integrate validation? The most likely use scenario is to receive REST data over HTTP that needs validation.

Well, so then I agree with Andrei… array of bytes it is. ;-)

> added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII.

Not assumes, but defines! :-)

If you have to validate UTF before lexing then you will end up needlessly scanning lots of ascii if the file contains lots of non-strings or is from a encoder that only sends pure ascii.

If you want to have "plugin" validation of strings then you also need to differentiate strings so that the user can select which data should be just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing double validation (you have to bypass >7F followed by string-end anyway).

The advantage of integrated validation is that you can use 16 bytes SIMD registers on the buffer.

I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course.

> At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes >0x7F, a sequence \uXXXX can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is.

You cannot assume \u… to be valid if you convert it.
August 25, 2014
On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad wrote:
> I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course.

I think it is doable and worth it…

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

e.g.:

__mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b)
__mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b)
__mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b)
__mmask16 _mm_test_epi8_mask (__m128i a, __m128i b)
etc.

So you can:

1. preload registers with "\\\\\\\\…" ,  "\"\"…"  and "\0\0\0…"
2. then compare signed/unsigned/equal whatever.
3. then load 16,32 or 64 bytes of data and stream until the masks trigger
4. tests masks
5. resolve any potential issues, goto 3
August 25, 2014
On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
> I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile.

Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.

August 25, 2014
On Monday, 25 August 2014 at 22:40:00 UTC, Ola Fosheim Grøstad wrote:
> On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad wrote:
>> I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course.
>
> I think it is doable and worth it…
>
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/
>
> e.g.:
>
> __mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b)
> __mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b)
> __mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b)
> __mmask16 _mm_test_epi8_mask (__m128i a, __m128i b)
> etc.
>
> So you can:
>
> 1. preload registers with "\\\\\\\\…" ,  "\"\"…"  and "\0\0\0…"
> 2. then compare signed/unsigned/equal whatever.
> 3. then load 16,32 or 64 bytes of data and stream until the masks trigger
> 4. tests masks
> 5. resolve any potential issues, goto 3

D:YAML uses a similar approach, but with 8 bytes (plain ulong - portable) to detect how many ASCII chars are there before the first non-ASCII UTF-8 sequence,  and it significantly improves performance (didn't keep any numbers unfortunately, but it decreases decoding overhead to a fraction for most inputs (since YAML (and JSON) files tend to be mostly-ASCII with non-ASCII from time to time in strings), if we know that we have e.g. 100 chars incoming that are plain ASCII, we can use a fast path for them and only consider decoding after that))

See the countASCII() function in https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d

However, this approach is useful only if you decode the whole buffer at once, not if you do something like foreach(dchar ch; "asdsššdfáľäô") {}, which is the most obvious way to decode in D.

FWIW, decoding _was_ a significant overhead in D:YAML (again, didn't keep numbers, but at a time it was around 10% in the profiler), and I didn't like the fact that it prevented making my code @nogc - I ended up copying chunks of std.utf and making them @nogc nothrow (D:YAML as a whole is not @nogc but I use @nogc in some parts basically as "@noalloc" to ensure I don't allocate anything)
August 25, 2014
On 8/25/2014 4:15 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
>> I didn't know that. But recall I did implement it in DMC++, and it turned out
>> to simply not be useful. I'd be surprised if the new C++ support for it does
>> anything worthwhile.
>
> Well, one should initialize with signaling NaN. Then you get an exception if you
> try to compute using uninitialized values.


That's the theory. The practice doesn't work out so well.
August 25, 2014
Btw, maybe it would be a good idea to take a look on the JSON that various browsers generates to see if there are any differences?

Then one could tune optimizations to what is the most common coding, like this:

1. start parsing assuming "browser style restricted JSON" grammar.

2. on failure jump to the slower "generic JSON"

Chrome does not seem to generate whitespace in JSON.stringfy(). And I would not be surprised if the encoding of double is similar across browsers.

Ola.