August 26, 2014
On Monday, 25 August 2014 at 23:24:43 UTC, Kiith-Sa wrote:
> D:YAML uses a similar approach, but with 8 bytes (plain ulong - portable) to detect how many ASCII chars are there before the first non-ASCII UTF-8 sequence,  and it significantly improves performance (didn't keep any numbers unfortunately, but it

Cool!

I think often you will have an array of numbers so you could subtract "000000000…", then parse offset-bytes and convert the mantissa/exponent using shuffles and simd.

Somehow…
August 26, 2014
Hi!

Thanks for the effort you've put in this.

I am having problems with building with LDC 0.14.0. DMD 2.066.0
seems to work fine (all unit tests pass). Do you have any ideas
why?

I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64).

Master was at 6a9f8e62e456c3601fe8ff2e1fbb640f38793d08.
$ dub fetch std_data_json --version=~master
$ cd std_data_json-master/
$ dub test --compiler=ldc2

Generating test runner configuration '__test__library__' for
'library' (library).
Building std_data_json ~master configuration "__test__library__",
build type unittest.
Running ldc2...
source/stdx/data/json/parser.d(77): Error: safe function
'stdx.data.json.parser.__unittestL68_22' cannot call system
function 'object.AssociativeArray!(string,
JSONValue).AssociativeArray.length'
source/stdx/data/json/parser.d(124): Error: safe function
'stdx.data.json.parser.__unittestL116_24' cannot call system
function 'object.AssociativeArray!(string,
JSONValue).AssociativeArray.length'
source/stdx/data/json/parser.d(341): Error: function
stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign
is not callable because it is annotated with @disable
source/stdx/data/json/parser.d(341): Error: safe function
'stdx.data.json.parser.__unittestL318_32' cannot call system
function
'stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign'
source/stdx/data/json/parser.d(633): Error: function
stdx.data.json.lexer.JSONToken.opAssign is not callable because
it is annotated with @disable
source/stdx/data/json/parser.d(633): Error:
'stdx.data.json.lexer.JSONToken.opAssign' is not nothrow
source/stdx/data/json/parser.d(630): Error: function
'stdx.data.json.parser.JSONParserNode.literal' is nothrow yet may
throw
FAIL
.dub/build/__test__library__-unittest-linux.posix-x86_64-ldc2-0F620B217010475A5A4E545A57CDD09A/
__test__library__ executable
Error executing command test: ldc2 failed with exit code 1.

Thanks
August 26, 2014
> ...
> I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64).
> ...

I meant Ubuntu 13.10 :D
August 26, 2014
On 25/08/14 21:49, simendsjo wrote:

> So ldc can remove quite a substantial amount of code in some cases.

It's because the latest release of LDC has the --gc-sections falg enabled by default.

-- 
/Jacob Carlborg
August 26, 2014
On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:
> On 8/25/2014 4:15 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
>> On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
>>> I didn't know that. But recall I did implement it in DMC++, and it turned out
>>> to simply not be useful. I'd be surprised if the new C++ support for it does
>>> anything worthwhile.
>>
>> Well, one should initialize with signaling NaN. Then you get an exception if you
>> try to compute using uninitialized values.
>
>
> That's the theory. The practice doesn't work out so well.

To be more concrete:

Processors from AMD have signalling NaN behaviour which is different from processors from Intel.

And the situation is worst on most other architectures. It's a lost cause, I think.
August 26, 2014
On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote:
> Processors from AMD have signalling NaN behaviour which is different from processors from Intel.
>
> And the situation is worst on most other architectures. It's a lost cause, I think.

I disagree. AFAIK signaling NaN was standardized in IEEE 754-2008.
So it receives attention.
August 26, 2014
Am 25.08.2014 23:53, schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>":
> On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
>> But why should UTF validation be the job of the lexer in the first place?
>
> Because you want to save time, it is faster to integrate validation? The
> most likely use scenario is to receive REST data over HTTP that needs
> validation.
>
> Well, so then I agree with Andrei… array of bytes it is. ;-)
>
>> added as a separate proxy range. But if we end up going for validating
>> in the lexer, it would indeed be enough to validate inside strings,
>> because the rest of the grammar assumes a subset of ASCII.
>
> Not assumes, but defines! :-)

I guess it depends on if you look at the grammar as productions or comprehensions(right term?) ;)

>
> If you have to validate UTF before lexing then you will end up
> needlessly scanning lots of ascii if the file contains lots of
> non-strings or is from a encoder that only sends pure ascii.

That's true. So the ideal solution would be to *assume* UTF-8 when the input is char based and to *validate* if the input is "numeric".

>
> If you want to have "plugin" validation of strings then you also need to
> differentiate strings so that the user can select which data should be
> just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing
> double validation (you have to bypass >7F followed by string-end anyway).
>
> The advantage of integrated validation is that you can use 16 bytes SIMD
> registers on the buffer.
>
> I presume you can load 16 bytes and do BITWISE-AND on the MSB, then
> match against string-end and carefully use this to boost performance of
> simultanous UTF validation, escape-scanning, and string-end scan. A bit
> tricky, of course.

Well, that's something that's definitely out of the scope of this proposal. Definitely an interesting direction to pursue, though.

>> At least no UTF validation is needed. Since all non-ASCII characters
>> will always be composed of bytes >0x7F, a sequence \uXXXX can be
>> assumed to be valid wherever in the string it occurs, and all other
>> bytes that don't belong to an escape sequence are just passed through
>> as-is.
>
> You cannot assume \u… to be valid if you convert it.

I meant "X" to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find "\uXXXX".
August 26, 2014
On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
> That's true. So the ideal solution would be to *assume* UTF-8 when the input is char based and to *validate* if the input is "numeric".

I think you should validate JSON-strings to be UTF-8 encoded even if you allow illegal unicode values. Basically ensuring that >0x7f has the right number of bytes after it, so you don't get >0x7f as the last byte in a string etc.

> Well, that's something that's definitely out of the scope of this proposal. Definitely an interesting direction to pursue, though.

Maybe the interface/code structure is or could be designed so that the implementation could later be version()'ed to SIMD where possible.

>> You cannot assume \u… to be valid if you convert it.
>
> I meant "X" to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find "\uXXXX".

When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a legal code point? I guess it is not necessary.

Btw, I believe rapidJSON achieves high speed by converting strings in situ, so that if the prefix is escape free it just converts in place when it hits the first escape. Thus avoiding some moving.
August 26, 2014
Am 26.08.2014 03:31, schrieb Entusiastic user:
> Hi!
>
> Thanks for the effort you've put in this.
>
> I am having problems with building with LDC 0.14.0. DMD 2.066.0
> seems to work fine (all unit tests pass). Do you have any ideas
> why?

I've fixed all errors on DMD 2.065 now. Hopefully that should also fix LDC.

August 26, 2014
Am 26.08.2014 10:24, schrieb "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>":
> On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
>> That's true. So the ideal solution would be to *assume* UTF-8 when the
>> input is char based and to *validate* if the input is "numeric".
>
> I think you should validate JSON-strings to be UTF-8 encoded even if you
> allow illegal unicode values. Basically ensuring that >0x7f has the
> right number of bytes after it, so you don't get >0x7f as the last byte
> in a string etc.

I think this is a misunderstanding. What I mean is that if the input range passed to the lexer is char/wchar/dchar based, the lexer should assume that the input is well formed UTF. After all this is how D strings are defined.

When on the other hand a ubyte/ushort/uint range is used, the lexer should validate all string literals.

>
>> Well, that's something that's definitely out of the scope of this
>> proposal. Definitely an interesting direction to pursue, though.
>
> Maybe the interface/code structure is or could be designed so that the
> implementation could later be version()'ed to SIMD where possible.

I guess that shouldn't be an issue. From the outside it's just a generic range that is passed in and internally it's always possible to add special cases for array inputs. If someone else wants to play around with this idea, we could of course also integrate it right away, it's just that I personally don't have the time to go to the extreme here.

>>> You cannot assume \u… to be valid if you convert it.
>>
>> I meant "X" to stand for a hex digit. The point was just that you
>> don't have to worry about interacting in a bad way with UTF sequences
>> when you find "\uXXXX".
>
> When you convert "\uXXXX" to UTF-8 bytes, is it then validated as a
> legal code point? I guess it is not necessary.

What is validated is that it forms valid UTF-16 surrogate pairs, and those are converted to a single dchar instead (if applicable). This is necessary, because otherwise the lexer would produce invalid UTF-8 for valid inputs. Apart from that, the value is used verbatim as a dchar.

>
> Btw, I believe rapidJSON achieves high speed by converting strings in
> situ, so that if the prefix is escape free it just converts in place
> when it hits the first escape. Thus avoiding some moving.

The same is true for this lexer, at least for array inputs. It actually currently just stores a slice of the string literal in all cases and lazily decodes on the first access. While doing that, it first skips any escape sequence free prefix and returns a slice if the whole string is escape sequence free.