[Issue 14519] Get rid of unicode validation in string processing (page 4)

https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #31 from Martin Nowak <code@dawg.eu> --- (In reply to Martin Nowak from comment #30) > Well, b/c they contain delimited binary and ASCII data, you'll have to find those delimiters, then validate and cast the ASCII part to a string, and can then use std.string functions. BTW, this is what I already wrote in comment 23. Not sure why you only partially quoted my answer to suggest a contradiction. --

https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #32 from Martin Nowak <code@dawg.eu> --- Summary: We should adopt a new model of unicode validations. The current one where every string processing function decodes unicode characters and performs validation causes too much overhead. A better alternative would be to perform unicode validation once when reading raw data (ubyte[]) and then assume any char[]/wchar[]/dchar[] is a valid unicode string. Invalid encodings introduced by string processing algorithms are programming bugs and thus do not warrant runtime checks in release builds. Also see https://github.com/D-Programming-Language/druntime/pull/1279 --

https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #33 from Sobirari Muhomori <dfj1esp02@sneakemail.com> --- Removing autodecoding is good, but this issue is about making autodecode @nothrow @nogc. --

https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #34 from Vladimir Panteleev <thecybershadow@gmail.com> --- (In reply to Martin Nowak from comment #31) > BTW, this is what I already wrote in comment 23. Not sure why you only partially quoted my answer to suggest a contradiction. Err, well, to be fair, you did not state this clearly in comment 23, which is why I asked for a clarification. I was not trying to maliciously nitpick your words, just tried to understand your point. --

https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #35 from Jonathan M Davis <issues.dlang@jmdavisProg.com> --- (In reply to Martin Nowak from comment #32) > Summary: > > We should adopt a new model of unicode validations. > The current one where every string processing function decodes unicode > characters and performs validation causes too much overhead. > A better alternative would be to perform unicode validation once when > reading raw data (ubyte[]) and then assume any char[]/wchar[]/dchar[] is a > valid unicode string. > Invalid encodings introduced by string processing algorithms are programming > bugs and thus do not warrant runtime checks in release builds. Exactly. --

https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #36 from Vladimir Panteleev <thecybershadow@gmail.com> --- Question, is there any overhead in actually verifying the validity of UTF-8 streams, or is all overhead related to error handling (i.e. inability to be nothrow)? --

https://issues.dlang.org/show_bug.cgi?id=14519 Vladimir Panteleev <thecybershadow@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://issues.dlang.org/sh | |ow_bug.cgi?id=14919 --

https://issues.dlang.org/show_bug.cgi?id=14519 Jack Stouffer <jack@jackstouffer.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jack@jackstouffer.com --- Comment #37 from Jack Stouffer <jack@jackstouffer.com> --- This entire discussion is moot unless you get Andrei on board with a breaking change to a very fundamental part of the language. --

May 20, 2016

[Issue 14519] Get rid of unicode validation in string processing

Posted by Martin Nowak

Permalink

Martin Nowak

Permalink

https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #38 from Martin Nowak <code@dawg.eu> ---
(In reply to Vladimir Panteleev from comment #36)
> Question, is there any overhead in actually verifying the validity of UTF-8 streams, or is all overhead related to error handling (i.e. inability to be nothrow)?

I think it's fairly measurable b/c you need to add lots of additional checks
and branches (though highly predictable ones).
While my initial decode implementation
https://github.com/MartinNowak/phobos/blob/1b0edb728c/std/utf.d#L577-L651 was
transmogrify into 200 lines in the meantime
https://github.com/dlang/phobos/blob/acafd848d8/std/utf.d#L1167-L1369, you can
still use it to benchmark validation.
I did run a lot of benchmarks when introducing that function, and the code path
for decoding just remains slow, even with the throwing code path removed out of
normal control flow.

--

Forums