[Issue 14919] utf/unicode should only be validated once

Aug 14, 2015

Martin Nowak

Aug 14, 2015

Martin Nowak

Aug 14, 2015

Aug 18, 2015

Aug 19, 2015

Dec 17, 2022

Dec 13

dlangBugzillaToGithub

August 14, 2015

[Issue 14919] utf/unicode should only be validated once

Posted by Martin Nowak

Permalink

Martin Nowak

Permalink

https://issues.dlang.org/show_bug.cgi?id=14919

Martin Nowak <code@dawg.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|utf error                   |utf/unicode should only be
                   |                            |validated once

--- Comment #1 from Martin Nowak <code@dawg.eu> ---
(In reply to Vladimir Panteleev from comment
https://issues.dlang.org/show_bug.cgi?id=14519#c25)
> Although I think this approach is acceptable (as long as the program halts regardless of compilation flags, which shouldn't be a problem), I would like to note that there are situations in which it is impractical to either convert or validate the data. One example is implementations of text-based network protocols (e.g. HTTP, NNTP, SMTP). Here, neither converting everything to UTF-8 or verifying that it is valid UTF-8 works, because text-based protocols often embed raw binary data. The program only needs to parse the ASCII text parts, so the ideal solution would be a string handling library which never decodes UTF-8 (something D doesn't have).

Such text protocols don't

--

https://issues.dlang.org/show_bug.cgi?id=14919 --- Comment #2 from Martin Nowak <code@dawg.eu> --- Such text protocols don't randomly contain binary data. It's properly delimited either by text markers or by known offsets. So what you need to do, is to lazily validate and convert ubyte[] to ASCII/UTF, find the delimiters (could prolly be done on ubyte[]), and skip validation for the binary blob. Vice versa for binary protocols that contain strings, first work on the binary data and then validate the extracted strings. --

https://issues.dlang.org/show_bug.cgi?id=14919 --- Comment #3 from Martin Nowak <code@dawg.eu> --- The transition could be done in the following order over several releases: 1. `deprecate("use UTFError instead") UTFException` and add `alias UTFError = UTFException`, so UTFError remains an Exception use UTFError in all validations 2. make UTFError an Error change all text reading functions (e.g. byLine) to eager validations 3. replace validations and UTFError with asserts --

https://issues.dlang.org/show_bug.cgi?id=14919 Vladimir Panteleev <thecybershadow@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |thecybershadow@gmail.com See Also| |https://issues.dlang.org/sh | |ow_bug.cgi?id=14519 --

https://issues.dlang.org/show_bug.cgi?id=14919 Iain Buclaw <ibuclaw@gdcproject.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P1 |P4 --

https://issues.dlang.org/show_bug.cgi?id=14919 --- Comment #4 from dlangBugzillaToGithub <robert.schadek@posteo.de> --- THIS ISSUE HAS BEEN MOVED TO GITHUB https://github.com/dlang/dmd/issues/19028 DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB --

Forums