Thread overview
[Issue 14919] utf/unicode should only be validated once
Aug 14, 2015
Martin Nowak
Aug 14, 2015
Martin Nowak
Aug 14, 2015
Martin Nowak
Aug 18, 2015
Martin Nowak
Aug 19, 2015
Vladimir Panteleev
Dec 17, 2022
Iain Buclaw
August 14, 2015
https://issues.dlang.org/show_bug.cgi?id=14919

Martin Nowak <code@dawg.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|utf error                   |utf/unicode should only be
                   |                            |validated once

--- Comment #1 from Martin Nowak <code@dawg.eu> ---
(In reply to Vladimir Panteleev from comment
https://issues.dlang.org/show_bug.cgi?id=14519#c25)
> Although I think this approach is acceptable (as long as the program halts regardless of compilation flags, which shouldn't be a problem), I would like to note that there are situations in which it is impractical to either convert or validate the data. One example is implementations of text-based network protocols (e.g. HTTP, NNTP, SMTP). Here, neither converting everything to UTF-8 or verifying that it is valid UTF-8 works, because text-based protocols often embed raw binary data. The program only needs to parse the ASCII text parts, so the ideal solution would be a string handling library which never decodes UTF-8 (something D doesn't have).

Such text protocols don't

--
August 14, 2015
https://issues.dlang.org/show_bug.cgi?id=14919

--- Comment #2 from Martin Nowak <code@dawg.eu> ---
Such text protocols don't randomly contain binary data.
It's properly delimited either by text markers or by known offsets.
So what you need to do, is to lazily validate and convert ubyte[] to ASCII/UTF,
find the delimiters (could prolly be done on ubyte[]), and skip validation for
the binary blob.
Vice versa for binary protocols that contain strings, first work on the binary
data and then validate the extracted strings.

--
August 14, 2015
https://issues.dlang.org/show_bug.cgi?id=14919

Martin Nowak <code@dawg.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Hardware|x86_64                      |All
                 OS|Linux                       |All

--
August 18, 2015
https://issues.dlang.org/show_bug.cgi?id=14919

--- Comment #3 from Martin Nowak <code@dawg.eu> ---
The transition could be done in the following order over several releases:

1. `deprecate("use UTFError instead") UTFException` and add `alias UTFError =
UTFException`, so UTFError remains an Exception
   use UTFError in all validations

2. make UTFError an Error
   change all text reading functions (e.g. byLine) to eager validations

3. replace validations and UTFError with asserts

--
August 19, 2015
https://issues.dlang.org/show_bug.cgi?id=14919

Vladimir Panteleev <thecybershadow@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thecybershadow@gmail.com
           See Also|                            |https://issues.dlang.org/sh
                   |                            |ow_bug.cgi?id=14519

--
December 17, 2022
https://issues.dlang.org/show_bug.cgi?id=14919

Iain Buclaw <ibuclaw@gdcproject.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P4

--
December 13
https://issues.dlang.org/show_bug.cgi?id=14919

--- Comment #4 from dlangBugzillaToGithub <robert.schadek@posteo.de> ---
THIS ISSUE HAS BEEN MOVED TO GITHUB

https://github.com/dlang/dmd/issues/19028

DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB

--