April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #21 from Sobirari Muhomori <dfj1esp02@sneakemail.com> ---
(In reply to Vladimir Panteleev from comment #16)
> > Global opt-in for foreach is not feasible.
> 
> I agree - some libraries will expect one thing, and others another.

Libraries don't determine on which data the program operates, it depends on the program and its environment, encoding mismatch has large scale consequence too: program crashes or corrupts data, libraries don't decide how to behave in such cases, it's a property of the program as a whole. Since they can't decide how to behave in such cases, they shouldn't decide and thus can't have different expectations on this matter, it's a per-program aspect.

--
April 29, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #22 from Marc Schütz <schuetzm@gmx.net> ---
(In reply to Vladimir Panteleev from comment #20)
> (In reply to Marc Schütz from comment #18)
> > Data with other (or unknown) encodings needs to be stored in `ubyte[]`.
> 
> Have you tried using ubyte[] to process ASCII text? It's horrible, you have to cast at every step, and nothing in std.string works even when it should.

For ASCII text, char[] is okay, UTF8 is a superset of ASCII.

But you're right for other encodings. That's why those need to be converted "at the border": To UTF8 when read from a file or stdin, main() args, env vars, and from UTF8 to whatever on writing. Internally, they need to be UTFx encoded. This is the only sane way to handle different text encodings, IMO.

--
April 30, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

Martin Nowak <code@dawg.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code@dawg.eu

--- Comment #23 from Martin Nowak <code@dawg.eu> ---
(In reply to Vladimir Panteleev from comment #20)
> (In reply to Marc Schütz from comment #18)
> > Data with other (or unknown) encodings needs to be stored in `ubyte[]`.
> 
> Have you tried using ubyte[] to process ASCII text? It's horrible, you have to cast at every step, and nothing in std.string works even when it should.

No one is suggesting you operate on ubyte[] as string.
What people are is saying is you should validate a ubyte[] array before
converting it to a string. This is by the way how readText works. You'll have
to cast raw data to string to get strings with invalid UTF.

--
April 30, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #24 from Martin Nowak <code@dawg.eu> ---
If we validate encoding on data entry points such as readText or byLine, then decoding errors should be assertions rather than silent replacements, because it's a programming error to use unvalidated data as string.

--
May 02, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #25 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Martin Nowak from comment #24)
> If we validate encoding on data entry points such as readText or byLine, then decoding errors should be assertions rather than silent replacements, because it's a programming error to use unvalidated data as string.

Although I think this approach is acceptable (as long as the program halts regardless of compilation flags, which shouldn't be a problem), I would like to note that there are situations in which it is impractical to either convert or validate the data. One example is implementations of text-based network protocols (e.g. HTTP, NNTP, SMTP). Here, neither converting everything to UTF-8 or verifying that it is valid UTF-8 works, because text-based protocols often embed raw binary data. The program only needs to parse the ASCII text parts, so the ideal solution would be a string handling library which never decodes UTF-8 (something D doesn't have).

--
May 02, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #26 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Sobirari Muhomori from comment #21)
> Libraries don't determine on which data the program operates, it depends on the program and its environment, encoding mismatch has large scale consequence too: program crashes or corrupts data, libraries don't decide how to behave in such cases, it's a property of the program as a whole. Since they can't decide how to behave in such cases, they shouldn't decide and thus can't have different expectations on this matter, it's a per-program aspect.

No. Almost nothing is a per-program aspect. A program may contain within itself a large number of big components, each functioning more-or-less independently, each of which which might have been single programs or even a collection of programs. If something prevents you from designing such a system, this indicates underlying encapsulation flaws. Such global changes of behavior as you are proposing can affect a component which is used by a second component, which is used by a third component etc. - and something along that line is likely to expect failures to occur in a predictable way.

--
May 05, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #27 from Sobirari Muhomori <dfj1esp02@sneakemail.com> ---
If you want to request definite behavior in a fine-grained manner, that's always possible with configurable decoders, they would ignore default behavior if necessary.

--
May 09, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #28 from Martin Nowak <code@dawg.eu> ---
(In reply to Vladimir Panteleev from comment #25)
> Here, neither converting everything to UTF-8 or verifying that it is valid UTF-8 works, because
> text-based protocols often embed raw binary data. The program only needs to
> parse the ASCII text parts, so the ideal solution would be a string handling
> library which never decodes UTF-8 (something D doesn't have).

Yes, and you would be better off to handle such protocols as ubyte.

--
May 16, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #29 from Vladimir Panteleev <thecybershadow@gmail.com> ---
(In reply to Martin Nowak from comment #28)
> Yes, and you would be better off to handle such protocols as ubyte.

What do you mean? Aren't you contradicting yourself from when you wrote:

> No one is suggesting you operate on ubyte[] as string.

?

--
July 17, 2015
https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #30 from Martin Nowak <code@dawg.eu> ---
(In reply to Vladimir Panteleev from comment #29)
> (In reply to Martin Nowak from comment #28)
> > Yes, and you would be better off to handle such protocols as ubyte.
> 
> What do you mean? Aren't you contradicting yourself from when you wrote:
> 
> > No one is suggesting you operate on ubyte[] as string.
> 
> ?

Well, b/c they contain delimited binary and ASCII data, you'll have to find those delimiters, then validate and cast the ASCII part to a string, and can then use std.string functions.

--