The Case Against Autodecode (page 13)

May 30, 2016

Re: The Case Against Autodecode

Posted by Marco Leise
in reply to H. S. Teoh

Permalink

Marco Leise

Posted in reply to H. S. Teoh

Permalink

Am Thu, 26 May 2016 16:23:16 -0700
schrieb "H. S. Teoh via Digitalmars-d"
<digitalmars-d@puremagic.com>:

> On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]
> > s.walkLength
> > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> > s.count!(c => c >= 32) // non-control characters
> 
> Question: what should count return, given a string containing (1)
> combining diacritics, or (2) Korean text? Or (3) zero-width spaces?
> 
> 
> > Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
> 
> The problem is that often such decisions can only be made by the user, because it depends on what the user wants to accomplish.  What should count return, given some Unicode string?  If the user wants to determine the size of a buffer (e.g., to store a string minus some characters to be stripped), then count should return the byte count. If the user wants to count the number of matching visual characters, then count should return the number of graphemes. If the user wants to determine the visual width of the (filtered) string, then count should not be used at all, but instead a font metric algorithm.  (I can't think of a practical use case where you'd actually need to count code points(!).)

Hey, I was about to answer exactly the same. It reminds me that
a few years ago I proposed making string iteration explicit
by code-unit, code-point and grapheme in "Rust" and there was
virtually no debate about doing it in the sense that to enable
people to write correct code they'd need to understand a
bit of Unicode and pick the right primitive. If you don't know
what to pick you look it up.

-- 
Marco

On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote: > D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be. Don't be so sure. All string handling code would become broken, even if it appears to work at first.

I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme?

On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote: > *** http://site.icu-project.org/home#TOC-What-is-ICU- I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.

On 05/30/2016 12:25 PM, Nick Sabalausky wrote: > On 05/29/2016 09:58 PM, Jack Stouffer wrote: >> >> The problem is not active users. The problem is companies who have > 10K >> LOC and libraries that are no longer maintained. E.g. It took >> Sociomantic eight years after D2's release to switch only a few parts of >> their projects to D2. With the loss of old libraries/old code (even old >> answers on SO), all of a sudden you lose a lot of the network effect >> that makes programming languages much more useful. >> > > D1 -> D2 was a vastly more disruptive change than getting rid of > auto-decoding would be. It was also made at a time when the community was smaller by a couple orders of magnitude. -- Andrei

On 05/30/2016 12:34 PM, Jack Stouffer wrote: > On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote: >> D1 -> D2 was a vastly more disruptive change than getting rid of >> auto-decoding would be. > > Don't be so sure. All string handling code would become broken, even if > it appears to work at first. That kind of makes this thread less productive than "How to improve autodecoding?" -- Andrei

On Monday, 30 May 2016 at 14:35:03 UTC, Seb wrote: > That's a great idea - the compiler should also issue deprecation warnings when I try to do things like: I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units). Besides, it'd be a much bigger change than the library transition.

On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote: > I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme? The comparison predicate does that... sort!( (string a, string b) { /* you interpret a and b here and return the comparison */ })(["hi", "there"]);

On 30-May-2016 21:24, Andrei Alexandrescu wrote: > On 05/30/2016 12:34 PM, Jack Stouffer wrote: >> On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote: >>> D1 -> D2 was a vastly more disruptive change than getting rid of >>> auto-decoding would be. >> >> Don't be so sure. All string handling code would become broken, even if >> it appears to work at first. > > That kind of makes this thread less productive than "How to improve > autodecoding?" -- Andrei 1. Generalize to all ranges of code units i.e. ranges of char/wchar. 2. Operating on codeunits explicitly would then always involve a step through ubyte/byte. -- Dmitry Olshansky

On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote: > When on the other hand you work with real world international text, you'll want to work with graphemes. Actually, my main rule of thumb is: don't mess with strings. Get them from the user, store them without modification, spit them back out again. Wherever possible, don't do anything more. But if you do have to implement the rest, eh, it depends on what you're doing still. If I want an ellipsis, for example, I like to take font size into account too - basically, I do a dry-run of the whole font render to get the length in pixels, then slice off the partial grapheme... So yeah that's kinda complicated...

Forums