The Case Against Autodecode (page 34)

On 06/02/2016 10:26 PM, Andrei Alexandrescu wrote: > The goal is to operate on code units. -- Andrei You sure you got the right word there? The code unit is the smallest building block. A code point is encoded with one or more code units. Also, if you mean code points, that's where people disagree. Operating on code points by default is seen as not particularly useful.

On 02.06.2016 22:29, Andrei Alexandrescu wrote: > On 06/02/2016 04:22 PM, cym13 wrote: >> >> A:“We should decode to code points” >> B:“No, decoding to code points is a stupid idea.” >> A:“No it's not!” >> B:“Can you show a concrete example where it does something useful?” >> A:“Sure, look at that!” >> B:“This isn't working at all, look at all those counter-examples!” >> A:“It may not work for your examples but look how easy it is to >> find code points!” > > With autodecoding all of std.algorithm operates correctly on code > points. Without it all it does for strings is gibberish. -- Andrei No, without it, it operates correctly on code units.

On 6/2/2016 9:02 AM, Adam D. Ruppe wrote: > Which gave us the list of places inside Phobos to fix, only about two hours of > work, and proved that the version() method was viable (and REALLY easy to > implement). Nothing prevents anyone from doing that on their own (it's trivial) in order to find Phobos problems, and pick one or three to fix.

June 02, 2016

Re: The Case Against Autodecode

Posted by tsbockman
in reply to Andrei Alexandrescu

Permalink

tsbockman

Posted in reply to Andrei Alexandrescu

Permalink

On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu wrote:
> On 06/02/2016 03:34 PM, tsbockman wrote:
>> Your 'ö' examples will NOT work reliably with auto-decoded code points,
>> and for nearly the same reason that they won't work with code units; you
>> would have to use byGrapheme.
>
> They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.

Your examples will pass or fail depending on how (and whether) the 'ö' grapheme is normalized. They only ever succeeds because 'ö' happens to be one of the privileged graphemes that *can* be (but often isn't!) represented as a single code point. Many other graphemes have no such representation.

Working directly with code points is sometimes useful anyway - but then, working with code units can be, also. Neither will lead to inherently "correct" Unicode processing, and in the absence of a compelling context, your examples fall completely flat as an argument for the inherent superiority of processing at the code unit level.

>> The fact that you still don't get that, even after a dozen plus attempts
>> by the community to explain the difference, makes you unfit to direct
>> Phobos' Unicode support.
>
> Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.

Who said mine is safe? I *know* that I'm not qualified to be in charge of this.

Your comprehension is under greater scrutiny because you are proposing to overrule nearly all other active contributors combined.

>> Please, either go study Unicode until you
>> really understand it, or delegate this issue to someone else.
>
> Would be happy to. To whom would I delegate?

If you're serious, I would suggest Dmitry Olshansky. He seems to be our top Unicode expert, based on his contributions to `std.uni` and `std.regex`. But, if he is unwilling/unsuitable for some reason there are other candidates participating in this thread (not me).

On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu wrote: > On 06/02/2016 04:23 PM, ag0aep6g wrote: >> People are arguing that auto-decoding to code points is not useful. > > And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- Andrei Just make RCStr the most amazing string type of any standard library ever and everyone will be happy :o)

On 06/02/2016 04:36 PM, tsbockman wrote: > Your examples will pass or fail depending on how (and whether) the 'ö' > grapheme is normalized. And that's fine. Want graphemes, .byGrapheme wags its tail in that corner. Otherwise, you work on code points which is a completely meaningful way to go about things. What's not meaningful is the random results you get from operating on code units. > They only ever succeeds because 'ö' happens to > be one of the privileged graphemes that *can* be (but often isn't!) > represented as a single code point. Many other graphemes have no such > representation. Then there's no dchar for them so no problem to start with. s.find(c) ----> "Find code unit c in string s" Andrei

On 06/02/2016 04:37 PM, default0 wrote: > On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu wrote: >> On 06/02/2016 04:23 PM, ag0aep6g wrote: >>> People are arguing that auto-decoding to code points is not useful. >> >> And want to return to the point where char[] is but an indiscriminated >> array, which would take std.algorithm back to the stone age. -- Andrei > > Just make RCStr the most amazing string type of any standard library > ever and everyone will be happy :o) Soon as this thread ends. -- Andrei

On Thursday, 2 June 2016 at 20:32:39 UTC, Walter Bright wrote: > The first step is to adjust Phobos implementations and documentation so they do not rely on autodecoding. The compiler can help you with that. That's the point of the do not merge PR: it got an actionable list out of the compiler and proved the way forward was viable.

On Thursday, 2 June 2016 at 20:36:12 UTC, Andrei Alexandrescu wrote: > On 06/02/2016 04:33 PM, ag0aep6g wrote: >> Operating on code points by default is seen as not particularly useful. > > By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- Andrei From the standard: > Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are surrogates, canonical equivalence, word boundaries, grapheme boundaries, and loose matches. (For more information about boundary conditions, see The Unicode Standard, Section 5-15.) > > Level 2 support matches much more what user expectations are for sequences of Unicode characters. It is still locale independent and easily implementable. However, the implementation may be slower when supporting Level 2, and some expressions may require Level 1 matches. Thus it is usually required to have some sort of syntax that will turn Level 2 support on and off. That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or extensions to the Unicode Standard very well".

Forums