The Case Against Autodecode (page 35)

On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote: > And want to return to the point where char[] is but an indiscriminated > array, which would take std.algorithm back to the stone age. -- Andrei I think you'd have to substantiate how that would be worse than auto-decoding. Your examples only show that treating code points as characters falls apart at a higher level than treating code units as characters. But it still falls apart. Failing early is a quality.

On 06/02/2016 04:47 PM, tsbockman wrote: > That doesn't sound like much of an endorsement for defaulting to only > level 1 support to me - "it does not handle more complex languages or > extensions to the Unicode Standard very well". Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- Andrei

On 02.06.2016 22:28, Andrei Alexandrescu wrote: > On 06/02/2016 04:12 PM, Timon Gehr wrote: >> It is not meaningful to compare utf-8 and utf-16 code units directly. > > But it is meaningful to compare Unicode code points. -- Andrei > It is also meaningful to compare two utf-8 code units or two utf-16 code units.

On 06/02/2016 04:47 PM, ag0aep6g wrote: > On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote: >> And want to return to the point where char[] is but an indiscriminated >> array, which would take std.algorithm back to the stone age. -- Andrei > > I think you'd have to substantiate how that would be worse than > auto-decoding. I gave a long list of std.algorithm uses that perform virtually randomly on char[]. > Your examples only show that treating code points as characters falls > apart at a higher level than treating code units as characters. But it > still falls apart. Failing early is a quality. It does not fall apart for code points. Andrei

On 06/02/2016 04:50 PM, Timon Gehr wrote: > On 02.06.2016 22:28, Andrei Alexandrescu wrote: >> On 06/02/2016 04:12 PM, Timon Gehr wrote: >>> It is not meaningful to compare utf-8 and utf-16 code units directly. >> >> But it is meaningful to compare Unicode code points. -- Andrei >> > > It is also meaningful to compare two utf-8 code units or two utf-16 code > units. By decoding them of course. -- Andrei

On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote: > By whom? The "support level 1" folks yonder at the Unicode standard? :o) > -- Andrei Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?

On 06/02/2016 04:52 PM, ag0aep6g wrote: > On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote: >> By whom? The "support level 1" folks yonder at the Unicode standard? :o) >> -- Andrei > > Do they say that level 1 should be the default, and do they give a > rationale for that? Would you kindly link or quote that? No, but that sounds agreeable to me, especially since it breaks no code of ours. We really should document this better. Kudos to Walter for finding all that Level 1 support. Andrei

On 6/2/2016 1:46 PM, Adam D. Ruppe wrote: > The compiler can help you with that. That's the point of the do not merge PR: it > got an actionable list out of the compiler and proved the way forward was viable. What is supposed to be done with "do not merge" PRs other than close them?

On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote: > What is supposed to be done with "do not merge" PRs other than close them? Experimentally iterate until something workable comes about. This way it's done publicly and people can collaborate.

June 02, 2016

Re: The Case Against Autodecode

Posted by tsbockman
in reply to Andrei Alexandrescu

Permalink

tsbockman

Posted in reply to Andrei Alexandrescu

Permalink

On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu wrote:
> On 06/02/2016 04:47 PM, tsbockman wrote:
>> That doesn't sound like much of an endorsement for defaulting to only
>> level 1 support to me - "it does not handle more complex languages or
>> extensions to the Unicode Standard very well".
>
> Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- Andrei

Actually, according to the document Walter Bright linked level 1 does NOT operate at the code point level:

> Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic 16-bit logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or UTF-32.)
> ...
> Level 1 support works well in many circumstances. However, it does not handle more complex languages or extensions to the Unicode Standard very well. Particularly important cases are **surrogates** ...

So, level 1 appears to be UTF-16 code units, not code points. To do code points it would have to recognize surrogates, which are specifically mentioned as not supported.

Level 2 skips straight to graphemes, and there is no code point level.

However, this document is very old - from Unicode 3.0 and the year 2000:

> While there are no surrogate characters in Unicode 3.0 (outside of private use characters), future versions of Unicode will contain them...

Perhaps level 1 has since been redefined?

Forums