The Case Against Autodecode (page 37)

On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote: > Nope, that's a radically different matter. As the examples show, the > examples would be entirely meaningless at code unit level. They're simply not possible. Won't compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units.

On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote: > On 6/2/2016 12:34 PM, deadalnix wrote: >> On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote: >>> Pretty much everything. Consider s and s1 string variables with possibly >>> different encodings (UTF8/UTF16). >>> >>> * s.all!(c => c == 'ö') works only with autodecoding. It returns always false >>> without. >>> >> >> False. Many characters can be represented by different sequences of codepoints. >> For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is >> one such character. > > There are 3 levels of Unicode support. What Andrei is talking about is Level 1. > > http://unicode.org/reports/tr18/tr18-5.1.html > > I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness. To be able to convert back and forth from/to unicode in a lossless manner.

On 06/02/2016 11:24 PM, ag0aep6g wrote: > They're simply not possible. Won't compile. There is no single UTF-8 > code unit for 'ö', so you can't (easily) search for it in a range for > code units. Just like there is no single code point for 'a⃗' so you can't > search for it in a range of code points. > > You can still search for 'a', and 'o', and the rest of ASCII in a range > of code units. I'm ignoring combining characters there. You can search for 'a' in code units in the same way that you can search for 'ä' in code points. I.e., more or less, depending on how serious you are about combining characters.

June 02, 2016

Re: The Case Against Autodecode

Posted by Andrei Alexandrescu
in reply to ag0aep6g

Permalink

Andrei Alexandrescu

Posted in reply to ag0aep6g

Permalink

On 6/2/16 5:24 PM, ag0aep6g wrote:
> On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
>> Nope, that's a radically different matter. As the examples show, the
>> examples would be entirely meaningless at code unit level.
>
> They're simply not possible. Won't compile.

They do compile.

> There is no single UTF-8
> code unit for 'ö', so you can't (easily) search for it in a range for
> code units.

Of course you can. Can you search for an int in a short[]? Oh yes you can. Can you search for a dchar in a char[]? Of course you can. Autodecoding also gives it meaning.

> Just like there is no single code point for 'a⃗' so you can't
> search for it in a range of code points.

Of course you can.

> You can still search for 'a', and 'o', and the rest of ASCII in a range
> of code units.

You can search for a dchar in a char[] because you can compare an individual dchar with either another dchar (correct, autodecoding) or with a char (incorrect, no autodecoding).

As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o).

Andrei

On 02.06.2016 23:20, deadalnix wrote: > > The sample code won't count the instance of the grapheme 'ö' as some of > its encoding won't be counted, which definitively count as doesn't work. It also has false positives (you can combine 'ö' with some combining character in order to get some strange character that is not an 'ö', and not even NFC helps with that).

On 6/2/16 5:23 PM, Timon Gehr wrote: > On 02.06.2016 22:51, Andrei Alexandrescu wrote: >> On 06/02/2016 04:50 PM, Timon Gehr wrote: >>> On 02.06.2016 22:28, Andrei Alexandrescu wrote: >>>> On 06/02/2016 04:12 PM, Timon Gehr wrote: >>>>> It is not meaningful to compare utf-8 and utf-16 code units directly. >>>> >>>> But it is meaningful to compare Unicode code points. -- Andrei >>>> >>> >>> It is also meaningful to compare two utf-8 code units or two utf-16 code >>> units. >> >> By decoding them of course. -- Andrei >> > > That makes no sense, I cannot decode single code units. > > BTW, I guess the reason why char converts to wchar converts to dchar is > that the lower half of code units in char and the lower half of code > units in wchar are code points. Maybe code units and code points with > low numerical values should have distinct types. Then you lost me. (I'm sure you're making a good point.) -- Andrei

On 6/2/16 5:27 PM, Andrei Alexandrescu wrote: > On 6/2/16 5:24 PM, ag0aep6g wrote: >> Just like there is no single code point for 'a⃗' so you can't >> search for it in a range of code points. > > Of course you can. Correx, indeed you can't. -- Andrei

On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote: > The level 2 support description noted that it should be opt-in because its slow. 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.

On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote: > On 6/2/16 5:24 PM, ag0aep6g wrote: >> On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote: >>> Nope, that's a radically different matter. As the examples show, the >>> examples would be entirely meaningless at code unit level. >> >> They're simply not possible. Won't compile. > > They do compile. Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature. > As I said: this thread produces an unpleasant amount of arguments in > favor of autodecoding. Even I don't like that :o). It's more of an argument against char : dchar, I'd say.

Forums