June 02, 2016
On 6/2/16 5:43 PM, Timon Gehr wrote:
>
> .̂ ̪.̂
>
> (Copy-paste it somewhere else, I think it might not be rendered
> correctly on the forum.)
>
> The point is that if I do:
>
> ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])
>
> no match is returned.
>
> If I use your method with dchars, I will get spurious matches. I.e. the
> suggested method to look for punctuation symbols is incorrect:
>
> writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"

Nice example.

> (Also, do you have an use case for this?)

Count delimited words. Did you also look at balancedParens?


Andrei
June 02, 2016
On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:
> On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
>> 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default.
>>
>> 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have).
>>
>> 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.
>
> 1) Right because a special toggleable syntax is definitely not "opt-in".

It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1.

> 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points.

And working on code points is way slower than working on code units (the actual level 1).

> 3) Not an argument - doing more work makes code slower.

What do you think I'm arguing for? It's not graphemes-by-default.

What I actually want to see: permanently deprecate the auto-decoding range primitives. Force the user to explicitly specify whichever of `by!dchar`, `byCodePoint`, or `byGrapheme` their specific algorithm actually needs. Removing the implicit conversions between `char`, `wchar`, and `dchar` would also be nice, but isn't really necessary I think.

That would be a standards-compliant solution (one of several possible). What we have now is non-standard, at least going by the old version Walter linked.
June 02, 2016
On 6/2/2016 1:12 PM, Timon Gehr wrote:
> On 02.06.2016 22:07, Walter Bright wrote:
>> On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
>>> * s.all!(c => c == 'ö') works only with autodecoding. It returns
>>> always false
>>> without.
>>
>> The o is inferred as a wchar. The lamda then is inferred to return a
>> wchar.
>
> No, the lambda returns a bool.

Thanks for the correction.


>> The algorithm can check that the input is char[], and is being
>> tested against a wchar. Therefore, the algorithm can specialize to do
>> the decoding itself.
>>
>> No autodecoding necessary, and it does the right thing.
>
> It still would not be the right thing. The lambda shouldn't compile. It is not
> meaningful to compare utf-8 and utf-16 code units directly.

Yes, you have a good point. But we do allow things like:

   byte b;
   if (b == 10000) ...

June 02, 2016
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
> The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always false without.

Can be made to work without autodecoding.

June 03, 2016
On 02.06.2016 23:46, Andrei Alexandrescu wrote:
> On 6/2/16 5:43 PM, Timon Gehr wrote:
>>
>> .̂ ̪.̂
>>
>> (Copy-paste it somewhere else, I think it might not be rendered
>> correctly on the forum.)
>>
>> The point is that if I do:
>>
>> ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])
>>
>> no match is returned.
>>
>> If I use your method with dchars, I will get spurious matches. I.e. the
>> suggested method to look for punctuation symbols is incorrect:
>>
>> writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"
>
> Nice example.
> ...

Thanks! :o)

>> (Also, do you have an use case for this?)
>
> Count delimited words. Did you also look at balancedParens?
>
>
> Andrei


On 02.06.2016 22:01, Timon Gehr wrote:
>
>> * s.balancedParens('〈', '〉') works only with autodecoding.
>> ...
>
> Doesn't work, e.g. s="⟨⃖". Shouldn't compile.

assert("⟨⃖".normalize!NFC.byGrapheme.balancedParens(Grapheme("⟨"),Grapheme("⟩")));

writeln("⟨⃖".balancedParens('⟨','⟩')); // false


June 02, 2016
On Thursday, 2 June 2016 at 21:51:51 UTC, tsbockman wrote:
> On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:
>> On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
>>> 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default.
>>>
>>> 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have).
>>>
>>> 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.
>>
>> 1) Right because a special toggleable syntax is definitely not "opt-in".
>
> It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1.
>

*sigh* reading comprehension. Needing to write .byGrapheme or similar to enable the behaviour qualifies for what that description was arguing for. I hope you understand that now that I am repeating this for you.

>> 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points.
>
> And working on code points is way slower than working on code units (the actual level 1).
>

Never claimed the opposite. Do note however that its specifically talking about UTF-16 code units.

>> 3) Not an argument - doing more work makes code slower.
>
> What do you think I'm arguing for? It's not graphemes-by-default.

Unrelated. I was refuting the point you made about the relevance of the performance claims of the unicode level 2 support description, not evaluating your hypothetical design. Please do not take what I say out of context, thank you.

June 03, 2016
Am Thu, 2 Jun 2016 15:05:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 06/02/2016 01:54 PM, Marc Schütz wrote:
> > Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?
> 
> Pretty much everything.
>
> s.all!(c => c == 'ö')

Andrei, your ignorance is really starting to grind on everyones nerves. If after 350 posts you still don't see why this is incorrect: s.any!(c => c == 'o'), you must be actively skipping the informational content of this thread.

You are in error, no one agrees with you, and you refuse to see it and in the end we have to assume you will make a decisive vote against any PR with the intent to remove auto-decoding from Phobos.

Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track.

Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it.

-- 
Marco

June 03, 2016
On 02.06.2016 23:56, Walter Bright wrote:
> On 6/2/2016 1:12 PM, Timon Gehr wrote:
>> ...
>> It is not
>> meaningful to compare utf-8 and utf-16 code units directly.
>
> Yes, you have a good point. But we do allow things like:
>
>     byte b;
>     if (b == 10000) ...
>

Well, this is a somewhat different case, because 10000 is just not representable as a byte. Every value that fits in a byte fits in an int though.

It's different for code units. They are incompatible both ways. E.g. dchar obviously does not fit in a char, and while the lower half of char is compatible with dchar, the upper half is specific to the encoding. dchar cannot represent upper half char code units. You get the code points with the corresponding values instead.

E.g.:

void main(){
    import std.stdio,std.utf;
    foreach(dchar d;"ö".byCodeUnit)
        writeln(d); // "Ã", "¶"
}

June 02, 2016
On Thursday, 2 June 2016 at 22:03:01 UTC, default0 wrote:
> *sigh* reading comprehension.
> ...
> Please do not take what I say out of context, thank you.

Earlier you said:

> The level 2 support description noted that it should be opt-in because its slow.

My main point is simply that you mischaracterized what the standard says. Making level 1 opt-in, rather than level 2, would be just as compliant as the reverse. The standard makes no suggestion as to which should be default.
June 03, 2016
On 02.06.2016 23:29, Andrei Alexandrescu wrote:
> On 6/2/16 5:23 PM, Timon Gehr wrote:
>> On 02.06.2016 22:51, Andrei Alexandrescu wrote:
>>> On 06/02/2016 04:50 PM, Timon Gehr wrote:
>>>> On 02.06.2016 22:28, Andrei Alexandrescu wrote:
>>>>> On 06/02/2016 04:12 PM, Timon Gehr wrote:
>>>>>> It is not meaningful to compare utf-8 and utf-16 code units directly.
>>>>>
>>>>> But it is meaningful to compare Unicode code points. -- Andrei
>>>>>
>>>>
>>>> It is also meaningful to compare two utf-8 code units or two utf-16
>>>> code
>>>> units.
>>>
>>> By decoding them of course. -- Andrei
>>>
>>
>> That makes no sense, I cannot decode single code units.
>>
>> BTW, I guess the reason why char converts to wchar converts to dchar is
>> that the lower half of code units in char and the lower half of code
>> units in wchar are code points. Maybe code units and code points with
>> low numerical values should have distinct types.
>
> Then you lost me. (I'm sure you're making a good point.) -- Andrei

Basically:

bool bad(char c,dchar d){ return c==d; } // ideally shouldn't compile
bool good(char c,char d){ return c==d; } // should compile