March 08, 2014
On 3/7/14, 2:26 PM, H. S. Teoh wrote:
> This illustrates one of my objections to Andrei's post: by auto-decoding
> behind the user's back and hiding the intricacies of unicode from him,
> it has masked the fact that codepoint-for-codepoint comparison of a
> unicode string is not guaranteed to always return the correct results,
> due to the possibility of non-normalized strings.
>
> Basically, to have correct behaviour in all cases, the user must be
> aware of, and use, the Unicode collation / normalization algorithms
> prescribed by the Unicode standard.

Which is a reasonable thing to ask for.

Andrei
March 08, 2014
On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
>> s.canFind('é')
>> s.endsWith('é')
>> s.find('é')
>> s.count('é')
>> s.countUntil('é')
>
> These should not compile post-change, because the sought element (dchar)
> is not of the same type as the string. So they will not fail silently.

The compared element need not have the same type (otherwise we'd break some other code).

Andrei


March 08, 2014
On Saturday, 8 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
> Yup, the grapheme issue. This should work.

No. It does not work because grapheme segmentation is not the same as normalization. Even if you fix the code (should be: assert(s.byGrapheme.canFind!"a[] == b"("é"))), it will not work because byGrapheme does not normalize (and not all graphemes can be normalized to a single code point anyway). And there is more than one type of normalization - you need to use the one depending on what you're trying to achieve.

> Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing.

It's not about types, it's about algorithms. It's never "X or nothing" - unless X is "actually understanding Unicode". Everything else is a compromise.

Compromises are acceptable, but not when they are built into the language as the standard way of working with text, thus hiding the problems that come with them.
March 08, 2014
On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu wrote:
> On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
>>> s.canFind('é')
>>> s.endsWith('é')
>>> s.find('é')
>>> s.count('é')
>>> s.countUntil('é')
>>
>> These should not compile post-change, because the sought element (dchar)
>> is not of the same type as the string. So they will not fail silently.
>
> The compared element need not have the same type (otherwise we'd break some other code).

Do you think such code will appear often in practice? Even if the type is a dchar, in some cases the programmer may not have intended to do decoding (e.g. the "dchar" type was a result of type deduction form .front OSLT).
March 08, 2014
On Saturday, 8 March 2014 at 01:41:01 UTC, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu wrote:
>> On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
>>> These should not compile post-change, because the sought element (dchar)
>>> is not of the same type as the string. So they will not fail silently.
>>
>> The compared element need not have the same type (otherwise we'd break some other code).
>
> Do you think such code will appear often in practice? Even if the type is a dchar, in some cases the programmer may not have intended to do decoding (e.g. the "dchar" type was a result of type deduction form .front OSLT).

Sorry, I see now that you were referring to algorithms in general. I think adding a temporary warning for character types only, as with .front, would be appropriate...
March 08, 2014
Vladimir Panteleev:

> It's not about types, it's about algorithms.

Given sufficiently refined types, it can be about types :-)

Bye,
bearophile
March 08, 2014
08-Mar-2014 05:23, Andrei Alexandrescu пишет:
> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>>> No, it doesn't.
>>>>
>>>> import std.algorithm;
>>>>
>>>> void main()
>>>> {
>>>>    auto s = "cassé";
>>>>    assert(s.canFind('é'));
>>>> }
>>>>
>>>
>>> Hm, I'm not following? Works perfectly fine on my system?
>>
>> Something's messing with your Unicode. Try downloading and compiling
>> this file:
>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>
> Yup, the grapheme issue. This should work.
>
> import std.algorithm, std.uni;
>
> void main()
> {
>      auto s = "cassé";
>      assert(s.byGrapheme.canFind('é'));
> }
>
> It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.

>
> Graphemes are the next level of Nirvana above code points, but that
> doesn't mean it's graphemes or nothing.
>
>
> Andrei
>


-- 
Dmitry Olshansky
March 08, 2014
08-Mar-2014 12:09, Dmitry Olshansky пишет:
> 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
>> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
>>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>>>> No, it doesn't.
>>>>>
>>>>> import std.algorithm;
>>>>>
>>>>> void main()
>>>>> {
>>>>>    auto s = "cassé";
>>>>>    assert(s.canFind('é'));
>>>>> }
>>>>>
>>>>
>>>> Hm, I'm not following? Works perfectly fine on my system?
>>>
>>> Something's messing with your Unicode. Try downloading and compiling
>>> this file:
>>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>>
>> Yup, the grapheme issue. This should work.
>>
>> import std.algorithm, std.uni;
>>
>> void main()
>> {
>>      auto s = "cassé";
>>      assert(s.byGrapheme.canFind('é'));
>> }
>>
>> It doesn't compile, seems like a library bug.
>
> Becasue Graphemes do not auto-magically convert to dchar and back? After
> all they are just small strings.
>
>>
>> Graphemes are the next level of Nirvana above code points, but that
>> doesn't mean it's graphemes or nothing.
>>

Plus it won't help the matters, you need both "é" and "cassé" to have the same normalization.


-- 
Dmitry Olshansky
March 08, 2014
08-Mar-2014 05:18, Andrei Alexandrescu пишет:
> On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
>> 07-Mar-2014 23:57, Andrei Alexandrescu пишет:
>>> On 3/6/14, 6:37 PM, Walter Bright wrote:
>>>> In "Lots of low hanging fruit in Phobos" the issue came up about the
>>>> automatic encoding and decoding of char ranges.
>>> [snip]
>>>
>>> Allow me to enumerate the functions of std.algorithm and how they work
>>> today and how they'd work with the proposed change. Let s be a variable
>>> of some string type.
>>
>> Special case was wrong though - special casing arrays of char[] and
>> throwing all other ranges of char out the window. The amount of code to
>> support this schizophrenia is enormous.
>
> I think this is a confusion. The code in e.g. std.algorithm is
> specialized for efficiency of stuff that already works.

Well, I've said it elsewhere - specialization was too fine grained. Either a generic or it doesn't work.

>
>>> Making strings bidirectional ranges has been a very good choice within
>>> the constraints. There was already a string type, and that was
>>> immutable(char)[], and a bunch of code depended on that definition.
>>
>> Trying to make it work by blowing a hole in the generic range concept
>> now seems like it wasn't worth it.
>
> I disagree. Also what hole?

Let's say we keep it.
Yesterday I had to write constraints like this:

if((isNarrowString!Range && is(Unqual!(ElementEncodingType!Range) == wchar)) ||
 (isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar)))

Just to accept anything that works alike to array of wchar, buffers and whatnot included.

I expect that this should have been enough:
isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar)

Or maybe introduce something to indicate any "DualRange" of narrow chars.

-- 
Dmitry Olshansky
March 08, 2014
On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote:
> Vladimir Panteleev:
>
>> It's not about types, it's about algorithms.
>
> Given sufficiently refined types, it can be about types :-)
>
> Bye,
> bearophile

I think Bear is onto something, we already solved an analogous problem in an elegant way.

see SortedRange with assumeSorted etc.

But for this to be convenient to use, I still think we should expand the current 'String Literal Postfix' types to include both normaliztion and graphemes.

Postfix	Type	Aka
c	immutable(char)[]	string
w	immutable(wchar)[]	wstring
d	immutable(dchar)[]	dstring
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19