March 08, 2014
On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
> 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
>> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
>>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>>>> No, it doesn't.
>>>>>
>>>>> import std.algorithm;
>>>>>
>>>>> void main()
>>>>> {
>>>>>    auto s = "cassé";
>>>>>    assert(s.canFind('é'));
>>>>> }
>>>>>
>>>>
>>>> Hm, I'm not following? Works perfectly fine on my system?
>>>
>>> Something's messing with your Unicode. Try downloading and compiling
>>> this file:
>>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>>
>> Yup, the grapheme issue. This should work.
>>
>> import std.algorithm, std.uni;
>>
>> void main()
>> {
>>      auto s = "cassé";
>>      assert(s.byGrapheme.canFind('é'));
>> }
>>
>> It doesn't compile, seems like a library bug.
>
> Becasue Graphemes do not auto-magically convert to dchar and back? After
> all they are just small strings.

Yah but I think they should support comparison with individual characters. No?

Andrei

March 08, 2014
On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
> 08-Mar-2014 12:09, Dmitry Olshansky пишет:
>> 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
>>> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
>>>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>>>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>>>>> No, it doesn't.
>>>>>>
>>>>>> import std.algorithm;
>>>>>>
>>>>>> void main()
>>>>>> {
>>>>>>    auto s = "cassé";
>>>>>>    assert(s.canFind('é'));
>>>>>> }
>>>>>>
>>>>>
>>>>> Hm, I'm not following? Works perfectly fine on my system?
>>>>
>>>> Something's messing with your Unicode. Try downloading and compiling
>>>> this file:
>>>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>>>
>>> Yup, the grapheme issue. This should work.
>>>
>>> import std.algorithm, std.uni;
>>>
>>> void main()
>>> {
>>>      auto s = "cassé";
>>>      assert(s.byGrapheme.canFind('é'));
>>> }
>>>
>>> It doesn't compile, seems like a library bug.
>>
>> Becasue Graphemes do not auto-magically convert to dchar and back? After
>> all they are just small strings.
>>
>>>
>>> Graphemes are the next level of Nirvana above code points, but that
>>> doesn't mean it's graphemes or nothing.
>>>
>
> Plus it won't help the matters, you need both "é" and "cassé" to have
> the same normalization.

Why? Couldn't the grapheme 'compare true with the character? I.e. the byGrapheme iteration normalizes on the fly.

Andrei

March 08, 2014
08-Mar-2014 19:32, Andrei Alexandrescu пишет:
> On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
>> 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
>>> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
>>>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>>>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>>>>> No, it doesn't.
>>>>>>
>>>>>> import std.algorithm;
>>>>>>
>>>>>> void main()
>>>>>> {
>>>>>>    auto s = "cassé";
>>>>>>    assert(s.canFind('é'));
>>>>>> }
>>>>>>
>>>>>
>>>>> Hm, I'm not following? Works perfectly fine on my system?
>>>>
>>>> Something's messing with your Unicode. Try downloading and compiling
>>>> this file:
>>>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>>>
>>> Yup, the grapheme issue. This should work.
>>>
>>> import std.algorithm, std.uni;
>>>
>>> void main()
>>> {
>>>      auto s = "cassé";
>>>      assert(s.byGrapheme.canFind('é'));
>>> }
>>>
>>> It doesn't compile, seems like a library bug.
>>
>> Becasue Graphemes do not auto-magically convert to dchar and back? After
>> all they are just small strings.
>
> Yah but I think they should support comparison with individual
> characters. No?
>

We could add one. I don't think Grapheme interface is optimal or set in stone.

The following should work as is though:

s.byGrapheme.canFind(Grapheme("é"))


> Andrei
>


-- 
Dmitry Olshansky
March 08, 2014
On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu wrote:
> Why? Couldn't the grapheme 'compare true with the character? I.e. the byGrapheme iteration normalizes on the fly.

Grapheme segmentation and normalization are distinct Unicode algorithms:

http://www.unicode.org/reports/tr15/
http://www.unicode.org/reports/tr29/

There are also several normalization algorithms.

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
March 08, 2014
08-Mar-2014 19:33, Andrei Alexandrescu пишет:
> On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
>> 08-Mar-2014 12:09, Dmitry Olshansky пишет:
>>> 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
>>>> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
>>>>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>>>>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>>>>>>> No, it doesn't.
>>>>>>>
>>>>>>> import std.algorithm;
>>>>>>>
>>>>>>> void main()
>>>>>>> {
>>>>>>>    auto s = "cassé";
>>>>>>>    assert(s.canFind('é'));
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> Hm, I'm not following? Works perfectly fine on my system?
>>>>>
>>>>> Something's messing with your Unicode. Try downloading and compiling
>>>>> this file:
>>>>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>>>>
>>>> Yup, the grapheme issue. This should work.
>>>>
>>>> import std.algorithm, std.uni;
>>>>
>>>> void main()
>>>> {
>>>>      auto s = "cassé";
>>>>      assert(s.byGrapheme.canFind('é'));
>>>> }
>>>>
>>>> It doesn't compile, seems like a library bug.
>>>
>>> Becasue Graphemes do not auto-magically convert to dchar and back? After
>>> all they are just small strings.
>>>
>>>>
>>>> Graphemes are the next level of Nirvana above code points, but that
>>>> doesn't mean it's graphemes or nothing.
>>>>
>>
>> Plus it won't help the matters, you need both "é" and "cassé" to have
>> the same normalization.
>
> Why? Couldn't the grapheme 'compare true with the character?

Iff it consists of one codepoint, it technically may.

> I.e. the
> byGrapheme iteration normalizes on the fly.

Oh crap, please no. It's not only _Slow_ but it's also horribly complicated (even in off-line, eager version). + there are 4 normalizations, of which 2 are lossy.

You simply can't be serious on this one, though seeing that you introduced auto-decoding then by extension you must have proposed to normalize on the fly :)


-- 
Dmitry Olshansky
March 08, 2014
On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu wrote:
>> Why? Couldn't the grapheme 'compare true with the character? I.e. the byGrapheme iteration normalizes on the fly.
>
> Grapheme segmentation and normalization are distinct Unicode algorithms:
>
> http://www.unicode.org/reports/tr15/
> http://www.unicode.org/reports/tr29/
>
> There are also several normalization algorithms.
>
> http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

How about this?

s.normalize!NFKD

To return a range of normalized code points?

Clearly, no definition of string can handle this natively. As you say, there are multiple algorithms, so there is no one 'right' answer. byGrapheme is useful, but doesn't and cannot solve the normalization issue.

I feel this discussion is tangential to main debate: whether strings should be ranges of code points or code units. By code unit is faster by default, and simpler to implement in Phobos (no more special code). By code point works better when searching for individual code points, but as you rightly point out this might not be useful in practice as you rarely search for individual non-ASCII code points, and it isn't a complete solution anyway because of normalization.

There's a few problems with by code unit:

1. Searching string/wstring for dchar fails silently. You have suggested making this a compilation error, but Andrei argues this would break lots of code. You counter that it's possible that people rarely search for dchar anyway, so may not matter.

2. It's a fundamental change. Regardless of which is better, we need to consider the impact of such a change.

3. Ranges of code units are random access and sliceable, which means they will be accepted by algorithms such as sort, which will just produce garbage strings. Maybe this isn't an issue.
March 08, 2014
On Saturday, 8 March 2014 at 15:56:08 UTC, Dmitry Olshansky wrote:
> The following should work as is though:
>
> s.byGrapheme.canFind(Grapheme("é"))

Doesn't work here. Not sure why.

Grapheme(1000065, 3, 0, 33554432, [101, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 2) // last byGrapheme

vs.

Grapheme(E9, 0, 0, 16777216, [233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 1) // Grapheme("é")
March 08, 2014
On 3/8/14, 8:08 AM, Dmitry Olshansky wrote:
> 08-Mar-2014 19:33, Andrei Alexandrescu пишет:
>> I.e. the
>> byGrapheme iteration normalizes on the fly.
>
> Oh crap, please no. It's not only _Slow_ but it's also horribly
> complicated (even in off-line, eager version). + there are 4
> normalizations, of which 2 are lossy.
>
> You simply can't be serious on this one, though seeing that you
> introduced auto-decoding then by extension you must have proposed to
> normalize on the fly :)

Yah, just pushing my luck :o). I don't know much about graphemes and normalization, so leaving that stuff to you guys.

Andrei


March 08, 2014
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
> Andrei suggests that this change would destroy D by breaking too much existing code. He might be right. Can we afford the risk that he is right?

Perhaps not.  But I think the current approach is totally broken, it's just also happens to be what people have coded to.  Andrei used algorithms operating on a code point level as an example of what would break if this change were made, and in that he's absolutely correct.  But what bothers me is whether it's appropriate to perform this sort of character-based operation on Unicode strings in the first place.

The current approach is a cut above treating strings as arrays of bytes for some languages, and still utterly broken for others.  If I'm operating on a right to left language like Hebrew, what would I expect the result to be from something like countUntil?  And how useful would such a result be?  I'm inclined to say that the correct approach is to state that algorithms operate explicitly on a T.sizeof basis and that if the data contained in a particular range has some multi-element encoding then separate, specialized routines should be used with the T.sizeof behavior will not produce the desired result.

So the problem to me is that we're stuck not fixing something that's horribly broken just because it's broken in a way that people presumably now expect.  I'd personally like to see this fixed and I think the new behavior is preferable overall, but I do share Andrei's concern that such a big change might hurt the language anyway.
March 08, 2014
On Friday, 7 March 2014 at 20:27:38 UTC, H. S. Teoh wrote:
> 	s.indexOf("a");			// for slicing
> 	s.byCodepoint.countUntil("a");	// count code points
> 	s.byGrapheme.countUntil("a");	// count graphemes

(BTW, byGrapheme is currently missing in the std.uni docs)