March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 3/7/14, 2:26 PM, H. S. Teoh wrote:
> This illustrates one of my objections to Andrei's post: by auto-decoding
> behind the user's back and hiding the intricacies of unicode from him,
> it has masked the fact that codepoint-for-codepoint comparison of a
> unicode string is not guaranteed to always return the correct results,
> due to the possibility of non-normalized strings.
>
> Basically, to have correct behaviour in all cases, the user must be
> aware of, and use, the Unicode collation / normalization algorithms
> prescribed by the Unicode standard.
Which is a reasonable thing to ask for.
Andrei
|
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
>> s.canFind('é')
>> s.endsWith('é')
>> s.find('é')
>> s.count('é')
>> s.countUntil('é')
>
> These should not compile post-change, because the sought element (dchar)
> is not of the same type as the string. So they will not fail silently.
The compared element need not have the same type (otherwise we'd break some other code).
Andrei
|
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Saturday, 8 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote: > Yup, the grapheme issue. This should work. No. It does not work because grapheme segmentation is not the same as normalization. Even if you fix the code (should be: assert(s.byGrapheme.canFind!"a[] == b"("é"))), it will not work because byGrapheme does not normalize (and not all graphemes can be normalized to a single code point anyway). And there is more than one type of normalization - you need to use the one depending on what you're trying to achieve. > Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing. It's not about types, it's about algorithms. It's never "X or nothing" - unless X is "actually understanding Unicode". Everything else is a compromise. Compromises are acceptable, but not when they are built into the language as the standard way of working with text, thus hiding the problems that come with them. |
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu wrote:
> On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
>>> s.canFind('é')
>>> s.endsWith('é')
>>> s.find('é')
>>> s.count('é')
>>> s.countUntil('é')
>>
>> These should not compile post-change, because the sought element (dchar)
>> is not of the same type as the string. So they will not fail silently.
>
> The compared element need not have the same type (otherwise we'd break some other code).
Do you think such code will appear often in practice? Even if the type is a dchar, in some cases the programmer may not have intended to do decoding (e.g. the "dchar" type was a result of type deduction form .front OSLT).
|
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | On Saturday, 8 March 2014 at 01:41:01 UTC, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu wrote:
>> On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
>>> These should not compile post-change, because the sought element (dchar)
>>> is not of the same type as the string. So they will not fail silently.
>>
>> The compared element need not have the same type (otherwise we'd break some other code).
>
> Do you think such code will appear often in practice? Even if the type is a dchar, in some cases the programmer may not have intended to do decoding (e.g. the "dchar" type was a result of type deduction form .front OSLT).
Sorry, I see now that you were referring to algorithms in general. I think adding a temporary warning for character types only, as with .front, would be appropriate...
|
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | Vladimir Panteleev:
> It's not about types, it's about algorithms.
Given sufficiently refined types, it can be about types :-)
Bye,
bearophile
|
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | 08-Mar-2014 05:23, Andrei Alexandrescu пишет: > On 3/7/14, 1:58 PM, Vladimir Panteleev wrote: >> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote: >>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: >>>> No, it doesn't. >>>> >>>> import std.algorithm; >>>> >>>> void main() >>>> { >>>> auto s = "cassé"; >>>> assert(s.canFind('é')); >>>> } >>>> >>> >>> Hm, I'm not following? Works perfectly fine on my system? >> >> Something's messing with your Unicode. Try downloading and compiling >> this file: >> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d > > Yup, the grapheme issue. This should work. > > import std.algorithm, std.uni; > > void main() > { > auto s = "cassé"; > assert(s.byGrapheme.canFind('é')); > } > > It doesn't compile, seems like a library bug. Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings. > > Graphemes are the next level of Nirvana above code points, but that > doesn't mean it's graphemes or nothing. > > > Andrei > -- Dmitry Olshansky |
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | 08-Mar-2014 12:09, Dmitry Olshansky пишет: > 08-Mar-2014 05:23, Andrei Alexandrescu пишет: >> On 3/7/14, 1:58 PM, Vladimir Panteleev wrote: >>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote: >>>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote: >>>>> No, it doesn't. >>>>> >>>>> import std.algorithm; >>>>> >>>>> void main() >>>>> { >>>>> auto s = "cassé"; >>>>> assert(s.canFind('é')); >>>>> } >>>>> >>>> >>>> Hm, I'm not following? Works perfectly fine on my system? >>> >>> Something's messing with your Unicode. Try downloading and compiling >>> this file: >>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d >> >> Yup, the grapheme issue. This should work. >> >> import std.algorithm, std.uni; >> >> void main() >> { >> auto s = "cassé"; >> assert(s.byGrapheme.canFind('é')); >> } >> >> It doesn't compile, seems like a library bug. > > Becasue Graphemes do not auto-magically convert to dchar and back? After > all they are just small strings. > >> >> Graphemes are the next level of Nirvana above code points, but that >> doesn't mean it's graphemes or nothing. >> Plus it won't help the matters, you need both "é" and "cassé" to have the same normalization. -- Dmitry Olshansky |
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | 08-Mar-2014 05:18, Andrei Alexandrescu пишет: > On 3/7/14, 12:48 PM, Dmitry Olshansky wrote: >> 07-Mar-2014 23:57, Andrei Alexandrescu пишет: >>> On 3/6/14, 6:37 PM, Walter Bright wrote: >>>> In "Lots of low hanging fruit in Phobos" the issue came up about the >>>> automatic encoding and decoding of char ranges. >>> [snip] >>> >>> Allow me to enumerate the functions of std.algorithm and how they work >>> today and how they'd work with the proposed change. Let s be a variable >>> of some string type. >> >> Special case was wrong though - special casing arrays of char[] and >> throwing all other ranges of char out the window. The amount of code to >> support this schizophrenia is enormous. > > I think this is a confusion. The code in e.g. std.algorithm is > specialized for efficiency of stuff that already works. Well, I've said it elsewhere - specialization was too fine grained. Either a generic or it doesn't work. > >>> Making strings bidirectional ranges has been a very good choice within >>> the constraints. There was already a string type, and that was >>> immutable(char)[], and a bunch of code depended on that definition. >> >> Trying to make it work by blowing a hole in the generic range concept >> now seems like it wasn't worth it. > > I disagree. Also what hole? Let's say we keep it. Yesterday I had to write constraints like this: if((isNarrowString!Range && is(Unqual!(ElementEncodingType!Range) == wchar)) || (isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar))) Just to accept anything that works alike to array of wchar, buffers and whatnot included. I expect that this should have been enough: isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar) Or maybe introduce something to indicate any "DualRange" of narrow chars. -- Dmitry Olshansky |
March 08, 2014 Re: Major performance problem with std.array.front() | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote:
> Vladimir Panteleev:
>
>> It's not about types, it's about algorithms.
>
> Given sufficiently refined types, it can be about types :-)
>
> Bye,
> bearophile
I think Bear is onto something, we already solved an analogous problem in an elegant way.
see SortedRange with assumeSorted etc.
But for this to be convenient to use, I still think we should expand the current 'String Literal Postfix' types to include both normaliztion and graphemes.
Postfix Type Aka
c immutable(char)[] string
w immutable(wchar)[] wstring
d immutable(dchar)[] dstring
|
Copyright © 1999-2021 by the D Language Foundation