June 02, 2016
On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu wrote:
> I don't think the plan is realistic. How can I tell you this without you getting mad at me?

You get out of the way and let the community get to work. Actually delegate, let people take ownership of problems, success and failure alike.

If we fail then, at least it will be from our own experience instead of from executive meddling.


June 02, 2016
On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
> Pretty much everything. Consider s and s1 string variables with possibly
> different encodings (UTF8/UTF16).
>
> * s.all!(c => c == 'ö') works only with autodecoding. It returns always
> false without.

Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.

Would actually work with UTF-16 and only combined 'ö's in s, because the combined character fits in a single UTF-16 code unit.
June 02, 2016
ag0aep6g <anonymous@example.com> wrote:
> On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
>> Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16).
>> 
>> * s.all!(c => c == 'ö') works only with autodecoding. It returns always
>> false without.
> 
> Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.

Works if s is normalized appropriately. No?




June 02, 2016
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
> Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16).
> ...

Your 'ö' examples will NOT work reliably with auto-decoded code points, and for nearly the same reason that they won't work with code units; you would have to use byGrapheme.

The fact that you still don't get that, even after a dozen plus attempts by the community to explain the difference, makes you unfit to direct Phobos' Unicode support. Please, either go study Unicode until you really understand it, or delegate this issue to someone else.
June 02, 2016
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
> Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16).
>
> * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without.
>

False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.

> * s.any!(c => c == 'ö') works only with autodecoding. It returns always false without.
>

False. (while this is pretty much the same as 1, one can come up with with as many example as wished by tweaking the same one to produce endless variations).

> * s.balancedParens('〈', '〉') works only with autodecoding.
>

Not sure, so I'll say OK.

> * s.canFind('ö') works only with autodecoding. It returns always false without.
>

False.

> * s.commonPrefix(s1) works only if they both use the same encoding; otherwise it still compiles but silently produces an incorrect result.
>

False.

> * s.count('ö') works only with autodecoding. It returns always zero without.
>

False.

> * s.countUntil(s1) is really odd - without autodecoding, whether it works at all, and the result it returns, depends on both encodings. With autodecoding it always works and returns a number independent of the encodings.
>

False.

> * s.endsWith('ö') works only with autodecoding. It returns always false without.
>

False.

> * s.endsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.
>

False.

> * s.find('ö') works only with autodecoding. It never finds it without.
>

False.

> * s.findAdjacent is a very interesting one. It works with autodecoding, but without it it just does odd things.
>

Not sure so I'll say OK, while I strongly suspect that, like for other, this will only work if string are normalized.

> * s.findAmong(s1) is also interesting. It works only with autodecoding.
>

False.

> * s.findSkip(s1) works only if s and s1 have the same encoding. Otherwise it compiles and runs but produces incorrect results.
>

False.

> * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only if s and s1 have the same encoding. Otherwise they compile and run but produce incorrect results.
>

False.

> * s.minCount, s.maxCount are unlikely to be terribly useful but with autodecoding it consistently returns the extremum numeric code unit regardless of representation. Without, they just return encoding-dependent and meaningless numbers.
>

Note sure, so I'll say ok.

> * s.minPos, s.maxPos follow a similar semantics.
>

Note sure, so I'll say ok.

> * s.skipOver(s1) only works with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.
>

False.

> * s.startsWith('ö') works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.
>

False.

> * s.startsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings.
>

False.

> * s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it will span the entire range.
>

False.

> ===
>
> The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others.
>
>
> Andrei

I mean what a trainwreck. Your examples are saying it all doesn't it ? Almost none of them would work without normalizing the string first. And that is the point you've been refusing to hear so far. autodecoding doesn't pay for itself as it is unable to do what it is supposed to do in the general case.

Really, there is not much you can do with anything unicode related without first going through normalization. If you want anything more than searching substring or alike, you'll also need a collation, that is locale dependent (for sorting for instance).

Supporting unicode, IMO, would be to provide facilities to normalize (preferably lazilly as a range), to manage collations, and so on. Decoding to codepoints just don't cut it.

As a result, any algorithm that need to support string need to either fight against the language because it doesn't need decoding, use decoding and assume to be incorrect without normalization or do the correct thing by itself (which is also going to require to work against the language).

June 02, 2016
On 02.06.2016 21:26, Andrei Alexandrescu wrote:
> ag0aep6g <anonymous@example.com> wrote:
>> On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
>>> Pretty much everything. Consider s and s1 string variables with possibly
>>> different encodings (UTF8/UTF16).
>>>
>>> * s.all!(c => c == 'ö') works only with autodecoding. It returns always
>>> false without.
>>
>> Doesn't work with autodecoding (to code points) when a combining
>> diaeresis (U+0308) is used in s.
>
> Works if s is normalized appropriately. No?
>

No. assert(!"ö̶".normalize!NFC.any!(c => c== 'ö'));
June 02, 2016
On 06/02/2016 09:26 PM, Andrei Alexandrescu wrote:
> ag0aep6g <anonymous@example.com> wrote:
>> On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
>>> Pretty much everything. Consider s and s1 string variables with possibly
>>> different encodings (UTF8/UTF16).
>>>
>>> * s.all!(c => c == 'ö') works only with autodecoding. It returns always
>>> false without.
>>
>> Doesn't work with autodecoding (to code points) when a combining
>> diaeresis (U+0308) is used in s.
>
> Works if s is normalized appropriately. No?

Works when normalized to precomposed characters, yes.

That's not a given, of course. When the user is aware enough to normalize their strings that way, then they should be able to call byDchar explicitly.

And of course you can't do s.all!(c => c == 'a⃗'), despite a⃗ looking like one character. Need byGrapheme for that.
June 02, 2016
On 02.06.2016 21:05, Andrei Alexandrescu wrote:
> On 06/02/2016 01:54 PM, Marc Schütz wrote:
>> On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
>>> That's not going to work. A false impression created in this thread
>>> has been that code points are useless
>>
>> They _are_ useless for almost anything you can do with strings. The only
>> places where they should be used are std.uni and std.regex.
>>
>> Again: What is the justification for using code points, in your opinion?
>> Which practical tasks are made possible (and work _correctly_) if you
>> decode to code points, that don't already work with code units?
>
> Pretty much everything. Consider s and s1 string variables with possibly
> different encodings (UTF8/UTF16).
>
> * s.all!(c => c == 'ö') works only with autodecoding. It returns always
> false without.
> ...

Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)

assert("ö".all!(c => c == 'ö')); // fails

> * s.any!(c => c == 'ö') works only with autodecoding. It returns always
> false without.
> ...

Doesn't work. Shouldn't compile.

assert("ö".any!(c => c == 'ö")); // fails
assert(!"̃ö⃖".any!(c => c== 'ö')); // fails

> * s.balancedParens('〈', '〉') works only with autodecoding.
> ...

Doesn't work, e.g. s="⟨⃖". Shouldn't compile.

> * s.canFind('ö') works only with autodecoding. It returns always false
> without.
> ...

Doesn't work. Shouldn't compile.

assert("ö".canFind!(c => c == 'ö")); // fails

> * s.commonPrefix(s1) works only if they both use the same encoding;
> otherwise it still compiles but silently produces an incorrect result.
> ...

Doesn't work. Shouldn't compile.

> * s.count('ö') works only with autodecoding. It returns always zero
> without.
> ....

Doesn't work. Shouldn't compile.

> * s.countUntil(s1) is really odd - without autodecoding, whether it
> works at all, and the result it returns, depends on both encodings.  With
> autodecoding it always works and returns a number independent of the
> encodings.
> ...

Doesn't work. Shouldn't compile.

> * s.endsWith('ö') works only with autodecoding. It returns always false
> without.
> ...

Doesn't work. Shouldn't compile.

> * s.endsWith(s1) works only with autodecoding.

Doesn't work.

> Otherwise it compiles and
> runs but produces incorrect results if s and s1 have different encodings.
>...

Shouldn't compile.

> * s.find('ö') works only with autodecoding. It never finds it without.
> ...

Doesn't work. Shouldn't compile.

> * s.findAdjacent is a very interesting one. It works with autodecoding,
> but without it it just does odd things.
> ....

Doesn't work. Shouldn't compile.

> * s.findAmong(s1) is also interesting. It works only with autodecoding.
> ...

Doesn't work. Shouldn't compile.

> * s.findSkip(s1) works only if s and s1 have the same encoding.
> Otherwise it compiles and runs but produces incorrect results.
> ...

Doesn't work. Shouldn't compile.

> * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only
> if s and s1 have the same encoding.

Doesn't work.

> Otherwise they compile and run but produce incorrect results.
> ...

Shouldn't compile.

> * s.minCount, s.maxCount are unlikely to be terribly useful but with
> autodecoding it consistently returns the extremum numeric code unit
> regardless of representation. Without, they just return
> encoding-dependent and meaningless numbers.
>
> * s.minPos, s.maxPos follow a similar semantics.
> ...

Hardly a point in favour of autodecoding.

> * s.skipOver(s1) only works with autodecoding.

Doesn't work. Shouldn't compile.

> Otherwise it compiles and
> runs but produces incorrect results if s and s1 have different encodings.
> ...

Shouldn't compile.

> * s.startsWith('ö') works only with autodecoding. Otherwise it compiles
> and runs but produces incorrect results if s and s1 have different
> encodings.
> ...

Doesn't work. Shouldn't compile.

> * s.startsWith(s1) works only with autodecoding. Otherwise it compiles
> and runs but produces incorrect results if s and s1 have different
> encodings.
> ...


Doesn't work. Shouldn't compile.

> * s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it
> will span the entire range.
> ...

Doesn't work. Shouldn't compile.

> ===
>
> The intent of autodecoding was to make std.algorithm work meaningfully
> with strings. As it's easy to see I just went through
> std.algorithm.searching alphabetically and found issues literally with
> every primitive in there. It's an easy exercise to go forth with the
> others.
> ...

Basically all of those still don't work with UTF-32 (assuming your goal is to operate on characters). You need to normalize and possibly iterate on graphemes. Also, many of those functions actually have valid uses intentionally operating on code units.

The "shouldn't compile" remarks ideally would be handled at the language level: char/wchar/dchar should be incompatible types and char[], wchar[] and dchar[] should be handled like all arrays.


June 02, 2016
On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
> * s.all!(c => c == 'ö') works only with autodecoding. It returns always false
> without.

The o is inferred as a wchar. The lamda then is inferred to return a wchar. The algorithm can check that the input is char[], and is being tested against a wchar. Therefore, the algorithm can specialize to do the decoding itself.

No autodecoding necessary, and it does the right thing.
June 02, 2016
On 06/02/2016 03:13 PM, Adam D. Ruppe wrote:
> On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu wrote:
>> I don't think the plan is realistic. How can I tell you this without
>> you getting mad at me?
>
> You get out of the way and let the community get to work. Actually
> delegate, let people take ownership of problems, success and failure alike.

That's a good point. We plan to do more of that in the future.

> If we fail then, at least it will be from our own experience instead of
> from executive meddling.

This applies to high-risk work that is also of commensurately extraordinary value. My assessment is this is not it. If you were in my position you'd also do what you think is the best thing to do, and nobody should feel offended by that.


Andrei