June 02, 2016
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
> On 6/2/16 5:20 PM, deadalnix wrote:
>> The good thing when you define works by whatever it does right now
>
> No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.
June 02, 2016
On 6/2/16 5:35 PM, ag0aep6g wrote:
> On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:
>> On 6/2/16 5:24 PM, ag0aep6g wrote:
>>> On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
>>>> Nope, that's a radically different matter. As the examples show, the
>>>> examples would be entirely meaningless at code unit level.
>>>
>>> They're simply not possible. Won't compile.
>>
>> They do compile.
>
> Yes, you're right, of course they do. char implicitly converts to dchar.
> I didn't think of that anti-feature.
>
>> As I said: this thread produces an unpleasant amount of arguments in
>> favor of autodecoding. Even I don't like that :o).
>
> It's more of an argument against char : dchar, I'd say.

I do think that's an interesting option in PL design space, but that would be super disruptive. -- Andrei
June 02, 2016
On 6/2/16 5:35 PM, deadalnix wrote:
> On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
>> On 6/2/16 5:20 PM, deadalnix wrote:
>>> The good thing when you define works by whatever it does right now
>>
>> No, it works as it was designed. -- Andrei
>
> Nobody says it doesn't. Everybody says the design is crap.

I think I like it more after this thread. -- Andrei
June 02, 2016
On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
> On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
>> The level 2 support description noted that it should be opt-in because its slow.
>
> 1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default.
>
> 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have).
>
> 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.

1) Right because a special toggleable syntax is definitely not "opt-in".
2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points.
3) Not an argument - doing more work makes code slower. The only thing that changes is what specific operations have what cost (for instance, memory access has a much higher cost now than it had then). Considering the way the process works and judging from what others in this thread have said about it, I will stick with "always decoding to graphemes for all operations is very slow" and indulge in being too lazy to write benchmarks for it to show just how bad it is.
June 02, 2016
On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:
> On 6/2/16 5:35 PM, deadalnix wrote:
>> On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
>>> On 6/2/16 5:20 PM, deadalnix wrote:
>>>> The good thing when you define works by whatever it does right now
>>>
>>> No, it works as it was designed. -- Andrei
>>
>> Nobody says it doesn't. Everybody says the design is crap.
>
> I think I like it more after this thread. -- Andrei

You start reminding me of the joke with that guy complaining that everybody is going backward on the highway.

June 02, 2016
On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu wrote:
> On 06/02/2016 04:22 PM, cym13 wrote:
>>
>> A:“We should decode to code points”
>> B:“No, decoding to code points is a stupid idea.”
>> A:“No it's not!”
>> B:“Can you show a concrete example where it does something useful?”
>> A:“Sure, look at that!”
>> B:“This isn't working at all, look at all those counter-examples!”
>> A:“It may not work for your examples but look how easy it is to
>>     find code points!”
>
> With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei

Allow me to try another angle:

- There are different levels of unicode support and you don't want to
support them all transparently. That's understandable.

- The level you choose to support is the code point level. There are
many good arguments about why this isn't a good default but you won't
change your mind. I don't like that at all and I'm not alone but let's
forget the entirety of the vocal D community for a moment.

- A huge part of unicode chars can be normalized to fit your
definition. That way not everything work (far from it) but a
sufficiently big subset works.

- On the other hand without normalization it just doesn't make any
sense from a user perspective.The ö example has clearly shown that
much, you even admitted it yourself by stating that many counter
arguments would have worked had the string been normalized).

- The most proeminent problem is with graphems that can have different
representations as those that can't be normalized can't be searched as
dchars as well.

- If autodecoding to code points is to stay and in an effort to find a
compromise then normalizing should be done by default. Sure it would
take some more time but it wouldn't break any code (I think) and would
actually make things more correct. They still wouldn't be correct but
I feel that something as crazy as unicode cannot be tackled
generically anyway.

June 02, 2016
On 6/2/16 5:37 PM, Andrei Alexandrescu wrote:
> On 6/2/16 5:35 PM, deadalnix wrote:
>> On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
>>> On 6/2/16 5:20 PM, deadalnix wrote:
>>>> The good thing when you define works by whatever it does right now
>>>
>>> No, it works as it was designed. -- Andrei
>>
>> Nobody says it doesn't. Everybody says the design is crap.
>
> I think I like it more after this thread. -- Andrei

Meh, thinking of it again: I don't like it more, I'd still do it differently given a clean slate (viz. RCStr). But let's say I didn't get many compelling reasons to remove autodecoding from this thread. -- Andrei

June 02, 2016
On 6/2/16 5:38 PM, deadalnix wrote:
> On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:
>> On 6/2/16 5:35 PM, deadalnix wrote:
>>> On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
>>>> On 6/2/16 5:20 PM, deadalnix wrote:
>>>>> The good thing when you define works by whatever it does right now
>>>>
>>>> No, it works as it was designed. -- Andrei
>>>
>>> Nobody says it doesn't. Everybody says the design is crap.
>>
>> I think I like it more after this thread. -- Andrei
>
> You start reminding me of the joke with that guy complaining that
> everybody is going backward on the highway.

Touché. (Get it?) -- Andrei

June 02, 2016
On 02.06.2016 23:23, Andrei Alexandrescu wrote:
> On 6/2/16 5:19 PM, Timon Gehr wrote:
>> On 02.06.2016 23:16, Timon Gehr wrote:
>>> On 02.06.2016 23:06, Andrei Alexandrescu wrote:
>>>> As the examples show, the examples would be entirely meaningless at
>>>> code
>>>> unit level.
>>>
>>> So far, I needed to count the number of characters 'ö' inside some
>>> string exactly zero times,
>>
>> (Obviously this isn't even what the example would do. I predict I will
>> never need to count the number of code points 'ö' by calling some
>> function from std.algorithm directly.)
>
> You may look for a specific dchar, and it'll work. How about
> findAmong("...") with a bunch of ASCII and Unicode punctuation symbols?
> -- Andrei
>
>

.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.)

The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect:

writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"


(Also, do you have an use case for this?)
June 02, 2016
On 6/2/16 5:38 PM, cym13 wrote:
> Allow me to try another angle:
>
> - There are different levels of unicode support and you don't want to
> support them all transparently. That's understandable.

Cool.

> - The level you choose to support is the code point level. There are
> many good arguments about why this isn't a good default but you won't
> change your mind. I don't like that at all and I'm not alone but let's
> forget the entirety of the vocal D community for a moment.

You mean all 35 of them?

It's not about changing my mind! A massive thing that the code point level handling is the incumbent, and that changing it would need to mark an absolutely Earth-shattering improvement to be worth it!

> - A huge part of unicode chars can be normalized to fit your
> definition. That way not everything work (far from it) but a
> sufficiently big subset works.

Cool.

> - On the other hand without normalization it just doesn't make any
> sense from a user perspective.The ö example has clearly shown that
> much, you even admitted it yourself by stating that many counter
> arguments would have worked had the string been normalized).

Yah, operating at code point level does not come free of caveats. It is vastly superior to operating on code units, and did I mention it's the incumbent.

> - The most proeminent problem is with graphems that can have different
> representations as those that can't be normalized can't be searched as
> dchars as well.

Yah, I'd say if the program needs graphemes the option is there. Phobos by default deals with code points which are not perfect but are independent of representation, produce meaningful and consistent results with std.algorithm etc.

> - If autodecoding to code points is to stay and in an effort to find a
> compromise then normalizing should be done by default. Sure it would
> take some more time but it wouldn't break any code (I think) and would
> actually make things more correct. They still wouldn't be correct but
> I feel that something as crazy as unicode cannot be tackled
> generically anyway.

Some more work on normalization at strategic points in Phobos would be interesting!


Andrei