March 09, 2014
On 3/9/14, 9:02 AM, bearophile wrote:
> Time ago I have even asked for a helper function:
> https://d.puremagic.com/issues/show_bug.cgi?id=10162

I commented on that and preapproved it.

Andrei

March 09, 2014
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
>> On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
>>> So IIUC iterating over s.byChar would not encounter the decoding-related
>>> speed hits that Walter is concerned about?
>>
>> That is correct.
>
> Unless I'm missing something, all algorithms that can work faster on
> arrays will need to be adapted to also recognize byChar-wrapped arrays,
> unwrap them, perform the fast array operation, and wrap them back in a
> byChar.

Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few.

Andrei

March 09, 2014
On 3/9/14, 10:34 AM, Peter Alexander wrote:
> If we assume strings are normalized then substring search, equality
> testing, sorting all work the same with either code units or code points.

But others such as edit distance or equal(some_string, some_wstring) will not.

>>> If you don't care about normalization then by code unit is just as good
>>> as by code point, but you don't need to specialise everywhere in Phobos.
>>>
>>> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
>>> but as Vladimir correctly points out: (a) by code point, this is still
>>> broken in the face of normalization, and (b) are there any real
>>> applications that search a string for a specific non-ASCII character?
>>
>> What happened to counting characters and such?
>
> I can't think of any case where you would want to count characters.

wc

(Generally: I've always been very very very doubtful about arguments that start with "I can't think of..." because I've historically tried them so many times, and with terrible results.)


Andrei
March 09, 2014
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
> wc

What should wc produce on a Sanskrit text?

The problem is that such questions quickly become philosophical.

> (Generally: I've always been very very very doubtful about arguments that start with "I can't think of..." because I've historically tried them so many times, and with terrible results.)

I agree, which is why I think that although such arguments are not unwelcome, it's much better to find out by experiment. Break something in Phobos and see how much of your code is affected :)
March 09, 2014
09-Mar-2014 21:16, Andrei Alexandrescu пишет:
> On 3/9/14, 4:34 AM, Peter Alexander wrote:
>> I think this is the main confusion: the belief that iterating by code
>> point has utility.
>>
>> If you care about normalization then neither by code unit, by code
>> point, nor by grapheme are correct (except in certain language subsets).
>
> I suspect that code point iteration is the worst as it works only with
> ASCII and perchance with ASCII single-byte extensions. Then we have code
> unit iteration that works with a larger spectrum of languages.

Was clearly meant to be: code point <--> code unit

> One
> question would be how large that spectrum it is. If it's larger than
> English, then that would be nice because we would've made progress.
>

Code points help only in so far that many (~all) high-level algorithms in Unicode are described in terms of code points. Code points have properties, code unit do not have anything. Code points with assigned semantic value are "abstract characters".

It's up to programmer to implement a particular algorithm to make it "as if" decoding really happened, working directly on code units or do decoding and work with code points which is simpler.

Current std.uni offering mostly work on code points and decodes, crucial building block to work directly on code units is in review:

https://github.com/D-Programming-Language/phobos/pull/1685

> I don't know about normalization beyond discussions in this group, but
> as far as I understand from
> http://www.unicode.org/faq/normalization.html, normalization would be a
> one-step process, after which code point iteration would cover still
> more human languages. No? I'm pretty sure it's more complicated than
> that, so please illuminate me :o).

Technically most apps just assume say "input comes in UTF-8 that is in normalization form C". Other such as browsers strive to get uniform representation on any input, do normalization of any input (often times normalization turns out to be just a no-op).


>> If you don't care about normalization then by code unit is just as good
>> as by code point, but you don't need to specialise everywhere in Phobos.
>>
>> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
>> but as Vladimir correctly points out: (a) by code point, this is still
>> broken in the face of normalization, and (b) are there any real
>> applications that search a string for a specific non-ASCII character?
>
> What happened to counting characters and such?

Counting chars is dubious. But, for instance, collation is defined in terms of code points. Regex pattern matching is _defined_ in terms of codepoints (even the mystical level 3 Unicode support of it). So there is certain merit to work at that level. But hacking it to be this way isn't the way to go.

The least intrusive change would be to generalize the current choice w.r.t. to RA ranges of char/wchar.

-- 
Dmitry Olshansky
March 09, 2014
09-Mar-2014 21:45, Andrei Alexandrescu пишет:
> On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
>> On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
>>> On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
>>>> So IIUC iterating over s.byChar would not encounter the
>>>> decoding-related
>>>> speed hits that Walter is concerned about?
>>>
>>> That is correct.
>>
>> Unless I'm missing something, all algorithms that can work faster on
>> arrays will need to be adapted to also recognize byChar-wrapped arrays,
>> unwrap them, perform the fast array operation, and wrap them back in a
>> byChar.
>
> Good point. Off the top of my head I can't remember any algorithm that
> relies on array representation to do better on arrays than on
> random-access ranges offering all of arrays' primitives. But I'm sure
> there are a few.

copy to begin with. And it's about 80x faster with plain arrays.


-- 
Dmitry Olshansky
March 09, 2014
On 3/9/14, 8:18 AM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote:
>> On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
>>> On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
>>>> What exactly is the consensus? From your wiki page I see "One of the
>>>> proposals in the thread is to switch the iteration type of string
>>>> ranges from dchar to the string's character type."
>>>>
>>>> I can tell you straight out: That will not happen for as long as I'm
>>>> working on D.
>>>
>>> Why?
>>
>> From the cycle "going in circles": because I think the breakage is way
>> too large compared to the alleged improvement.
>
> All right. I was wondering if there was something more fundamental
> behind such an ultimatum.

It's just factual information with no drama attached (i.e. I'm not threatening to leave the language, just plainly explain I'll never approve that particular change).

That said a larger explanation is in order. There have been cases in the past when our community has worked itself in a froth over a non-issue and ultimately caused a language change imposed by "the faction that shouted the loudest". The "lazy" keyword and recently the "virtual" keyword come to mind as cases in which the language leadership has been essentially annoyed into making a change it didn't believe in.

I am all about listening to the community's needs and desires. But at some point there is a need to stick to one's guns in matters of judgment call. See e.g. https://d.puremagic.com/issues/show_bug.cgi?id=11837 for a very recent example in which reasonable people may disagree but at some point you can't choose both options.

What we now have works as intended. As I mentioned, there is quite a bit more evidence the design is useful to people, than detrimental. Unicode is all about code points. Code units are incidental to each encoding. The fact that we recognize code points at language and library level is, in my opinion, a Good Thing(tm).

I understand that doesn't reach the ninth level of Nirvana and there are still issues to work on, and issues where good-looking code is actually incorrect. But I think we're overall in good shape. A regression from that to code unit level would be very destructive. Even a clear slight improvement that breaks backward compatibility would be destructive.

So I wanted to limit the potential damage of this discussion. It is made only a lot more dangerous that Walter himself started it, something that others didn't fail to tune into. The sheer fact that we got to contemplate an unbelievably massive breakage on no other evidence than one misuse case and for the sake of possibly an illusory improvement - that's a sign we need to grow up. We can't go like this about changing the language and aim to play in the big leagues.

>> In fact I believe that that design is inferior to the current one
>> regardless.
>
> I was hoping we could come to an agreement at least on this point.

Sorry to disappoint.

> ---
>
> BTW, a thought struck me while thinking about the problem yesterday.
>
> char and dchar should not be implicitly convertible between one another,
> or comparable to the other.

I think only the char -> dchar conversion works, and I can see arguments against it. Also comparison of char with dchar is dicey. But there are also cases in which it's legitimate to do that (e.g. assign ASCII chars etc) and this would be a breaking change.

One good way to think about breaking changes is "if this change were executed to perfection, how much would that improve the overall quality of D?" Because breakages _are_ "overall" - users don't care whether they come from this or the other part of the type system. Really puts things into perspective.

> void main()
> {
>      string s = "Привет";
>      foreach (c; s)
>          assert(c != 'Ñ');
> }
>
> Instead, std.conv.to should allow converting between character types,
> iff they represent one whole code point and fit into the destination
> type, and throw an exception otherwise (similar to how it deals with
> integer overflow). Char literals should be special-cased by the compiler
> to implicitly convert to any sufficiently large type.
>
> This would break more[1] code, but it would avoid the silent failures of
> the earlier proposal.
>
> [1] I went through my own larger programs. I actually couldn't find any
> uses of dchar which would be impacted by such a hypothetical change.

Generally I think we should steer away from slight improvements of the language at the cost of breaking existing code. Instead, we must think of ways to improve the language without the breakage. You may want to pursue (bugzilla + pull request) adding the std.conv routines with the semantics you mentioned.


Andrei

March 09, 2014
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
> On 3/9/14, 10:34 AM, Peter Alexander wrote:
>> If we assume strings are normalized then substring search, equality
>> testing, sorting all work the same with either code units or code points.
>
> But others such as edit distance or equal(some_string, some_wstring) will not.

equal(string, wstring) should either not compile, or would be overloaded to do the right thing. In an ideal world, char, wchar, and dchar should not be comparable.

Edit distance on code points is of questionable utility. Like Vladimir says, its meaning is pretty philosophical, even in ASCII (is "\r\n" really two "edits"? What is an "edit"?)


>> I can't think of any case where you would want to count characters.
>
> wc

% echo € | wc -c
4

:-)


> (Generally: I've always been very very very doubtful about arguments that start with "I can't think of..." because I've historically tried them so many times, and with terrible results.)

Fair point... but it's not as if we would be removing the ability (you could always do s.byCodePoint.count); we are talking about defaults. The argument that we shouldn't iterate by code unit by default because people might want to count code points is without substance. Also, with the proposal, string.count(dchar) would encode the dchar to a string first for performance, so it would still work.

Anyway, I think this discussion isn't really going anywhere so I think I'll agree to disagree and retire.
March 09, 2014
On 3/9/14, 11:14 AM, Dmitry Olshansky wrote:
> 09-Mar-2014 21:45, Andrei Alexandrescu пишет:
>> On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
>>> On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
>>>> On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
>>>>> So IIUC iterating over s.byChar would not encounter the
>>>>> decoding-related
>>>>> speed hits that Walter is concerned about?
>>>>
>>>> That is correct.
>>>
>>> Unless I'm missing something, all algorithms that can work faster on
>>> arrays will need to be adapted to also recognize byChar-wrapped arrays,
>>> unwrap them, perform the fast array operation, and wrap them back in a
>>> byChar.
>>
>> Good point. Off the top of my head I can't remember any algorithm that
>> relies on array representation to do better on arrays than on
>> random-access ranges offering all of arrays' primitives. But I'm sure
>> there are a few.
>
> copy to begin with. And it's about 80x faster with plain arrays.

Question is if there are a bunch of them.

Andrei

March 09, 2014
On 3/9/14, 11:19 AM, Peter Alexander wrote:
> On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
>> On 3/9/14, 10:34 AM, Peter Alexander wrote:
>>> If we assume strings are normalized then substring search, equality
>>> testing, sorting all work the same with either code units or code
>>> points.
>>
>> But others such as edit distance or equal(some_string, some_wstring)
>> will not.
>
> equal(string, wstring) should either not compile, or would be overloaded
> to do the right thing.

These would be possible designs each with its pros and cons. The current design works out of the box across all encodings. It has its own pros and cons. Puts in perspective what should and shouldn't be.

> In an ideal world, char, wchar, and dchar should
> not be comparable.

Probably. But that has nothing to do with equal() working.

> Edit distance on code points is of questionable utility. Like Vladimir
> says, its meaning is pretty philosophical, even in ASCII (is "\r\n"
> really two "edits"? What is an "edit"?)

Nothing philosophical - it's as cut and dried as it gets. An edit is as defined by the Levenshtein algorithm using code points as the unit of comparison.

>>> I can't think of any case where you would want to count characters.
>>
>> wc
>
> % echo € | wc -c
> 4
>
> :-)

Noice.

>> (Generally: I've always been very very very doubtful about arguments
>> that start with "I can't think of..." because I've historically tried
>> them so many times, and with terrible results.)
>
> Fair point... but it's not as if we would be removing the ability (you
> could always do s.byCodePoint.count); we are talking about defaults. The
> argument that we shouldn't iterate by code unit by default because
> people might want to count code points is without substance. Also, with
> the proposal, string.count(dchar) would encode the dchar to a string
> first for performance, so it would still work.

That's a good enhancement for the current design as well - care to submit a request for it?

> Anyway, I think this discussion isn't really going anywhere so I think
> I'll agree to disagree and retire.

The part that advocates a breaking change will not indeed lead anywhere. The parts where we improve Unicode support for D is very fertile.


Andrei