March 09, 2014
Vladimir Panteleev:

>> Seriously, Bearophile suggested "ABCD".sort(), and it took about 6 pages (!) for someone to point out this would be wrong.
>
> Sorting a string has quite limited use in the general case,

It seems I am sorting arrays of mutable ASCII chars often enough :-)

Time ago I have even asked for a helper function:
https://d.puremagic.com/issues/show_bug.cgi?id=10162

Bye,
bearophile
March 09, 2014
On Sunday, 9 March 2014 at 16:02:55 UTC, bearophile wrote:
> Vladimir Panteleev:
>
>>> Seriously, Bearophile suggested "ABCD".sort(), and it took about 6 pages (!) for someone to point out this would be wrong.
>>
>> Sorting a string has quite limited use in the general case,
>
> It seems I am sorting arrays of mutable ASCII chars often enough :-)

What do you use this for?

I can think of sort being useful e.g. to see which characters appear in a string (and with which frequency), but as the concept does not apply to all languages, one would need to draw a line somewhere for which languages they want to support. I think this should be done explicitly in user code.
March 09, 2014
Vladimir Panteleev:

> What do you use this for?

For lots of different reasons (counting, testing, histograms, to unique-ify, to allow binary searches, etc), you can find alternative solutions for every one of those use cases.


> I can think of sort being useful e.g. to see which characters appear in a string (and with which frequency), but as the concept does not apply to all languages, one would need to draw a line somewhere for which languages they want to support. I think this should be done explicitly in user code.

So far I have needed to sort 7-bit ASCII chars.

Bye,
bearophile
March 09, 2014
On 3/9/14, 4:34 AM, Peter Alexander wrote:
> I think this is the main confusion: the belief that iterating by code
> point has utility.
>
> If you care about normalization then neither by code unit, by code
> point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress.

I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o).

> If you don't care about normalization then by code unit is just as good
> as by code point, but you don't need to specialise everywhere in Phobos.
>
> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
> but as Vladimir correctly points out: (a) by code point, this is still
> broken in the face of normalization, and (b) are there any real
> applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

> To those that think the status quo is better, can you give an example of
> a real-life use case that demonstrates this?

split(ter) comes to mind.

> I do think it's probably too late to change this, but I think there is
> value in at least getting everyone on the same page.

Awesome.


Andrei

March 09, 2014
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
> On 09/03/14 04:26, Andrei Alexandrescu wrote:
>>>> 2. Add byChar that returns a random-access range iterating a string by
>>>> character. Add byWchar that does on-the-fly transcoding to UTF16. Add
>>>> byDchar that accepts any range of char and does decoding. And such
>>>> stuff. Then whenever one wants to go through a string by code point
>>>> can just use str.byChar.
>>>
>>> This is confusing. Did you mean to say that byChar iterates a string by
>>> code unit (not character / code point)?
>>
>> Unit. s.byChar.front is a (possibly ref, possibly qualified) char.
>
> So IIUC iterating over s.byChar would not encounter the decoding-related
> speed hits that Walter is concerned about?

That is correct.

Andrei
March 09, 2014
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
> On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
>> So IIUC iterating over s.byChar would not encounter the decoding-related
>> speed hits that Walter is concerned about?
>
> That is correct.

Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar.
March 09, 2014
On 3/9/14, 6:34 AM, Jakob Ovrum wrote:
> On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
>> Also, `byCodeUnit` and `byCodePoint` would probably be better names
>> than `raw` and `decode`, to much the already existing `byGrapheme` in
>> std.uni.
>
> There already is a std.uni.byCodePoint. It is a higher order range that
> accepts ranges of graphemes and ranges of code points (such as strings).

noice

> `byCodeUnit` is essentially std.string.representation.

Actually not because for reasons that are unclear to me people really want the individual type to be char, not ubyte.


Andrei

March 09, 2014
On Sunday, 9 March 2014 at 15:23:57 UTC, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
>> On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
>>> On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
>>>> Can we look at some example situations that this will break?
>>>
>>> Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?
>>
>> This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.
>
> Why? There's no reason why dchar[] would stop being a range. It will be treated as now, like any other array.

This was under the assumption that Nick's proposal (and my "amendment" to extend it to dchar because of graphemes e.a.) would be implemented.

But I made the mistake of replying to posts as I read them, just to notice a few posts later that someone else already posted something to the same effect, or that made my point irrelevant. Sorry for the confusion.
March 09, 2014
On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
> On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
>> 2) It is regression back to C++ days of no-one-cares-about-Unicode
>> pain. Thinking about strings as character arrays is so natural and
>> convenient that if language/Phobos won't punish you for that, it will
>> be extremely widespread.
>
> Not with Nick Sabalausky's suggestion to remove the implementation of
> front from char arrays. This way, everyone will be forced to decide
> whether they want code units or code points or something else.

Such as giving up on that crappy language that keeps on breaking their code.

Andrei

March 09, 2014
On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu wrote:
> On 3/9/14, 4:34 AM, Peter Alexander wrote:
>> I think this is the main confusion: the belief that iterating by code
>> point has utility.
>>
>> If you care about normalization then neither by code unit, by code
>> point, nor by grapheme are correct (except in certain language subsets).
>
> I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress.
>
> I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o).

It depends what you mean by "cover" :-)

If we assume strings are normalized then substring search, equality testing, sorting all work the same with either code units or code points.


>> If you don't care about normalization then by code unit is just as good
>> as by code point, but you don't need to specialise everywhere in Phobos.
>>
>> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
>> but as Vladimir correctly points out: (a) by code point, this is still
>> broken in the face of normalization, and (b) are there any real
>> applications that search a string for a specific non-ASCII character?
>
> What happened to counting characters and such?

I can't think of any case where you would want to count characters.

* If you want an index to slice from, then you need code units.
* If you want a buffer size, then you need code units.
* If you are doing something like word wrapping then you need to count glyphs, which is not the same as counting code points (and that only works with mono-spaced fonts anyway -- with variable width fonts you need to add up the widths of those glyphs)


>> To those that think the status quo is better, can you give an example of
>> a real-life use case that demonstrates this?
>
> split(ter) comes to mind.

splitter is just an application of substring search, no? substring search works the same with both code units and code points (e.g. strstr in C works with UTF encoded strings without any need to decode).

All you need to do is ensure that mismatched encodings in the delimeter are re-encoded (you want to do this for performance anyway)

auto splitter(string str, dchar delim)
{
    char[4] enc;
    return splitter(str, enc[0..encode(enc, delim)]);
}