March 09, 2014
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
> On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
>> On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:
>>> My only claim is that recognizing and iterating strings by code point
>>> is better than doing things by the octet.
>>
>> Considering or disregarding the disadvantages of this choice?
>
> Doing my best to weigh everything with the right measures.

I think it would be good to get a comparison of the two approaches, and list the arguments presented so far. I'll look into starting a Wiki page.

> Okay, though when you opened with "devastating" I was hoping for nothing short of death and dismemberment.

In proportion. To the best of my knowledge, no one here writes software for military or industrial robots in D. Security issues rank as the worst kind of bugs in software on my scale.

> Anyhow the fix is obvious per this brief tutorial: http://www.youtube.com/watch?v=hkDD03yeLnU

I don't get it.

>> I'm quite sure that std.range and std.algorithm will lose a LOT of
>> weight if they were fixed to not treat strings specially.
>
> I'm not so sure. Most of the string-specific optimizations simply detect certain string cases and forward them to array algorithms that need be written anyway. You would, indeed, save a fair amount of isSomeString conditionals and stuff (thus simplifying on scaffolding), but probably not a lot of code. That's not useless work - it'd go somewhere in any design.

One way to find out.

>>> Besides if you want to do Unicode you gotta crack some eggs.
>>
>> No, I can't see how this justifies the choice. An explicit decoding
>> range would have simplified things greatly while offering much of the
>> same advantages.
>
> My point there is that there's no useless or duplicated code that would be thrown away. A better design would indeed make for better modular separation - would be great if the string-related optimizations in std.algorithm went elsewhere. They wouldn't disappear.

Why? Isn't the whole issue that std.range presents strings as dchar ranges, and std.algorithm needs to detect dchar ranges and then treat them as char arrays? As opposed to std.algorithm just detecting arrays and treating them all as arrays (which it should be doing now anyway)?

>>>> 3. Hidden, difficult-to-detect performance problems. The reason why this
>>>> thread was started. I've had to deal with them in several places myself.
>>>
>>> I disagree with "hidden, difficult to detect".
>>
>> Why? You can only find out that an algorithm is slower than it needs to
>> be via either profiling (at which point you're wondering why the @#$%
>> the thing is so slow), or feeding it invalid UTF. If you had made a
>> different choice for Unicode in D, this problem would not exist altogether.
>
> Disagree.

Could you please elaborate? This is the second uninformative reply to this argument.

>> Except we already do. Arguments have already been presented in this
>> thread that demonstrate correctness problems with the current approach.
>> I don't think that these can stand up to the problems that the simpler
>> by-char iteration approach would have.
>
> Sure there are, and you yourself illustrated a misuse of the APIs.

If UTF decoding was explicit, the problem would stand out. I don't think this is a valid argument.

> My point is: code point is better than code unit

This was debated... people should not be looking at individual code points, unless they really know what they're doing.

> Grapheme is better than code point but a lot slower.

We are going in circles. People should have very good reasons for looking at individual graphemes as well.

> It seems we're quite in a sweet spot here wrt performance/correctness.

This does not seem like an objective summary of this thread's arguments so far.

I guess I'll get working on that wiki page to organize the arguments. This discussion is starting to feel like a quicksand roundabout.

> With what has been put forward so far, that's not even close to justifying a breaking change. If that great better design is just get back to code unit iteration, the change will not happen while I work on D. It is possible, however, that a much better idea comes forward, and I'd be looking forward to such.

Actually, could you post some examples of real-world code that would be broken by a hypothetical sudden switch? I think I would be hard-pressed to find some in my own code, but I'd need to check for sure to find out.

> 2. Add byChar that returns a random-access range iterating a string by character. Add byWchar that does on-the-fly transcoding to UTF16. Add byDchar that accepts any range of char and does decoding. And such stuff. Then whenever one wants to go through a string by code point can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)?
March 09, 2014
On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

>> Graphemes do not appear to have a 1:1 mapping with dchars, and any
>> attempt to do so would likely be a giant mistake.
> 
> I think they may be comparable to dchar.

Dchar, aka code points, are much clearly defined than graphemes. A quick search shows me there's more than one way to segment a string into graphemes. There's the legacy and extended boundary algorithms for general processing, and then there are some tailored algorithms that can segment code points differently depending on the locale, or other considerations.

Reference:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

There are three examples of local-specific graphemes in the table in the section linked above. "Ch" is one of them. Quoting Wikipedia: "Ch is a digraph in the Latin script. It is treated as a letter of its own in Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish, Breton and Belarusian Łacinka alphabets."
https://en.wikipedia.org/wiki/Ch_(digraph)

Also, there's some code points that represent ligatures (such as “fl”), which are in theory two graphemes. I'm not sure that the general algorithm does with that, but the depending on what you're doing (counting characters? spell checking?) you might want to split it in two.

So basically you just can't make make an algorithm capable of counting letters/graphemes/characters in a universal fashion. There's no such thing as a universal grapheme segmentation algorithm, even though there is a general one. It'd be wise for any API to expose this subtlety whenever segmenting graphemes.

Text is an interesting topic for never-ending discussions.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

March 09, 2014
On 3/8/14, 6:14 PM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
>> On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
>> My point there is that there's no useless or duplicated code that
>> would be thrown away. A better design would indeed make for better
>> modular separation - would be great if the string-related
>> optimizations in std.algorithm went elsewhere. They wouldn't disappear.
>
> Why? Isn't the whole issue that std.range presents strings as dchar
> ranges, and std.algorithm needs to detect dchar ranges and then treat
> them as char arrays? As opposed to std.algorithm just detecting arrays
> and treating them all as arrays (which it should be doing now anyway)?

That's scaffolding, not actual executable code.

>>> Why? You can only find out that an algorithm is slower than it needs to
>>> be via either profiling (at which point you're wondering why the @#$%
>>> the thing is so slow), or feeding it invalid UTF. If you had made a
>>> different choice for Unicode in D, this problem would not exist
>>> altogether.
>>
>> Disagree.
>
> Could you please elaborate? This is the second uninformative reply to
> this argument.

What can I say? The answer is obvious. It's not hard to figure for me. Performance of D's UTF strings has never been a mystery to me. From where I stand all this "hidden, difficult-to-detect performance problems" drama is just posturing. We'd do good to wean such out of the discussion.

No bug myriad of bug reports "D strings are awfully slow" on bugzilla.

No long threads "Why are D strings so slow" on stack overflow.

No trolling on reddit or hackernews "D? Just look at their strings. How could anyone think that's a good idea lol."

And it's not like people aren't talking. In contrast, D has been (and often rightly) criticized in the past for things like floating point performance and garbage collection. No evidence we are having an acute performance problem with UTF strings.

>> Sure there are, and you yourself illustrated a misuse of the APIs.
>
> If UTF decoding was explicit, the problem would stand out. I don't think
> this is a valid argument.

Yours? Indeed isn't, if what you want is iterate by code unit (= meaningless for all but ASCII strings) by default.

>> My point is: code point is better than code unit
>
> This was debated... people should not be looking at individual code
> points, unless they really know what they're doing.

Should they be looking at code units instead?

>> Grapheme is better than code point but a lot slower.
>
> We are going in circles. People should have very good reasons for
> looking at individual graphemes as well.

And it's good we have increasing support for graphemes. I don't think they should be the default.

>> It seems we're quite in a sweet spot here wrt performance/correctness.
>
> This does not seem like an objective summary of this thread's arguments
> so far.

What is an objective summary? Those who want to inflict massive breakage are not even done arguing we have a better design.

> I guess I'll get working on that wiki page to organize the arguments.
> This discussion is starting to feel like a quicksand roundabout.

That's great. Yes, we're exchanging jabs right now which is not our best use of time. Also in the interest of time, please understand you'd need to show the second coming if you want to break backward compatibility. Additions are a much better path.

>> With what has been put forward so far, that's not even close to
>> justifying a breaking change. If that great better design is just get
>> back to code unit iteration, the change will not happen while I work
>> on D. It is possible, however, that a much better idea comes forward,
>> and I'd be looking forward to such.
>
> Actually, could you post some examples of real-world code that would be
> broken by a hypothetical sudden switch? I think I would be hard-pressed
> to find some in my own code, but I'd need to check for sure to find out.

I'm afraid burden of proof is on you. Far as I'm concerned every breakage of string processing is unacceptable or at least very undesirable.

>> 2. Add byChar that returns a random-access range iterating a string by
>> character. Add byWchar that does on-the-fly transcoding to UTF16. Add
>> byDchar that accepts any range of char and does decoding. And such
>> stuff. Then whenever one wants to go through a string by code point
>> can just use str.byChar.
>
> This is confusing. Did you mean to say that byChar iterates a string by
> code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.


Andrei

March 09, 2014
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
> And it's not like people aren't talking. In contrast, D has been (and often rightly) criticized in the past for things like floating point performance and garbage collection. No evidence we are having an acute performance problem with UTF strings.

The size of this thread is one factor. But I see your point - I agree that is evidently not one of D's more glaring current problems. I hope I never alluded to that not being the case. That doesn't mean the problem doesn't exist at all, though.

>> If UTF decoding was explicit, the problem would stand out. I don't think
>> this is a valid argument.
>
> Yours? Indeed isn't, if what you want is iterate by code unit (= meaningless for all but ASCII strings) by default.

I don't understand this argument. Iterating by code unit is not meaningless if you don't want to extract meaning from each unit iteration. For example, if you're parsing JSON or XML, you only care about the syntax characters, which are all ASCII. And there is no confusion of "what exactly are we counting here".

>> This was debated... people should not be looking at individual code
>> points, unless they really know what they're doing.
>
> Should they be looking at code units instead?

No. They should only be looking at substrings.

Unless they're e.g. parsing a computer language (regardless if it has international text data), as above.

>> We are going in circles. People should have very good reasons for
>> looking at individual graphemes as well.
>
> And it's good we have increasing support for graphemes. I don't think they should be the default.

I don't think so either. Did I somehow imply that?

> What is an objective summary? Those who want to inflict massive breakage are not even done arguing we have a better design.

From my POV, I could say I see consensus, with just you defending a decision you made a while ago :) But I'd prefer a constructive discussion.

Anyway, I don't want to "inflict massive breakage" either. I want the amount of breakage to be a justified cost of fixing a mistake and permanently improving the language's design going forward.

Here's what I have so far, BTW:
http://wiki.dlang.org/Element_type_of_string_ranges
I'll have to review it in the morning. Or rather, afternoon, given that it's 6 AM here.

> I'm afraid burden of proof is on you.

Why? I'm not saying that if you can't produce an example of breakage then your arguments are invalid. Rather, concrete examples give us a concrete problem to work with. I'm not trying to put any "burden of proof" on anyone.

> That's great. Yes, we're exchanging jabs right now which is not our best use of time. Also in the interest of time, please understand you'd need to show the second coming if you want to break backward compatibility. Additions are a much better path.

Even a teensy-weensy breakage? :)

> Far as I'm concerned every breakage of string processing is unacceptable or at least very undesirable.

In all seriousness, at this point I'm worried that you will defend the status quo even if the breakage turns out minimal. Instead of dealing with absolutes, advantages and disadvantages should be weighed against another (even with the breaking-backwards-compatibility penalty being very high).

> Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So... does byChar for wstrings do the same thing as byWchar? And what if you want to iterate a wstring by char? Wouldn't it be better to have byChar/byWchar/byDchar be a range of char/wchar/dchar regardless of the string type, and have byCodeUnit which iterates by the code unit type?
March 09, 2014
On 3/8/14, 7:53 PM, Vladimir Panteleev wrote:
>  From my POV, I could say I see consensus, with just you defending a
> decision you made a while ago :) But I'd prefer a constructive discussion.

What exactly is the consensus? From your wiki page I see "One of the proposals in the thread is to switch the iteration type of string ranges from dchar to the string's character type."

I can tell you straight out: That will not happen for as long as I'm working on D. I'm ready to fight on this not only Walter Bright, but him and Walter White together. (Fortunately the former agrees the breakage is too large; haven't asked the latter yet.)

> Anyway, I don't want to "inflict massive breakage" either. I want the
> amount of breakage to be a justified cost of fixing a mistake and
> permanently improving the language's design going forward.

It seems you and I have a different view of the tradeoffs involved.

> In all seriousness, at this point I'm worried that you will defend the
> status quo even if the breakage turns out minimal. Instead of dealing
> with absolutes, advantages and disadvantages should be weighed against
> another (even with the breaking-backwards-compatibility penalty being
> very high).

Of course. If you come with something better, I'd be glad to take a look.

>> Unit. s.byChar.front is a (possibly ref, possibly qualified) char.
>
> So... does byChar for wstrings do the same thing as byWchar?

No, it transcodes from UTF16 to UTF8.

> And what if
> you want to iterate a wstring by char?

byChar.

> Wouldn't it be better to have
> byChar/byWchar/byDchar be a range of char/wchar/dchar regardless of the
> string type

that's right

>, and have byCodeUnit which iterates by the code unit type?

We must add that too. I agree the resulting design is roundabout (you have char[] which is by default iterated by code point, and you need to wrap it to get to its units that were there in the first place).

I also wanted to add some ASCII string love (by ascribing it a separate type) but Walter has good arguments opposing that.


Andrei

March 09, 2014
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
> What exactly is the consensus? From your wiki page I see "One of the proposals in the thread is to switch the iteration type of string ranges from dchar to the string's character type."
>
> I can tell you straight out: That will not happen for as long as I'm working on D.

Why?
March 09, 2014
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
>
> I was thinking of these too:
>
> 1. Revisit std.encoding and perhaps confer legitimacy to the character types defined there. The implementation in std.encoding is wanting, but I think the idea is sound. Essentially give more love to various encodings, including Ascii and "bypass encoding, I'll deal with stuff myself".
>
> 2. Add byChar that returns a random-access range iterating a string by character. Add byWchar that does on-the-fly transcoding to UTF16. Add byDchar that accepts any range of char and does decoding. And such stuff. Then whenever one wants to go through a string by code point can just use str.byChar.
>
>
> Andrei

I like these two points you make here. In particular, I like the recent addition of byGrapheme, and other ideas along this line which provide a custom range interface to a string. Such additions do not break code but add opt-in functionality for those who need it, while leaving the default case intact.

Overall, I think the current string design in D2 stikes a nice balance between performance and functionality. It does not reach Unicode perfection but gets rather close to good useability while still maintaining good C compatibility and performance in the default case.

As for Walter's original post regarding the use of decode by default in std.array.front, if I had it my way, I would prefer all performance hits to be explicit so that way I know what I am paying for by simply reading the code. Nonetheless, this change will break code in the wild relying on its current behavior. As a result, I feel that making such a fundamental change would be better to postpone until the next major version of D is considered. D currently seems to carry much hope due to its potential, but is struggling to gain reputation as a reliable, quality, production-ready language. If such fundamental changes are made at this point it will do a lot of harm to D's reputation which it may never recover from. Rather than making such a change now, I feel that fixing all open issues in bugzilla and 'completing' D2 would do much good. Then, near the close of implementing D2, a new library implementation of text capabilities could be prototyped for D3 and flagged as beta-please-test-but-avoid-use-in-production-code. Such an approach would benefit from the insights gained from implementing this version in D2 and also get much needed input from actual usage.

Joseph
March 09, 2014
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
>> What exactly is the consensus? From your wiki page I see "One of the
>> proposals in the thread is to switch the iteration type of string
>> ranges from dchar to the string's character type."
>>
>> I can tell you straight out: That will not happen for as long as I'm
>> working on D.
>
> Why?

From the cycle "going in circles": because I think the breakage is way too large compared to the alleged improvement. In fact I believe that that design is inferior to the current one regardless.

Andrei

March 09, 2014
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
>> The current approach is a cut above treating strings as arrays of bytes
>> for some languages, and still utterly broken for others. If I'm
>> operating on a right to left language like Hebrew, what would I expect
>> the result to be from something like countUntil?
>
> The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind.
>
> Andrei

I'm pretty sure that all string operations are actually "front to back". If I recall correctly, evenlanguages that "read" right to left, are stored in a front to back manner: EG: string[0] would be the right-most character. Is is only a question of "display", and changes nothing to the code. As for "countUntil", it would still work perfectly fine, as a RTL reader would expect the counting to start at the "begining" eg: the "Right" side.

I'm pretty confident RTL is 100% supported. The only issue is the "front"/"left" abiguity, and the only one I know of is the oddly named "stripLeft" function, which actually does a "stripFront" anyways.

So I wouldn't worry about RTL.

But as mentioned, it is languages like indian, that have complex graphemes, or languages with accentuated characters, eg, most europeans ones, that can have problems, such as canFind("cassé", 'e').

On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win.

I'd be tempted to not ask "how do we back out", but rather, "how can we take this further"? I'd love to ditch the whole "char"/"dchar" thing altogether, and work with graphemes. But that would be massive involvement.
March 09, 2014
On 3/7/2014 6:33 PM, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
>> On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
>>>
>>> +1
>>> In Indian languages, a character consists of one or more UNICODE
>>> code points. For example, in Sanskrit "ddhrya"
>>> http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
>>> consists of 7 UNICODE code points. So to search for this char I
>>> have to use string search.
>>>
>>> - Sarath
>>
>> Oops, incomplete reply ...
>>
>> Since a single "alphabet" in Indian languages can contain multiple
>> code-points, iterating over single code-points is like iterating
>> over char[] for non English European languages. So decode is of no
>> use other than decreasing the performance. A raw char[] comparison
>> is much faster.
>
> Yes. The more I think about it, the more auto-decoding sounds like a
> wrong decision. The question, though, is whether it's worth the massive
> code breakage needed to undo it. :-(
>

I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that "by code unit" is default. For better or worse, that ship has sailed.

Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way:

Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point.

So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :)