March 08, 2014
On 3/8/14, 9:33 AM, Sean Kelly wrote:
> On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
>> Andrei suggests that this change would destroy D by breaking too much
>> existing code. He might be right. Can we afford the risk that he is
>> right?
>
> Perhaps not.  But I think the current approach is totally broken, it's
> just also happens to be what people have coded to.

I think that's an exaggeration poorly supported by evidence. My definition of "totally broken" would be "essentially unusable" and I think we're well past the point we need to prove that. Virtually all applications need to deal with strings to some extent, and I myself wrote a couple of relatively string-heavy ones. D strings work well. Even the most ardent detractors of D on e.g. reddit.com admit by omission that string processing is one if its strengths. Though they'll probably pick up on this thread soon :o).

> Andrei used
> algorithms operating on a code point level as an example of what would
> break if this change were made, and in that he's absolutely correct.
> But what bothers me is whether it's appropriate to perform this sort of
> character-based operation on Unicode strings in the first place.

Searching for characters in strings would be difficult to deem inappropriate.

When I designed std.algorithm I recall I put the following options on the table:

1. All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness. At some point I swear I've had a byDchar definition somewhere; I've searched the recent git history for it, no avail.

2. All algorithms would by default operate at code point level. That way correctness would be achieved by default, and certain algorithms would require specialization for efficiency. (Back then I didn't know about graphemes and normalization. I'm not sure how that would have affected the final decision.)

3. Change the alias string, wstring etc. to be some type that requires explicit access for code units/code points etc. instead of implicitly mixing the two.

My fave was (3). And not mine only - several people suggested alternative definitions of the "default" string type. Back then however we were in the middle of the D1/D2 transition and one more aftershock didn't seem like a good idea at all. Walter opposed such a change, and didn't really have to convince me.

From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern.

What would you have chosen given that context?

> The current approach is a cut above treating strings as arrays of bytes
> for some languages, and still utterly broken for others. If I'm
> operating on a right to left language like Hebrew, what would I expect
> the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind.

> And how useful would
> such a result be?

I don't know.

> I'm inclined to say that the correct approach is to
> state that algorithms operate explicitly on a T.sizeof basis and that if
> the data contained in a particular range has some multi-element encoding
> then separate, specialized routines should be used with the T.sizeof
> behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.

> So the problem to me is that we're stuck not fixing something that's
> horribly broken just because it's broken in a way that people presumably
> now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

> I'd personally like to see this fixed and I think the new behavior is
> preferable overall, but I do share Andrei's concern that such a big
> change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions.


Andrei

March 08, 2014
I'll admit that I'm probably not the best person to make suggestions here. As a back-end programmer, a large portion of my work is dealing with text streams of various types. And the data I work with is in any number of encodings and none can be assumed to be in English. But literally all of my work is either parsing protocols where the symbols are single byte and so the C way is appropriate, or they are with blocks of text where I basically never work at the per character level. In fact I can think of only one case--trimming a block of text for disay in a small frame. And there I use an explicit routine for trimming to a specific number of Unicodw characters.

So regarding std.algorithm, I couldn't use it because I need to be able to slice based on the result. Knowing the number of multibyte code points between the beginning of the string and the thing I was searching for is utterly useless. Also, the performance is way too bad to make it a consideration.

But you're right. I was being dramatic when I called it utterly broken. It's simply not useful to me as-is. The solution for me is fairly simple though if inelegant--cast the string to an array of ubyte. Having both options is nice I suppose. I just can't comment on the utility of the default behavior because I can't imagine a use for it.
March 08, 2014
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
> Searching for characters in strings would be difficult to deem inappropriate.

The notion of "character" exists only in certain writing systems. It is thus a flawed practice, and I think it should not be encouraged, as it will only make writing truly-international software more difficult. A more correct approach is searching for a certain substring. If non-exact matching is needed (normalization, case insensitivity etc.), then the appropriate solution is to use the Unicode algorithms.

If you look at the situation from this point of view, single code points become merely an implementation detail.

> 1. All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness.

As previously discussed, "correctness" here is conditional. I would not use that word, it is another extreme.

> From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern.
>
> What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard library from day 1, and advocated their use throughout the documentation.

>> I'm inclined to say that the correct approach is to
>> state that algorithms operate explicitly on a T.sizeof basis and that if
>> the data contained in a particular range has some multi-element encoding
>> then separate, specialized routines should be used with the T.sizeof
>> behavior will not produce the desired result.
>
> That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing slicing and range capabilities, of course.

>> So the problem to me is that we're stuck not fixing something that's
>> horribly broken just because it's broken in a way that people presumably
>> now expect.
>
> Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D containing a writing system such as Sanskrit's?

>> I'd personally like to see this fixed and I think the new behavior is
>> preferable overall, but I do share Andrei's concern that such a big
>> change might hurt the language anyway.
>
> I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of dchars in an application are incorrect, and ultimately a time bomb for proper internationalization support. We need to apply the same procedure that we do with any language construct that was deemed to have been a poor decision: put it through a deprecation cycle and fix it.
March 08, 2014
On 3/8/14, 12:26 PM, Sean Kelly wrote:
> But you're right. I was being dramatic when I called it utterly broken.
> It's simply not useful to me as-is. The solution for me is fairly simple
> though if inelegant--cast the string to an array of ubyte.

Ain't nobody know nothing about http://dlang.org/phobos/std_string.html#.representation around here!

Andrei

March 08, 2014
On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
>> 1. All algorithms would by default operate on strings at char/wchar
>> level (i.e. code unit). That would cause the usual issues and
>> confusions I was aware of from C++. Certain algorithms would require
>> specialization and/or the user using byDchar for correctness.
>
> As previously discussed, "correctness" here is conditional. I would not
> use that word, it is another extreme.

Agreed.

>> From experience with C++ I knew (1) had a bad track record, and (2)
>> "generically conservative, specialize for speed" was a successful
>> pattern.
>>
>> What would you have chosen given that context?
>
> Ideally, we would have the Unicode algorithms in the standard library
> from day 1, and advocated their use throughout the documentation.

It's not late to do a lot of that.

>>> I'm inclined to say that the correct approach is to
>>> state that algorithms operate explicitly on a T.sizeof basis and that if
>>> the data contained in a particular range has some multi-element encoding
>>> then separate, specialized routines should be used with the T.sizeof
>>> behavior will not produce the desired result.
>>
>> That sounds quite like C++ plus ICU. It doesn't strike me as the
>> golden standard for Unicode integration.
>
> Why not? Because it sounds like D needs exactly that. Plus its amazing
> slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.

>>> So the problem to me is that we're stuck not fixing something that's
>>> horribly broken just because it's broken in a way that people presumably
>>> now expect.
>>
>> Clearly I'm being subjective here but again I'd find it difficult to
>> get convinced we have something horribly broken from the evidence I
>> gathered inside and outside Facebook.
>
> Have you or anyone you personally know tried to process text in D
> containing a writing system such as Sanskrit's?

No. Point being?

>>> I'd personally like to see this fixed and I think the new behavior is
>>> preferable overall, but I do share Andrei's concern that such a big
>>> change might hurt the language anyway.
>>
>> I've said this once and I'm saying it again: the best way to convert
>> this discussion into something useful is to devise ideas for useful
>> non-breaking additions.
>
> I disagree. As I've argued, I believe that currently most uses of dchars
> in an application are incorrect, and ultimately a time bomb for proper
> internationalization support. We need to apply the same procedure that
> we do with any language construct that was deemed to have been a poor
> decision: put it through a deprecation cycle and fix it.

I think there are too large risks for that, and it's quite unclear this is solving a problem. "Slightly better Unicode support" is hardly a good justification.


Andrei

March 08, 2014
On Sat, Mar 08, 2014 at 08:38:40PM +0000, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
> >Searching for characters in strings would be difficult to deem inappropriate.
> 
> The notion of "character" exists only in certain writing systems. It is thus a flawed practice, and I think it should not be encouraged, as it will only make writing truly-international software more difficult. A more correct approach is searching for a certain substring. If non-exact matching is needed (normalization, case insensitivity etc.), then the appropriate solution is to use the Unicode algorithms.

+1. Most "character"-based Unicode string operations are actually *substring* operations, because the notion of "character" is not universal to every writing system, and doesn't map 1-to-1 to Unicode code points anyway. I would argue that most instances of code that perform character-based operations on strings are incorrect, in the sense that they will fail to correctly process strings in certain languages.


[...]
> >From experience with C++ I knew (1) had a bad track record, and
> >(2) "generically conservative, specialize for speed" was a
> >successful pattern.
> >
> >What would you have chosen given that context?
> 
> Ideally, we would have the Unicode algorithms in the standard library from day 1, and advocated their use throughout the documentation.

+1. I came to D expecting this to be the case... and was a little let down when I discovered the actual state of affairs in std.uni at the time.  Thankfully, things have improved since, and all those who worked on that have my gratitude. But it's still not quite there yet.


[...]
> >>So the problem to me is that we're stuck not fixing something that's horribly broken just because it's broken in a way that people presumably now expect.
> >
> >Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.
> 
> Have you or anyone you personally know tried to process text in D containing a writing system such as Sanskrit's?
[...]

Or more to the point, do you know of any experience that you can share about code that attempts to process these sorts of strings on a per character basis? My suspicion is that any code that operates on such strings, if they have any claim to correctness at all, must be substring-based, rather than character-based.


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
March 08, 2014
On 3/8/2014 9:44 AM, "Luís Marques" <luis@luismarques.eu>" wrote:
> (BTW, byGrapheme is currently missing in the std.uni docs)

https://github.com/D-Programming-Language/phobos/pull/1985
March 08, 2014
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu wrote:
> On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
>> On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
>>> That sounds quite like C++ plus ICU. It doesn't strike me as the
>>> golden standard for Unicode integration.
>>
>> Why not? Because it sounds like D needs exactly that. Plus its amazing
>> slicing and range capabilities, of course.
>
> Pretty much everyone using ICU hates it.

I admit I never used it personally. I just thought you meant that implied "D implementations of relevant Unicode algorithms, adapted to D style (range interface)". Is there more to this than the limitations of C++ or the implementers' design choices?

>> Have you or anyone you personally know tried to process text in D
>> containing a writing system such as Sanskrit's?
>
> No. Point being?

Point being, we don't have solid data to conclude whether D's current approach is actually good enough for such cases as you claim.

We do have one post in this thread:
http://forum.dlang.org/post/jlgfkxlrhlzdpwkpsrot@forum.dlang.org

> I think there are too large risks for that,

For what? We have not discussed a possible plan yet. Are you referring to Walter Bright's proposal?

> and it's quite unclear this is solving a problem. "Slightly better Unicode support" is hardly a good justification.

What this will solve:

1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf both returning integers, yet possibly having different values in circumstances that the developer may not foresee.

2. Very high complexity of implementations (the ElementEncodingType problem previously mentioned).

3. Hidden, difficult-to-detect performance problems. The reason why this thread was started. I've had to deal with them in several places myself.

4. Encourage D programmers to write Unicode-capable code that is correct in the full sense of the word.

I think the above list has enough weight to merit at least considering *some* breaking changes.
March 08, 2014
On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
> Becasue Graphemes do not auto-magically convert to dchar and back? After all
> they are just small strings.

std.uni.Grapheme is a struct, and that struct contains a string of arbitrary length.

I don't know if that is the right design or not, or if a Grapheme should instead be an alias for a slice (rather than be a distinct type).

Graphemes do not appear to have a 1:1 mapping with dchars, and any attempt to do so would likely be a giant mistake.

March 08, 2014
On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
> Or more to the point, do you know of any experience that you can share
> about code that attempts to process these sorts of strings on a per
> character basis? My suspicion is that any code that operates on such
> strings, if they have any claim to correctness at all, must be
> substring-based, rather than character-based.

That's pretty much it. Unless you are working in the confines of certain languages (alphabets, scripts, etc.), many notions that are valid for English or European languages lose meaning in general. This includes the notion of "characters" - at full abstraction, you can only treat a string as a stream of code units (or code points, if you wish, but as has been discussed to death this is rarely useful).

An application which has to handle user text (said text being possibly in any language), has to pretty much treat string variables as "holy":
- no indexing
- no slicing
- no counting anything
- no toUpper/toLower (std.ascii or std.uni)
etc.

All processing and transformations (line breaking, normalization, etc.) needs to be done using the relevant Unicode algorithms.

I've posted something earlier which I'd like to take back:

> [a-z] makes sense in English, and [а-я] makes sense in Russian

[а-я] makes sense for Russian, but it doesn't for Ukrainian, in the same way how [a-z] is useless for Portuguese. There are probably only a few such ranges in Unicode which encompass exactly one alphabet, due to how much letters overlap across alphabets of similar languages.