March 08, 2014
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu wrote:
>
> Pretty much everyone using ICU hates it.

I think the biggest problem with ICU is documentation.  It can take a long time to figure out how to do something if you've never done it before.  Also, the C interface in ICU seems better than the C++ interface.  And I'll grant that a few things are just far harder than they need to be.  I wanted a transcoding iterator and ICU almost has this but not quite, so I've got to write my own.  In fact, iterating across an arbitrary encoding in general is at least not intuitive and perhaps not possible.  I kinda gave up on that.  Um, and using UTF-16 as the standard encoding, requiring many transcoding operations to require two conversions.  Okay, I guess there are a lot of problems with ICU, but it handles nearly every requirement I have, which is in itself quite a lot.
March 08, 2014
08-Mar-2014 20:43, Vladimir Panteleev пишет:
> On Saturday, 8 March 2014 at 15:56:08 UTC, Dmitry Olshansky wrote:
>> The following should work as is though:
>>
>> s.byGrapheme.canFind(Grapheme("é"))
>
> Doesn't work here. Not sure why.
>
> Grapheme(1000065, 3, 0, 33554432, [101, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0], 2) // last byGrapheme
>
> vs.
>
> Grapheme(E9, 0, 0, 16777216, [233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0], 1) // Grapheme("é")

Sounds like a bug, file it before we derailed.

-- 
Dmitry Olshansky
March 08, 2014
09-Mar-2014 01:15, Walter Bright пишет:
> On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
>> Becasue Graphemes do not auto-magically convert to dchar and back?
>> After all
>> they are just small strings.
>
> std.uni.Grapheme is a struct, and that struct contains a string of
> arbitrary length.
>
> I don't know if that is the right design or not, or if a Grapheme should
> instead be an alias for a slice (rather than be a distinct type).

They use small-string optimization with great success, as indeed plenty of graphemes are just 1 codepoint. Many others are just a couple.

>
> Graphemes do not appear to have a 1:1 mapping with dchars, and any
> attempt to do so would likely be a giant mistake.
>


-- 
Dmitry Olshansky
March 08, 2014
On Saturday, 8 March 2014 at 22:42:20 UTC, Dmitry Olshansky wrote:
> Sounds like a bug, file it before we derailed.

https://d.puremagic.com/issues/show_bug.cgi?id=12324
March 08, 2014
> 1. implement decode() as an algorithm for string types,

Decode is an incredibly generic name. What about byGlyph similar to byLine?

March 08, 2014
> This would kill D. I am not exaggerating.

I don't know about kill but it certainly feels awfully similar to the Python 2/3 spit over strings and Unicode which still doesn't seem to be resolved.
March 08, 2014
On 3/8/14, 1:15 PM, Walter Bright wrote:
> On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
>> Becasue Graphemes do not auto-magically convert to dchar and back?
>> After all
>> they are just small strings.
>
> std.uni.Grapheme is a struct, and that struct contains a string of
> arbitrary length.
>
> I don't know if that is the right design or not, or if a Grapheme should
> instead be an alias for a slice (rather than be a distinct type).

I think basic encapsulation suggests Grapheme should be a distinct type. It's a restricted slice, not just any slice.

> Graphemes do not appear to have a 1:1 mapping with dchars, and any
> attempt to do so would likely be a giant mistake.

I think they may be comparable to dchar.


Andrei

March 08, 2014
On 3/8/14, 1:13 PM, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu wrote:
>> On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
>>> On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
>>>> That sounds quite like C++ plus ICU. It doesn't strike me as the
>>>> golden standard for Unicode integration.
>>>
>>> Why not? Because it sounds like D needs exactly that. Plus its amazing
>>> slicing and range capabilities, of course.
>>
>> Pretty much everyone using ICU hates it.
>
> I admit I never used it personally.

Time to do due diligence :o).

> I just thought you meant that
> implied "D implementations of relevant Unicode algorithms, adapted to D
> style (range interface)". Is there more to this than the limitations of
> C++ or the implementers' design choices?
>
>>> Have you or anyone you personally know tried to process text in D
>>> containing a writing system such as Sanskrit's?
>>
>> No. Point being?
>
> Point being, we don't have solid data to conclude whether D's current
> approach is actually good enough for such cases as you claim.

My only claim is that recognizing and iterating strings by code point is better than doing things by the octet.

> We do have one post in this thread:
> http://forum.dlang.org/post/jlgfkxlrhlzdpwkpsrot@forum.dlang.org
>
>> I think there are too large risks for that,
>
> For what? We have not discussed a possible plan yet. Are you referring
> to Walter Bright's proposal?

Any plan to inflict a large breaking change for strings incurs a risk. To add insult to injury, the improvement brought about by the change is debatable.

>> and it's quite unclear this is solving a problem. "Slightly better
>> Unicode support" is hardly a good justification.
>
> What this will solve:
>
> 1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
> both returning integers, yet possibly having different values in
> circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.

> 2. Very high complexity of implementations (the ElementEncodingType
> problem previously mentioned).

I disagree with "very high". Besides if you want to do Unicode you gotta crack some eggs.

> 3. Hidden, difficult-to-detect performance problems. The reason why this
> thread was started. I've had to deal with them in several places myself.

I disagree with "hidden, difficult to detect". Also I'd add that I'd rather not have hidden, difficult to detect correctness problems.

> 4. Encourage D programmers to write Unicode-capable code that is correct
> in the full sense of the word.

I disagree we are presently discouraging them. I do agree a change would make certain things clearer. But not enough to nearly make up for the breakage.

> I think the above list has enough weight to merit at least considering
> *some* breaking changes.

I think a better approach is to figure what to add.


Andrei

March 09, 2014
On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:
> My only claim is that recognizing and iterating strings by code point is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

>> 1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
>> both returning integers, yet possibly having different values in
>> circumstances that the developer may not foresee.
>
> I disagree there's any danger. They deal in code points, end of story.

Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone without complete knowledge of Phobos functions and how D handles Unicode, it is not obvious that this code is actually wrong. In certain situations, this can have devastating effects: consider, for example, if this code is extracting a slice from a string that elsewhere contains sensitive data (e.g. a configuration file containing, among other data, a password). An attacker could supply an Unicode string where the developer did not expect it, thus causing "pos" to have a smaller value than the corresponding indexOf result, thus revealing a slice of "s" which was not intended to be visible. Thus, a developer currently needs to tread very carefully wherever he is slicing strings, so as to not accidentally use indices obtained from functions that count code points.

>> 2. Very high complexity of implementations (the ElementEncodingType
>> problem previously mentioned).
>
> I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT of weight if they were fixed to not treat strings specially.

> Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit decoding range would have simplified things greatly while offering much of the same advantages. Whether the fact that it is there "by default" an advantage of the current approach at all is debatable.

>> 3. Hidden, difficult-to-detect performance problems. The reason why this
>> thread was started. I've had to deal with them in several places myself.
>
> I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it needs to be via either profiling (at which point you're wondering why the @#$% the thing is so slow), or feeding it invalid UTF. If you had made a different choice for Unicode in D, this problem would not exist altogether.

> Also I'd add that I'd rather not have hidden, difficult to detect correctness problems.

Except we already do. Arguments have already been presented in this thread that demonstrate correctness problems with the current approach. I don't think that these can stand up to the problems that the simpler by-char iteration approach would have.

>> 4. Encourage D programmers to write Unicode-capable code that is correct
>> in the full sense of the word.
>
> I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging them either - we are instead setting an example of how to do it in a wrong (incomplete) way.

> I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this thread being shoved behind the one word "clearer".

> But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some possible changes without releasing them. I'm going to try them with my own programs first to see how much it will break. I believe that you are too eagerly dismissing all proposals without even evaluating them.

>> I think the above list has enough weight to merit at least considering
>> *some* breaking changes.
>
> I think a better approach is to figure what to add.

This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation
March 09, 2014
On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
> On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:
>> My only claim is that recognizing and iterating strings by code point
>> is better than doing things by the octet.
>
> Considering or disregarding the disadvantages of this choice?

Doing my best to weigh everything with the right measures.

>>> 1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
>>> both returning integers, yet possibly having different values in
>>> circumstances that the developer may not foresee.
>>
>> I disagree there's any danger. They deal in code points, end of story.
>
> Perhaps I did not explain clearly enough.
>
> auto pos = s.countUntil(sub);
> writeln(s[pos..$]);
>
> This will compile, and work for English text. For someone without
> complete knowledge of Phobos functions and how D handles Unicode, it is
> not obvious that this code is actually wrong.

I agree. At a point or another, the dual nature of strings (and dual means to iterate them) will cause trouble for the unwary.

> In certain situations,
> this can have devastating effects: consider, for example, if this code
> is extracting a slice from a string that elsewhere contains sensitive
> data (e.g. a configuration file containing, among other data, a
> password).

Whaaa, passwords in clear?

> An attacker could supply an Unicode string where the
> developer did not expect it, thus causing "pos" to have a smaller value
> than the corresponding indexOf result, thus revealing a slice of "s"
> which was not intended to be visible. Thus, a developer currently needs
> to tread very carefully wherever he is slicing strings, so as to not
> accidentally use indices obtained from functions that count code points.

Okay, though when you opened with "devastating" I was hoping for nothing short of death and dismemberment. Anyhow the fix is obvious per this brief tutorial: http://www.youtube.com/watch?v=hkDD03yeLnU

>>> 2. Very high complexity of implementations (the ElementEncodingType
>>> problem previously mentioned).
>>
>> I disagree with "very high".
>
> I'm quite sure that std.range and std.algorithm will lose a LOT of
> weight if they were fixed to not treat strings specially.

I'm not so sure. Most of the string-specific optimizations simply detect certain string cases and forward them to array algorithms that need be written anyway. You would, indeed, save a fair amount of isSomeString conditionals and stuff (thus simplifying on scaffolding), but probably not a lot of code. That's not useless work - it'd go somewhere in any design.

>> Besides if you want to do Unicode you gotta crack some eggs.
>
> No, I can't see how this justifies the choice. An explicit decoding
> range would have simplified things greatly while offering much of the
> same advantages.

My point there is that there's no useless or duplicated code that would be thrown away. A better design would indeed make for better modular separation - would be great if the string-related optimizations in std.algorithm went elsewhere. They wouldn't disappear.

> Whether the fact that it is there "by default" an advantage of the
> current approach at all is debatable.

Clearly. If I'd do things over again, I'd definitely change a thing or two. (I wouldn't go with Walter's proposal, which I think is worse than what we have now.)

But the current approach has something very difficult to talk away: it's there. And that makes a whole lotta difference. Do I believe it's perfect? Hell no. Does it blunt much of the point of this debate? I'm afraid so.

>>> 3. Hidden, difficult-to-detect performance problems. The reason why this
>>> thread was started. I've had to deal with them in several places myself.
>>
>> I disagree with "hidden, difficult to detect".
>
> Why? You can only find out that an algorithm is slower than it needs to
> be via either profiling (at which point you're wondering why the @#$%
> the thing is so slow), or feeding it invalid UTF. If you had made a
> different choice for Unicode in D, this problem would not exist altogether.

Disagree.

>> Also I'd add that I'd rather not have hidden, difficult to detect
>> correctness problems.
>
> Except we already do. Arguments have already been presented in this
> thread that demonstrate correctness problems with the current approach.
> I don't think that these can stand up to the problems that the simpler
> by-char iteration approach would have.

Sure there are, and you yourself illustrated a misuse of the APIs. My point is: code point is better than code unit and not all that much slower. Grapheme is better than code point but a lot slower. It seems we're quite in a sweet spot here wrt performance/correctness.

>>> 4. Encourage D programmers to write Unicode-capable code that is correct
>>> in the full sense of the word.
>>
>> I disagree we are presently discouraging them.
>
> I did not say we are. The problem is that we aren't encouraging them
> either - we are instead setting an example of how to do it in a wrong
> (incomplete) way.

Code unit is what it is. Those programming for natural languages for which code units are not sufficient would need to exercise due diligence. We ought to help them without crippling efficiency.

>> I do agree a change would make certain things clearer.
>
> I have an issue with all the counter-arguments presented in this thread
> being shoved behind the one word "clearer".

What is the issue you are having? I don't see a much better API being proposed. I see a marginally improved API at the very best, and possibly quite a bit more prone to error.

>> But not enough to nearly make up for the breakage.
>
> I would still like to go ahead with my suggestion to attempt some
> possible changes without releasing them. I'm going to try them with my
> own programs first to see how much it will break.

I think that's great.

> I believe that you are
> too eagerly dismissing all proposals without even evaluating them.

Perspective is everything, isn't it :o). I thought I'm being reasonable and accepting in discussing of a number of proposed points, although in my heart of hearts many arguments seem rather frivolous.

With what has been put forward so far, that's not even close to justifying a breaking change. If that great better design is just get back to code unit iteration, the change will not happen while I work on D. It is possible, however, that a much better idea comes forward, and I'd be looking forward to such.

>>> I think the above list has enough weight to merit at least considering
>>> *some* breaking changes.
>>
>> I think a better approach is to figure what to add.
>
> This is obvious:
> - more Unicode algorithms (normalization, segmentation, etc.)
> - better documentation

I was thinking of these too:

1. Revisit std.encoding and perhaps confer legitimacy to the character types defined there. The implementation in std.encoding is wanting, but I think the idea is sound. Essentially give more love to various encodings, including Ascii and "bypass encoding, I'll deal with stuff myself".

2. Add byChar that returns a random-access range iterating a string by character. Add byWchar that does on-the-fly transcoding to UTF16. Add byDchar that accepts any range of char and does decoding. And such stuff. Then whenever one wants to go through a string by code point can just use str.byChar.


Andrei