September 06, 2018
On Thursday, 6 September 2018 at 13:30:11 UTC, Chris wrote:
> And autodecode is a good example of experts getting it wrong, because, you know, you cannot be an expert in all fields. I think the problem was that it was discovered too late.

There are very valid reasons not to talk about auto-decoding again:

- it's too late to remove because breakage
- attempts at removing it were _already_ tried
- it has been debated to DEATH
- there is an easy work-around

So any discussion _now_ would have the very same structure of the discussion _then_, and would lead to the exact same result. It's quite tragic. And I urge the real D supporters to let such conversation die (topics debated to death) as soon as they appear.



> why shouldn't users be allowed to give feedback?
Straw-man.

If we don't get over _some_ technical debate, the only thing that is achieved is a loss of time for everyone involved.
September 07, 2018
On 07/09/2018 2:30 AM, Guillaume Piolat wrote:
> On Thursday, 6 September 2018 at 13:30:11 UTC, Chris wrote:
>> And autodecode is a good example of experts getting it wrong, because, you know, you cannot be an expert in all fields. I think the problem was that it was discovered too late.
> 
> There are very valid reasons not to talk about auto-decoding again:
> 
> - it's too late to remove because breakage
> - attempts at removing it were _already_ tried
> - it has been debated to DEATH
> - there is an easy work-around
> 
> So any discussion _now_ would have the very same structure of the discussion _then_, and would lead to the exact same result. It's quite tragic. And I urge the real D supporters to let such conversation die (topics debated to death) as soon as they appear.

+1
Either decide a list of conditions before we can break to remove it, or yes lets let this idea go. It isn't helping anyone.
September 06, 2018
On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> Hehe, it's already a bit laughable that correctness is not preferred.
>
> // Swift
> let a = "á"
> let b = "á"
> let c = "\u{200B}" // zero width space
> let x = a + c + a
> let y = b + c + b
>
> print(a.count) // 1
> print(b.count) // 1
> print(x.count) // 3
> print(y.count) // 3
>
> print(a == b) // true
> print("ááááááá".range(of: "á") != nil) // true
>
> // D
> auto a = "á";
> auto b = "á";
> auto c = "\u200B";
> auto x = a ~ c ~ a;
> auto y = b ~ c ~ b;
>
> writeln(a.length); // 2 wtf
> writeln(b.length); // 3 wtf
> writeln(x.length); // 7 wtf
> writeln(y.length); // 9 wtf
>
> writeln(a == b); // false wtf
> writeln("ááááááá".canFind("á")); // false wtf

writeln(cast(ubyte[]) a); // [195, 161]
writeln(cast(ubyte[]) b); // [97, 204, 129]

At least for equality, it doesn't seem far fetched to me that both are not considered equal if they are not the same.
September 06, 2018
On Thursday, 6 September 2018 at 14:30:38 UTC, Guillaume Piolat wrote:
> On Thursday, 6 September 2018 at 13:30:11 UTC, Chris wrote:
>> And autodecode is a good example of experts getting it wrong, because, you know, you cannot be an expert in all fields. I think the problem was that it was discovered too late.
>
> There are very valid reasons not to talk about auto-decoding again:
>
> - it's too late to remove because breakage
> - attempts at removing it were _already_ tried
> - it has been debated to DEATH
> - there is an easy work-around
>
> So any discussion _now_ would have the very same structure of the discussion _then_, and would lead to the exact same result. It's quite tragic. And I urge the real D supporters to let such conversation die (topics debated to death) as soon as they appear.

The real supporters? So it's a religion? For me it's about technology and finding a good tool for a job.

>> why shouldn't users be allowed to give feedback?
> Straw-man.

I meant in _general_, not necessarily autodecode ;)

> If we don't get over _some_ technical debate, the only thing that is achieved is a loss of time for everyone involved.

Translation: "Nothing to see here, move along!" Usually a sign to move on...
September 06, 2018
On Thursday, 6 September 2018 at 14:33:27 UTC, rikki cattermole wrote:
> Either decide a list of conditions before we can break to remove it, or yes lets let this idea go. It isn't helping anyone.

Can't you just let mark it as deprecated and provide a library compatibility range (100% compatible). Then people will just update their code to use the range...

This should be possible to achieve using automated source-to-source translation in most cases.




September 06, 2018
On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> // D
> auto a = "á";
> auto b = "á";
> auto c = "\u200B";
> auto x = a ~ c ~ a;
> auto y = b ~ c ~ b;
>
> writeln(a.length); // 2 wtf
> writeln(b.length); // 3 wtf
> writeln(x.length); // 7 wtf
> writeln(y.length); // 9 wtf
>
> writeln(a == b); // false wtf
> writeln("ááááááá".canFind("á")); // false wtf
>

I had to copy-paste that because I wondered how the last two can be false. They are because á is encoded differently. if you replace all occurences of it with a grapheme that fits to one code point, the results are:

2
2
7
7
true
true
September 06, 2018
On Thursday, 6 September 2018 at 14:42:14 UTC, Chris wrote:
> Usually a sign to move on...

You have said that at least 10 times in this very thread. Doomsayers are as old as D. It will be doing OK.
September 06, 2018
On Thu, Sep 6, 2018 at 4:45 PM Dukc via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> > // D
> > auto a = "á";
> > auto b = "á";
> > auto c = "\u200B";
> > auto x = a ~ c ~ a;
> > auto y = b ~ c ~ b;
> >
> > writeln(a.length); // 2 wtf
> > writeln(b.length); // 3 wtf
> > writeln(x.length); // 7 wtf
> > writeln(y.length); // 9 wtf
> >
> > writeln(a == b); // false wtf
> > writeln("ááááááá".canFind("á")); // false wtf
> >
>
> I had to copy-paste that because I wondered how the last two can be false. They are because á is encoded differently. if you replace all occurences of it with a grapheme that fits to one code point, the results are:
>
> 2
> 2
> 7
> 7
> true
> true
>

import std.stdio;
import std.algorithm : canFind;
import std.uni : normalize;

void main()
{
    auto a = "á".normalize;
    auto b = "á".normalize;
    auto c = "\u200B".normalize;
    auto x = a ~ c ~ a;
    auto y = b ~ c ~ b;

    writeln(a.length); // 2
    writeln(b.length); // 2
    writeln(x.length); // 7
    writeln(y.length); // 7

    writeln(a == b); // true
    writeln("ááááááá".canFind("á".normalize)); // true
}


September 06, 2018
On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> > // D
> > auto a = "á";
> > auto b = "á";
> > auto c = "\u200B";
> > auto x = a ~ c ~ a;
> > auto y = b ~ c ~ b;
> > 
> > writeln(a.length); // 2 wtf
> > writeln(b.length); // 3 wtf
> > writeln(x.length); // 7 wtf
> > writeln(y.length); // 9 wtf
[...]

This is an unfair comparison.  In the Swift version you used .count, but here you used .length, which is the length of the array, NOT the number of characters or whatever you expect it to be.  You should rather use .count and specify exactly what you want to count, e.g., byCodePoint or byGrapheme.

I suspect the Swift version will give you unexpected results if you did something like compare "á" to "a\u301", for example (which, in case it isn't obvious, are visually identical to each other, and as far as an end user is concerned, should only count as 1 grapheme).

Not even normalization will help you if you have a string like "a\u301\u302": in that case, the *only* correct way to count the number of visual characters is byGrapheme, and I highly doubt Swift's .count will give you the correct answer in that case. (I expect that Swift's .count will count code points, as is the usual default in many languages, which is unfortunately wrong when you're thinking about visual characters, which are called graphemes in Unicode parlance.)

And even in your given example, what should .count return when there's a zero-width character?  If you're counting the number of visual places taken by the string (e.g., you're trying to align output in a fixed-width terminal), then *both* versions of your code are wrong, because zero-width characters do not occupy any space when displayed. If you're counting the number of code points, though, e.g., to allocate the right buffer size to convert to dstring, then you want to count the zero-width character as 1 rather than 0.  And that's not to mention double-width characters, which should count as 2 if you're outputting to a fixed-width terminal.

Again I say, you need to know how Unicode works. Otherwise you can easily deceive yourself to think that your code (both in D and in Swift and in any other language) is correct, when in fact it will fail miserably when it receives input that you didn't think of.  Unicode is NOT ASCII, and you CANNOT assume there's a 1-to-1 mapping between "characters" and display length. Or 1-to-1 mapping between any of the various concepts of string "length", in fact.

In ASCII, array length == number of code points == number of graphemes == display width.

In Unicode, array length != number of code points != number of graphemes != display width.

Code written by anyone who does not understand this is WRONG, because you will inevitably end up using the wrong value for the wrong thing: e.g., array length for number of code points, or number of code points for display length. Not even .byGrapheme will save you here; you *need* to understand that zero-width and double-width characters exist, and what they imply for display width. You *need* to understand the difference between code points and graphemes.  There is no single default that will work in every case, because there are DIFFERENT CORRECT ANSWERS depending on what your code is trying to accomplish. Pretending that you can just brush all this detail under the rug of a single number is just deceiving yourself, and will inevitably result in wrong code that will fail to handle Unicode input correctly.


T

-- 
It's amazing how careful choice of punctuation can leave you hanging:
September 06, 2018
On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:
> On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
>> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
>> > // D
>> > auto a = "á";
>> > auto b = "á";
>> > auto c = "\u200B";
>> > auto x = a ~ c ~ a;
>> > auto y = b ~ c ~ b;
>> > 
>> > writeln(a.length); // 2 wtf
>> > writeln(b.length); // 3 wtf
>> > writeln(x.length); // 7 wtf
>> > writeln(y.length); // 9 wtf
> [...]
>
> This is an unfair comparison.  In the Swift version you used .count, but here you used .length, which is the length of the array, NOT the number of characters or whatever you expect it to be.  You should rather use .count and specify exactly what you want to count, e.g., byCodePoint or byGrapheme.
>
> I suspect the Swift version will give you unexpected results if you did something like compare "á" to "a\u301", for example (which, in case it isn't obvious, are visually identical to each other, and as far as an end user is concerned, should only count as 1 grapheme).
>
> Not even normalization will help you if you have a string like "a\u301\u302": in that case, the *only* correct way to count the number of visual characters is byGrapheme, and I highly doubt Swift's .count will give you the correct answer in that case. (I expect that Swift's .count will count code points, as is the usual default in many languages, which is unfortunately wrong when you're thinking about visual characters, which are called graphemes in Unicode parlance.)

No, Swift counts grapheme clusters by default, so it gives 1. I suggest you read the linked Swift chapter above. I think it's the wrong choice for performance, but they chose to emphasize intuitiveness for the common case.

I agree with most of the rest of what you wrote about programmers having no silver bullet to avoid Unicode's and languages' complexity.