September 06, 2018
On Thursday, 6 September 2018 at 10:22:22 UTC, ag0aep6g wrote:
> On 09/06/2018 09:23 AM, Chris wrote:
>> Python 3 gives me this:
>> 
>> print(len("á"))
>> 1
>
> Python 3 also gives you this:
>
> print(len("á"))
> 2
>
> (The example might not survive transfer from me to you if Unicode normalization happens along the way.)
>
> That's when you enter the 'á' as 'a' followed by U+0301 (combining acute accent). So Python's `len` counts in code points, like D's std.range does (auto-decoding).

To avoid this you have to normalize and recompose any decomposed characters. I remember that Mac OS X used (and still uses?) decomposed characters by default, so when you typed 'á' into your cli, it would automatically decompose it to 'a' + acute. `string` however returns len=2 for composed characters too. If you do a lot of string handling it will come back to bite you sooner or later.
September 06, 2018
On Thursday, 6 September 2018 at 09:35:27 UTC, Chris wrote:
> On Thursday, 6 September 2018 at 08:44:15 UTC, nkm1 wrote:
>> On Wednesday, 5 September 2018 at 07:48:34 UTC, Chris wrote:
>>> On Tuesday, 4 September 2018 at 21:36:16 UTC, Walter Bright wrote:
>>>>
>>>> Autodecode - I've suffered under that, too. The solution was fairly simple. Append .byCodeUnit to strings that would otherwise autodecode. Annoying, but hardly a showstopper.
>>>
>>> import std.array : array;
>>> import std.stdio : writefln;
>>> import std.uni : byCodePoint, byGrapheme;
>>> import std.utf : byCodeUnit;
>>>
>>> void main() {
>>>
>>>   string first = "á";
>>>
>>>   writefln("%d", first.length);  // prints 2
>>>
>>>   auto firstCU = "á".byCodeUnit; // type is `ByCodeUnitImpl` (!)
>>>
>>>   writefln("%d", firstCU.length);  // prints 2
>>>
>>>   auto firstGr = "á".byGrapheme.array;  // type is `Grapheme[]`
>>>
>>>   writefln("%d", firstGr.length);  // prints 1
>>>
>>>   auto firstCP = "á".byCodePoint.array; // type is `dchar[]`
>>>
>>>   writefln("%d", firstCP.length);  // prints 1
>>>
>>>   dstring second = "á";
>>>
>>>   writefln("%d", second.length);  // prints 1 (That was easy!)
>>>
>>>   // DMD64 D Compiler v2.081.2
>>> }
>>
>> And this has what to do with autodecoding?
>
> Nothing. I was just pointing out how awkward some basic things can be. autodecoding just adds to it in the sense that it's a useless overhead but will keep string handling in a limbo forever and ever and ever.
>
>>
>> TBH, it looks like you're just confused about how Unicode works. None of that is something particular to D. You should probably address your concerns to the Unicode Consortium. Not that they care.
>
> I'm actually not confused since I've been dealing with Unicode (and encodings in general) for quite a while now. Although I'm not a Unicode expert, I know what the operations above do and why. I'd only expect a modern PL to deal with Unicode correctly and have some guidelines as to the nitty-gritty.

Since you understand Unicode well, enlighten us: what's the best default format to use for string iteration?

You can argue that D chose the wrong default by having the stdlib auto-decode to code points in several places, and Walter and a host of the core D team would agree with you, and you can add me to the list too. But it's not clear there should be a default format at all, other than whatever you started off with, particularly for a programming language that values performance like D does, as each format choice comes with various speed vs. correctness trade-offs.

Therefore, the programmer has to understand that complexity and make his own choice. You're acting like there's some obvious choice for how to handle Unicode that we're missing here, when the truth is that _no programming language knows how to handle unicode well_, since handling a host of world languages in a single format is _inherently unintuitive_ and has significant efficiency tradeoffs between the different formats.

> And once again, it's the user's fault as in having some basic assumptions about how things should work. The user is just too stoooopid to use D properly - that's all. I know this type of behavior from the management of pubs and shops that had to close down, because nobody would go there anymore.
>
> Do you know the book "Crónica de una muerte anunciada" (Chronicle of a Death Foretold) by Gabriel García Márquez?
>
> "The central question at the core of the novella is how the death of Santiago Nasar was foreseen, yet no one tried to stop it."[1]
>
> [1] https://en.wikipedia.org/wiki/Chronicle_of_a_Death_Foretold#Key_themes

You're not being fair here, Chris. I just saw this SO question that I think exemplifies how most programmers react to Unicode:

"Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.

Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.

The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:

Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...

...

Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...

...

Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.

...

Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?"
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

Honestly, unicode is a mess, and I believe we will all have to dump the Unicode standard and start over one day. Until that fine day, there is no neat solution to how to handle it, no matter how much you'd like to think so. Also, much of the complexity actually comes from the complexity of the various language alphabets, so that cannot be waved away no matter what standard you come up with, though Unicode certainly adds more unneeded complexity on top, which is why it should be dumped.
September 06, 2018
On Wednesday, 5 September 2018 at 07:48:34 UTC, Chris wrote:
>
> import std.array : array;
> import std.stdio : writefln;
> import std.uni : byCodePoint, byGrapheme;
> import std.utf : byCodeUnit;
>
> void main() {
>
>   string first = "á";
>
>   writefln("%d", first.length);  // prints 2
>
>   auto firstCU = "á".byCodeUnit; // type is `ByCodeUnitImpl` (!)
>
>   writefln("%d", firstCU.length);  // prints 2
>
>   auto firstGr = "á".byGrapheme.array;  // type is `Grapheme[]`
>
>   writefln("%d", firstGr.length);  // prints 1
>
>   auto firstCP = "á".byCodePoint.array; // type is `dchar[]`
>
>   writefln("%d", firstCP.length);  // prints 1
>
>   dstring second = "á";
>
>   writefln("%d", second.length);  // prints 1 (That was easy!)
>
>   // DMD64 D Compiler v2.081.2
> }
>

So Unicode in D works EXACTLY as expected, yet people in this thread act as if the house is on fire.

D dying because of auto-decoding? Who can possibly think that in its right mind?

The worst part of this forum is that suddenly everyone, by virtue of posting in a newsgroup, is an annointed language design expert.

Let me break that to you: core developer are language experts. The rest of us are users, that yes it doesn't make us necessarily qualified to design a language.


September 06, 2018
On Thursday, 6 September 2018 at 10:44:45 UTC, Joakim wrote:
[snip]
>
> You're not being fair here, Chris. I just saw this SO question that I think exemplifies how most programmers react to Unicode:
>
> "Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.
>
> Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.
>
> The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:
>
> Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...
>
> ...
>
> Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...
>
> ...
>
> Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.
>
> ...
>
> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...
>
> Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.
>
> So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?"
> https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
>
> Honestly, unicode is a mess, and I believe we will all have to dump the Unicode standard and start over one day. Until that fine day, there is no neat solution to how to handle it, no matter how much you'd like to think so. Also, much of the complexity actually comes from the complexity of the various language alphabets, so that cannot be waved away no matter what standard you come up with, though Unicode certainly adds more unneeded complexity on top, which is why it should be dumped.

One problem imo is that they mixed the terms up: "Grapheme: A minimally distinctive unit of writing in the context of a particular writing system." In linguistics a grapheme is not a single character like "á" or "g". It may also be a combination of characters like in English spelling <sh> ("s" + "h") that maps to a phoneme (e.g. ship, shut, shadow). In German this sound is written as <sch> as in "Schiff" (ship) (but not always, cf. "s" in "Stange").

Since Unicode is such a difficult beast to deal with, I'd say D (or any PL for that matter) needs, first and foremost, a clear policy about what's the default behavior - not ad hoc patches. Then maybe a strategy as to how the default behavior can be turned on and off, say for performance reasons. One way _could_ be a compiler switch to turn the default behavior on/off -unicode or -uni or -utf8 or whatever, or maybe better a library solution like `ustring`.

If you need high performance and checks are no issue for the most part (web crawling, data harvesting etc), get rid of autodecoding. Once you need to check for character/grapheme correctness (e.g. translation tools) make it available through something like `to!ustring`. Which ever way: be clear about it. But don't let the unsuspecting user use `string` and get bitten by it.
September 06, 2018
On Thursday, 6 September 2018 at 11:19:14 UTC, Chris wrote:

>
> One problem imo is that they mixed the terms up: "Grapheme: A minimally distinctive unit of writing in the context of a particular writing system." In linguistics a grapheme is not a single character like "á" or "g". It may also be a combination of characters like in English spelling <sh> ("s" + "h") that maps to a phoneme (e.g. ship, shut, shadow). In German this sound is written as <sch> as in "Schiff" (ship) (but not always, cf. "s" in "Stange").
>

Sorry, this should read "In linguistics a grapheme is not _necessarily_ _only_ a single character like "á" or "g"."
September 06, 2018
On 09/06/2018 12:40 PM, Chris wrote:
> To avoid this you have to normalize and recompose any decomposed characters. I remember that Mac OS X used (and still uses?) decomposed characters by default, so when you typed 'á' into your cli, it would automatically decompose it to 'a' + acute. `string` however returns len=2 for composed characters too. If you do a lot of string handling it will come back to bite you sooner or later.

You say that D users shouldn't need a '"Unicode license" before they do anything with strings'. And you say that Python 3 gets it right (or maybe less wrong than D).

But here we see that Python requires a similar amount of Unicode knowledge. Without your Unicode license, you couldn't make sense of `len` giving different results for two strings that look the same.

So both D and Python require a Unicode license. But on top of that, D also requires an auto-decoding license. You need to know that `string` is both a range of code points and an array of code units. And you need to know that `.length` belongs to the array side, not the range side. Once you know that (and more), things start making sense in D.

My point is: D doesn't require more Unicode knowledge than Python. But D's auto-decoding gives `string` a dual nature, and that can certainly be confusing. It's part of why everybody dislikes auto-decoding.

(Not saying that Python is free from such pitfalls. I simply don't know the language well enough.)
September 06, 2018
On Thursday, 6 September 2018 at 11:43:31 UTC, ag0aep6g wrote:
>
> You say that D users shouldn't need a '"Unicode license" before they do anything with strings'. And you say that Python 3 gets it right (or maybe less wrong than D).
>
> But here we see that Python requires a similar amount of Unicode knowledge. Without your Unicode license, you couldn't make sense of `len` giving different results for two strings that look the same.
>
> So both D and Python require a Unicode license. But on top of that, D also requires an auto-decoding license. You need to know that `string` is both a range of code points and an array of code units. And you need to know that `.length` belongs to the array side, not the range side. Once you know that (and more), things start making sense in D.

You'll need some basic knowledge of Unicode, if you deal with strings, that's for sure. But you don't need a "license" and it certainly shouldn't be used as an excuse for D's confusing nature when it comes to strings. Unicode is confusing enough, so you don't need to add another layer of complexity to confuse users further. And most certainly you shouldn't blame the user for being confused. Afaik, there's no warning label with an accompanying user manual for string handling.

> My point is: D doesn't require more Unicode knowledge than Python. But D's auto-decoding gives `string` a dual nature, and that can certainly be confusing. It's part of why everybody dislikes auto-decoding.

D should be clear about it. I think it's too late for `string` to change its behavior (i.e. "á".length = 1). If you wanna change `string`'s behavior now, maybe a compiler switch would be an option for the transition period: -autodecode=off.

Maybe a new type of string could be introduced that behaves like one would expect, say `ustring` for correct Unicode handling. Or `string` does that and you introduce a new type for high performance tasks (`rawstring` would unfortunately be confusing).

The thing is that even basic things like string handling are complicated and flawed so that I don't want to use D for any future projects and I don't have the time to wait until it gets fixed one day, if it ever will get fixed that is. Neither does it seem to be a priority as opposed to other things that are maybe less important for production. But at least I'm wiser after this thread, since it has been made clear that things are not gonna change soon, at least not soon enough for me.

This is why I'll file for D-vorce :) Will it be difficult? Maybe at the beginning, but it will make things easier in the long run. And at the end of the day, if you have to fix and rewrite parts of your code again and again due to frequent language changes, you might as well port it to a different PL altogether. But I have no hard feelings, it's a practical decision I had to make based on pros and cons.

[snip]


September 06, 2018
On Thursday, 6 September 2018 at 11:01:55 UTC, Guillaume Piolat wrote:
> Let me break that to you: core developer are language experts. The rest of us are users, that yes it doesn't make us necessarily qualified to design a language.

Who?



September 06, 2018
On Thursday, 6 September 2018 at 11:01:55 UTC, Guillaume Piolat wrote:

>
> So Unicode in D works EXACTLY as expected, yet people in this thread act as if the house is on fire.

Expected by who? The Unicode expert or the user?

> D dying because of auto-decoding? Who can possibly think that in its right mind?

Nobody, it's just another major issue to be fixed.

> The worst part of this forum is that suddenly everyone, by virtue of posting in a newsgroup, is an annointed language design expert.
>
> Let me break that to you: core developer are language experts. The rest of us are users, that yes it doesn't make us necessarily qualified to design a language.

Calm down. I for my part never said I was an expert on language design.

Number one: experts do make mistakes too, there is nothing wrong with that. And autodecode is a good example of experts getting it wrong, because, you know, you cannot be an expert in all fields. I think the problem was that it was discovered too late.

Number two: why shouldn't users be allowed to give feedback? Engineers and developers need feedback, else we'd still be using CLI, wouldn't we. The user doesn't need to be an expert to know what s/he likes and doesn't like and developers / engineers often have a different point of view as to what is important / annoying etc. That's why IT companies introduced customer service, because the direct interaction between developers and users would often end badly (disgruntled customers).


September 06, 2018
On Wednesday, 5 September 2018 at 22:00:27 UTC, H. S. Teoh wrote:
> Because grapheme decoding is SLOW, and most of the time you don't even need it anyway.  SLOW as in, it will easily add a factor of 3-5 (if not worse!) to your string processing time, which will make your natively-compiled D code a laughing stock of interpreted languages like Python.  It will make autodecoding look like an optimization(!).

Hehe, it's already a bit laughable that correctness is not preferred.

// Swift
let a = "á"
let b = "á"
let c = "\u{200B}" // zero width space
let x = a + c + a
let y = b + c + b

print(a.count) // 1
print(b.count) // 1
print(x.count) // 3
print(y.count) // 3

print(a == b) // true
print("ááááááá".range(of: "á") != nil) // true

// D
auto a = "á";
auto b = "á";
auto c = "\u200B";
auto x = a ~ c ~ a;
auto y = b ~ c ~ b;

writeln(a.length); // 2 wtf
writeln(b.length); // 3 wtf
writeln(x.length); // 7 wtf
writeln(y.length); // 9 wtf

writeln(a == b); // false wtf
writeln("ááááááá".canFind("á")); // false wtf

Tell me which one would cause the giggles again?

If speed is the preference over correctness (which I very much disagree with, but for arguments sake...) then still code points are the wrong choice. So, speed was obviously (??) not the reason to prefer code points as the default.

Here's a read on how swift 4 strings behave. Absolutely amazing work there: https://oleb.net/blog/2017/11/swift-4-strings/

>
> Grapheme decoding is really only necessary when (1) you're typesetting a Unicode string, and (2) you're counting the number of visual characters taken up by the string (though grapheme counting even in this case may not give you what you want, thanks to double-width characters, zero-width characters, etc. -- though it can form the basis of correct counting code).

Yeah nah. Those are not the only 2 cases *ever* where grapheme decoding is correct. I don't think one can list all the cases where grapheme decoding is the correct behavior. Off the op of me head you've already forgotten comparisons. And on top of that, comparing and counting has a bajillion* use cases.

* number is an exaggeration.

>
> For all other cases, you really don't need grapheme decoding, and being forced to iterate over graphemes when unnecessary will add a horrible overhead, worse than autodecoding does today.

As opposed to being forced to iterate with incorrect results? I understand that it's slower. I just don't think that justifies incorrect output. I agree with everything you've said next though, that people should understand unicode.

>
> //
>
> Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works.  Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.  You can't drive without a license, and even if you try anyway, the chances of ending up in a nasty accident is pretty high.  People *need* to learn how to use Unicode properly before complaining about why this or that doesn't work the way they thought it should work.

I agree that you should know about unicode. And maybe you can't be correct 100% of the time but you can very well get much closer than were D is right now.

And yeah, you can't drive without a license, but most cars hopefully don't show you an incorrect speedometer reading because it produces faster drivers.

>
>
> T
> --
> Gone Chopin. Bach in a minuet.

Lol :D