September 05, 2018
On Wednesday, 5 September 2018 at 07:48:34 UTC, Chris wrote:
> On Tuesday, 4 September 2018 at 21:36:16 UTC, Walter Bright wrote:
>>
>> Autodecode - I've suffered under that, too. The solution was fairly simple. Append .byCodeUnit to strings that would otherwise autodecode. Annoying, but hardly a showstopper.
>
> import std.array : array;
> import std.stdio : writefln;
> import std.uni : byCodePoint, byGrapheme;
> import std.utf : byCodeUnit;
>
> void main() {
>
>   string first = "á";
>
>   writefln("%d", first.length);  // prints 2
>
>   auto firstCU = "á".byCodeUnit; // type is `ByCodeUnitImpl` (!)
>
>   writefln("%d", firstCU.length);  // prints 2
>
>   auto firstGr = "á".byGrapheme.array;  // type is `Grapheme[]`
>
>   writefln("%d", firstGr.length);  // prints 1
>
>   auto firstCP = "á".byCodePoint.array; // type is `dchar[]`
>
>   writefln("%d", firstCP.length);  // prints 1
>
>   dstring second = "á";
>
>   writefln("%d", second.length);  // prints 1 (That was easy!)
>
>   // DMD64 D Compiler v2.081.2
> }
>
> Welcome to my world!
>
> [snip]

The dstring is only ok because the 2 code units fit in a dchar right? But all the other ones are as expected right?

Seriously... why is it not graphemes by default for correctness whyyyyyyy!

September 05, 2018
On Wed, Sep 05, 2018 at 09:33:27PM +0000, aliak via Digitalmars-d wrote: [...]
> The dstring is only ok because the 2 code units fit in a dchar right? But all the other ones are as expected right?

And dstring will be wrong once you have non-precomposed diacritics and other composing sequences.


> Seriously... why is it not graphemes by default for correctness whyyyyyyy!

Because grapheme decoding is SLOW, and most of the time you don't even need it anyway.  SLOW as in, it will easily add a factor of 3-5 (if not worse!) to your string processing time, which will make your natively-compiled D code a laughing stock of interpreted languages like Python.  It will make autodecoding look like an optimization(!).

Grapheme decoding is really only necessary when (1) you're typesetting a Unicode string, and (2) you're counting the number of visual characters taken up by the string (though grapheme counting even in this case may not give you what you want, thanks to double-width characters, zero-width characters, etc. -- though it can form the basis of correct counting code).

For all other cases, you really don't need grapheme decoding, and being forced to iterate over graphemes when unnecessary will add a horrible overhead, worse than autodecoding does today.

//

Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works.  Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.  You can't drive without a license, and even if you try anyway, the chances of ending up in a nasty accident is pretty high.  People *need* to learn how to use Unicode properly before complaining about why this or that doesn't work the way they thought it should work.


T

-- 
Gone Chopin. Bach in a minuet.
September 06, 2018
On Wednesday, 5 September 2018 at 22:00:27 UTC, H. S. Teoh wrote:

>
> //
>
> Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works.  Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.  You can't drive without a license, and even if you try anyway, the chances of ending up in a nasty accident is pretty high.  People *need* to learn how to use Unicode properly before complaining about why this or that doesn't work the way they thought it should work.
>
>
> T

Python 3 gives me this:

print(len("á"))
1

and so do other languages.

Is it asking too much to ask for `string` (not `dstring` or `wstring`) to behave as most people would expect it to behave in 2018 - and not like Python 2 from days of yore? But of course, D users should have a "Unicode license" before they do anything with strings. (I wonder is there a different license for UTF8 and UTF16 and UTF32, Big / Little Endian, BOM? Just asking.)

So again, for the umpteenth time, it's the users' fault. I see. Ironically enough, it was the language developers' lack of understanding of Unicode that led to string handling being a nightmare in D in the first place. Oh lads, if you were politicians I'd say that with this attitude you're gonna the next election. I say this, because many times the posts by (core) developers remind me so much of politicians who are completely detached from the reality of the people. Right oh!




September 06, 2018
On Thursday, 6 September 2018 at 07:23:57 UTC, Chris wrote:

>> Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works.  Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.

> Is it asking too much to ask for `string` (not `dstring` or `wstring`) to behave as most people would expect it to behave in 2018 - and not like Python 2 from days of yore?

I agree with Chris.

The boat is sailed, so D2 should just go full throttle with the original design and auto decode to graphemes, regardless of the performance.

September 06, 2018
On Thursday, 6 September 2018 at 07:23:57 UTC, Chris wrote:
> On Wednesday, 5 September 2018 at 22:00:27 UTC, H. S. Teoh wrote:
>
>>
>> //
>>
>> Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works.  Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.  You can't drive without a license, and even if you try anyway, the chances of ending up in a nasty accident is pretty high.  People *need* to learn how to use Unicode properly before complaining about why this or that doesn't work the way they thought it should work.
>>
>>
>> T
>
> Python 3 gives me this:
>
> print(len("á"))
> 1
>
> and so do other languages.

The same Python 3 that people criticize for having unintuitive unicode string handling?

https://learnpythonthehardway.org/book/nopython3.html

> Is it asking too much to ask for `string` (not `dstring` or `wstring`) to behave as most people would expect it to behave in 2018 - and not like Python 2 from days of yore? But of course, D users should have a "Unicode license" before they do anything with strings. (I wonder is there a different license for UTF8 and UTF16 and UTF32, Big / Little Endian, BOM? Just asking.)

Yes and no, unicode is a clusterf***, so every programming language is having problems with it.

> So again, for the umpteenth time, it's the users' fault. I see. Ironically enough, it was the language developers' lack of understanding of Unicode that led to string handling being a nightmare in D in the first place. Oh lads, if you were politicians I'd say that with this attitude you're gonna the next election. I say this, because many times the posts by (core) developers remind me so much of politicians who are completely detached from the reality of the people. Right oh!

You have a point that it was D devs' ignorance of unicode that led to the current auto-decoding problem. But let's have some nuance here, the problem ultimately is unicode.
September 06, 2018
On 06/09/2018 7:54 PM, Joakim wrote:
> On Thursday, 6 September 2018 at 07:23:57 UTC, Chris wrote:
>> On Wednesday, 5 September 2018 at 22:00:27 UTC, H. S. Teoh wrote:
>>
>>>
>>> //
>>>
>>> Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works. Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.  You can't drive without a license, and even if you try anyway, the chances of ending up in a nasty accident is pretty high.  People *need* to learn how to use Unicode properly before complaining about why this or that doesn't work the way they thought it should work.
>>>
>>>
>>> T
>>
>> Python 3 gives me this:
>>
>> print(len("á"))
>> 1
>>
>> and so do other languages.
> 
> The same Python 3 that people criticize for having unintuitive unicode string handling?
> 
> https://learnpythonthehardway.org/book/nopython3.html
> 
>> Is it asking too much to ask for `string` (not `dstring` or `wstring`) to behave as most people would expect it to behave in 2018 - and not like Python 2 from days of yore? But of course, D users should have a "Unicode license" before they do anything with strings. (I wonder is there a different license for UTF8 and UTF16 and UTF32, Big / Little Endian, BOM? Just asking.)
> 
> Yes and no, unicode is a clusterf***, so every programming language is having problems with it.
> 
>> So again, for the umpteenth time, it's the users' fault. I see. Ironically enough, it was the language developers' lack of understanding of Unicode that led to string handling being a nightmare in D in the first place. Oh lads, if you were politicians I'd say that with this attitude you're gonna the next election. I say this, because many times the posts by (core) developers remind me so much of politicians who are completely detached from the reality of the people. Right oh!
> 
> You have a point that it was D devs' ignorance of unicode that led to the current auto-decoding problem. But let's have some nuance here, the problem ultimately is unicode.

Let's also be realistic here, when D was being designed UTF-16 was touted as being 'the' solution you should support e.g. Java had it retrofitted shortly before D. So it isn't anyone's fault on D's end.
September 06, 2018
On Thursday, 6 September 2018 at 07:54:09 UTC, Joakim wrote:
> On Thursday, 6 September 2018 at 07:23:57 UTC, Chris wrote:
>> On Wednesday, 5 September 2018 at 22:00:27 UTC, H. S. Teoh wrote:
>>
>>>
>>> //
>>>
>>> Seriously, people need to get over the fantasy that they can just use Unicode without understanding how Unicode works.  Most of the time, you can get the illusion that it's working, but actually 99% of the time the code is actually wrong and will do the wrong thing when given an unexpected (but still valid) Unicode string.  You can't drive without a license, and even if you try anyway, the chances of ending up in a nasty accident is pretty high.  People *need* to learn how to use Unicode properly before complaining about why this or that doesn't work the way they thought it should work.
>>>
>>>
>>> T
>>
>> Python 3 gives me this:
>>
>> print(len("á"))
>> 1
>>
>> and so do other languages.
>
> The same Python 3 that people criticize for having unintuitive unicode string handling?
>
> https://learnpythonthehardway.org/book/nopython3.html
>
>> Is it asking too much to ask for `string` (not `dstring` or `wstring`) to behave as most people would expect it to behave in 2018 - and not like Python 2 from days of yore? But of course, D users should have a "Unicode license" before they do anything with strings. (I wonder is there a different license for UTF8 and UTF16 and UTF32, Big / Little Endian, BOM? Just asking.)
>
> Yes and no, unicode is a clusterf***, so every programming language is having problems with it.
>
>> So again, for the umpteenth time, it's the users' fault. I see. Ironically enough, it was the language developers' lack of understanding of Unicode that led to string handling being a nightmare in D in the first place. Oh lads, if you were politicians I'd say that with this attitude you're gonna the next election. I say this, because many times the posts by (core) developers remind me so much of politicians who are completely detached from the reality of the people. Right oh!
>
> You have a point that it was D devs' ignorance of unicode that led to the current auto-decoding problem. But let's have some nuance here, the problem ultimately is unicode.

Yes, Unicode is a beast that is hard to tame. But there is, afaik, not even a proper plan to tackle the whole thing in D, just patches. D has autodecoding which slows things down but doesn't even work correctly at the same time. However, it cannot be removed due to massive code breakage. So you sacrifice speed for security (fine) - but the security doesn't even exist. So what's the point? Also, there aren't any guidelines about how to use strings in different contexts. So after a while your code ends up being a mess of .byCodePoint / .byGrapheme / string / dstring whatever, and you never know if you really got it right or not (performance wise and other).

We're talking about a basic functionality like string handling. String handling is very important these days (data harvesting, translation tools) and IT is used all over the world where you have to deal with different alphabets that are outside the ASCII range. And because it's such a basic functionality, you don't want to waste time having to think about it.
September 06, 2018
On Wednesday, 5 September 2018 at 07:48:34 UTC, Chris wrote:
> On Tuesday, 4 September 2018 at 21:36:16 UTC, Walter Bright wrote:
>>
>> Autodecode - I've suffered under that, too. The solution was fairly simple. Append .byCodeUnit to strings that would otherwise autodecode. Annoying, but hardly a showstopper.
>
> import std.array : array;
> import std.stdio : writefln;
> import std.uni : byCodePoint, byGrapheme;
> import std.utf : byCodeUnit;
>
> void main() {
>
>   string first = "á";
>
>   writefln("%d", first.length);  // prints 2
>
>   auto firstCU = "á".byCodeUnit; // type is `ByCodeUnitImpl` (!)
>
>   writefln("%d", firstCU.length);  // prints 2
>
>   auto firstGr = "á".byGrapheme.array;  // type is `Grapheme[]`
>
>   writefln("%d", firstGr.length);  // prints 1
>
>   auto firstCP = "á".byCodePoint.array; // type is `dchar[]`
>
>   writefln("%d", firstCP.length);  // prints 1
>
>   dstring second = "á";
>
>   writefln("%d", second.length);  // prints 1 (That was easy!)
>
>   // DMD64 D Compiler v2.081.2
> }

And this has what to do with autodecoding?

>
> Welcome to my world!
>

TBH, it looks like you're just confused about how Unicode works. None of that is something particular to D. You should probably address your concerns to the Unicode Consortium. Not that they care.
September 06, 2018
On Thursday, 6 September 2018 at 08:44:15 UTC, nkm1 wrote:
> On Wednesday, 5 September 2018 at 07:48:34 UTC, Chris wrote:
>> On Tuesday, 4 September 2018 at 21:36:16 UTC, Walter Bright wrote:
>>>
>>> Autodecode - I've suffered under that, too. The solution was fairly simple. Append .byCodeUnit to strings that would otherwise autodecode. Annoying, but hardly a showstopper.
>>
>> import std.array : array;
>> import std.stdio : writefln;
>> import std.uni : byCodePoint, byGrapheme;
>> import std.utf : byCodeUnit;
>>
>> void main() {
>>
>>   string first = "á";
>>
>>   writefln("%d", first.length);  // prints 2
>>
>>   auto firstCU = "á".byCodeUnit; // type is `ByCodeUnitImpl` (!)
>>
>>   writefln("%d", firstCU.length);  // prints 2
>>
>>   auto firstGr = "á".byGrapheme.array;  // type is `Grapheme[]`
>>
>>   writefln("%d", firstGr.length);  // prints 1
>>
>>   auto firstCP = "á".byCodePoint.array; // type is `dchar[]`
>>
>>   writefln("%d", firstCP.length);  // prints 1
>>
>>   dstring second = "á";
>>
>>   writefln("%d", second.length);  // prints 1 (That was easy!)
>>
>>   // DMD64 D Compiler v2.081.2
>> }
>
> And this has what to do with autodecoding?

Nothing. I was just pointing out how awkward some basic things can be. autodecoding just adds to it in the sense that it's a useless overhead but will keep string handling in a limbo forever and ever and ever.

>
> TBH, it looks like you're just confused about how Unicode works. None of that is something particular to D. You should probably address your concerns to the Unicode Consortium. Not that they care.

I'm actually not confused since I've been dealing with Unicode (and encodings in general) for quite a while now. Although I'm not a Unicode expert, I know what the operations above do and why. I'd only expect a modern PL to deal with Unicode correctly and have some guidelines as to the nitty-gritty.

And once again, it's the user's fault as in having some basic assumptions about how things should work. The user is just too stoooopid to use D properly - that's all. I know this type of behavior from the management of pubs and shops that had to close down, because nobody would go there anymore.

Do you know the book "Crónica de una muerte anunciada" (Chronicle of a Death Foretold) by Gabriel García Márquez?

"The central question at the core of the novella is how the death of Santiago Nasar was foreseen, yet no one tried to stop it."[1]

[1] https://en.wikipedia.org/wiki/Chronicle_of_a_Death_Foretold#Key_themes
September 06, 2018
On 09/06/2018 09:23 AM, Chris wrote:
> Python 3 gives me this:
> 
> print(len("á"))
> 1

Python 3 also gives you this:

print(len("á"))
2

(The example might not survive transfer from me to you if Unicode normalization happens along the way.)

That's when you enter the 'á' as 'a' followed by U+0301 (combining acute accent). So Python's `len` counts in code points, like D's std.range does (auto-decoding).