May 31, 2016
On 05/31/2016 04:33 PM, Seb wrote:
> https://github.com/dlang/phobos/pull/4384
>
> Explicitly stating the type of iteration in the 132 places with
> auto-decoding in Phobos doesn't sound that terrible.

After checking some of those 132 places, they are in generic functions that take ranges. std.algorithm.equal, std.range.take - stuff like that.

That's expected, of course, as the range primitives are used there. But those places are not the ones we'd have to fix. We'd have to fix the code that uses those generic functions on strings.
May 31, 2016
On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:
> In an ideal world, the programs someone intuitively writes will do the right thing, and if they can't, they at least refuse to compile. If we agree that it's up to the user whether to iterate over a string by code unit or code points or graphemes, and that we shouldn't arbitrarily choose one of those (except when we know that it's what the user wants), then the same applies to indexing, slicing and counting.

If the user doesn't know how he wants to iterate and you leave the decision to the user... erm... it's not going to give correct result :)
May 31, 2016
On 5/31/16 3:56 AM, Walter Bright wrote:
> On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
>> On 5/30/16 5:51 PM, Walter Bright wrote:
>>> On 5/30/2016 8:34 AM, Marc Schütz wrote:
>>>> In an ideal world, we'd also want to change the way `length` and
>>>> `opIndex` work,
>>>
>>> Why? strings are arrays of code units. All the trouble comes from
>>> erratically pretending otherwise.
>>
>> That's not an argument.
>
> Consistency is a factual argument, and autodecode is not consistent.

Consistency with what? Consistent with what?

>> Objects are arrays of bytes, or tuples of their fields,
>> etc. The whole point of encapsulation is superimposing a more
>> structured view on
>> top of the representation. Operating on open-heart representation is
>> risky, and
>> strings are no exception.
>
> If there is an abstraction for strings that is efficient, consistent,
> useful, and hides the fact that it is UTF, I am not aware of it.

It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges. -- Andrei

May 31, 2016
On 5/31/16 10:33 AM, Seb wrote:
> Explicitly stating the type of iteration in the 132 places with
> auto-decoding in Phobos doesn't sound that terrible.

It is terrible, no two ways about it. We've been very very careful with changes that caused a handful or breakages in Phobos. It really means every D project on the planet will be broken. We can't contemplate that, it's suicide. -- Andrei

May 31, 2016
On Sunday, May 29, 2016 13:47:32 H. S. Teoh via Digitalmars-d wrote:
> On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
> > So now code points are good? -- Andrei
>
> It depends on what you're trying to accomplish. That's the point we're trying to get at.  For some operations, working with code points makes the most sense. But for other operations, it does not.  There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis.  Which is why forcing everything to decode to code points eventually leads to problems.

Exactly. And even a given function can't necessarily always be defined to use a specific level of Unicode, because whether that's correct or not depends on what the programmer is actually trying to do with the function. And then there are cases where the programmer knows enough about the data that they're dealing with that they're able to operate at a different level of Unicode than would normally be correct. The most obvious example of that is when you know that your strings are pure ASCII, but it's not the only case.

We should strive to make Phobos operate correctly on strings by default where we can, but there are cases where the programmer needs to know enough to specify the behavior that they want, and deciding for them is just going to lead to behavior that happens to be right some of the time while making it hard for code using Phobos to have the correct behavior the rest of the time. And the default behavior that we currently have is inefficient to boot.

- Jonathan M Davis

May 31, 2016
On Tuesday, 31 May 2016 at 15:07:09 UTC, Andrei Alexandrescu wrote:
> Consistency with what? Consistent with what?
>

It is a slice type. It should work as a slice type. Every other design stink.

May 31, 2016
On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:
> On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:
>
>> *** http://site.icu-project.org/home#TOC-What-is-ICU-
>
> I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.

Part of it is the complexity of written language, part of it is bad technical decisions.  Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity.  I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine.  Fast-forward years later and exactly the issues I raised are now causing pain.

UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.  It is unnecessarily inefficient, which is precisely why auto-decoding is a problem.  It is only a matter of time till UTF-8 is ditched.

D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable.  I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte.  Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_

The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language!  This is madness.

Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language.  UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile.  It is not.
May 31, 2016
On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/31/16 3:56 AM, Walter Bright wrote:
> > If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.
>
> It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.

Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF. I have to agree with Walter in that there really isn't a way to automatically handle Unicode correctly and efficiently while hiding the fact that it's doing all of the stuff that has to be done for UTF.

That being said, while an array of code units is really what a string should be underneath the hood, having a string type that provides byCodeUnit, byCodePoint, and byGrapheme is an improvement over treating immutable(char)[] as string, even if byCodeUnit returns immutable(char)[], because it forces the programmer to decide what they want to do rather than blindingly operate on immutable(char)[] as if a char were a full character. And as long as it provides access to each level of Unicode, then it's possible for programmers who know what they're doing to efficiently operate on Unicode while simultaneously making it much more obvious to those who don't know what they're doing that they don't know they're doing rather than having them blindly act like char is a full character.

There's really no reason why we couldn't define a string type that operated that way while continuing to treat arrays of char the way that we do now in the language, though transitioning to such a scheme is not at all straightforward in terms of avoiding code breakage. Defining a String type would be simple enough, and any function in Phobos which accepted a string could be changed to accept a String, but we'd have problems with many functions which currently returned string, since changing what they returned would break code.

But even if Phobos were somehow completly changed over to use a new String type, and even if the string alias were deprecated/removed, we'd still have to deal with arrays of char, wchar, and dchar and run the risk of someone using those and having problems, because they didn't treat them as arrays of code units. We can't really prevent that, just make it so that string/String is something else that makes the Unicode issue obvious so that folks are less likely to blindly treat chars as full characters. But even then, it's not like it would be hard for folks to just use the wrong Unicode level. All we'd really be doing is shoving the issue in their face so that they'd have to acknowledge it on some level and maybe then actually learn enough to operate on Unicode strings correctly.

But then again, since all you're really doing at that point is shoving the Unicode issues in folks' faces by not treating strings as ranges or indexable and forcing them to call byCodeUnit, byCodePoint, byGrapheme, etc., I don't know that it actually solves much over treating immutable(char)[] as string. Programmers still have to learn Unicode enough to handle it correctly, just like they do now (whether we have autodecoding or not). And such a string type really doesn't make the Unicode handling any easier. It just make it harder to ignore the Unicode issues.

The Unicode problem is a lot like the floating point problems that have been discussed recently. Programmers want it to "just work" without them having to worry about the details, but that really doesn't work, and while the average programmer may not understand either floating point operations or Unicode properly, the average programmer does actually have to work with both on a regular basis.

I'm not at all convinced that having string be an alias of immutable(char)[] was a mistake, but having a struct that's not a range may very well be an improvement. It _would_ at least make some of the Unicode issues more obvious, but it doesn't really solve much from what I can see.

- Jonathan M Davis

May 31, 2016
On Tuesday, May 31, 2016 07:17:03 default0 via Digitalmars-d wrote:
> Thinking about this a bit more - what algorithms are actually
> correct when implemented on the level of code units?
> Off the top of my head I can only really think of copying and
> hashing, since you want to do that on the byte level anyways.
> I would also think that if you know your strings are normalized
> in the same normalization form (for example because they come
> from the same normalized source), you can check two strings for
> equality on the code unit level, but my understanding of unicode
> is still quite lacking, so I'm not sure on that.

Equality does not require decoding. Similarly, functions like find don't either. Something like filter generally would, but it's also not particularly normal to filter a string on a by-character basis. You'd probably want to get to at least the word level in that case.

To make matters worse, functions like find or splitter are frequently used to look for ASCII delimiters, even when the strings themselves contain Unicode characters. So, even if decoding were necessary when looking for a Unicode character, it's utterly wasteful when the character you're looking for is ASCII. But searching generally does not require decoding so long as the same character is always encoded the same way. So, Unicode normalization _can_ be a problem, but that's a problem with code points as well as code units (since the normalization has to do with the order of code points when multiple code points make up a single grapheme). You'd have to go to the grapheme level to avoid that problem. And that's why at least some of the time, string-processing code is going to need to normalize its strings before doing searches. But the searches themselves can then operate at the code unit level.

- Jonathan M Davis

May 31, 2016
On Friday, May 27, 2016 23:16:58 David Nadlinger via Digitalmars-d wrote:
> On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:
> > Those should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint Ä into A + ¨
>
> Unless I'm mistaken, this depends on the form used. For example, in NFKC you'd get the single codepoint Ä.

Yeah. For better or worse, there are different normalization schemes for Unicode. A normalization scheme makes the encodings consisent, but that doesn't mean that each of the different normalization schemes does the same thing, just that if you apply the same normalization scheme to two strings, then all graphemes within those strings will be encoded identically.

- Jonathan M Davis