September 06, 2018
On Thursday, September 6, 2018 10:44:11 AM MDT H. S. Teoh via Digitalmars-d wrote:
> On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
> > On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> > > // D
> > > auto a = "á";
> > > auto b = "á";
> > > auto c = "\u200B";
> > > auto x = a ~ c ~ a;
> > > auto y = b ~ c ~ b;
> > >
> > > writeln(a.length); // 2 wtf
> > > writeln(b.length); // 3 wtf
> > > writeln(x.length); // 7 wtf
> > > writeln(y.length); // 9 wtf
>
> [...]
>
> This is an unfair comparison.  In the Swift version you used .count, but here you used .length, which is the length of the array, NOT the number of characters or whatever you expect it to be.  You should rather use .count and specify exactly what you want to count, e.g., byCodePoint or byGrapheme.
>
> I suspect the Swift version will give you unexpected results if you did something like compare "á" to "a\u301", for example (which, in case it isn't obvious, are visually identical to each other, and as far as an end user is concerned, should only count as 1 grapheme).
>
> Not even normalization will help you if you have a string like "a\u301\u302": in that case, the *only* correct way to count the number of visual characters is byGrapheme, and I highly doubt Swift's .count will give you the correct answer in that case. (I expect that Swift's .count will count code points, as is the usual default in many languages, which is unfortunately wrong when you're thinking about visual characters, which are called graphemes in Unicode parlance.)
>
> And even in your given example, what should .count return when there's a zero-width character?  If you're counting the number of visual places taken by the string (e.g., you're trying to align output in a fixed-width terminal), then *both* versions of your code are wrong, because zero-width characters do not occupy any space when displayed. If you're counting the number of code points, though, e.g., to allocate the right buffer size to convert to dstring, then you want to count the zero-width character as 1 rather than 0.  And that's not to mention double-width characters, which should count as 2 if you're outputting to a fixed-width terminal.
>
> Again I say, you need to know how Unicode works. Otherwise you can easily deceive yourself to think that your code (both in D and in Swift and in any other language) is correct, when in fact it will fail miserably when it receives input that you didn't think of.  Unicode is NOT ASCII, and you CANNOT assume there's a 1-to-1 mapping between "characters" and display length. Or 1-to-1 mapping between any of the various concepts of string "length", in fact.
>
> In ASCII, array length == number of code points == number of graphemes == display width.
>
> In Unicode, array length != number of code points != number of graphemes != display width.
>
> Code written by anyone who does not understand this is WRONG, because you will inevitably end up using the wrong value for the wrong thing: e.g., array length for number of code points, or number of code points for display length. Not even .byGrapheme will save you here; you *need* to understand that zero-width and double-width characters exist, and what they imply for display width. You *need* to understand the difference between code points and graphemes.  There is no single default that will work in every case, because there are DIFFERENT CORRECT ANSWERS depending on what your code is trying to accomplish. Pretending that you can just brush all this detail under the rug of a single number is just deceiving yourself, and will inevitably result in wrong code that will fail to handle Unicode input correctly.

Indeed. And unfortunately, the net result is that a large percentage of the string-processing code out there is going to be wrong, and I don't think that there's any way around that, because Unicode is simply too complicated for the average programmer to understand it (sad as that may be) - especially when most of them don't want to have to understand it.

Really, I'd say that there are only three options that even might be sane if you really have the flexibility to design a proper solution:

1. Treat strings as ranges of code units by default.

2. Don't allow strings to be ranges, to be iterated, or indexed. They're opaque types.

3. Treat strings as ranges of graphemes.

If strings are treated as ranges of code units by default (particularly if they're UTF-8), you'll get failures very quickly if you're dealing with non-ASCII, and you screw up the Unicode handling. It's also by far the most performant solution and in many cases is exactly the right thing to do. Obviously, something like byCodePoint or byGrapheme would then be needed in the cases where code points or graphemes are the appropriate level to iterate at.

If strings are opaque types (with ways to get ranges over code units, code points, etc.), that mostly works in that it forces you to at least try to understand Unicode well enough to make sane choices about how you iterate over the string. However, it doesn't completely get away from the issue of the default, because of ==. It would be a royal pain if == didn't work, and if it does work, you then have the question of what it's comparing. Code units? Code points? Graphemes? Assuming that the representation is always the same encoding, comparing code ponits wouldn't make any sense, but you'd still have the question of code units or graphemes. As such, I'm not sure that an opaque type really makes the most sense (though it's suggested often enough).

If strings are treated as ranges of graphemes, then that should then be correct for everything that doesn't care about the visual representation (and thus doesn't care about the display width of characters), but it would be highly inefficient to do most things at the grapheme level, and it would likely have many of the same problems that we have with strings now with regards to stuff like them not being able to be random-access and how they're don't really work as output ranges.

So, if we were doing things from scratch, and it were up to me, I would basically go with what Walter originally tried to do and make strings be arrays of code units but with them also being ranges of code units - thereby avoiding all of the pain that we get with trying to claim that strings don't have capabilities that they clearly do have (such as random-access or length). And then of course, all of the appropriate helper functions would be available for the different levels of Unicode handling. I think that this is the solution that quite a few of us want - though some have expressed interest in an opaque string type, and I think that that's the direction that RCString (or whatever it's called) may be going.

Unfortunately, right now, it's not looking like we're going to be able to implement what we'd like here because of the code breakage issues in removing auto-decoding. RCString may very well end up doing the right thing, and I know that Andrei wants to then encourage it to be the default string for everyone to use (much as we don't all agree with that idea), but we're still stuck with auto-decoding with regular strings and having to worry about it when writing generic code. _Maybe_ someone will be able to come up with a sane solution for moving away from auto-decoding, but it's not seeming likely at the moment.

Either way, what needs to be done first is making sure that Phobos in general works with ranges of char, wchar, dchar, and graphemes rather than assuming that all ranges of characters are ranges of dchar. Fortunately, some work has been done towards that, but it's not yet true of Phobos in general, and it needs to be. Once it is, then the impact of auto-decoding is reduced in general, and with Phobos depending on it as little as possible, it then makes it saner to discuss how we might remove auto-decoding. I'm not at all convinced that it would make it possible to sanely remove it, but until that work is done, we definitely can't remove it regardless. And actually, until that work is done, the workarounds for auto-decoding (e.g. byCodeUnit) don't work as well as they should. I've done some of that work (as have some others), but I really should figure out how to get through enough of my todo list that I can get more done towards that goal - particularly since I don't think that anyone is actively working the problem. For the most part, it's only been done when someone ran into a problem with a specific function, whereas in reality, we need to be adding the appropriate tests for all of the string-processing functions in Phobos and then ensure that they pass those tests.

- Jonathan M Davis




September 06, 2018
On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:
> On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
>> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
>> > // D
>> > auto a = "á";
>> > auto b = "á";
>> > auto c = "\u200B";
>> > auto x = a ~ c ~ a;
>> > auto y = b ~ c ~ b;
>> > 
>> > writeln(a.length); // 2 wtf
>> > writeln(b.length); // 3 wtf
>> > writeln(x.length); // 7 wtf
>> > writeln(y.length); // 9 wtf
> [...]
>
> This is an unfair comparison.  In the Swift version you used .count, but here you used .length, which is the length of the array, NOT the number of characters or whatever you expect it to be.  You should rather use .count and specify exactly what you want to count, e.g., byCodePoint or byGrapheme.
>
> I suspect the Swift version will give you unexpected results if you did something like compare "á" to "a\u301", for example (which, in case it isn't obvious, are visually identical to each other, and as far as an end user is concerned, should only count as 1 grapheme).
>
> Not even normalization will help you if you have a string like "a\u301\u302": in that case, the *only* correct way to count the number of visual characters is byGrapheme, and I highly doubt Swift's .count will give you the correct answer in that case. (I expect that Swift's .count will count code points, as is the usual default in many languages, which is unfortunately wrong when you're thinking about visual characters, which are called graphemes in Unicode parlance.)
>
> And even in your given example, what should .count return when there's a zero-width character?  If you're counting the number of visual places taken by the string (e.g., you're trying to align output in a fixed-width terminal), then *both* versions of your code are wrong, because zero-width characters do not occupy any space when displayed. If you're counting the number of code points, though, e.g., to allocate the right buffer size to convert to dstring, then you want to count the zero-width character as 1 rather than 0.  And that's not to mention double-width characters, which should count as 2 if you're outputting to a fixed-width terminal.
>
> Again I say, you need to know how Unicode works. Otherwise you can easily deceive yourself to think that your code (both in D and in Swift and in any other language) is correct, when in fact it will fail miserably when it receives input that you didn't think of.  Unicode is NOT ASCII, and you CANNOT assume there's a 1-to-1 mapping between "characters" and display length. Or 1-to-1 mapping between any of the various concepts of string "length", in fact.
>
> In ASCII, array length == number of code points == number of graphemes == display width.
>
> In Unicode, array length != number of code points != number of graphemes != display width.
>
> Code written by anyone who does not understand this is WRONG, because you will inevitably end up using the wrong value for the wrong thing: e.g., array length for number of code points, or number of code points for display length. Not even .byGrapheme will save you here; you *need* to understand that zero-width and double-width characters exist, and what they imply for display width. You *need* to understand the difference between code points and graphemes.  There is no single default that will work in every case, because there are DIFFERENT CORRECT ANSWERS depending on what your code is trying to accomplish. Pretending that you can just brush all this detail under the rug of a single number is just deceiving yourself, and will inevitably result in wrong code that will fail to handle Unicode input correctly.
>
>
> T

It's a totally fair comparison. .count in swift is the equivalent of .length in D, you use that to get the size of an array, etc. They've just implemented string.length as string.byGrapheme.walkLength. So it's intuitively correct (and yes, slower). If you didn't want the default though then you could also specify what "view" over characters you want. E.g.

let a = "á̂"
a.count // 1 <-- Yes it is exactly as expected.
a.unicodeScalars // 3
a.utf8.count // 5

I don't really see any issues with a zero-width character. If you want to deal with screen width (i.e. pixel space) that's not the same as how many characters are in a string. And it doesn't matter whether you go byGrapheme or byCodePoint or byCodeUnit because none of those represent a single column on screen. A zero-width character is 0 *width* but it's still *one* character. There's no .length/size/count in any language (that I've heard of) that'll give you your screen space from their string type. You query the font API for that as that depends on font size, kerning, style and face.

And again, I agree you need to know how unicode works. I don't argue that at all. I'm just saying that having the default be incorrect for application logic is just silly and when people have to do things like string.representation.normalize.byGrapheme or whatever to search for a character in a string *correctly* ... well, just, ARGH!

D makes the code-point case default and hence that becomes the simplest to use. But unfortunately, the only thing I can think of that requires code point representations is when dealing specifically with unicode algorithms (normalization, etc). Here's a good read on code points: https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ -

tl;dr: application logic does not need or want to deal with code points. For speed units work, and for correctness, graphemes work.

Yes you will fail miserably when you receive input you did not expect. That's always true. That's why we have APIs that make it easier or harder to fail more or less. Expecting people to be unicode experts before using unicode is also unreasonable - or just makes it easier to fail, must easier. I sit next to one of the guys who worked on unicode in Qt and he couldn't explain the difference between a grapheme and an extended grapheme cluster... I'm not saying I can btw... I'm just saying unicode is frikkin hard. And we don't need APIs making it harder to get right - which is exactly what non-correct-by-default APIs do.

I think to boil it down to one sentence is I think it's silly to have a string type that is advertised as unicode but optimized for latin1 ... ish because people will use it for unicode and get incorrect results with its naturally intuitive usage.

Cheers,
- Ali

September 06, 2018
On Thursday, September 6, 2018 1:04:45 PM MDT aliak via Digitalmars-d wrote:
> D makes the code-point case default and hence that becomes the
> simplest to use. But unfortunately, the only thing I can think of
> that requires code point representations is when dealing
> specifically with unicode algorithms (normalization, etc). Here's
> a good read on code points:
> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-un
> icode-code-points/ -
>
> tl;dr: application logic does not need or want to deal with code points. For speed units work, and for correctness, graphemes work.

I think that it's pretty clear that code points are objectively the worst level to be the default. Unfortunately, changing it to _anything_ else is not going to be an easy feat at this point. But if we can first ensure that Phobos in general doesn't rely on it (i.e. in general, it can deal with ranges of char, wchar, dchar, or graphemes correctly rather than assuming that all ranges of characters are ranges of dchar), then maybe we can figure something out. Unfortunately, while some work has been done towards that, what's mostly happened is that folks have complained about auto-decoding without doing much to improve the current situation. There's a lot more to this than simply ripping out auto-decoding even if every D user on the planet agreed that outright breaking almost every existing D program to get rid of auto-decoding was worth it. But as with too many things around here, there's a lot more talking than working. And actually, as such, I should probably stop discussing this and go do something useful.

- Jonathan M Davis



September 06, 2018
On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis wrote:
> On Thursday, September 6, 2018 1:04:45 PM MDT aliak via Digitalmars-d wrote:
>> D makes the code-point case default and hence that becomes the
>> simplest to use. But unfortunately, the only thing I can think of
>> that requires code point representations is when dealing
>> specifically with unicode algorithms (normalization, etc). Here's
>> a good read on code points:
>> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-un
>> icode-code-points/ -
>>
>> tl;dr: application logic does not need or want to deal with code points. For speed units work, and for correctness, graphemes work.
>
> I think that it's pretty clear that code points are objectively the worst level to be the default. Unfortunately, changing it to _anything_ else is not going to be an easy feat at this point. But if we can first ensure that Phobos in general doesn't rely on it (i.e. in general, it can deal with ranges of char, wchar, dchar, or graphemes correctly rather than assuming that all ranges of characters are ranges of dchar), then maybe we can figure something out. Unfortunately, while some work has been done towards that, what's mostly happened is that folks have complained about auto-decoding without doing much to improve the current situation. There's a lot more to this than simply ripping out auto-decoding even if every D user on the planet agreed that outright breaking almost every existing D program to get rid of auto-decoding was worth it. But as with too many things around here, there's a lot more talking than working. And actually, as such, I should probably stop discussing this and go do something useful.
>
> - Jonathan M Davis

Is there a unittest somewhere in phobos you know that one can be pointed to that shows the handling of these 4 variations you say should be dealt with first? Or maybe a PR that did some of this work that one could investigate?

I ask so I can see in code what it means to make something not rely on autodecoding and deal with ranges of char, wchar, dchar or graphemes.

Or a current "easy" bugzilla issue maybe that one could try a hand at?
September 06, 2018
On Thursday, 6 September 2018 at 17:19:01 UTC, Joakim wrote:
> No, Swift counts grapheme clusters by default, so it gives 1. I suggest you read the linked Swift chapter above. I think it's the wrong choice for performance, but they chose to emphasize intuitiveness for the common case.

I like to point out that Swift spend a lot of time reworking how string are handled.

If my memory serves me well, they have reworked strings from version 2 to 3 and finalized it in version 4.

> Swift 4 includes a faster, easier to use String implementation that retains Unicode correctness and adds support for creating, using and managing substrings.

That took them somewhere along the line of two years to get string handling to a acceptable and predictable state. And it annoyed the Swift user base greatly but a lot of changes got made to reaching a stable API.

Being honest, i personally find Swift a more easy languages despite it lacking IDE support on several platforms and no official Windows compiler.
September 08, 2018
On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis wrote:
> On Thursday, September 6, 2018 1:04:45 PM MDT aliak via Digitalmars-d wrote:
>> D makes the code-point case default and hence that becomes the
>> simplest to use. But unfortunately, the only thing I can think of
>> that requires code point representations is when dealing
>> specifically with unicode algorithms (normalization, etc). Here's
>> a good read on code points:
>> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-un
>> icode-code-points/ -
>>
>> tl;dr: application logic does not need or want to deal with code points. For speed units work, and for correctness, graphemes work.
>
> I think that it's pretty clear that code points are objectively the worst level to be the default. Unfortunately, changing it to _anything_ else is not going to be an easy feat at this point. But if we can first ensure that Phobos in general doesn't rely on it (i.e. in general, it can deal with ranges of char, wchar, dchar, or graphemes correctly rather than assuming that all ranges of characters are ranges of dchar), then maybe we can figure something out. Unfortunately, while some work has been done towards that, what's mostly happened is that folks have complained about auto-decoding without doing much to improve the current situation. There's a lot more to this than simply ripping out auto-decoding even if every D user on the planet agreed that outright breaking almost every existing D program to get rid of auto-decoding was worth it. But as with too many things around here, there's a lot more talking than working. And actually, as such, I should probably stop discussing this and go do something useful.
>
> - Jonathan M Davis

A tutorial page linked from the front page with some examples would go a long way to making it easier for people.  If I had time and understood strings enough to explain to others I would try to make a start, but unfortunately neither are true.

And if we are doing things right with RCString, then isn't it easier to make the change with that first - which is new so can't break code - and in some years when people are used to working that way update Phobos (compiler switch in beginning and have big transition a few years after that).

Isn't this one of the challenges created by the tension between D being both a high-level and low-level language.  The higher the aim, the more problems you will encounter getting there.  That's okay.

And isn't the obstacle to breaking auto-decoding because it seems to be a monolithic challenge of overwhelming magnitude, whereas if we could figure out some steps to eat the elephant one mouthful at a time (which might mean start with RCString) then it will seem less intimidating.  It will take years anyway perhaps - but so what?


September 08, 2018
On Thursday, 6 September 2018 at 14:42:14 UTC, Chris wrote:
> On Thursday, 6 September 2018 at 14:30:38 UTC, Guillaume Piolat wrote:
>> On Thursday, 6 September 2018 at 13:30:11 UTC, Chris wrote:
>>> And autodecode is a good example of experts getting it wrong, because, you know, you cannot be an expert in all fields. I think the problem was that it was discovered too late.
>>
>> There are very valid reasons not to talk about auto-decoding again:
>>
>> - it's too late to remove because breakage
>> - attempts at removing it were _already_ tried
>> - it has been debated to DEATH
>> - there is an easy work-around
>>
>> So any discussion _now_ would have the very same structure of the discussion _then_, and would lead to the exact same result. It's quite tragic. And I urge the real D supporters to let such conversation die (topics debated to death) as soon as they appear.
>
> The real supporters? So it's a religion? For me it's about technology and finding a good tool for a job.

Religions have believers but not supporters - in fact saying you are a supporter says you are not a member of that faith or community.  I support the Catholic Church's efforts to relieve poverty in XYZ country - you're not a core part of that effort directly.

Social institutions need support to develop - language is a very old human institution, and programming languages have more similarity with natural languages alongst certain dimensions (I'm aware that NLP is your field) than some recognise.

So, why shouldn't a language have supporters?  I give some money to the D Foundation - this is called providing support.  Does that make me a zealot, or someone who confuses a computer programming language with a religion?  I don't think so.  I give money to the Foundation because it's a win-win.  It makes me happy to support the development of things that are beautiful and it's commercially a no-brainer because of the incidental benefits it brings.  Probably I would do so without those benefits, but on the other hand the best choices in life often end up solving problems you weren't even planning on solving and maybe didn't know you had.

Does that make me a monomaniac who thinks D should be used everywhere, and only D - the one true language?  I don't think so.  I confess to being excited by the possibility of writing web applications in D, but that has much more to do with Javascript and the ecosystem than it does D.  And on the other hand - even though I have supported the development of a Jupyter kernel for D (something that conceivably could make Julia less necessary) - I'm planning on doing more with Julia, because it's a better solution for some of our commercial problems than anything else I could find, including D.  Does using Julia mean we will write less D?  No - being able to do more work productively means writing more code, probably including more D, Python and C#.

I suggest the problem is in fact the entitlement of people who expect others to give them things for free without recognising that some appreciation would be in order, and that if one can helping in whatever way is possible is probably the right thing to do even if it's in a small way in the beginning.  This is of course a well-known challenge of open-source projects in general, but it's my belief it's a fleeting period already passing for D.

You know sometimes it's clear from the way someone argues that it isn't about what they say.  If the things they claim were problems were in fact anti-problems (merits) they would make different arguments but with the same emotional tone.

It's odd - if something isn't useful for me then either I just move on and find something that is, or I try to directly act myself or organise others to improve it so it is useful.  I don't stand there grumbling at the toolmakers whilst taking no positive action to make that change happen.

September 08, 2018
On Thursday, September 6, 2018 3:15:59 PM MDT aliak via Digitalmars-d wrote:
> On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis
>
> wrote:
> > On Thursday, September 6, 2018 1:04:45 PM MDT aliak via
> >
> > Digitalmars-d wrote:
> >> D makes the code-point case default and hence that becomes the
> >> simplest to use. But unfortunately, the only thing I can think
> >> of
> >> that requires code point representations is when dealing
> >> specifically with unicode algorithms (normalization, etc).
> >> Here's
> >> a good read on code points:
> >> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to
> >> -un icode-code-points/ -
> >>
> >> tl;dr: application logic does not need or want to deal with code points. For speed units work, and for correctness, graphemes work.
> >
> > I think that it's pretty clear that code points are objectively the worst level to be the default. Unfortunately, changing it to _anything_ else is not going to be an easy feat at this point. But if we can first ensure that Phobos in general doesn't rely on it (i.e. in general, it can deal with ranges of char, wchar, dchar, or graphemes correctly rather than assuming that all ranges of characters are ranges of dchar), then maybe we can figure something out. Unfortunately, while some work has been done towards that, what's mostly happened is that folks have complained about auto-decoding without doing much to improve the current situation. There's a lot more to this than simply ripping out auto-decoding even if every D user on the planet agreed that outright breaking almost every existing D program to get rid of auto-decoding was worth it. But as with too many things around here, there's a lot more talking than working. And actually, as such, I should probably stop discussing this and go do something useful.
> >
> > - Jonathan M Davis
>
> Is there a unittest somewhere in phobos you know that one can be pointed to that shows the handling of these 4 variations you say should be dealt with first? Or maybe a PR that did some of this work that one could investigate?
>
> I ask so I can see in code what it means to make something not rely on autodecoding and deal with ranges of char, wchar, dchar or graphemes.
>
> Or a current "easy" bugzilla issue maybe that one could try a hand at?

Not really. The handling of this has generally been too ad-hoc. There are plenty of examples of handling different string types, and there are a few handling different ranges of character types, but there's a distinct lack of tests involving graphemes. And the correct behavior for each is going to depend on what exactly the function does - e.g. almost certainly, the correct thing for filter to do is to not do anything special for ranges of characters at all and just filter on the element type of the range (even though it would almost always be incorrect to filter a range of char unless it's known to be all ASCII), while on the other hand, find is clearly designed to handle different encodings. So, it needs to be able to find a dchar or grapheme in a range of char. And of course, there's the issue of how normalization should be handled (if at all).

A number of the tests in std.utf and std.string do a good job of testing Unicode strings of varying encodings, and std.utf does a good job overall of testing ranges of char, wchar, and dchar which aren't strings, but I'm not sure that anything in Phobos outside of std.uni currently does anything with ranges of graphemes.

std.conv.to does have some tests for ranges of char, wchar, and dchar due to a bug fix. e.g.

// bugzilla 15800
@safe unittest
{
    import std.utf : byCodeUnit, byChar, byWchar, byDchar;

    assert(to!int(byCodeUnit("10")) == 10);
    assert(to!int(byCodeUnit("10"), 10) == 10);
    assert(to!int(byCodeUnit("10"w)) == 10);
    assert(to!int(byCodeUnit("10"w), 10) == 10);

    assert(to!int(byChar("10")) == 10);
    assert(to!int(byChar("10"), 10) == 10);
    assert(to!int(byWchar("10")) == 10);
    assert(to!int(byWchar("10"), 10) == 10);
    assert(to!int(byDchar("10")) == 10);
    assert(to!int(byDchar("10"), 10) == 10);
}

but there are no grapheme tests, and no Unicode characters are involved (though I'm not sure that much in std.conv really needs to worry about Unicode characters).

So, there are tests scattered all over the place which do pieces of what they need to be doing, but I'm not sure that there are currently any that test the full range of character ranges that they really need to be testing. As with testing reference type ranges, such tests have generally been added only when fixing a specific bug, and there hasn't been a sufficient effort to just go through all of the affected functions and add appropriate tests.

And unfortunately, unlike with reference type ranges, the correct behavior of a function when faced with ranges of different character types is going to be highly dependent on what they do. Some of them shouldn't be doing anything special for processing ranges of characters, some shouldn't be doing anything special for processing arbitrary ranges of characters, but they still need to do something special for strings because of efficiency issues caused by auto-decoding, and yet others need to actually take Unicode into account and operate on each range type differently depending on whether it's a range of code units, code points, or graphemes.

So, completely aside from auto-decoding issues, it's a bit of a daunting task. I keep meaning to take the time to work on it, I've done some of the critical work for supporting arbitrary ranges of char, wchar, and dchar rather than just string types (as have some other folks), but I haven't spent the time to start going through the functions one by one and add the appropriate tests and fixes, and no one else has gone that far either. So, I can't really point towards a specific set of tests and say "here, do what these do." And even if I could, whether what those tests do would be correct for another function would depend on what the functions do. So, sorry that I can't be more helpful.

Actually, what you could probably do if you're looking for something related to this to do, and you don't feel that you know enough to just start adding tests, you could try byCodeUnit, byDchar, and byGrapheme with various functions and see what happens. If the function doesn't even compile (which will probably be the case at least some of the time), then that's an easy bug report. If the function does compile, then it will require a greater understanding to know whether it's doing the right thing, but in at least some cases, it may be obvious, and if the result is obviously wrong, you can create a bug report for that.

Ultimately though, a pretty solid understanding of ranges and Unicode is going to be required to write a lot of these tests. And worse, a pretty solid understanding of ranges and Unicode is going to be required to use any of these functions correctly even if they all work correctly and have all of the necessary tests to prove it. Unicode is just plain too complicated, and trying to make things "just work" with it is frequently difficult - especially if efficiency matters, but even when efficiency doesn't matter, it's not always obvious how to make it "just work." :(

- Jonathan M Davis



September 08, 2018
On Saturday, September 8, 2018 8:05:04 AM MDT Laeeth Isharc via Digitalmars- d wrote:
> On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis
>
> wrote:
> > On Thursday, September 6, 2018 1:04:45 PM MDT aliak via
> >
> > Digitalmars-d wrote:
> >> D makes the code-point case default and hence that becomes the
> >> simplest to use. But unfortunately, the only thing I can think
> >> of
> >> that requires code point representations is when dealing
> >> specifically with unicode algorithms (normalization, etc).
> >> Here's
> >> a good read on code points:
> >> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to
> >> -un icode-code-points/ -
> >>
> >> tl;dr: application logic does not need or want to deal with code points. For speed units work, and for correctness, graphemes work.
> >
> > I think that it's pretty clear that code points are objectively the worst level to be the default. Unfortunately, changing it to _anything_ else is not going to be an easy feat at this point. But if we can first ensure that Phobos in general doesn't rely on it (i.e. in general, it can deal with ranges of char, wchar, dchar, or graphemes correctly rather than assuming that all ranges of characters are ranges of dchar), then maybe we can figure something out. Unfortunately, while some work has been done towards that, what's mostly happened is that folks have complained about auto-decoding without doing much to improve the current situation. There's a lot more to this than simply ripping out auto-decoding even if every D user on the planet agreed that outright breaking almost every existing D program to get rid of auto-decoding was worth it. But as with too many things around here, there's a lot more talking than working. And actually, as such, I should probably stop discussing this and go do something useful.
>
> A tutorial page linked from the front page with some examples would go a long way to making it easier for people.  If I had time and understood strings enough to explain to others I would try to make a start, but unfortunately neither are true.

Writing up an article on proper Unicode handling in D is on my todo list, but my todo list of things to do for D is long enough that I don't know then I'm going to get to it.

> And if we are doing things right with RCString, then isn't it easier to make the change with that first - which is new so can't break code - and in some years when people are used to working that way update Phobos (compiler switch in beginning and have big transition a few years after that).

Well, I'm not actually convinced that what we have for RCString right now _is_ doing the right thing, but even if it is, that doesn't fix the issue that string doesn't do the right thing, and code needs to take that into account - especially if it's generic code. The better job we do at making Phobos code work with arbitrary ranges of characters, the less of an issue that is, but you're still pretty much forced to deal with it in a number of cases if you want your code to be efficient or if you want a function to be able to accept a string and return a string rather than a wrapper range. Using RCString in your code would reduce how much you had to worry about it, but it doesn't completely solve the problem. And if you're doing stuff like writing a library for other people to use, then you definitely can't just ignore the issue. So, an RCString that handles Unicode sanely will definitely help, but it's not really a fix. And plenty of code is still going to be written to use strings (especially when -betterC is involved). RCString is going to be another option, but it's not going to replace string. Even if RCString became the most common string type to use (which I question is going to ever happen), dynamic arrays of char, wchar, etc. are still going to exist in the language and are still going to have to be handled correctly.

Phobos won't be able to assume that all of the code out there is using RCString and not string. The combination of improving Phobos so that it works properly with ranges of characters in general (and not just strings or ranges of dchar) and having an alternate string type that does the right thing will definitely help and need to be done if we have any hope of actually removing auto-decoding, but even with all of that, I don't see how it would be possible to really deprecate the old behavior. We _might_ be able to do something if we're willing to deprecate std.algorithm and std.range (since std.range gives you the current definitions of the range primitives for arrays, and std.algorithm publicly imports std.range), but you still then have the problem of two different definitions of the range primitives for arrays and all of the problems that that causes (even if it's only for the deprecation period). So, strings would end up behaving drastically differently with range-based functions depending on which module you imported. I don't know that that problem is insurmountable, but it's not at all clear that there is a path to fixing auto-decoding that doesn't outright break old code. If we're willing to break old code, then we could defnitely do it, but if we don't want to risk serious problems, we really need a way to have a more gradual transition, and that's the big problem that no one has a clean solution for.

> Isn't this one of the challenges created by the tension between D being both a high-level and low-level language.  The higher the aim, the more problems you will encounter getting there.  That's okay.
>
> And isn't the obstacle to breaking auto-decoding because it seems to be a monolithic challenge of overwhelming magnitude, whereas if we could figure out some steps to eat the elephant one mouthful at a time (which might mean start with RCString) then it will seem less intimidating.  It will take years anyway perhaps - but so what?

Well, I think that it's clear at this point that before we can even consider getting rid of auto-decoding, we need to make sure that Phobos in general works with arbitrary ranges of code units, code points, and graphemes. With that done, we would have a standard library that could work with strings as ranges of code units if that's what they were. So, in theory, at that point, the only issue would be how on earth to make strings work as ranges of code units without just pulling the rug out from under everyone. I'm not at all convinced that that's possible, but I am very much convinced that unless we improve first Phobos so that it's fully correct in spite of the auto-decoding issues, we definitely can't remove auto-decoding. And as a group, we haven't done a good enough job with that. Most of us agree that auto-decoding was a huge mistake, but there hasn't been enough work done towards fixing what we have, and there's plenty of work there that needs to be done whether we later try to remove auto-decoding or not.

- Jonathan M Davis



September 08, 2018
On Saturday, 8 September 2018 at 14:20:10 UTC, Laeeth Isharc wrote:
> Religions have believers but not supporters - in fact saying you are a supporter says you are not a member of that faith or community.

If you are a supporter of Jesus Christ's efforts, then you most certainly are a christian. If you are a supporter of the Pope, then you may or not may be catholic, but you most likely are christian or a sympathise with the faith.

Programming languages are more like powertools. You may be a big fan of Makita and dislike using other powertools like Bosch and DeWalt, or you may have different preferences based the situation, or you may accept whatever you have at hand. Being a supporter is stretching it though... Although I am sure that people who only have Makita in their toolbox feel that they are supporting the company.

> Social institutions need support to develop - language is a very old human institution, and programming languages have more similarity with natural languages alongst certain dimensions (I'm aware that NLP is your field) than some recognise.

Sounds like a fallacy.

> So, why shouldn't a language have supporters?  I give some money to the D Foundation - this is called providing support.

If you hope to gain some kind of return for it or consequences that you benefit from then it is more like obtaining support and influence through providing funds. I.e. paying for support...

> It's odd - if something isn't useful for me then either I just move on and find something that is, or I try to directly act myself or organise others to improve it so it is useful.  I don't stand there grumbling at the toolmakers whilst taking no positive action to make that change happen.

Pointing out that there is a problem, that needs to be solved, in order to reach a state where the tool is applicable in a production line... is not grumbling.  It is healthy.  Whether that leads to positive actions (changes in policies) can only be affected through politics, not "positive action".  Doesn't help to buy a new, bigger and better motor, if the transmission is broken.