Range of chars (narrow string ranges) (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Range of chars (narrow string ranges) (page 3)

April 29, 2015

Re: Range of chars (narrow string ranges)

Posted by Jonathan M Davis
in reply to Vladimir Panteleev

Jonathan M Davis

Posted in reply to Vladimir Panteleev

On Tuesday, 28 April 2015 at 21:57:31 UTC, Vladimir Panteleev wrote:
> On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
>> But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code,
>
> One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't). That would solve most silent breakages at the expense of more not-so-silent breakages.

It would, but it doesn't necessarily play nicely with the promotion rules, and since the character types tend to be treated as integral types, I suspect that it would be problematic in a number of cases. I also suspect that it's not something that Walter would go for given his typical attitude about conversions (though I don't know). It's definitely an interesting thought, but I doubt that it would fly.

>> And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got.
>
> Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands:
>
> digger build master+jmdavis/phobos/noautodecode
> digger install

Well, that may very well be what needs to happen as an experiment, but if we want to actually transition to not having autodecoding, we need a transition plan in master itself rather than a fork, and a temporary version would be one way to do that.

After thinking about the situation some over the past few days though, I think that what we need to do to begin with is to make it so that as many functions in Phobos as possible don't care whether they're dealing with ranges of char or dchar so that they'll work regardless of what front does on strings (either by simply not using front on strings - or by making it so that the code will work whether front return char or dchar). And that will reduce the number of changes that will have to be done in Phobos via versioning or deprecation or whatever we'd have to do to actually remove autodecoding. I suspect that it would mean that very little would have to be versioned or deprecated if/when we make the switch.

The bigger problem though is probably 3rd party range-based functions using front with strings or checking rather than Phobos itself or code using Phobos, since much of that would just work even if we outright switched front from autodecoding to non-autodecoding, and most of what wouldn't can be made to work by making it so that those functions don't care whether they're dealing with autodecoded strings or not.

- Jonathan M Davis

April 29, 2015

Re: Range of chars (narrow string ranges)

Posted by Chris
in reply to Jonathan M Davis

Chris

Posted in reply to Jonathan M Davis

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
> On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
>> Would it be much work to show have example code or even an experimental module that gets rid of auto-decoding, so we could see what would be affected in general and how actual code we have would be affected by it?
>>
>> The topic keeps coming up again and again, and while I'm in favor of anything that enhances performance, I'm afraid of having to refactor large chunks of my code. However, this fear may be unfounded, but I would need some examples to visualize the problem.
>
> Honestly, most code won't care. If we just switched out all of the auto-decoding right now, pretty much anything using only ASCII would just work, and most anything that's trying to manipulate ASCII characters in a Unicode string will just work, whereas code that's specifically manipulating Unicode characters might have problems (e.g. comparing front with a dchar will no longer have the same result, since front would just be the first code unit rather than necessarily the first code point). Since most Phobos range-based functions which operate on strings are special-cased on strings already, many of them would continue to just work (e.g. find returns the same range type as what's passed to it even if it's given a string, so it might just work with the change, or it might need to be tweaked slightly), and those that would then generally either need to call encode on an argument to make it match the string type in the cases string types mix (e.g. "foo".find("fo"d) would need to call encode on "fo"d to make it a string for comparison), or the caller would need to use std.utf.byDchar or std.uni.byGrapheme to operate on code points or graphemes rather than code units.
>
> The two biggest places in Phobos that would potentially have problems are functions that special-cased strings but still used front and those which have to return a new range type. e.g. filter would be a good example, because it's forced to return a new range type. Right now, it would filter on dchars, but with the change, it would filter on the code unit type (most typically char). If you're filtering on ASCII characters, it wouldn't matter aside from the fact that the resulting range would have an element type of char rather than dchar, but if you're filtering on Unicode characters, it wouldn't work anymore. For situations like that, you'd be forced do use std.utf.byDchar or std.uni.byGrapheme. However, since most string code tends to operate on substrings rather than characters, I don't know how common it even is to use a function like filter on a string (as opposed to a range of strings). Such code might actually be fairly rare.
>
> So, there _are_ a few functions which stop working the same way in a potentially silent manner if we just made it so that front didn't autodecode anymore. However, in general, because Phobos almost always special-cases strings, calls to Phobos functions probably wouldn't need to change in most cases, and when they do, a call to byDchar would restore the old behavior. But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code, even though in most cases, it wouldn't matter, because most code will be operating on ASCII strings even if the strings themselves contain Unicode - e.g. unicodeString.find(asciiString) is far more common than unicodeString.find(otherUnicodeString).
>
> I suspect that the code that's at the greatest risk is code that checks for is(Unqual!(ElementType!Range) == dchar) to operate on strings and wrapper ranges around strings, since it would then only match the cases where byDchar had been used. In general though, the code that's going to run into the most trouble is user code that contains range-based functions similar to what you might find in Phobos rather than code that's simply using the Phobos functions like startsWith and find - i.e. if you're writing range-base code that worries about doing stuff like special-casing strings or which specifically needs to operate on code points, then you're going to have to make changes, whereas to a great extent, if all you're doing is passing strings to Phobos functions, your code will tend to just work.
>
> To actually see what the impact would be, we'd have to just change Phobos, I think, and then see what the impact was on user code. It could be surprising how much or how little it affects things, though in most cases, I expect that it'll mean that code will just work. And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got. In many cases, if we simply made sure that Phobos functions which special-cased strings didn't use front directly but instead didn't care whether they were operating on ranges of char, wchar, or dchar, then we wouldn't even need to version anything (e.g. find could easily be made to work that way if it doesn't already), but some functions (like filter) would need to be versioned differently.
>
> So, maybe what we need to do to start is to just go through Phobos and make as many functions as possible not care about whether they're dealing with strings as ranges of char, wchar, or dchar. And at least then, we'd minimize how much code would have to be versioned differently if we were to test out getting rid of autodecoding with versioning.
>
> - Jonathan M Davis

This sounds like a good starting point for a transition plan. One important thing, though, would be to do some benchmarking with and without autodecoding, to see if it really boosts performance in a way that would justify the transition.

April 29, 2015

Re: Range of chars (narrow string ranges)

Posted by Jonathan M Davis
in reply to Chris

Jonathan M Davis

Posted in reply to Chris

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
> This sounds like a good starting point for a transition plan. One important thing, though, would be to do some benchmarking with and without autodecoding, to see if it really boosts performance in a way that would justify the transition.

Well, personally, I think that it's worth it even if the performance is identical (and it's a guarantee that it's going to be better without autodecoding - it's just a question of how much better - since it's going to have less work to do without autodecoding). Simply operating at the code point level like we do now is the worst of all worlds in terms of flexibility and correctness. As long as the Unicode is normalized, operating at the code unit level is the most efficient, and decoding is often unnecessary for correctness, and if you need to decode, then you really need to go up to the grapheme level in order to be operating on the full character, meaning that operating on code points really has the same problems as operating on code units as far as correctness goes. So, it's less performant without actually being correct. It just gives the illusion of correctness.

By treating strings as ranges of code units, you don't take a performance hit when you don't need to, and it forces you to actually consider something like byDchar or byGrapheme if you want to operate on full, Unicode characters. It's similar to how operating on UTF-16 code units as if they were characters (as Java and C# generally do) frequently gives the incorrect impression that you're handling Unicode correctly, because you have to work harder at coming up with characters that can't fit in a single code unit, whereas with UTF-8, anything but ASCII is screwed if you treat code units as code points. Treating code points as if they were full characters like we're doing now in Phobos with ranges just makes it that much harder to notice that you're not handling Unicode correctly.

Also, treating strings as ranges of code units makes it so that they're not so special and actually are treated like every other type of array, which eliminates a lot of the special casing that we're forced to do right now, and it eliminates all of the confusion that folks keep running into when string doesn't work with many functions, because it's not a random-access range or doesn't have length, or because the resulting range isn't the same type (copy would be a prime example of a function that doesn't work with char[] when it should). By leaving in autodecoding, we're basically leaving in technical debt in D permanently. We'll forever have to be explaining it to folks and forever have to be working around it in order to achieve either performance or correctness.

What we have now isn't performant, correct, or flexible, and we'll be forever paying for that if we don't get rid of autodecoding.

I don't criticize Andrei in the least for coming up with it, since if you don't take graphemes into account (and he didn't know about them at the time), it seems like a great idea and allows us to be correct by default and performant if we put some effort into, but after having seen how it's worked out, how much code has to be special-cased, how much confusion there is over it, and how it's not actually correct anyway, I think that it's quite clear that autodecoding was a mistake. And at this point, it's mainly a question of how we can get rid of it without being too disruptive and whether we can convince Andrei that it makes sense to make the change, since he seems to still think that autodecoding is fine in spite of the fact that it's neither performant nor correct.

It may be that the decision will be that it's too disruptive to remove autodecoding, but I think that that's really a question of whether we can find a way to do it that doesn't break tons of code rather than whether it's worth the performance or correctness gain.

- Jonathan M Davis

April 29, 2015

Re: Range of chars (narrow string ranges)

Posted by Chris
in reply to Jonathan M Davis

Chris

Posted in reply to Jonathan M Davis

On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis wrote:
> On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
>> This sounds like a good starting point for a transition plan. One important thing, though, would be to do some benchmarking with and without autodecoding, to see if it really boosts performance in a way that would justify the transition.
>
> Well, personally, I think that it's worth it even if the performance is identical (and it's a guarantee that it's going to be better without autodecoding - it's just a question of how much better - since it's going to have less work to do without autodecoding). Simply operating at the code point level like we do now is the worst of all worlds in terms of flexibility and correctness. As long as the Unicode is normalized, operating at the code unit level is the most efficient, and decoding is often unnecessary for correctness, and if you need to decode, then you really need to go up to the grapheme level in order to be operating on the full character, meaning that operating on code points really has the same problems as operating on code units as far as correctness goes. So, it's less performant without actually being correct. It just gives the illusion of correctness.
>
> By treating strings as ranges of code units, you don't take a performance hit when you don't need to, and it forces you to actually consider something like byDchar or byGrapheme if you want to operate on full, Unicode characters. It's similar to how operating on UTF-16 code units as if they were characters (as Java and C# generally do) frequently gives the incorrect impression that you're handling Unicode correctly, because you have to work harder at coming up with characters that can't fit in a single code unit, whereas with UTF-8, anything but ASCII is screwed if you treat code units as code points. Treating code points as if they were full characters like we're doing now in Phobos with ranges just makes it that much harder to notice that you're not handling Unicode correctly.
>
> Also, treating strings as ranges of code units makes it so that they're not so special and actually are treated like every other type of array, which eliminates a lot of the special casing that we're forced to do right now, and it eliminates all of the confusion that folks keep running into when string doesn't work with many functions, because it's not a random-access range or doesn't have length, or because the resulting range isn't the same type (copy would be a prime example of a function that doesn't work with char[] when it should). By leaving in autodecoding, we're basically leaving in technical debt in D permanently. We'll forever have to be explaining it to folks and forever have to be working around it in order to achieve either performance or correctness.
>
> What we have now isn't performant, correct, or flexible, and we'll be forever paying for that if we don't get rid of autodecoding.
>
> I don't criticize Andrei in the least for coming up with it, since if you don't take graphemes into account (and he didn't know about them at the time), it seems like a great idea and allows us to be correct by default and performant if we put some effort into, but after having seen how it's worked out, how much code has to be special-cased, how much confusion there is over it, and how it's not actually correct anyway, I think that it's quite clear that autodecoding was a mistake. And at this point, it's mainly a question of how we can get rid of it without being too disruptive and whether we can convince Andrei that it makes sense to make the change, since he seems to still think that autodecoding is fine in spite of the fact that it's neither performant nor correct.
>
> It may be that the decision will be that it's too disruptive to remove autodecoding, but I think that that's really a question of whether we can find a way to do it that doesn't break tons of code rather than whether it's worth the performance or correctness gain.
>
> - Jonathan M Davis

Ok, I see. Well, if we don't want to repeat C++'s mistakes, we should fix it before it's too late. Since I'm dealing a lot with strings (non ASCII) and depend on Unicode (and correctness!), I would be more than happy to test any changes to Phobos with my programs to see if it screws up anything.

April 30, 2015

Re: Range of chars (narrow string ranges)

Posted by Kagamin
in reply to Walter Bright

Kagamin

Posted in reply to Walter Bright

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
> Time has shown, however, that UTF8 has pretty much won. wchar only exists for Windows API and Java

Also NSString. It used to support UTF-16 and C encoding. AFAIK, the latter later evolved into UTF-8.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation