Range of chars (narrow string ranges) (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Range of chars (narrow string ranges) (page 2)

April 25, 2015

Re: Range of chars (narrow string ranges)

Posted by Jonathan M Davis
in reply to Steven Schveighoffer

Jonathan M Davis

Posted in reply to Steven Schveighoffer

On Saturday, 25 April 2015 at 02:04:02 UTC, Steven Schveighoffer wrote:
> On 4/24/15 9:02 PM, Walter Bright wrote:
>> On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
>>> This is pretty easy. We just have to create a string type that is
>>> backed by, but
>>> isn't simply an alias to, an array of char.
>>
>> Just shoot me now!
>>
>
> Yeah, that's the reaction I figured I'd get ;) But it doesn't hurt to keep trying since we keep coming back to this over, and over, and over, and over...

Honestly, even if that were the ideal way to go (and I don't think that it is), I'd expect that to be even more disruptive than trying to rearrange the modules so that front and friends don't autodecode for strings.

I suppose that a related alternative would be to change it so that strings aren't considered ranges anymore (at least temporarily), and force folks to use stuff like byChar or byDChar (or whatever those functions are) whenever they use strings as ranges. And actually, that _would_ allow us to get rid of the autodecoding without rearranging modules. Later, we could change them to being ranges of their actual element types, or we could just force folks to be explicit forever in an effort to make the Unicode issues clear, if we thought that that were better (though it would probably better to just change front and friends later to work with strings again but not autodecode). And if an algorithm would work with either autodecoding or without it, then maybe it could be special cased to accept strings as ranges, only forcing it in the cases where it the behavior of the algorithm would change based on whether autodecoding were used or not.

Hmmm. I'm not sure what all of the repercussions of such an approach would be, but the more I think about it, the more tempting it seems to me.

- Jonathan M Davis

April 25, 2015

Re: Range of chars (narrow string ranges)

Posted by ketmar
in reply to Walter Bright

ketmar

Posted in reply to Walter Bright

Attachments:

signature.asc

On Fri, 24 Apr 2015 13:44:43 -0700, Walter Bright wrote:

> I'm afraid we are stuck with autodecoding, as taking it out may be far too disruptive.

the more time passing the harder autodecode to kill. kill it while it's not too late. make the next DMD release 2.100 and KILL AUTODECODE for good.

April 27, 2015

Re: Range of chars (narrow string ranges)

Posted by H. S. Teoh
in reply to Jonathan M Davis

H. S. Teoh

Posted in reply to Jonathan M Davis

On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via Digitalmars-d wrote: [...]
> I suppose that a related alternative would be to change it so that strings aren't considered ranges anymore (at least temporarily), and force folks to use stuff like byChar or byDChar (or whatever those functions are) whenever they use strings as ranges. And actually, that _would_ allow us to get rid of the autodecoding without rearranging modules. Later, we could change them to being ranges of their actual element types, or we could just force folks to be explicit forever in an effort to make the Unicode issues clear, if we thought that that were better (though it would probably better to just change front and friends later to work with strings again but not autodecode). And if an algorithm would work with either autodecoding or without it, then maybe it could be special cased to accept strings as ranges, only forcing it in the cases where it the behavior of the algorithm would change based on whether autodecoding were used or not.
> 
> Hmmm. I'm not sure what all of the repercussions of such an approach would be, but the more I think about it, the more tempting it seems to me.
[...]

I would vote for this approach, if we ever decide to get rid of autodecoding. I'm OK with either option -- get rid of autodecoding, or keep it and use it consistently. What I am *not* OK with is the present, and growing, schizophrenic mixture of autodecoding and non-autodecoding string functions in Phobos. This inconsistency is going to come back to bite us later.


T

-- 
One reason that few people are aware there are programs running the internet is that they never crash in any significant way: the free software underlying the internet is reliable to the point of invisibility. -- Glyn Moody, from the article "Giving it all away"

April 27, 2015

Re: Range of chars (narrow string ranges)

Posted by Jonathan M Davis
in reply to H. S. Teoh

Jonathan M Davis

Posted in reply to H. S. Teoh

On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
> On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via Digitalmars-d wrote:
> [...]
>> I suppose that a related alternative would be to change it so that
>> strings aren't considered ranges anymore (at least temporarily), and
>> force folks to use stuff like byChar or byDChar (or whatever those
>> functions are) whenever they use strings as ranges. And actually, that
>> _would_ allow us to get rid of the autodecoding without rearranging
>> modules. Later, we could change them to being ranges of their actual
>> element types, or we could just force folks to be explicit forever in
>> an effort to make the Unicode issues clear, if we thought that that
>> were better (though it would probably better to just change front and
>> friends later to work with strings again but not autodecode). And if
>> an algorithm would work with either autodecoding or without it, then
>> maybe it could be special cased to accept strings as ranges, only
>> forcing it in the cases where it the behavior of the algorithm would
>> change based on whether autodecoding were used or not.
>> 
>> Hmmm. I'm not sure what all of the repercussions of such an approach
>> would be, but the more I think about it, the more tempting it seems to
>> me.
> [...]
>
> I would vote for this approach, if we ever decide to get rid of
> autodecoding. I'm OK with either option -- get rid of autodecoding, or
> keep it and use it consistently. What I am *not* OK with is the present,
> and growing, schizophrenic mixture of autodecoding and non-autodecoding
> string functions in Phobos. This inconsistency is going to come back to
> bite us later.

I expect that the two biggest problems causing the current situation are

1. Andrei and Walter don't seem to agree on the issue (Andrei seems to think that it's not a big deal to leave in the autodecoding).

2. While most of the core devs want to get rid of the autodecoding, it's a big enough change that we're afraid to do it and/or aren't sure of how we could do it without being too disruptive.

So, Walter has been pushing the schizophrenic approach in an effort to work around the problem. If the core devs could agree on an approach to removing autodecoding that wasn't too disruptive and somehow get Andrei to go along with it, then we could do that and fix the problem, but otherwise, Walter is just going to push for the schizophrenic approach, because it at least partially fixes the autodecoding problem, and enough of the core devs want to ditch the autodecoding that at least some of those changes are likely to make it in.

Honestly, I think that we need to figure out what the best options are for killing autodecoding and then figure out how to convince Andrei of it, but I haven't a clue how to convince Andrei unless maybe a solution which isn't very disruptive can be found, but it seems like every time the issue comes up, he gets annoyed that we're spending time on something unimportant. I do think that this limbo needs to stop though, and I think that it's clear that while autodecoding seemed like a good idea at first (especially if code points really were full characters instead of having to worry about graphemes), ultimately, autodecoding is a mistake.

- Jonathan M Davis

April 28, 2015

Re: Range of chars (narrow string ranges)

Posted by Chris
in reply to Jonathan M Davis

Chris

Posted in reply to Jonathan M Davis

On Monday, 27 April 2015 at 17:49:04 UTC, Jonathan M Davis wrote:
> On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
>> On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via Digitalmars-d wrote:
>> [...]
>>> I suppose that a related alternative would be to change it so that
>>> strings aren't considered ranges anymore (at least temporarily), and
>>> force folks to use stuff like byChar or byDChar (or whatever those
>>> functions are) whenever they use strings as ranges. And actually, that
>>> _would_ allow us to get rid of the autodecoding without rearranging
>>> modules. Later, we could change them to being ranges of their actual
>>> element types, or we could just force folks to be explicit forever in
>>> an effort to make the Unicode issues clear, if we thought that that
>>> were better (though it would probably better to just change front and
>>> friends later to work with strings again but not autodecode). And if
>>> an algorithm would work with either autodecoding or without it, then
>>> maybe it could be special cased to accept strings as ranges, only
>>> forcing it in the cases where it the behavior of the algorithm would
>>> change based on whether autodecoding were used or not.
>>> 
>>> Hmmm. I'm not sure what all of the repercussions of such an approach
>>> would be, but the more I think about it, the more tempting it seems to
>>> me.
>> [...]
>>
>> I would vote for this approach, if we ever decide to get rid of
>> autodecoding. I'm OK with either option -- get rid of autodecoding, or
>> keep it and use it consistently. What I am *not* OK with is the present,
>> and growing, schizophrenic mixture of autodecoding and non-autodecoding
>> string functions in Phobos. This inconsistency is going to come back to
>> bite us later.
>
> I expect that the two biggest problems causing the current situation are
>
> 1. Andrei and Walter don't seem to agree on the issue (Andrei seems to think that it's not a big deal to leave in the autodecoding).
>
> 2. While most of the core devs want to get rid of the autodecoding, it's a big enough change that we're afraid to do it and/or aren't sure of how we could do it without being too disruptive.
>
> So, Walter has been pushing the schizophrenic approach in an effort to work around the problem. If the core devs could agree on an approach to removing autodecoding that wasn't too disruptive and somehow get Andrei to go along with it, then we could do that and fix the problem, but otherwise, Walter is just going to push for the schizophrenic approach, because it at least partially fixes the autodecoding problem, and enough of the core devs want to ditch the autodecoding that at least some of those changes are likely to make it in.
>
> Honestly, I think that we need to figure out what the best options are for killing autodecoding and then figure out how to convince Andrei of it, but I haven't a clue how to convince Andrei unless maybe a solution which isn't very disruptive can be found, but it seems like every time the issue comes up, he gets annoyed that we're spending time on something unimportant. I do think that this limbo needs to stop though, and I think that it's clear that while autodecoding seemed like a good idea at first (especially if code points really were full characters instead of having to worry about graphemes), ultimately, autodecoding is a mistake.
>
> - Jonathan M Davis

Would it be much work to show have example code or even an experimental module that gets rid of auto-decoding, so we could see what would be affected in general and how actual code we have would be affected by it?

The topic keeps coming up again and again, and while I'm in favor of anything that enhances performance, I'm afraid of having to refactor large chunks of my code. However, this fear may be unfounded, but I would need some examples to visualize the problem.

April 28, 2015

Re: Range of chars (narrow string ranges)

Posted by Jonathan M Davis
in reply to Chris

Jonathan M Davis

Posted in reply to Chris

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
> Would it be much work to show have example code or even an experimental module that gets rid of auto-decoding, so we could see what would be affected in general and how actual code we have would be affected by it?
>
> The topic keeps coming up again and again, and while I'm in favor of anything that enhances performance, I'm afraid of having to refactor large chunks of my code. However, this fear may be unfounded, but I would need some examples to visualize the problem.

Honestly, most code won't care. If we just switched out all of the auto-decoding right now, pretty much anything using only ASCII would just work, and most anything that's trying to manipulate ASCII characters in a Unicode string will just work, whereas code that's specifically manipulating Unicode characters might have problems (e.g. comparing front with a dchar will no longer have the same result, since front would just be the first code unit rather than necessarily the first code point). Since most Phobos range-based functions which operate on strings are special-cased on strings already, many of them would continue to just work (e.g. find returns the same range type as what's passed to it even if it's given a string, so it might just work with the change, or it might need to be tweaked slightly), and those that would then generally either need to call encode on an argument to make it match the string type in the cases string types mix (e.g. "foo".find("fo"d) would need to call encode on "fo"d to make it a string for comparison), or the caller would need to use std.utf.byDchar or std.uni.byGrapheme to operate on code points or graphemes rather than code units.

The two biggest places in Phobos that would potentially have problems are functions that special-cased strings but still used front and those which have to return a new range type. e.g. filter would be a good example, because it's forced to return a new range type. Right now, it would filter on dchars, but with the change, it would filter on the code unit type (most typically char). If you're filtering on ASCII characters, it wouldn't matter aside from the fact that the resulting range would have an element type of char rather than dchar, but if you're filtering on Unicode characters, it wouldn't work anymore. For situations like that, you'd be forced do use std.utf.byDchar or std.uni.byGrapheme. However, since most string code tends to operate on substrings rather than characters, I don't know how common it even is to use a function like filter on a string (as opposed to a range of strings). Such code might actually be fairly rare.

So, there _are_ a few functions which stop working the same way in a potentially silent manner if we just made it so that front didn't autodecode anymore. However, in general, because Phobos almost always special-cases strings, calls to Phobos functions probably wouldn't need to change in most cases, and when they do, a call to byDchar would restore the old behavior. But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code, even though in most cases, it wouldn't matter, because most code will be operating on ASCII strings even if the strings themselves contain Unicode - e.g. unicodeString.find(asciiString) is far more common than unicodeString.find(otherUnicodeString).

I suspect that the code that's at the greatest risk is code that checks for is(Unqual!(ElementType!Range) == dchar) to operate on strings and wrapper ranges around strings, since it would then only match the cases where byDchar had been used. In general though, the code that's going to run into the most trouble is user code that contains range-based functions similar to what you might find in Phobos rather than code that's simply using the Phobos functions like startsWith and find - i.e. if you're writing range-base code that worries about doing stuff like special-casing strings or which specifically needs to operate on code points, then you're going to have to make changes, whereas to a great extent, if all you're doing is passing strings to Phobos functions, your code will tend to just work.

To actually see what the impact would be, we'd have to just change Phobos, I think, and then see what the impact was on user code. It could be surprising how much or how little it affects things, though in most cases, I expect that it'll mean that code will just work. And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got. In many cases, if we simply made sure that Phobos functions which special-cased strings didn't use front directly but instead didn't care whether they were operating on ranges of char, wchar, or dchar, then we wouldn't even need to version anything (e.g. find could easily be made to work that way if it doesn't already), but some functions (like filter) would need to be versioned differently.

So, maybe what we need to do to start is to just go through Phobos and make as many functions as possible not care about whether they're dealing with strings as ranges of char, wchar, or dchar. And at least then, we'd minimize how much code would have to be versioned differently if we were to test out getting rid of autodecoding with versioning.

- Jonathan M Davis

April 28, 2015

Re: Range of chars (narrow string ranges)

Posted by Vladimir Panteleev
in reply to Jonathan M Davis

Vladimir Panteleev

Posted in reply to Jonathan M Davis

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
> But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code,

One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't). That would solve most silent breakages at the expense of more not-so-silent breakages.

> And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got.

Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands:

digger build master+jmdavis/phobos/noautodecode
digger install

April 28, 2015

Re: Range of chars (narrow string ranges)

Posted by H. S. Teoh
in reply to Vladimir Panteleev

H. S. Teoh

Posted in reply to Vladimir Panteleev

On Tue, Apr 28, 2015 at 09:57:29PM +0000, Vladimir Panteleev via Digitalmars-d wrote:
> On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
> >But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code,
> 
> One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't).  That would solve most silent breakages at the expense of more not-so-silent breakages.
> 
> >And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got.
> 
> Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands:
> 
> digger build master+jmdavis/phobos/noautodecode
> digger install

Oooh, Jonathan has the code ready? Haha, maybe I'll start using that instead of git master! ;-)


T

-- 
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a clever lawsuit / That Windowsland will finally fall, / But thousands writing open source code / Like mice who nibble through a wall. -- The Linux-nationale by Greg Baker

April 28, 2015

Re: Range of chars (narrow string ranges)

Posted by Damian
in reply to H. S. Teoh

Damian

Posted in reply to H. S. Teoh

On Tuesday, 28 April 2015 at 23:15:40 UTC, H. S. Teoh wrote:
> On Tue, Apr 28, 2015 at 09:57:29PM +0000, Vladimir Panteleev via Digitalmars-d wrote:
>> On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
>> >But of course, we'd want to do the transition in a way that didn't
>> >result in silent behavioral changes that would break code,
>> 
>> One proposal is to make char and dchar comparisons illegal (after all,
>> they are comparing different things - an UTF-8 code unit with a code
>> point, and even though in some cases this comparison makes sense, in
>> many it doesn't).  That would solve most silent breakages at the
>> expense of more not-so-silent breakages.
>> 
>> >And if we really wanted to do that, we could create a version flag
>> >that turned of autodecoding and version the changes in Phobos
>> >appropriately to see what we got.
>> 
>> Shameless self-promotion alert: An alternative is a GitHub fork. You
>> can easily install and try out D forks with Digger, it's two commands:
>> 
>> digger build master+jmdavis/phobos/noautodecode
>> digger install
>
> Oooh, Jonathan has the code ready? Haha, maybe I'll start using that
> instead of git master! ;-)
>
>
> T

I second that! If we all make the switch, perhaps Walter will too? :D

April 28, 2015

Re: Range of chars (narrow string ranges)

Posted by Jonathan M Davis
in reply to Damian

Jonathan M Davis

Posted in reply to Damian

On Tuesday, 28 April 2015 at 23:26:14 UTC, Damian wrote:
> I second that! If we all make the switch, perhaps Walter will too? :D

Walter isn't necessarily the one we have to convince in this case. He'll be very concerned about avoiding breaking existing code, so we'd need a solid transition plan, but he very much wants to get rid of autodecoding, so he'll welcome it if we can do it cleanly. The bigger problem is convincing Andrei, since he seems to think that even discussing the issue is a waste of time and takes away from more important topics. And I don't dispute that there are other important topics, and coming back to this one over and over again is arguably a problem, but if we can just figure out how to make the transition and get it over with, then it wouldn't need to keep getting discussed like this.

- Jonathan M Davis

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation