October 23, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Tuesday, 23 October 2012 at 23:07:28 UTC, Jonathan M Davis wrote: > I think that Andrei was arguing for changing how the compiler itself handles arrays of char and wchar so that they wouldn't As I said last time this came up, we could actually do this today without changing the compiler. Since string is a user defined type anyway, we could just define it differently. http://arsdnet.net/dcode/test99.d I'm pretty sure that changes to Phobos are even required. (The reason I called it "String" there instead of "string" is simply so it doesn't conflict with the string in object.d) |
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Timon Gehr | On Wednesday, October 24, 2012 01:33:28 Timon Gehr wrote:
> On 10/24/2012 01:07 AM, Jonathan M Davis wrote:
> > On Wednesday, October 24, 2012 00:28:28 Timon Gehr wrote:
> >> The other valid opinion is that the 'mistake' is in Phobos because it treats narrow character arrays specially.
> >
> > If it didn't, then range-based functions would be useless for strings in most cases, because it rarely makes sense to operate on code units.
> >
> >> Note that string is just a name for immutable(char)[]. It would have to become a struct if random access was to be deprecated.
> >
> > I think that Andrei was arguing for changing how the compiler itself handles arrays of char and wchar so that they wouldn't have direct random access or length anymore, forcing you to do something like str.rep[6] for random access regardless of what happens with range-based functions.
> >
> > - Jonathan M Davis
>
> That idea does not even deserve discussion.
Actually, it solves the problem quite well, because you then have to work at misusing strings (of any constness or char type), but it's still extremely easy to operate on code units if you want to. However, Walter seems to think that everyone should understand unicode and code for it, in which case it would be normal for the programmer to understand all of the quirks of code units vs code points and code accordingly, but I think that it's pretty clear that that the average programmer doesn't have a clue about unicode, so if the normal string operations do anything which isn't unicode aware (e.g. length), then lots of programmers are going to screw it up. But since such a change would break tons of code, I think that there's pretty much no way that it's going to happen at this point even if it were generally agreed that it was the way to go.
The alternative, of course, is to create a string type which wraps arrays of the various character types, but no one has been able to come up with a design for it which was generally accepted. It also risks not working very well with string literals and the like, since a string literal would no longer be a string (similar to the nonsense that you have to put up with in C++ with regards to std::string vs string literals). But even if someone can come up with a solid solution, the amount of code which it would break could easiily disqualify it anyway.
- Jonathan M Davis
|
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Adam D. Ruppe | On 2012-41-24 01:10, Adam D. Ruppe <destructionator@gmail.com> wrote: > On Tuesday, 23 October 2012 at 23:07:28 UTC, Jonathan M Davis wrote: >> I think that Andrei was arguing for changing how the compiler itself handles arrays of char and wchar so that they wouldn't > > As I said last time this came up, we could actually do this today without changing the compiler. Since string is a user defined type anyway, we could just define it differently. > > http://arsdnet.net/dcode/test99.d > > I'm pretty sure that changes to Phobos are even required. (The reason I called it "String" there instead of "string" is simply so it doesn't conflict with the string in object.d) As long as typeof("") != String, this is not going t work: auto s = ""; -- Simen |
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Simen Kjaeraas | On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
> On 2012-10-23, 19:21, mist wrote:
>
>> Hm, and all phobos functions should operate on narrow strings as if they where not random-acessible? I am thinking about something like commonPrefix from std.algorithm, which operates on code points for strings.
>
> Preferably, yes. If there are performance (or other) benefits from
> operating on code units, and it's just as safe, then operating on code
> units is ok.
Probably I don't undertsand it fully, but D approach has always been "safe first, fast with some additional syntax". Back to commonPrefix and take:
==========================
import std.stdio, std.traits, std.algorithm, std.range;
void main()
{
auto beer = "Пиво";
auto r1 = beer.take(2);
auto pony = "Пони";
auto r2 = commonPrefix(beer, pony);
writeln(r1);
writeln(r2);
}
==========================
First one returns 2 symbols. Second one - 3 code points and broken string. There is no way such incosistency by-default in standard library is understandable by a newbie.
|
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Timon Gehr | Actually it is awesome. But all the code breakage.. eh. |
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to mist | On Wednesday, October 24, 2012 12:42:59 mist wrote: > On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote: > > On 2012-10-23, 19:21, mist wrote: > >> Hm, and all phobos functions should operate on narrow strings as if they where not random-acessible? I am thinking about something like commonPrefix from std.algorithm, which operates on code points for strings. > > > > Preferably, yes. If there are performance (or other) benefits > > from > > operating on code units, and it's just as safe, then operating > > on code > > units is ok. > > Probably I don't undertsand it fully, but D approach has always been "safe first, fast with some additional syntax". Back to commonPrefix and take: > > ========================== > import std.stdio, std.traits, std.algorithm, std.range; > > void main() > { > auto beer = "Пиво"; > auto r1 = beer.take(2); > auto pony = "Пони"; > auto r2 = commonPrefix(beer, pony); > writeln(r1); > writeln(r2); > } > ========================== > > First one returns 2 symbols. Second one - 3 code points and broken string. There is no way such incosistency by-default in standard library is understandable by a newbie. We don't really have much choice here. As long as strings are arrays of code units, it wouldn't work to treat them as ranges of their elements, because that would be a complete disaster for unicode. You'd be operating on code units rather than code points, which is almost always wrong. Pretty much the only way to really solve the problem as long as strings are arrays with all of the normal array operations is for the std.range traits (hasLength, hasSlicing, etc.) and the range functions for arrays in std.array (e.g. front, popFront, etc.) to treat strings as ranges of code points (dchar), which is what they do. The result _is_ confusing, but as long as strings are arrays of code units like they are now, to do anything else would result in incorrect behavior. There just isn't a good solution given what strings currently are in the language itself. Andrei's suggestion would work if Walter could be talked into it, but that doesn't look like it's going to happen. And making it so that strings are structs which hold arrays of code units could work, but without language support, it's likely to have major issues. String literals would have to become the struct type, which could cause issue with calling C functions, and the code breakage would be _way_ larger than with Andrei's suggestion, since arrays of code units would no longer be strings at all. It would be feasible, but it gets really messy. What we have is probably about the best that we can do without actually changing the language (and Andrei's suggestion is likely the best way to do that IMHO), but that's unlikely to happen at this point, especilaly since Walter seems to view unicode quite differently from your average programmer and expects your average programmer to actually understand it and handle correctly (which just isn't going to happen). The confusion could be reduced if we not only had an article on dlang.org explaining exactly what ranges were and how to use them with Phobos but also an article (maybe the same one, maybe another), which explained what this means for strings and why. That way, it would become easier to become educated. But no one has written (or at least finished writing) such an article for dlang.org (I keep meaning to, but I never get around to it). Some stuff has been written outside of dlang.org (e.g. http://www.drdobbs.com/architecture- and-design/component-programming-in-d/240008321 and http://ddili.org/ders/d.en/ranges.html ), but there's nothing on dlang.org, and I don't believe that there's really anything online aside from stray newsgroup posts or stackoverflow answers which discusses why strings are the way they are with regards to ranges. And there should be. - Jonathan M Davis |
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to mist | On 10/24/2012 12:45 PM, mist wrote:
> Actually it is awesome.
> But all the code breakage.. eh.
Obviously T[] should support indexing for any T.
This is the definition of an array.
|
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | Ok, just one question to make an official position clear: is commonPrefix implementation buggy or it is a conscious decision to go for some speed breaking correct operations on narrow strings at the same time? |
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 10/24/2012 01:07 PM, Jonathan M Davis wrote: > On Wednesday, October 24, 2012 12:42:59 mist wrote: >> On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote: >>> On 2012-10-23, 19:21, mist wrote: >>>> Hm, and all phobos functions should operate on narrow strings >>>> as if they where not random-acessible? I am thinking about >>>> something like commonPrefix from std.algorithm, which operates >>>> on code points for strings. >>> >>> Preferably, yes. If there are performance (or other) benefits >>> from >>> operating on code units, and it's just as safe, then operating >>> on code >>> units is ok. >> >> Probably I don't undertsand it fully, but D approach has always >> been "safe first, fast with some additional syntax". Back to >> commonPrefix and take: >> >> ========================== >> import std.stdio, std.traits, std.algorithm, std.range; >> >> void main() >> { >> auto beer = "Пиво"; >> auto r1 = beer.take(2); >> auto pony = "Пони"; >> auto r2 = commonPrefix(beer, pony); >> writeln(r1); >> writeln(r2); >> } >> ========================== >> >> First one returns 2 symbols. Second one - 3 code points and >> broken string. There is no way such incosistency by-default in >> standard library is understandable by a newbie. > > We don't really have much choice here. As long as strings are arrays of code > units, it wouldn't work to treat them as ranges of their elements, because > that would be a complete disaster for unicode. You'd be operating on code > units rather than code points, which is almost always wrong. There are plenty cases where it makes no difference, or iterating by code point is harmful, or just as incorrect. str.filter!(a=>a!='x'); // works for all str iterated by // code point or by code unit string x = str.filter!(a=>a!='x').array;// only works in the latter case dstring s = "ÅA"; dstring g = s.filter!(a=>a!='A').array; > Pretty much the > only way to really solve the problem as long as strings are arrays with all of > the normal array operations is for the std.range traits (hasLength, > hasSlicing, etc.) and the range functions for arrays in std.array (e.g. front, > popFront, etc.) to treat strings as ranges of code points (dchar), which is > what they do. The result _is_ confusing, but as long as strings are arrays of > code units like they are now, to do anything else would result in incorrect > behavior. It would result in by-code-unit behavior. > There just isn't a good solution given what strings currently are in > the language itself. > > Andrei's suggestion would work if Walter could be talked into it, but that > doesn't look like it's going to happen. And making it so that strings are > structs which hold arrays of code units could work, but without language > support, it's likely to have major issues. String literals would have to > become the struct type, which could cause issue with calling C functions, and > the code breakage would be _way_ larger than with Andrei's suggestion, since > arrays of code units would no longer be strings at all. > ... You realize that the proposed solution is that arrays of code units would no longer be arrays of code units? |
October 24, 2012 Re: Narrow string is not a random access range | ||||
---|---|---|---|---|
| ||||
Posted in reply to mist | On Wednesday, October 24, 2012 13:19:54 mist wrote:
> Ok, just one question to make an official position clear: is commonPrefix implementation buggy or it is a conscious decision to go for some speed breaking correct operations on narrow strings at the same time?
Strings are always ranges of dchar, but if a function can operate on them more efficiently by special casing them and then using array operations taking the correct unicode handling into account, then it generally will.
commonPrefix can't make much more efficient by special casing strings, but it _can_ change its return type to be a string via slicing, since it can keep track of where it is in the string as it iterates over it. However, the documentation incorrectly states that the result of commonPrefix is always takeExactly. That's generally true but is _not_ true for strings. The documentation needs to be fixed.
That being said, there _is_ a bug in commonPrefix that I just noticed when looking it over. It currently operates on code units rather than code points. It can operate on strings just fine like it's doing now (even returning a slice), but it needs to decode the code points as it iterates over them, and it's not doing that.
- Jonathna M Davis
|
Copyright © 1999-2021 by the D Language Foundation