Jonathan M Davis
| Okay, I'm sorry, but this is a wall of text, and I don't know how to actually make it short. If you really want the TLDR version, look for the =='s at the bottom which mark off the conclusion, but you're not going to understand the reasoning if you skip the wall of text. Now, to the post...
Recently, there was a PR submitted for Phobos which tried to add isStringLike to std.traits. There was some debate about what the name should be, but ultimately, Walter pointed out that we really needed to step back and examine all of the various traits that we have for strings and related types in Phobos before we add anything like this, regardless of the name. So, I've been mulling over this and looking at the current situation and thought I should post about it.
The PR in question: https://github.com/dlang/phobos/pull/5137
First off, I think that the concept of anything "string-like" should be tabled for now. String-like implies that it acts like a string and is therefore at least attempting to be a drop-in replacement. Nothing in Phobos currently tries to do that, and I think that discussions on that sort of thing are better left for when we look at adding something like Andrei's RCString to druntime or Phobos. What we're concerned with at the moment is how strings, things that convert to strings, and ranges of characters are dealt with in templated code.
The first thing to notice when when looking at strings and traits in Phobos
is that we have too many traits that are very similiar but not quite the
same. Each of them was added over time to solve a particular problem that
the existing traits didn't solve well enough or simply because it was a
concept that the submitter thought was one that we'd need to reuse. The
result is a mish-mash of traits with varying - and often confusing uses. We
have isSomeString, isNarrowString, isAutodecodableString, and
isConvertibleToString. And of course, related to those, we have isSomeChar,
ElementType, ElementEncodingType, isDynamicArray, isInputRange,
isForwardRange, isBidirectionalRange, isRandomAccessRange, hasSlicing,
isDynamicArray, isAggregateType, isStaticArray, is(T == enum),
is(T : const char[]), etc. all of which can be used in combination with or
instead of the main string traits. Exactly which combination of all of those
traits get used depends on what the function in question is trying to do,
and they're often used inconsistently. But I think that the main concern is
the various is*String traits. So,
isSomeString: True for any dynamic array of any character type as well as any enum type whose base type is a dynamic array of any character type.
isNarrowString: The same as isSomeString except that arrays of dchar and enum types with a base type that's a dynamic array of dchar do not qualify as narrow strings.
isAutodecodableString: The same as isNarrowString except that it also accepts user-defined types which implicitly convert to a dynamic array of char or wchar.
isConvertibleToString: Any type that is _not_ a dynamic array of characters but implicitly converts to a dynamic array of characters - so enums with that as their base type, user-defined types which implicitly convert to such dynamic arrays, and static arrays which implicitly convert to such dynamic arrays.
Note that these are all similar but subtly different. Of particular note is how this affects implicit conversions. Implicit conversions in generic code are _bad_. It's trivial for a type to pass a template constraint due to an implicit conversion and then not actually compile with the template - or worse, to compile but have subtly wrong behavior. So, pretty much any time that an implicit conversion is allowed with a template constraint, that conversion needs to be forced. Once it's converted, then it can function as the targetted type, because it _is_ the targetted type.
Honestly, I'm inclined to think that in most cases that conversion should be done explicitly by the programmer rather than happening automatically, because that's far less error-prone (e.g. this is what we do when we require that you slice a static array to pass it to a range-based function rather than having the range-based function do it for you). Unfortunately, that's not always an option - e.g. when templatizing a function that previously accepted string, if you don't want to break code, then you have to accept anything that implicitly converts to string.
So, the first mistake here is that isSomeString accepts enums. Annoyingly,
that was actually my fault (Kenji actually fixed isSomeString years ago, and
at the time, I argued that it should accept enums, which in hindsight,
was dumb of me). :(
In some cases, having a templated function accept both strings and enums _can_ work, but in most cases, it's going to be wrong. You can examine an enum with a base type of string, but you can't mutate it or assign it without risking either not compiling or ending up with a variable of an enum type with an invalid value, and enums don't pass isInputRange even if their base types is a range, so unlike strings, you can't use range-based functions with them. So, having code like
auto foo(S)(S str)
if(isSomeString!S)
{...}
is actually a recipe for disaster. It'll work with strings, but it likely won't work properly with enums, and most programmers will likely forget to test it with enums and won't catch the problem. Fortunately, I think that isSomeString is used primarily in range-based code to differentiate strings from other ranges, but it _is_ a concern that isSomeString accepts enums. As such, I think that code using isSomeString in a context where an enum would pass needs to either be changed so that the function accepts only strings and uses !is(S == enum) to prevent enums from passing _or_, it should be templatized in a manner which accepts anything which implicitly converts to a string type while simultaneously forcing the conversion. e.g.
auto foo(C)(C[] str)
if(isSomeChar!C)
{...}
That gives you essentially the same semantics as
auto foo(string str)
{...}
except that it accepts any character type (as well as any constness for the character type), and the conversion is done cleanly without needing any overloads.
Ideally, isSomeString would be fixed to not accept enums, but that risks breaking code, and I'd argue that in most cases where the fact that a template accepts enums would be a problem, templatizing on the character type is a better solution anyway, because it works with more code and forces the conversion without needing any extra code.
Now, most of the code that would break if we changed isSomeString is probably buggy anyway, but we would be risking breaking code if we made the change. So, if we started from scratch, I would definitely do it, but I don't know that we can reasonably do so now. We could of course do what std.conv does internally and add isExactSomeString, which is what isSomeString _should_ be, but that would mean adding yet another trait, whereas ideally we'd be consolidating them. However, it's probably a better approach than changing isSomeString if we don't want to risk breaking code. Regardless, isSomeString's documentation should be updated to make this problem clearer.
isNarrowString has the same problem as isSomeString, but AFAIK, its only real use is in range-based code where the fact that it accepts enums isn't a problem. So, the risk of breaking code when making it not accept enums is probably much lower than doing that with isSomeString, but the benefit is also lower, since the risk of it being used in a buggy manner is less. Aside from the enum issue though, I think that isNarrowString is fine as-is.
However, I think that isAutodecodableString should definitely be deprecated, and I think that we should at least consider deprecating isConvertibleToString. But I'll go into that below, since to understand that, I think that we first need to look at what we're trying to accomplish with these traits.
In general, when you templatize a function involving strings, you're doing one of
- create a generic function that works on any range (which just so happens
to include strings)
- create a function that specifically operates on strings (but is templated
so that it can operate on different constness or operate on different
character types)
- create a function that specifically operates on a range of characters,
which happens to include strings
- create a function that operates on strings or ranges of characters or
operates on types that implicitly convert to a string type
- templatize a function that used to take string (or took any dynamic array
of characters) and make it take any range of characters
If you're creating a generic function, it may not need to do anything special for strings or arbitrary ranges of characters - this is especially true if byCodeUnit has been used, and no decoding is taking place - but sometimes, it may be necessary to specialize on narrow strings to avoid unnecessary decoding (and the function may or may not then do any decoding explicitly, depending on what it's trying to do). I think that that's largely covered by isNarrowString. The constraint usually makes sure that you're dealing with a range, and ideally, the overload dealing with isNarrowString would be inside a static if rather than at the top level. If it's at the top level, then there's some risk of an enum getting through, because the constraint assumed that isNarrowString was enough to guarantee that the type was a range, but that goes back to whether we should make isSomeString and isNarrowString accept enums, and switching to a simpler, top-level constraint and making the isNarrowString bit be internal with a static if largely solves the problem in this case. So, I don't think that this use case requires any changes to any of the string-related traits right now.
As for functions that specifically operate on strings. As I said above when discussing isSomeString, I think that the correct thing to do in their case is to just templatize on the character type
auto foo(C)(C[] str)
if(isSomeChar!C)
{...}
That does mean that the function will accept types that implicit convert to strings in addition to strings, but if you're looking for it to specifically operate on strings, that should be just as okay as it would be if the function explictly took string instead of being templated. Any code that wants to avoid that though can do
auto foo(S)(S str)
if(isSomeString!S && !(S == enum))
{...}
but there usually isn't any point in doing that instead of templatizing on the character type. It just makes the function work with less code.
As for functions that specifically operate on ranges of characters (and operate on strings as part of that), I think that they're pretty much in the same boat as generic range-based functions. They have isNarrowString to check if they want to specialize on narrow strings, but they also have byCodeUnit, which they can use without checking isNarrowString, completely avoiding having to check for narrow strings, though whether using byCodeUnit is appropriate or not depends on what the function is doing. Still, I don't think that anything special needs to be done with the existing traits for this use case.
The last two cases are essentially the same (in both cases, you end up with a function that accepts types that implicitly convert to a string type in addition to accepting arbitrary ranges of characters). The primary difference is whether you're creating a new function or templatizing an existing one. The reason that I separated them is because I'm inclined to argue that in most cases, when you're creating a range-based function, you simply shouldn't be making it accept implicit conversions. It's clearer and less error-prone for the caller to do the conversion. So, I don't think that we normally should be creating such functions. However, we _do_ need to templatize existing functions that take string, and what we need to do to make that work is pretty much the same thing that we'd need to do when creating a templated function that accepts types that implicitly convert to a string type in addition to accepting arbitrary ranges of characters.
isConvertibleToString was created for this case. Basically, you have a function that accepts string
auto foo(string str) {..}
and you want to templatize it, so you create an overload for ranges of characters and another for types that implicitly convert to string. e.g.
auto foo(R)(R range)
is(isInputRange!R &&
isSomeChar!(ElementType!R) &&
!isConvertibleToString!S)
{...}
auto foo(T)(auto ref T convertible)
is(isConvertibleToString!T)
{
return foo!(StringTypeOf!T)(convertible);
}
Then the types that work as-is are accepted by the first overload, and those that need to be converted are converted by the second overload and then passed to the first.
Note the auto ref on the second overload. This is so that if a static array is passed as a variable, then it won't be copied. And since we don't want the auto ref on the overload taking ranges (since that would potentially break code), we're forced to have two separate functions rather having a single function that has an internal static if.
What I would _like_ to see done here is to have the convertible overload be something more like
auto foo(C)(C[] str)
if(isSomeChar!C)
{...}
This makes the conversion automatic and negates the need for isConvertibleToString. However, this overload would also accept strings, which means that you either have to duplicate code between the functions, or you need a helper function. e.g.
auto foo(R)(R range)
is(!isSomeString!R &&
isInputRange!R &&
isSomeChar!(ElementType!R))
{
return _fooImpl(range);
}
auto foo(C)(C[] str)
if(isSomeChar!C)
{
return _fooImpl(str);
}
private auto _fooImpl(R)(R range)
is(isInputRange!R &&
isSomeChar!(ElementType!R))
{...}
And because one of the overloads is templated on character type, whereas the other is templated on the entire type, there's no way to combine them. If anything, it's harder to combine them than in the case where isConvertibleToString is used. So, unless someone sees something I'm missing here or comes up with a great idea that has escaped me (which is quite possible), I think that we're stuck with isConvertibleToString so that we can make functions that accept string accept arbitrary ranges of characters without breaking code.
Alternatively, we _could_ deprecate the overload which takes isConvertibleToString and ultimately make the function require explicit conversions to be used like we would normally do with a range-based function that we were writing from scratch, but you'd still need isConvertibleToString during the deprecation phase. And I'm not sure that it's worth going through that deprecation phase anyway.
Unfortunately, however, using this technique with a function that returns a portion of the range is going to be wrong when it's passed a static array by value (or if it's passed a user-defined type that implicitly converts to a string but does so in a manner similar to slicing static arrays such that the original variable can't go out of scope before the string does without having the string refer to memory that is no longer valid). In the case where you have
auto foo(C)(C[] str)
if(isSomeChar!C)
{...}
the conversion takes place at the callsite and thus the return value is potentially usable before the original goes out of scope - which is the exact same situtation that we had when the function was not templated and explicitly took string, whereas with
auto foo(T)(auto ref T convertible)
is(isConvertibleToString!T)
{
return foo!(StringTypeOf!T)(convertible);
}
unless it's passed by reference, it's going to go out of scope too early. So, the safe thing to do would be to change it to using the three function solution discussed above:
auto foo(R)(R range)
is(!isSomeString!R &&
isInputRange!R &&
isSomeChar!(ElementType!R))
{
return _fooImpl(range);
}
auto foo(C)(C[] str)
if(isSomeChar!C)
{
return _fooImpl(str);
}
private auto _fooImpl(R)(R range)
is(isInputRange!R &&
isSomeChar!(ElementType!R))
{...}
in which case, the only reason to keep isConvertibleToString is that in the case where you know that the return value is not a piece of the original range, then the two function solution works safely. But given how easy it is to screw up with the two function solution (e.g. there's been confusion on github about std.file functions that use this idiom and why they have auto ref), I'm inclined to argue that the 3 function solution is what should become the "official" way to templatize a function that accepts string and make it accept arbitrary ranges of characters in addition to what it accepted before. And in that case, it would make sense to deprecate isConvertibleToString and thus reduce the number of stray string traits that programmers have to understand use correctly.
So, that leaves us with isAutodecodableString. Notice that no use cases have used it. The only use case I'm aware that makes any sense is byCodeUnit, which does not work with static arrays (presumably, because they're not ranges and slicing them for byCodeUnit is likely to be unsafe - though how unsafe it is exactly the same as what we've been discussing for templatizing functions that take string). And actually, I'm inclined to think that byCodeUnit shouldn't have accepted types that implicitly converted to string and should have required the conversion just like other templated code normally does, but changing it now risks breaking code, and in the case of something like an enum, it's probably handy, since byCodeUnit is the sort of function that's going to be used at the beginning of a chain of ranges. That being said, if byCodeUnit were to use the three function solution that I just talked about for templatizing string functions, it could use static arrays just as safely as any function that has a string parameter and returns a slice of that string. So, isAutodecodable wouldn't be needed.
And if we didn't want to make that change,
isNarrowString!T || (isConvertibleToString&T && !isStaticArray!T)
could be used instead (the main reason that isConvertibleToString wasn't used with byCodeUnit in the first place was that isConvertibleToString was added later), or if we do decide to get rid of isConvertibleToString but don't want byCodeUnit to accept static arrays, it would simple enough to make its template constraint do the right thing without either isAutodecodableString or isConvertibleToString.
In general though, I think that templated functions operating on ranges of characters should either be accepting only ranges of characters and not stuff that implicitly converts to them (as we typically do with range-based functions), or they should accept arbitrary ranges of characters plus everything that a function accepting strings explicitly would. So, I don't think that using isAutodecodableString with byCodeUnit ultimately is the right thing to do, particularly since it's used almost nowhere else. Aside from isAutodecodableString's unit tests and byCodeUnit, the only other place in Phobos that isAutodecodableString is used is a static if inside of std.algorithm.comparison.equal which could have just as easily used isNarrowString.
As such, I think that isAutodecodableString should definitely be deprecated, and while it's more questionable with isConvertibleToString, given the safety concerns, I think that we'd be better of deprecating it as well. And fortunately, neither of them is particularly old, so not much code outside of Phobos is going to be using them yet.
As for adding a new trait, which is what started this whole mess, the
issue that we have that a new trait would help solve is that any range-based
code that specifically operates on strings is going to end up with something
like
isInputRange!R && isSomeChar!(ElementType!R)
So, that could be repeated a fair bit in templated code. We _could_ shorten that to something like isSomeCharRange!R, but as soon as you need a forward range or greater, you're going to end up with something like
isForwardRange!R && isSomeCharRange!R
whereas right now it would be
isForwardRange!R && isSomeChar!(ElementType!R)
And that really doesn't add much value. Alternatively, it could also require finite ranges (which is what the original PR proposed), but to make that clear, you'd probably need something like isSomeFiniteCharRange rather than isSomeCharRange, which eats away at its value even further (since it's even longer), and it's often the case that we want to accept infinite ranges, because we generally try and make algorithms lazy as much as we can.
So, I'm not at all convinced that we want to add a new trait for arbitrary ranges of characters. And it's not like we're looking at adding something like isIntegralRange. That sort of thing just doesn't scale, and beyond code that only accepts basic input ranges, it doesn't even tend to make template constraints much shorter. Regardless, as stated before, I think that talking about a range of characters as being "string-like" is incorrect, because strings are a lot more than finite, basic input ranges of characters.
============================================================================
So, in conclusion, with regards to isSomeString, isNarrowString, isAutodecodableString, and isConvertibleToString, I propose that we
1. Deprecate isAutodecodableString and make the few places in Phobos that
currently use it use isNarrowString or isConvertibleToString instead.
2. Strongly consider deprecating isConvertibleToString (given its safety
concerns with static arrays) and instead use overloads that are
templated on character type to accept implicit conversions to string
rather than being templated on the entire type (the overload taking
arbitrary ranges would of course still be templated on the entire type).
3. Consider changing isSomeString and isNarrowString so that they are false
for enums - since they really should be - but we probably can't do that
without breaking at least some code (albeit mostly code that's going to
be buggy anyway). If we don't make that change, we should consider adding
isExactSomeString to std.traits rather than just having it as internal to
std.conv, and then for any non-range based code, we can tell folks to use
isExactSomeString and avoid isSomeString if they really don't want to
accept implicit conversions (whereas code accepting implicit conversions
can just templatize on character type). Regardless, isSomeString and
isNarrowString's documentation should be updated to make the problem
with enums clear.
If it were entirely up to me, I would deprecate isAutodecodableString and isConvertibleToString as soon as Phobos were fixed to not use them, but I'm divided on whether I would fix isSomeString, since I don't know of a way to do that without risking breaking code - even if it is likely to be code that's buggy, and we'd prefer to not be breaking code without any kind of deprecation phase.
And with regards to adding a new trait specifically for ranges of characters, I don't think that it adds enough value to be worth it. It would shorten some constraints, but there are many it would not, and it even the ones that it would shorten wouldn't be shortened all that much.
- Jonathan M Davis
|