November 04, 2012
On Sunday, 4 November 2012 at 17:38:07 UTC, Andrei Alexandrescu wrote:
> Back to a simpler solution: what's wrong with adding alternative APIs for certain input ranges? We have byLine, byChunk, byChunkAsync. We may as well add eachLine, eachChunk, eachChunkAsync and let the documentation explain the differences.
>
>
> Andrei

Yes, please! (<nitpick>Although I don't necessarily agree with the naming</nitpick>).

I think we should go with transient ranges the same way we went with "non-size_t index" ranges. Acknowledge that they exist, but not guarantee their support when passed to Phobos algorithms.

byLine is a perfect example of this: You can use it all you want in a for loop, but pass it "copy" or "array" at your own risk.

At that point, all we do is "tag" byLine as "Warning: Transient.", and call it a day. Not as perfectly safe as we'd want, but good enough, IMO. We then offer "eachLine", which isn't transient, and everyone is happy.
November 04, 2012
On 11/4/12 11:16 AM, H. S. Teoh wrote:
>> Fetching lines should be solved by using types; trafficking in
>> char[] does not offer guarantees about the preservation of the
>> content. In contrast, trafficking in string formalizes a binding
>> agreement between the range and its user that the content is
>> immutable.
> [...]
>
> I think that this may be oversimplifying the issue.

The express purpose here is to simplify instead of offering highly granular notions with many odd corner cases and combinations. So the risk of oversimplifying is indeed material.

> What about the
> example of a range of in-place permutations over an array? That range
> can be made a forward range, or even bidirectional (by reversing the
> sense of the comparison used by nextPermutation). But it is inherently a
> transient range.

That's an interesting example indeed. It would keep a T[] as its state and each popFront() would mutate that array to its next permutation. Implementing .save would be achieved by duplicating the array.

I would argue this is a tenuous range to characterize as a forward range. Invoking .save on a "usual" forward range means saving the iteration state, with an understanding that the saved range will go through the same iteration steps as the initial one. But nonono, actually the saved range goes through DUPLICATES of the states that the original goes through. Clearly there is an isomorphism of the two sequences of states, but they're not the same states - for example, someone looking at the original array will see the original changing the array, but not the .save.

So what you're looking at is a range that can define .dup, not one that can define .save.

> And this is merely one example. I have some non-trivial code that
> evaluates an expression tree, each node of which may generate multiple
> elements; one may take a range over all possible values that the tree
> may have. For efficiency reasons, each output value (which is
> vector-valued) reuses the same buffer. It is a forward range with a
> transient .front (it has to be a forward range, because otherwise one
> can't range over all combinations of multiple values that subtrees may
> produce).

I'm not sure about this example. Are we looking again at a .dup kind of thing?

> While I'm perfectly fine with the decision that such ranges shouldn't be
> considered proper ranges at all, it would greatly limit the
> applicability of std.algorithm in my code. Either I would have to
> artificially limit the range to an input range (which is a bit silly, as
> it amounts to renaming .save to something else just so std.algorithm
> wouldn't recognize it as a forward range), or I wouldn't be able to use
> std.algorithm to manipulate these ranges, but would have to roll my own.
> Of course, D makes it so that rolling my own isn't all that hard, but it
> would be a bit disappointing that the supposedly-generic algorithms in
> the standard library aren't that generic after all.

I understand. At the same time, there is a point where an abstraction can't comprehend all possible instances. There's already significant aggravation moveFront and hasLvalueElements etc., which I'd want to simplify given a chance. Continuing down that path seems risky to me.


Andrei
November 04, 2012
The whole problem seems to me to be caused by one simple problem in D:

value vs. reference semantics

If it was clear that the output of a range has value semantics, then clearly, its copy would not be transient.

And if it was clear that the output has reference semantics, then clearly, its copy could be modified internally.



So really, the problem is not with ranges.
It's with an inherent ambiguity in D -- can you assume arrays & structs are values or no?
November 04, 2012
On Sunday, 4 November 2012 at 19:59:55 UTC, Mehrdad wrote:
> The whole problem seems to me to be caused by one simple problem in D:
>
> value vs. reference semantics
>
> If it was clear that the output of a range has value semantics, then clearly, its copy would not be transient.
>
> And if it was clear that the output has reference semantics, then clearly, its copy could be modified internally.
>
>
>
> So really, the problem is not with ranges.
> It's with an inherent ambiguity in D -- can you assume arrays & structs are values or no?

Not sure this is specific to D. Just because you pass something by value doesn't mean somebody else can't modify what's underneath.

C++ gets away with this with an obsessive ownership model. Copy always trigger duplication of internals everytime anyting is ever passed by value. In D, you have dup, to do it only when you really need it.

-------
The problem IS with ranges: in C++, stream operations require an output argument, defering to the caller to chose where the output is written to, avoiding the problem. Ranges, just like iterators, must return values.

Heck, D has it better than C++. If you want to use istream iterators in C++ (say, to interface with an algorithm), you can't chose char* as a re-useable output buffer. You HAVE to use std:::string, always allocating.

In D, you at least have the choice of byLine and byLine.map!"a.idup"().

Problem is the default choice is not the easiest to operate with...
November 04, 2012
On Sunday, 4 November 2012 at 21:30:39 UTC, monarch_dodra wrote:
> Not sure this is specific to D. Just because you pass something by value doesn't mean somebody else can't modify what's underneath.


By "D" I didn't mean the language, I meant the current practices in Phobos and other libraries, as compared to (say) the STL.


> C++ gets away with this with an obsessive ownership model. Copy always trigger duplication of internals everytime anyting is ever passed by value. In D, you have dup, to do it only when you really need it.


Yeah, that's exactly what I mean. It's never ambiguous in C++ (although you certainly COULD make it ambiguous if you want to), because copying always, well, copies.


In D it's a mishmash of so many things you can't figure out when anything is copied or not...



> Heck, D has it better than C++. If you want to use istream iterators in C++ (say, to interface with an algorithm), you can't chose char* as a re-useable output buffer. You HAVE to use std:::string, always allocating.


Better as in more potential for performance? Heck yeah.

Better as in, easier to reason about the code? Well, no, that's what's causing this whole thread...
November 04, 2012
On Sun, Nov 04, 2012 at 12:38:06PM -0500, Andrei Alexandrescu wrote:
> On 11/4/12 12:26 PM, deadalnix wrote:
> >I think it fit nicely D's phylosophy, in the sense it does provide a safe, easy to use interface while providing a backdoor when this isn't enough.
> 
> It doesn't fit the (admittedly difficult to fulfill) desideratum that the obvious code is safe and fast. And the obvious code uses byLine, not byLine.transient.

Actually, it does. The proposal was to modify byLine so that it returns strings instead of a reused char[] buffer. The current version of byLine that has transient front will be moved into a .transient property. This has the following results:

- All existing code doesn't break: it just becomes a tad slower from
  duplicating each line. Correctness is not broken.

- Code that wants to be fast can call .transient at the end of the
  construction, e.g., stdin.byLine().map!myFunc.transient. Assuming
  that map propagates .transient (which also implies it's safe to use
  with transient ranges), this will return an optimized range that
  reuses the transient buffer safely.

- Code that calls .transient on non-transient ranges get a no-op:
  [1,2,3].map!myFunc.transient is equivalent to [1,2,3].map!myFunc,
  because non-transient ranges just alias .transient to themselves (this
  can be done via UFCS so no actual change needs to be made to existing
  non-transient ranges).

IOW, code is always 100% safe by default. You can optionally use .transient with certain ranges if you'd like for it to be faster. Using .transient where it isn't supported simply defaults to the usual safe behaviour (no performance increase, but no safety bugs either).

Now, this can be taken one step further. User code need not even know what .transient is. As deadalnix said, the insight is that it is only invoked by range *consumers*. For example, one could in theory make canFind() transient-safe, so the user would write this:

	auto n = stdin.byLine().map!myFunc.canFind(myNeedle);

and canFind uses the .transient range to do the finding. Assuming map!myFunc correctly supports .transient, this request for the faster range automatically propagates back to byLine(), and so the code is fast *without the user even asking for .transient*.

And if the range being scanned is non-transient, then it behaves normally as it does today. For example, if map!myFunc does *not* support .transient, then when canFind asks for .transient, UFCS takes over and it just gets the (non-transient) mapped range back. Which in turn asks for the *non-transient* version of byLine(), so in no case will there be any safety breakage.

I think this solution is most elegant so far, in the sense that:

(1) Existing code doesn't break;
(2) Existing code doesn't even need to change, or can be slowly
optimized to use .transient on a case-by-case basis without any code
breakage;
(3) The user doesn't even need to know what .transient is to reap the
benefits, where it's supported;
(4) The algorithm writer decides whether or not .transient is supported,
guaranteeing that there will be no hidden bugs caused by assuming .front
is persistent and then getting a transient range passed to it. So it's
the algorithm that declares whether or not it supports .transient.


> Back to a simpler solution: what's wrong with adding alternative APIs for certain input ranges? We have byLine, byChunk, byChunkAsync. We may as well add eachLine, eachChunk, eachChunkAsync and let the documentation explain the differences.
[...]

This is less elegant than deadalnix's proposal, because there is a certain risk associated with using .transient ranges. Now the onus is on the user to make sure he doesn't pass a transient range to an algorithm that can't handle it.

With deadalnix's proposal, it's the *algorithm* that decides whether or not it knows how to handle .transient properly. The user doesn't have to know, and can still reap the benefits. An algorithm that doesn't know how to deal with transient ranges simply does nothing, and UFCS takes over to provide the default .transient, which just returns the original non-transient range.


T

-- 
Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем.
November 04, 2012
On 11/04/2012 10:52 PM, Mehrdad wrote:
>
> By "D" I didn't mean the language, ...

That is what D is. If you stick to this terminology you will be understood.
November 05, 2012
On Sunday, November 04, 2012 21:49:58 Dmitry Olshansky wrote:
> I was mostly going by:
>  > The simpler way would be to simply state that input ranges can't
> 
> guarantee the persistence of .front after calling .popFront.
> 
> and
> 
>  >The algorithms that need to worry about .front's transience are only
> 
> the ones that work with input ranges (as opposed to forward ranges or
> stronger).
> 
> And conclude that applying this simple rule to std.array.array* means it has to assume non-transient elements and somehow copy/dup'em. Problem is we don't have generic function to make 100% guaranteed unaliased copy (it could omit doing any work on non-aliased objects).
> 
> Also that it misses forward ranges that have aliased .front, assuming they are non-transient.

The reality is that in the general case, you can't know for certain whether the result of front is transient or not. In some cases, you know, because the type is clearly immutable or a value type. In others (e.g. char[]), you can't know. A particularly annoying one though is classes, because they're almost never going to be immutable, so in almost all cases, if input ranges can have transient fronts and the range's element types is a class, then you have to assume that its front is transient. And I'm not sure whether it's possible to determine whether a struct is a value type or a reference type or not. The mutable indirections bit probably makes it so that you can determine for sure that some structs are value types, but you can't be certain for some of them whether they're a reference type or value type without examining the postblit constructor.

So, what it ends up coming down to if we accept that input ranges can have transient fronts is that algorithms then either have to require forward ranges in the case where they need to keep the result of front around, or we need a template of some kind which can determine in at least some cases that front in not transient (but it's going to have to say that front is transient in many cases where it isn't), and such algorithms will have to use that trait when dealing with input ranges.

The big problem though is std.array.array. Based on the functions that an input range has, all it needs is an input range, but it'll never work with a transient front. So, what is likely one of the most heavily used range-based functions can't work with input ranges. This is a big problem, and fixing it _will_ break code. Because either, we need to create a hasTransientFront template or make std.array.array require at least a forward range, which will make dealing with input ranges that much worse.

Still, as much as I hate the idea of permitting transient fronts, I'd rather have it be part of input ranges and be able to assume that other range types have non-transient fronts than create isTransient. Ranges are already overly complicated as it is. Regardless, if we decide that input ranges can have transient fronts, then we need to make that very clear, which clearly hasn't been the case, because only Andrei thought that the transience of front had anything to do with the definition of an input range.

- Jonathan M Davis
November 05, 2012
On Sunday, November 04, 2012 15:40:18 deadalnix wrote:
> Le 04/11/2012 03:26, Jonathan M Davis a écrit :
> > 3. Make it so that ranges which can be transient are non-transient by default but provide a function to get at a transient version for speed (which was the fastRange proposal in this thread). The main problem here is that when the fast range gets wrapped, it's transient, and so anything using the wrapped range will be forced to use the transient version rather than using the non- transient version and only using the transient version when it's asked for. So, I don't think that this is particularly viable.
> 
> Can you elaborate on that. I definitively don't see a problem that big here.

The problem is that you can't opt out of the fast range once you've got it. If you use fastRange and then the result ends up getting wrapped by another range, that range's front is transient and has no way of indicating it. So, you're right back in the boat that we're in right now. In order to avoid that, virtually all range-based functions would have to not use fastRange, because they have no idea whether you're going to use their result to pass to another function or not.

- Jonathan M Davis
November 05, 2012
On Sunday, November 04, 2012 08:55:48 Andrei Alexandrescu wrote:
> On 11/3/12 9:02 PM, Jonathan M Davis wrote:
> > Andrei seems to think that algorithms should consider input ranges' fronts to be transient, and that forward ranges and greater shouldn't have transient fronts. And I don't think that he's said much about it after that (probably mostly because he was out of town and unable to communicate on the newsgroup much for a while). However, I think that he's the only one in this thread who had come to that conclusion. Most of the rest of us agree that transience isn't necessarily related to the type of range at all. And I think that most everyone has considered ranges like ByLine and ByChunk to be bizarre and abnormal and not something that normal algorithms need to worry about (and if anything that ByLine and ByChunk should be changed to be non-transient or to not be ranges at all).
> 
> (Let's drop the argumentum ad populum part and focus on the technical
> argument.)
> 
> These ranges are the only ones dealing with streaming input - the only veritable input ranges, so qualifying them as bizarre would pretty much raise the point that input ranges are bizarre.
> 
> An input range means that you're fetching an item, looking at it, and discard it. Transience is, in my opinion, quite intrinsic to input ranges. It is also a well understood fact that copying a char[] does not guarantee its contents are copied, in any API. So just by seeing as ByLine is an input range of char[], its properties are self-evident. If it were an input range of string, it would have other properties, equally self-evident. This destroys the argument that transience is unrelated to the range type and kind.

There is nothing inherent to an input range that requires that front be transient. Ranges like ByLine can work just fine with non-transient fronts. They're just less efficient that way. And as H. S. Teoh has pointed out, you can easily have forward ranges with transient fronts. So, whether you can save a range really doesn't have much to do with whether front can be non-transient or not.

Now, if we say that we need transient fronts for some input ranges but don't want to add an isTransient trait or something similar, and so we're going to say that transient fronts are only acceptable for input ranges and that algorithms have to assume that front can be transient for input ranges, then fine. I don't really like it, but I'm definitely in favor of simplifying things here.

However, that means that we need to be very clear about the fact that input ranges can have transient fronts and that algorithms cannot assume that their fronts are not transient (something that only you appear to have thought was clear). We then either have to change a bunch of algorithms so that they require forward ranges, or we need to create a trait which can determine whether front is transient in it least some cases (it would still be claiming that front is transient in some cases where it isn't, because it can't know for sure, but we'd at least be able to figure it out in some cases), and make algorithms use that in their constraints when they deal with input ranges. std.array.array would be a prime case where an algorithm would have to be changed, because it can't function with a transient front, and it's a prime example of a function that you'd normally expect to work with an input range. And this _will_ break code. It'll allow us to fix the ByLine problem though.

- Jonathan M Davis