Thread overview
let's talk about output ranges
Feb 06, 2014
Adam D. Ruppe
Feb 06, 2014
H. S. Teoh
Feb 06, 2014
Adam D. Ruppe
Feb 06, 2014
Adam D. Ruppe
Feb 06, 2014
Adam D. Ruppe
Feb 06, 2014
Dmitry Olshansky
February 06, 2014
splitting from the ARC thread

On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
> As they do not keep references there's no need for ARC or GC,
> we just need a way to tell every function how it should allocate.

yea. I think it is time we hashed out output ranges the same way we've done with input ranges. Right now, output ranges just have put(T). I think it would be useful to add extensions similarly to random access ranges, infinite ranges, etc..


What I want to see work is:
char[3] buffer;
char[] got = toLowerWithRange(buffer, "FOO"); // if we allow it w/o slicing
assert(got.ptr is buffer.ptr);
assert(got == "foo");

char[] got = toLowerWithRange(buffer, "FOOL"); // throws, out no more space

char[] got = toLowerWithRange(buffer[], "FOO"); // same result as above
char[] got = toLowerWithRange(buffer[], "FOOL"); // MAyBE reallocates

(I call it toLowerWithRange just to avoid any overloading ambiguities when reusing the toLower name. Names aren't important to me though.)


What do we have here? When passed a static array, it must not grow beyond its length - the only allocation allowed is the exception it throws.

When passed a slice though, things are a bit more interesting. I think then it should try to put it directly in the slice, but if that isn't possible, go ahead and reallocate.

Though that's not necessarily sane either. Maybe slice should also throw if there isn't enough room.


toLower would prolly just call put(), with std.array or std.range defining it for static vs dynamic arrays.


With user defined types, this behavior would be tweakable, similarly to how input ranges might hasLength!T. Other similarities:

* A slice should also be a random access output range. Not all values are generated in order, one char at a time.

* Allocators should *offer* output ranges, but not necessarily *be* output ranges, similarly to the relationship between containers and input ranges (a view into the container). This could argue that passing a static buffer without slicing it is wrong

* We want this stuff to just work without a bajillion explicit wrappers, again, like input ranges.

* Output ranges would probably make the most sense to pass places by ref (as is the case with them most the time right now).



Eh, I gotta go, but let's start talking about this.

BTW for the growable thing, I wrote this little array thingy:
http://arsdnet.net/dcode/array2.d

it uses a static buffer up to an estimated size, then switches to GC. The idea being it can use infinite length, but for the majority of the input, uses stack space and thus avoids allocating at all. I'd like to see this thing be easy to use with phobos functions (perhaps with some additions but the same basic idea) as I find this pattern of guessing with a prealloced stack buffer very useful.


> Exceptions currently can't work on a system without GC cause we always use 'throw new' and nobody ever explicitly frees Exceptions.

Exceptions only come in exceptional circumstances anyway, so I think they are ok to be an... exception. LOL

but seriously, the exception path is rarely fully optimized in the first place. If you want to avoid them, use nothrow or preallocate them or something.
February 06, 2014
On Thu, Feb 06, 2014 at 04:21:51PM +0000, Adam D. Ruppe wrote:
> splitting from the ARC thread
> 
> On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
> >As they do not keep references there's no need for ARC or GC,
> >we just need a way to tell every function how it should allocate.
> 
> yea. I think it is time we hashed out output ranges the same way we've done with input ranges. Right now, output ranges just have put(T). I think it would be useful to add extensions similarly to random access ranges, infinite ranges, etc..

+1. I think output ranges (as a concept, not necessarily what's currently defined in Phobos) are far underused, and have a lot of untapped potential.


> What I want to see work is:
> char[3] buffer;
> char[] got = toLowerWithRange(buffer, "FOO"); // if we allow it w/o
> slicing
> assert(got.ptr is buffer.ptr);
> assert(got == "foo");
> 
> char[] got = toLowerWithRange(buffer, "FOOL"); // throws, out no
> more space
> 
> char[] got = toLowerWithRange(buffer[], "FOO"); // same result as
> above
> char[] got = toLowerWithRange(buffer[], "FOOL"); // MAyBE
> reallocates
> 
> (I call it toLowerWithRange just to avoid any overloading ambiguities when reusing the toLower name. Names aren't important to me though.)
> 
> What do we have here? When passed a static array, it must not grow beyond its length - the only allocation allowed is the exception it throws.

Would it make sense to have a .full method/property analogous to input ranges' .empty? Perhaps something like:

	struct MyOutputRange {
		void put(...);
		@property bool full() { ... }
	}

Though it may not be a good idea for the caller to constantly have to check if something is full before writing to it. Maybe having put() throw is the better way (then output ranges that allocate will just allocate when full without the caller needing to worry about it).


> When passed a slice though, things are a bit more interesting. I think then it should try to put it directly in the slice, but if that isn't possible, go ahead and reallocate.
> 
> Though that's not necessarily sane either. Maybe slice should also throw if there isn't enough room.

I think we're jumping the gun to implementation details here. What about other logical concepts that output ranges should support? One thing that the current output range API doesn't do very well is chaining. For example, if you don't care about allocation strategy and just use the GC, you can write:

	auto result = "mystring".toUpper.translate("abc", "def");

But once you have to pass an explicit output range to it, you can't chain things anymore:

	auto result1 = ArcOutputRange!string();
	"mystring".toUpper(result1);	// need to pass in result explicitly

	auto result2 = ArcOutputRange!string();
	result.data.translate("abc", "def", result2);

This is a big usability hindrance. Ideally we'd want to write something like:

	auto result = "mystring".toUpper(ArcOutputRange!string())
				.translate("abc", "def");

But I'm not sure how this can be made to work.


> toLower would prolly just call put(), with std.array or std.range defining it for static vs dynamic arrays.

Yes.


> With user defined types, this behavior would be tweakable, similarly to how input ranges might hasLength!T. Other similarities:
> 
> * A slice should also be a random access output range. Not all values are generated in order, one char at a time.

So we should extend put() to take an index, then?

	struct MyOutputRange {
		void put(T item);	// put item at current index
		void put(T item, size_t idx); // put item at specified index
	}


> * Allocators should *offer* output ranges, but not necessarily *be* output ranges, similarly to the relationship between containers and input ranges (a view into the container). This could argue that passing a static buffer without slicing it is wrong

An allocator is definitely not an output range! You don't put data into an allocator, you get a memory buffer *from* an allocator that you can put data into. Big difference. We shouldn't conflate the two.

That's why I said in another post, that algorithms that produce data into a data sink should not care what an allocator is; they should take an output range. Logically, it makes more sense: what you want is something to put your result data into, you don't care how it's allocated, or even whether it's a chunk of memory (it may be a socket or pipe or whatever).  Doing it this way also eliminates some needless allocations -- why do you need to allocate a string if all you want is to take input data, transform it to lowercase, and write to stdout? Let stdout do the buffering, and let toLower send the data to stdout directly. Calling an allocator from toLower essentially amounts to buffering the data twice.


> * We want this stuff to just work without a bajillion explicit wrappers, again, like input ranges.

Yes, which is why I brought up the issue of how to chain function calls that takes output ranges. That's a big usability issue that we need to address if output ranges are going to be as cool as input ranges were.


> * Output ranges would probably make the most sense to pass places by ref (as is the case with them most the time right now).

They should probably be *always* passed by ref, otherwise you could end up with some pathological behaviour of data from multiple sources overwriting each other because they were operating on copies of output ranges instead of references to a single one.

Also, delegates and function pointers should be treated as output ranges as well (Phobos should define .put and whatever other needed methods for them via UFCS).


[...]
> >Exceptions currently can't work on a system without GC cause we always use 'throw new' and nobody ever explicitly frees Exceptions.
> 
> Exceptions only come in exceptional circumstances anyway, so I think they are ok to be an... exception. LOL
> 
> but seriously, the exception path is rarely fully optimized in the first place. If you want to avoid them, use nothrow or preallocate them or something.

Yeah, you could do something like this:

	void myFailureProneAlgorithm(T...)(T args) {
		auto prealloc_exception = ARCAllocator!Exception();
		auto stacktraceBuffer = /* preallocate stacktrace here */;

		auto result = doBigComplicatedCalculation();
		if (result.hasError) {
			// Fill in exception info without allocation
			prealloc_exception.msg = result.err_msg;
			createStackTrace(stacktraceBuffer);
			prealloc_exception.stacktrace = stacktraceBuffer;

			// N.B.: no 'new'.
			throw prealloc_exception;
		}
		return result;
	}

	void main() {
		try {
			auto input = ...;
			auto output = myFailureProneAlgorithm(input);
		} catch(Exception e) {
			... // you get a statically allocated exception here
		}
	}

Doesn't solve the case where you call some library function that throws, though. :-(


T

-- 
One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot
February 06, 2014
I just slapped this together in the other thread, let me copy/paste it here, talk about it a bit, then I'll finish reading what you wrote:

struct GCSink(T) {
    // so this is a reference type
    private struct Impl {
        T[] data;
        void put(T t) { data ~= t; }
        T[] finish() { return data; }
    }
    Impl* impl;
    alias impl this;
    void start() {
        impl = new Impl;
    }
}

struct StaticSink(T) {
    T[] container;
    this(T[] c) { container = c; }
    size_t size;
    void start() { size = 0; }
    void put(T t) { container[size++] = t; }
    T[] finish() { return container[0 .. size]; }
}
StaticSink!T staticSink(T)(T[] t) {
    return StaticSink!T(t);
}

T[] toUpper(T, OR = GCSink!T)(in T[] data, OR output = OR()) {
    output.start();
    foreach(d; data)
       output.put(d & ~0x20);
    return output.finish();
}

void main() {
    import std.stdio;
    writeln(toUpper("cool")); // default: GC
    char[10] buffer;
    auto received = toUpper("cool", staticSink(buffer[])); // custom static sink
    assert(buffer.ptr is received.ptr);
    assert(received == "COOL");
}

====

In addition to put(), I also added start() and finish(). There's precedent for this in Phobos already: the std.digest output ranges have methods like this. Like std.digest, put builds it up, then finish returns the final product.

These wouldn't be required functions on all ranges. Finish might even return null or void, if it was a sink into a write function or something. It could also close a file or something like that.

But, if it does build into an array, finish ought to be defined to return an analogous input range (or slice) into the data it just built for easier chaining and basic simple usage.

(Appender in phobos has a kinda similar thing called data, but I think this breaks things since it might be called at any time. Finish, on the other hand, could be defined that any puts after it are invalid.

start is used to restart things. Calling start at any time should reset the range to reuse the buffer.


I used the Impl* on the GC one so the default worked. Ref with a default argument would fail because it isn't an lvalue, so that would kill the simplicity.


I called staticSink on the buffer to build the range... which my first post said I wanted to avoid but I kind like it this way better. It isn't a hassle to call and keeps a nice separation between the view and the container.
February 06, 2014
On Thursday, 6 February 2014 at 18:10:53 UTC, H. S. Teoh wrote:
> Would it make sense to have a .full method/property analogous to input ranges' .empty? Perhaps something like:

That could be ok but I agree that having to check is a pain. Besides, what are you going to do if it is full? Suppose I call

char[1] b;
toUpper("lol", staticSink(b[]));


The best toUpper could do is truncate or throw.... and I think that would be better encoded in the range itself so the caller can decide.

auto got = toUpper("lol", staticSinkTruncating(b[]));
if(got.length < "lol".length) {
   // we could perhaps call it again to process the rest
}

(put in the truncating one would just check its own length and noop if it is full)

Otherwise, throwing range violations/out of memory exceptions are what would most likely happen anyway.

> One thing that
> the current output range API doesn't do very well is chaining.

Indeed. In my other post, I just wrote about finish. Finish serves to flush the buffer (digests or compression algorithms for example might need to be padded to block size), could finalize things (suppose an appender which just puts the pieces into a static array, then calls join all at once at the end), and could also just generally return the result.

There are some cases where returning from finish doesn't make sense, such as if you sunk to a file, you wouldn't keep an array of the contents around... but finish is still potentially useful in that it could close the file or release a lock. (Of course, dtors could do that too. But destructors can never return data to the user - that's where finish is special.)

Anyway, not all output ranges would offer finish and not all would return T[]. But not all input ranges offer opSlice either so we're still in analogous territory.

> This is a big usability hindrance. Ideally we'd want to write something
> like:
>
> 	auto result = "mystring".toUpper(ArcOutputRange!string())
> 				.translate("abc", "def");
>
> But I'm not sure how this can be made to work.

hmmm.... finish doesn't account for all that.... well, I guess it could by returning a range.

tbh toUpper might be better as a higher-order input range. Like alias toUpper = map!charToUpper(...). Those chain, they don't allocate, and they are well-defined right now.

Then at the end we build the result lazily and just put it all at once into the output range.

"mystring".toUpper.translate("abc","def").array(ArcOutputRange!string());



Yeah, I actually think that's the way to go. And calling .array at the end is nothing new to Phobos anyway. I'd be a bit weird doing it with toUpper but I think it really is the best fit.

(BTW I would be PISSED of toUpper actually changed like this. It'd break a bunch of code and I don't really care that toUpper allocates. I want it to just work. But we could offer equivalent functionality via per-character functions and map so we don't have to break code to offer the new options.)

> So we should extend put() to take an index, then?

that would work.

> An allocator is definitely not an output range!

yup, and I don't think a static array is either. A static array is neither an input range, since you can't do a = a[1..$]. But offering easy getters for such is easy and it rox.


> into a data sink should not care what an allocator is; they should take an output range.

Actually, I think they should generate lazy input ranges whenever possible. Then only at the end do we send it to the output range. It's just input ranges aren't allowed to allocate, that would kill their complexity guarantee, so we need an example of a function which *must* allocate up front.

They want the random access output range. Otherwise we can just put at the end.

> Let
> stdout do the buffering, and let toLower send the data to stdout
> directly. Calling an allocator from toLower essentially amounts to buffering the data twice.

yes

> They should probably be *always* passed by ref, otherwise you could end up with some pathological behaviour of data from multiple sources overwriting each other because they were operating on copies of output ranges instead of references to a single one.

That won't necessarily work though, you can't have a ref default parameter. But we can use pimpl or something to force a regular struct to be a ref item. Lazy initialization can be surprising, but we deal with that already with array slices so I think it is ok.

> Also, delegates and function pointers should be treated as output ranges as well (Phobos should define .put and whatever
> other needed methods for them via UFCS).

Yes, indeed.


> Doesn't solve the case where you call some library function that throws, though. :-(

at least there's nothrow if it is really that important to us.
February 06, 2014
In the other thread:

Andrei said:
>.flush() or .done() to mark the end of several writes

similar to my finish

> .clear() to clear the range (useful if e.g. it's implemented as a
> slice with appending)

similar to my start, but start is allowed to do required initializtions too whereas clear doesn't sound as required.


Dmitry said:
> .reserve(n) to notify underlying sink that it n items are
> coming (it should preallocate etc.)

yes, this may also be a noop on some objects
February 06, 2014
06-Feb-2014 20:21, Adam D. Ruppe пишет:
> splitting from the ARC thread
>
> On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
>> As they do not keep references there's no need for ARC or GC,
>> we just need a way to tell every function how it should allocate.
>
> yea. I think it is time we hashed out output ranges the same way we've
> done with input ranges. Right now, output ranges just have put(T). I
> think it would be useful to add extensions similarly to random access
> ranges, infinite ranges, etc..
>

Just throwing a bit of related thoughts that came and go up w.r.t. output ranges.

We really should add multiplexer (one sink writes to many) and demultiplexer adapters (by applying n-ary predicate and putting the result into a single sink).

And a 3rd filter primitive, as a wrapper of a sink that applies some predicate before forwarding it down the line.

Together with other suggestions would make output ranges *much* more powerful/useful.

-- 
Dmitry Olshansky