Dealing with ranges where front and popFront do the same logic / eager ranges

I've always been curious around the design choice of ranges to make front and popFront separate functions, instead of having popFront return the front. I suppose it is useful sometimes to be able to access front multiple times without having to save it temporarily (can't think of a specific example though).

But sometimes doing popFront requires doing the same work as front. For example, for strings and autodecoding, both front and popFront need to determine the length of the next code point. Now for utf-8 that is still managable, but I'm making a tokenizer range for a programming language. Most of the logic in `front` needs to be duplicated for `popFront`! (and `empty` as well)

So I often end up with designs like:
```
struct Range
{
  string source;
  bool empty;
  Token front;
  this(string source) {
    this.source = source;
    popFront();
  }

  void popFront() {
    // *complicated logic*
    front = result;
  }
}
```

This increases the size of the Range struct, and makes the range one element eager. If the first token is invalid, and I use Exceptions, then merely constructing the range will immediately throw an Exception. I recall reading that constructors throwing exceptions are problematic, but I don't have any experience with that. So maybe I should add a try-catch in the constructor, and rethrow the Exception upon the first 'front' call? That feels hacky.

This is not a dealbreaker, I'm just curious on your tips and opinions regarding this.

October 16, 2018

Re: Dealing with ranges where front and popFront do the same logic / eager ranges

Posted by H. S. Teoh
in reply to Dennis

Permalink

H. S. Teoh

Posted in reply to Dennis

Permalink

On Tue, Oct 16, 2018 at 10:59:50PM +0000, Dennis via Digitalmars-d-learn wrote:
> I've always been curious around the design choice of ranges to make front and popFront separate functions, instead of having popFront return the front. I suppose it is useful sometimes to be able to access front multiple times without having to save it temporarily (can't think of a specific example though).

IMO, it is a good design choice, though I have also encountered the issues you mention myself.  The main advantage is the ability to make decisions with a 1-element lookahead -- you can examine the next element without consuming it, which is useful in writing parsers, when the next token may not be a part of the current parse subtree, so a subtree parsing routine can just gracefully exit without consuming the next token (and requiring stupid hacks like unputting the token back so that another parsing routine can pick it up), and let the upper-level routines take care of the case where the next token is actually invalid.

> But sometimes doing popFront requires doing the same work as front. For example, for strings and autodecoding, both front and popFront need to determine the length of the next code point. Now for utf-8 that is still managable, but I'm making a tokenizer range for a programming language. Most of the logic in `front` needs to be duplicated for `popFront`! (and `empty` as well)
> 
> So I often end up with designs like:
[...]

Your solution is how I would do it too.  There's no point in duplicating the work in .front; calling .front multiple times shouldn't recompute the same thing over and over again. Just do the work once and save it, then return the same result if the caller calls .front multiple times.

If you're worried about the size of the struct becoming too big, just add a fake Token type that marks EOF, and check for that in .empty instead of the `bool empty`.

[...]
> This increases the size of the Range struct, and makes the range one element eager. If the first token is invalid, and I use Exceptions, then merely constructing the range will immediately throw an Exception. I recall reading that constructors throwing exceptions are problematic, but I don't have any experience with that. So maybe I should add a try-catch in the constructor, and rethrow the Exception upon the first 'front' call? That feels hacky.
[...]

It *is* hacky.  I'd just throw the Exception in the ctor, and not bother with rethrowing it in .front.  I'm not sure what's the reasoning behind the saying that throwing exceptions in ctors is bad, but exceptions are exactly the kind of thing designed for handling this sort of situation. If the parser detects a problem early (i.e., at construction time, rather than the first call to .front), why not report the problem early instead of waiting?

Now, there *is* one situation where the first-element-eager behaviour of ranges isn't desirable, and that is if you're reading live user input from stdin (e.g., in a REPL), and you want to do something like print a prompt before the first input line is read.  In this case, the mere act of constructing the parser would block on input, likely before the code has a chance to print the prompt.  I've encountered this situation on multiple occasions, and have found it truly aggravating, esp. if you're using std.stdio.File.byLine, which has to read a line of input before it can know the value of .empty, so the mere act of calling .byLine requires you to print a prompt first, even if you structured your code differently for better encapsulation.

My eventual solution is to wrap .byLine inside a struct that lazily constructs the ByLine instance:

	// Implementation of byLinePrompted.
	private struct ByLinePrompted(alias writePrompt, File)
	{
	    private File f;
	    private alias ByLine = typeof(f.byLine());

	    import std.typecons : Nullable;
	    private Nullable!ByLine _impl;
	    private ref ByLine impl()
	    {
		if (_impl.isNull)
		{
		    writePrompt();
		    _impl = f.byLine;
		}
		return _impl;
	    }

	    @property bool empty() { return impl.empty; }
	    @property auto front() { return impl.front; }
	    void popFront()
	    {
		writePrompt();
		impl.popFront();
	    }

	    static assert(isInputRange!(typeof(this)));
	}

	/**
	 * Returns: A lazily-initialized wrapper around File.byLine that
	 * emits a prompt before every line read.
	 *
	 * Params:
	 *  writePrompt = A function invoked each time a prompt is called for.
	 *  f = The file to wrap around.
	 */
	auto byLinePrompted(alias writePrompt, File)(File f)
	    if (is(typeof(writePrompt())))
	{
	    return ByLinePrompted!(writePrompt, File)(f);
	}

Example usage:

	auto input = stdin.byLinePrompted!({
		stdout.write("> ");
		stdout.flush;
	}).parse;

	while (!input.empty) {
		auto token = input.front;
		...
	}

In essence, we write the prompt before constructing byLine. This also defers any exceptions from the parser until the first call to .front. The byLinePrompted wrapper gives us a nice API to hide away all of this ugly mess.

It does entail a null check every invocation, but the CPU branch predictor ought to be able to figure out pretty quickly that the null check will always be false after the first time, so you shouldn't see a significant performance hit.

T

-- 
Bomb technician: If I'm running, try to keep up.

Forums