Jump to page: 1 25  
Page
Thread overview
[RFC] I/O and Buffer Range
Dec 29, 2013
Dmitry Olshansky
Dec 29, 2013
Vladimir Panteleev
Dec 30, 2013
Marco Leise
Dec 30, 2013
Dmitry Olshansky
Dec 30, 2013
Marco Leise
Dec 30, 2013
Dmitry Olshansky
Dec 31, 2013
Brad Anderson
Dec 31, 2013
Joseph Cassman
Dec 31, 2013
Dmitry Olshansky
Dec 31, 2013
Joseph Cassman
Dec 31, 2013
Dmitry Olshansky
Dec 31, 2013
Joseph Cassman
Jan 04, 2014
Dmitry Olshansky
Jan 05, 2014
Jason White
Jan 05, 2014
Dmitry Olshansky
Jan 05, 2014
Jason White
Jan 05, 2014
Dmitry Olshansky
Jan 06, 2014
Jason White
Jan 06, 2014
Dmitry Olshansky
Jan 07, 2014
Jason White
Jan 07, 2014
Dmitry Olshansky
Jan 16, 2014
Dmitry Olshansky
Jan 16, 2014
Dmitry Olshansky
Jan 17, 2014
Dmitry Olshansky
Jan 17, 2014
Dmitry Olshansky
Jan 17, 2014
Dmitry Olshansky
Jan 17, 2014
Kagamin
Jan 17, 2014
Dmitry Olshansky
Jan 17, 2014
Kagamin
Jan 17, 2014
Jakob Ovrum
Jan 04, 2014
Brian Schott
Jan 04, 2014
Dmitry Olshansky
Jan 09, 2014
Brian Schott
Jan 12, 2014
Dmitry Olshansky
Jan 16, 2014
Walter Bright
Jan 16, 2014
Dmitry Olshansky
Jan 16, 2014
Walter Bright
December 29, 2013
Didn't mean this to be kind of under the tree .. Planned to post long ago but I barely had the time to flesh it out.

TL;DR: Solve the input side of I/O problem with new kind of ranges.
I'm taking on the buffering primitive on top of the low-level unbuffered I/O sources specifically.

Links

Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka@steves-laptop?page=4

http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmars-d@puremagic.com?page=1

http://forum.dlang.org/thread/vnpriguleebpbzhkpdwi@forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com

An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388

Into and motivation

Anyone doing serious work on parsers (text & binary alike) with D soon finds out that while slice-em-up with random access ranges works very nice any attempts to extend that beyond arrays and in-memory stuff are getting real awkward. It sums up to implementing not a small amount of bookkeeping (buffering) all the while having to use primitives that just don't map well to the underlying "hardware" - buffered stream.

For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)? Save the range, do 3 front/popFront with empty checks then copy saved range back. Painful especially if you know full well that it must be just a length check plus 3 direct reads of array. Even if the underlying buffer allows you to slice up the contents and do array-wise operations on them, the dumbed-down forward range interface just won't let you.

And once all of the hard (and useless) work to support forward ranges is done - you get that you have this 2-nd parser that should support forward ranges as well. It leads to conclusion that parsers simply don't want to work forward ranges, they need something bigger and all extra work you did on buffering simply belongs to that bigger primitive.

To put the last nail in the coffin - most of interesting sources can't even be forward ranges! Stream over a TCP socket - how do you 'save' that, Sherlock? Keeping in mind to return the same type - we'd better called it "fork" back then.

To sum it up - the current selection of ranges is not good enough to _efficiently_ work with I/O. Yet ranges are very neat and we'd better retain their strengths and capitalize on the existing work done in this area (e.g. a slew of good stuff in std.algorithm and std.range). Got to extend the notion then.

Goals

Introduce a better primitive aimed squarely at solving the _input_ side of I/O problem. Output IMHO might need some tweaks but with OutputRange it's in a much better shape then input.

The said new input primitive (from now on "buffer range") must naturally and efficiently support the following:
1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory
4) Primitives that (all kinds of) parsers need, including lookahead and lookbehind.

It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:

input-source <--> buffer range <--> parser/consumer

Meaning that if we can mix and match parsers with buffer ranges, and buffer ranges with input sources we had grown something powerful indeed.

Spoiler

I have a proof of concept implemented. It *works*. What I'm looking for is to simplify the set of primitives and/or better suggestions.

Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/

See it in action with e.g. a bit trimmed-down std.regex using this abstraction and working directly over files:

grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d

and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d

(for usage see build.sh and bench.sh)

It's as efficient as it was with arrays but no more need to work line by line and/or load the whole file.

In fact it's faster then the fastest line by line solution I had before: fgets + std.regex.matchFirst. Note that this (old) solution is cheating twice - it seeks only 1 match per line and it knows the line length ahead of time.

For me this proves that (5) is well within our reach.

Proposal

The BufferRange concept itself (for now called simply Buffer) is defined here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I'm not comfortable with the sheer number of primitives but my goal was sufficient first, minimal second.

Rationale for the selection of the primitives follows.

1. Start with InputRange, as there is no need to break the good thing (and foreach). This gets us somewhat close to (4). :)

Accept that given requirements (1)-(3) we are working with "sliding window" over whatever is the true source of data. Thus a sliding window can be the whole array, a mapped area of a file or a buffer of network stream. A sliding window may be moved across the input stream (~ re-loading the buffer) or extended to reach further.

Now what we need is to properly exploit capabilities of sliding window model and match that with requirements a parser would "like".

2. Slicing "after the fact".

This means ability to mark relevant parts of buffer as a start of something "useful" and require the underlying implementation when the time comes to move the window to keep the data starting from here. Typically later "down the stream" when the boundaries of slice are established it's extracted, examined and (rarely!) copied over. Ability to avoid copy unless absolutely necessary is _crucial_.

Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:

{
  //pin this position so that buffering won't lose it
  auto m = input.mark();
  while(!input.empty && isAlpha(input.front)){
	input.popFront();
  }
  //get a slice from 'm' to current position
  auto word = input.slice(m);
  //grab a copy if haven't seen before
  if(word !in identifiers)
      identifiers.insert(word.idup);
} //here m's destructor unpins position in input

To address slicing (parser requirement) I had to introduce marking and pinning concept explicitly. See mark/slice in docs.

3. Cheap save/restore.

Copying the whole range to save iteration state was a bad idea. It's not just wasteful as in memory, it's _semantically_ costly. In case of buffer range it would also imply some "re-reading" of I/O source as the copy has to be completely independent view of the same input(!). Hence "save" of forward range is an inadequate primitive that must be dropped.

However when time comes to backtrack (and many parsers do that, quite often) the state has to be restored. To minimize primitive count and not break requirement (2) reuse 'mark' to save the state and add 'seek' to restore position to the marked point. As it was pinned it's always in the buffer and readily accessible, keeping the ability to work over streams.

4. Even cheaper save/restore.

Something discovered the hard way in the practical setting is that setting up boundaries of stuff to slice (captures in regex) must be dirt cheap, like integer assignment cheap.

This means always using marking for every sub-slice is prohibitively costly as it has to communicate with buffer (currently bump a counter in some array). The idea that worked out with std.regex is to use a single "vantage point" mark and take a big slice off that and then make sub-slices of it as the plain array it is. Then from time to time "vantage point" should be updated when there is no sub-slices in mid-air.

This leads to a 'tell' primitive that gives offset from a given mark to the current position plus 'seek' overloads that work with relative offsets (positive and negative).

Another usage for relative seek is skipping the of data without looking, potentially dropping the whole buffers away (there is overlap with popFrontN though). Also fixed-width backtracking is easily had with relative seek if we can instruct the buffer range to keep at least K last bytes in the buffer (require minimal history).

5. Cheap lookahead/lookbehind.

This exploits the fact that underlying buffer is nothing but array, hence one may easily take a peek at some portion of it as plain ubyte[]. Implementation makes sure it buffers up enough of data as required and/or returns empty range if not possible. This supports things like LL(k) lookahead and fixed-length lookbehind that is common in regex 0-width assertions.

These 2 can be implemented with relative 'seek' +front/popFront, the question remains is how effective it'll be.

-- 
Dmitry Olshansky
December 29, 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
> [snip]

Hmm, just yesterday I was rewriting a parser to use a buffer instead of loading the whole file in memory, so this is quite timely for me.

Questions:

1. What happens when the distance between the pinned and current position exceeds the size of the buffer (sliding window)? Is the buffer size increased, or is the stream rewound if possible and the range returned by the slice does seeking?

2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)?

I guess I see the point when applied to regular expressions, where the user explicitly specifies how many characters to look ahead/behind. However, I think in most use cases the amount is not known beforehand (without imposing arbitrary limitations on users like "Thou shalt not have variable identifiers longer than 32 characters"), so the pattern would be "try a cheap lookahead/behind, and if that fails, do an expensive one".

3. I think ideally the final design would use something like what std.allocator does with "unbounded" and "chooseAtRuntime" - some uses might not need lookahead or lookbehind or other features at all, so having a way to disable the relevant code would benefit those cases.
December 30, 2013
Am Sun, 29 Dec 2013 22:45:35 +0000
schrieb "Vladimir Panteleev" <vladimir@thecybershadow.net>:

> [snip]
> 
> 2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)?

Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :)

I use those primitives:

   ubyte[] mapAvailable() pure nothrow
   ubyte[] mapAtLeast(in ℕ count)

See: https://github.com/mleise/piped/blob/master/src/piped/circularbuffer.d#L400

Both return a slice over all the available buffered data.
mapAtLeast() also waits till enough data is available from the
producer thread. The circular buffer is automatically extended
if the producer wants to write a larger chunk or the consumer
needs a larger window.
(I was mostly focused on lock-free operation in not-starved
situations and minimal bounds checking, so the code is a bit
more sophisticated than the average circular buffer.)

-- 
Marco

December 30, 2013
30-Dec-2013 02:45, Vladimir Panteleev пишет:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> [snip]
>
> Hmm, just yesterday I was rewriting a parser to use a buffer instead of
> loading the whole file in memory, so this is quite timely for me.
>
> Questions:
>
> 1. What happens when the distance between the pinned and current
> position exceeds the size of the buffer (sliding window)? Is the buffer
> size increased, or is the stream rewound if possible and the range
> returned by the slice does seeking?

It's expected that the window is increased. The exact implementation may play any dirty tricks it sees fit as long as it can provide a slice over the pinned area. In short - maintain the illusion that the window has increased. I would be against seeking range and would most likely opt for memory-mapped files instead but it all depends on the exact numbers.

>
> 2. I don't understand the rationale behind the current semantics of
> lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the
> first whitespace char, you don't know how many chars to request.

If you want to 'find' just do front/popFront, no?
Or do you specifically want to do array-wise operations?

> Wouldn't it be better to make these functions return the ENTIRE
> available buffer in O(1)?

Indeed, now I think that 2 overloads would be better:
auto lookahead(size_t n); //exactly n bytes, re-buffering as needed
auto lookahead(); // all that is available in the window, no re-buffering

Similar for lookbehind.

> I guess I see the point when applied to regular expressions, where the
> user explicitly specifies how many characters to look ahead/behind.

Actually the user doesn't - our lookahead/lookbehind is variable length. One thing I would have to drop is unbound lookbehind, not that it's so critical.

> However, I think in most use cases the amount is not known beforehand
> (without imposing arbitrary limitations on users like "Thou shalt not
> have variable identifiers longer than 32 characters"), so the pattern
> would be "try a cheap lookahead/behind, and if that fails, do an
> expensive one".

I would say that in case where you need arbitrary-length lookahead:
m = mark, seek + popFront x N, seek(m) should fit the bill.
Or as is the case in regex at the moment - mark once, and use seek back to some position relative to it. In one word - backtracking.

An example of where fixed lookahead rocks:
https://github.com/blackwhale/datapicked/blob/master/dpick/buffer/buffer.d#L421

>
> 3. I think ideally the final design would use something like what
> std.allocator does with "unbounded" and "chooseAtRuntime" - some uses
> might not need lookahead or lookbehind or other features at all, so
> having a way to disable the relevant code would benefit those cases.

It makes sense to make lookahead and lookbehind optional.
As for the code - for the moment it doesn't add much and builds on stuff already there. Though I suspect some other implementations would be able to "cut corners" more efficiently.

-- 
Dmitry Olshansky
December 30, 2013
30-Dec-2013 09:29, Marco Leise пишет:
> Am Sun, 29 Dec 2013 22:45:35 +0000
> schrieb "Vladimir Panteleev" <vladimir@thecybershadow.net>:
>
>> [snip]
>>
>> 2. I don't understand the rationale behind the current semantics
>> of lookahead/lookbehind. If you want to e.g. peek ahead/behind to
>> find the first whitespace char, you don't know how many chars to
>> request. Wouldn't it be better to make these functions return the
>> ENTIRE available buffer in O(1)?
>
> Yeah, been there too. I guess after X years of programming
> chances are good you implemented some buffer with
> look-ahead. :)
>
> I use those primitives:
>
>     ubyte[] mapAvailable() pure nothrow
>     ubyte[] mapAtLeast(in ℕ count)
>

Yeah, that's a good idea.
I mean primitives, not Unicode in type names ;)


-- 
Dmitry Olshansky
December 30, 2013
Am Mon, 30 Dec 2013 12:08:18 +0400
schrieb Dmitry Olshansky <dmitry.olsh@gmail.com>:

> 30-Dec-2013 09:29, Marco Leise пишет:
> > Am Sun, 29 Dec 2013 22:45:35 +0000
> > schrieb "Vladimir Panteleev" <vladimir@thecybershadow.net>:
> >
> >> [snip]
> >>
> >> 2. I don't understand the rationale behind the current semantics of lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the first whitespace char, you don't know how many chars to request. Wouldn't it be better to make these functions return the ENTIRE available buffer in O(1)?
> >
> > Yeah, been there too. I guess after X years of programming chances are good you implemented some buffer with look-ahead. :)
> >
> > I use those primitives:
> >
> >     ubyte[] mapAvailable() pure nothrow
> >     ubyte[] mapAtLeast(in ℕ count)
> >
> 
> Yeah, that's a good idea.
> I mean primitives, not Unicode in type names ;)

One day it'll catch on :D

-- 
Marco

December 31, 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
> Proposal

Having never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me.

Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it.

Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability.

Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages.

---
auto scrapeTitles(M)(in M message)
{
    static url_re = ctRegex!(r"(https?|ftp)://[^\s/$.?#].[^\s]*", "i");
    static title_re = ctRegex!(r"<title.*?>(.*?)<", "si");
    static ws_re = ctRegex!(r"(\s{2,}|\n|\t)");

    auto utf8 = new EncodingSchemeUtf8;
    auto titles =
         matchAll(message, url_re)
        .map!(match => match.captures[0])
        .map!((url) => get(url).ifThrown([]))
        .map!(bytes => cast(string)
                       utf8.sanitize(cast(immutable(ubyte)[])bytes))
        .map!(content => matchFirst(content, title_re))
        .filter!(captures => !captures.empty)
        .map!(capture => capture[1].idup) // dup so original is GCed
        .map!(title => title.entitiesToUni.replace(ws_re, " "))
        .array;

    return titles;
}
---

I really, really didn't want to use that std.net.curl.get().  It causes all sorts of problems if someone links to a huge resource. I just could not figure out how to utilize byLine (the title regex capture can be multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob Ovrum's help in IRC) was really a goal here as this is just a toy so I went with get() for the time being but it's always sad to sacrifice elegance for performance.  I certainly didn't want to add some elaborate evergrowing buffer in the middle of this otherwise clean UFCS chain (and I'm not even sure how to incrementally regex search the growing buffer or if that's even possible).

If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems.



P.S. I love std.regex more and more every day. It made that entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d
December 31, 2013
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
> [...]

Interesting idea. Seems to fill a need I have been facing with some parsing code.

Since I was unclear about how its functionality compares to ForwardRange I took a look through std.algorithm. If typed versions of lookahead/lookbehind were added it seems like ForwardRange could be replaced with this new range type, at least in certain places. For example, it seems to me that the following code (from the article "On Iteration" https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7)

   ForwardRange findAdjacent(ForwardRange r){
      if (!r.empty()) {
         auto s = r.save();
         s.popFront();
         for (; !s.empty();
               r.popFront(), s.popFront()) {
            if (r.front() == s.front()) break;
         }
      }
      return r;
   }

also could be written as

   BufferedRange findAdjacent(BufferedRange r) {
      for(;!r.empty;r.popFront()) {
         if(r.front == r.lookahead(1)[0]) break;
      }
      return r;
   }

Perhaps the mark and/or slice functionality could be leveraged to do something similar but be more performant.

If anyone can see how the new range type compares to ForwardRange that would be cool.

Joseph

December 31, 2013
On Tuesday, 31 December 2013 at 01:51:23 UTC, Brad Anderson wrote:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> Proposal
>
> Having never written any parser I'm not really qualified to seriously give comments or review it but it all looks very nice to me.
>
> Speaking as just an end user of these things whenever I use ranges over files or from, say, std.net.curl the byLine/byChunk interface always feels terribly awkward to use which often leads to me just giving up and loading the entire file/resource into an array. It's the boundaries that I stumble over. byLine never fits when I want to extract something multiline but byChunk doesn't fit because I often if what I'm searching for lands on the boundary I'll miss it.
>
> Being able to just do a matchAll() on a file, std.net.curl, etc. without sacrificing performance and memory would be such a massive gain for usability.
>
> Just a simple example of where I couldn't figure out how to utilize either byLine or byChunk without adding some clunky homegrown buffering solution. This is code that scrapes website titles from the pages of URLs in IRC messages.
>
> [...]
>
> If I'm understanding your proposal correctly that get(url) could be replaced with a hypothetical std.net.curl "buffer range" and that could be passed directly to matchFirst. It would only take up, at most, the size of the buffer in memory (which could grow if the capture grows to be larger than the buffer) and wouldn't read the unneeded portion of the resource at all. That would be such a huge win for everyone so I'm very excited about this proposal. It addresses all of my current problems.
>
> [...]

I find the same to be the case with API's used to process files. It would be a real win to have a simple, efficient, and possibly non-blocking range interface over a file able to access a single byte or character at a time.

I remember a discussion on this forum that spoke positively about the async and await combination of keywords in .NET 4.5. If asynchronous and synchronous functionality could be combined with a flexible buffer in a UFCS call chain I think that would hit a sweet spot for code composability and understandability.

Joseph
December 31, 2013
31-Dec-2013 05:53, Joseph Cassman пишет:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> [...]
>
> Interesting idea. Seems to fill a need I have been facing with some
> parsing code.
>
> Since I was unclear about how its functionality compares to ForwardRange
> I took a look through std.algorithm. If typed versions of
> lookahead/lookbehind were added it seems like ForwardRange could be
> replaced with this new range type, at least in certain places. For
> example, it seems to me that the following code (from the article "On
> Iteration"
> https://www.informit.com/articles/article.aspx?p=1407357&seqNum=7)
>
>     ForwardRange findAdjacent(ForwardRange r){
>        if (!r.empty()) {
>           auto s = r.save();
>           s.popFront();
>           for (; !s.empty();
>                 r.popFront(), s.popFront()) {
>              if (r.front() == s.front()) break;
>           }
>        }
>        return r;
>     }
>
The trick is that the name Forward range is misleading actually (I was surprized myself). A proper name would be ForkableRange, indeed look at this code:
       auto s = r.save();
       s.popFront();

s is a fork of r (independent view that has a live of its own).
The original algortihm even _reads_ easier if the correct term is used:

       auto s = r.fork();
       s.popFront();
       ...
Now it's clear that we fork a range and advance the forked copy one step ahead of the original.

> also could be written as
>
>     BufferedRange findAdjacent(BufferedRange r) {
>        for(;!r.empty;r.popFront()) {
>           if(r.front == r.lookahead(1)[0]) break;
>        }
>        return r;
>     }
>
Aye, just don't forget to test lookahead for empty instead of r.empty.

> Perhaps the mark and/or slice functionality could be leveraged to do
> something similar but be more performant.
>
> If anyone can see how the new range type compares to ForwardRange that
> would be cool.

I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment.

A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges.

I need to think on this, as the ability to integrate well with forwardish algorithms would be a great improvement.

-- 
Dmitry Olshansky
« First   ‹ Prev
1 2 3 4 5