December 31, 2013
31-Dec-2013 05:51, Brad Anderson пишет:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> Proposal
>
> Having never written any parser I'm not really qualified to seriously
> give comments or review it but it all looks very nice to me.
>
> Speaking as just an end user of these things whenever I use ranges over
> files or from, say, std.net.curl the byLine/byChunk interface always
> feels terribly awkward to use which often leads to me just giving up and
> loading the entire file/resource into an array. It's the boundaries that
> I stumble over. byLine never fits when I want to extract something
> multiline but byChunk doesn't fit because I often if what I'm searching
> for lands on the boundary I'll miss it.

Exactly, the situation is simply not good enough. I can assure you that on the side of parser writers it's even less appealing.

>
> Being able to just do a matchAll() on a file, std.net.curl, etc. without
> sacrificing performance and memory would be such a massive gain for
> usability.

.. and performance ;)

>
> Just a simple example of where I couldn't figure out how to utilize
> either byLine or byChunk without adding some clunky homegrown buffering
> solution. This is code that scrapes website titles from the pages of
> URLs in IRC messages.
[snip]
>
> I really, really didn't want to use that std.net.curl.get().  It causes
> all sorts of problems if someone links to a huge resource.

*Nods*

> I just could
> not figure out how to utilize byLine (the title regex capture can be
> multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob
> Ovrum's help in IRC) was really a goal here as this is just a toy so I
> went with get() for the time being but it's always sad to sacrifice
> elegance for performance. I certainly didn't want to add some elaborate
> evergrowing buffer in the middle of this otherwise clean UFCS chain (and
> I'm not even sure how to incrementally regex search the growing buffer
> or if that's even possible).

I thought to provide something like that, incremental match that takes pieces of data slice by slice, having to mess with the not-yet-matched kind of object. But it was solving the wrong problem. And it shows that backtracking engines simply can't work like that, they would want to go back to the prior pieces.

>
> If I'm understanding your proposal correctly that get(url) could be
> replaced with a hypothetical std.net.curl "buffer range" and that could
> be passed directly to matchFirst. It would only take up, at most, the
> size of the buffer in memory (which could grow if the capture grows to
> be larger than the buffer) and wouldn't read the unneeded portion of the
> resource at all. That would be such a huge win for everyone so I'm very
> excited about this proposal. It addresses all of my current problems.

That's indeed what the proposal is all about. Glad it makes sense :)

>
>
> P.S. I love std.regex more and more every day. It made that
> entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d

Aye, replace with functor rox!

-- 
Dmitry Olshansky
December 31, 2013
On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky wrote:
> 31-Dec-2013 05:53, Joseph Cassman пишет:
>> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>
> I'm thinking there might be a way to bridge the new range type with ForwardRange but not directly as defined at the moment.
>
> A possibility I consider is to separate a Buffer object (not a range), and let it be shared among views - light-weight buffer-ranges. Then if we imagine that these light-weight buffer-ranges are working as marks (i.e. they pin down the buffer) in the current proposal then they could be forward ranges.
>
> I need to think on this, as the ability to integrate well with forwardish algorithms would be a great improvement.

I think I now understand a bit better what you were thinking when you first posted

> input-source <--> buffer range <--> parser/consumer
>
> Meaning that if we can mix and match parsers with buffer ranges, and buffer ranges with input sources we had grown something powerful indeed.

Being able to wrap an already-in-use range object with the buffer interface as you do in the sample code (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good for composability. Also allows for existing functionality in std.algorithm to be reused as-is.

I think the new range type could also be added directly to some new, or perhaps retrofitted into existing, code to add the new functionality without sacrificing performance. In that way the internal payload already used to get the data (say by the input range) could be reused without having to allocate new memory to support the buffer API.

As one idea of using a buffer range from the start, a function template by(T) (where T is ubyte, char, wchar, or dchar) could be added to std.stdio. It would return a buffer range object providing more functionality than byChunk or byLine while adding access to the entire stream of data in a file in a contiguous and yet efficient manner. Seems to help with the issues faced in processing file data mentioned in previous comments in this thread.

Joseph
January 04, 2014
I've been rewriting a bit of the lexer in DScanner.

https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d
(Ignore the "invariant" block. I've been trying to hunt down some unreelated memory corruption issue)

One thing that I've found to be very useful is the ability to increment column or index counters inside of the lexer range's popFront method. I think that I'd end up using my own range type for arrays, but construct a lexer range on top of your buffer range for anything else.
January 04, 2014
04-Jan-2014 13:39, Brian Schott пишет:
> I've been rewriting a bit of the lexer in DScanner.
>
> https://github.com/Hackerpilot/Dscanner/blob/NewLexer/stdx/lexer.d
> (Ignore the "invariant" block. I've been trying to hunt down some
> unreelated memory corruption issue)

startsWith & peek could be implemented with proposed the lookahead, so overall buffer range looks fitting to your use case.

>
> One thing that I've found to be very useful is the ability to increment
> column or index counters inside of the lexer range's popFront method.

>I think that I'd end up using my own range type for arrays, but construct
> a lexer range on top of your buffer range for anything else.

I think it should be possible to wrap a buffer range and to add related accounting and simplify interface for your own use.

As an experiment I have a greatly simplified buffer range - it builds on top of forward range making it instantly compatible with the most of std.algorithm. The only problem at the moment is slightly worse performance with DMD (who cares) and that it segfaults with LDC (and this is a problem). Would be nice to test LDC first but I think I'll publish it anyway, stay tuned.


-- 
Dmitry Olshansky
January 04, 2014
31-Dec-2013 22:46, Joseph Cassman пишет:
> On Tuesday, 31 December 2013 at 09:04:58 UTC, Dmitry Olshansky wrote:
>> 31-Dec-2013 05:53, Joseph Cassman пишет:
>>> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:

>> I'm thinking there might be a way to bridge the new range type with
>> ForwardRange but not directly as defined at the moment.
>>
>> A possibility I consider is to separate a Buffer object (not a range),
>> and let it be shared among views - light-weight buffer-ranges. Then if
>> we imagine that these light-weight buffer-ranges are working as marks
>> (i.e. they pin down the buffer) in the current proposal then they
>> could be forward ranges.

I've created a fork where I've implemented just that.
As a bonus I also tweaked stream primitives so it now works with pipes or whatever input stdin happens to be.

Links stay the same:
Docs: http://blackwhale.github.io/datapicked/dpick.buffer.traits.html
Code: https://github.com/blackwhale/datapicked/tree/fwd-buffer-range/dpick/buffer

The description has largely simplified and the primitive count reduced.

1. A buffer range is a forward range. It has reference semantics.
2. A copy produced by _save_ is an independent view of the underlying buffer (or window).
3. No bytes can be discarded that are seen in some existing view. Thus each reference pins its position in the buffer.
4. 3 new primitives are:
   Range slice(BufferRange r);
Returns a slice of a window between the current range position and r. It must be a random access range.

   ptrdiff_t tell(BufferRange r);
Returns a difference in positions in the window of current range and r. Note that unlike slice(r).length this can be both positive and negative.

   bool seek(ptrdiff_t ofs);
Reset buffer state to an offset from the current position. Return indicates success of the operation. It may fail if there is not enough data, or (if ofs is negative) that this portion of data was already discarded.

5. Lookahead and lookbehind are a extra primitives that were left intact for the moment. Where applicable a range may provide lookahead:

Range lookahead(); //as much as available in the window
Range lookahead(size_t n); // either n exactly or nothing if not

And lookbehind:

Range lookbehind(); //as much as available in the window
Range lookbehind(size_t n); //either n exactly or nothing if not

These should probably be tested as separate traits.

>> input-source <--> buffer range <--> parser/consumer
>>
>> Meaning that if we can mix and match parsers with buffer ranges, and
>> buffer ranges with input sources we had grown something powerful indeed.
>
> Being able to wrap an already-in-use range object with the buffer
> interface as you do in the sample code
> (https://github.com/blackwhale/datapicked/blob/master/dgrep.d) is good
> for composability. Also allows for existing functionality in
> std.algorithm to be reused as-is.

It was more about wrapping an array but it's got to integrate well with what we have. I could imagine a use case for buffering an input range.
Then I think a buffer range of anything other then bytes would be in order.

> I think the new range type could also be added directly to some new, or
> perhaps retrofitted into existing, code to add the new functionality
> without sacrificing performance. In that way the internal payload
> already used to get the data (say by the input range) could be reused
> without having to allocate new memory to support the buffer API.
>
> As one idea of using a buffer range from the start, a function template
> by(T) (where T is ubyte, char, wchar, or dchar) could be added to
> std.stdio.

IMHO C run-time I/O has no use in D. The amount of work spent on special-casing the non-locking primitives of each C run-time,
repeating legacy mistakes (like text mode, codepages and locales) and stumbling on portability problems (getc is a macro we can't have) would have been better spent elsewhere - designing our own I/O framework.

I've put together up something pretty simple and fast for buffer range directly on native I/O:
https://github.com/blackwhale/datapicked/blob/fwd-buffer-range/dpick/buffer/stream.d

It needs a bit better error messages then naked enforce, and a bit of tweaks to memory management. It does runs circles around existing std.stdio already.

> It would return a buffer range object providing more
> functionality than byChunk or byLine while adding access to the entire
> stream of data in a file in a contiguous and yet efficient manner.

Drop 'efficient' if we talk interfacing with C run-time. Otherwise, yes, absolutely.

> Seems
> to help with the issues faced in processing file data mentioned in
> previous comments in this thread.


-- 
Dmitry Olshansky
January 05, 2014
On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky wrote:
> IMHO C run-time I/O has no use in D. The amount of work spent on special-casing the non-locking primitives of each C run-time,
> repeating legacy mistakes (like text mode, codepages and locales) and stumbling on portability problems (getc is a macro we can't have) would have been better spent elsewhere - designing our own I/O framework.

I agree. I wrote a (mostly complete) file stream implementation that uses the native I/O API:

    https://github.com/jasonwhite/io/blob/master/src/io/file.d

It allows for more robust open-flags than the fopen-style flags (like "r+"). For seeking, I adapted it to use your concept of marks (which I quite like) instead of SEEK_CUR and friends.

Please feel free to use this! (The Windows implementation hasn't been tested yet, so it probably doesn't work.)

BTW, I was also working on buffered streams, but couldn't figure out a good way to do it.
January 05, 2014
05-Jan-2014 09:22, Jason White пишет:
> On Saturday, 4 January 2014 at 13:32:15 UTC, Dmitry Olshansky wrote:
>> IMHO C run-time I/O has no use in D. The amount of work spent on
>> special-casing the non-locking primitives of each C run-time,
>> repeating legacy mistakes (like text mode, codepages and locales) and
>> stumbling on portability problems (getc is a macro we can't have)
>> would have been better spent elsewhere - designing our own I/O framework.
>
> I agree. I wrote a (mostly complete) file stream implementation that
> uses the native I/O API:
>
>      https://github.com/jasonwhite/io/blob/master/src/io/file.d
>
As an advice I'd suggest to drop the 'Data' part in writeData/readData. It's obvious and adds no extra value.

> It allows for more robust open-flags than the fopen-style flags (like
> "r+"). For seeking, I adapted it to use your concept of marks (which I
> quite like) instead of SEEK_CUR and friends.
>
> Please feel free to use this! (The Windows implementation hasn't been
> tested yet, so it probably doesn't work.)

Will poke around. I like this (I mean composition):
https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

>
> BTW, I was also working on buffered streams, but couldn't figure out a
> good way to do it.


-- 
Dmitry Olshansky
January 05, 2014
On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:
> As an advice I'd suggest to drop the 'Data' part in writeData/readData. It's obvious and adds no extra value.

You're right, but it avoids a name clash if it's composed with text writing. "write" would be used for text and "writeData" would be used for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid that problem (which, I suppose, are more appropriate names).

> Will poke around. I like this (I mean composition):
> https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17

Yeah, the idea is to separate buffering, text, and locking operations so that they can be composed with any other type of stream (e.g., files, in-memory arrays, or sockets). Currently, std.stdio has all three of those facets rolled into one.
January 05, 2014
05-Jan-2014 15:08, Jason White пишет:
> On Sunday, 5 January 2014 at 09:33:46 UTC, Dmitry Olshansky wrote:
>> As an advice I'd suggest to drop the 'Data' part in
>> writeData/readData. It's obvious and adds no extra value.
>
> You're right, but it avoids a name clash if it's composed with text
> writing. "write" would be used for text and "writeData" would be used
> for raw data. std.stdio.File uses the names rawRead/rawWrite to avoid
> that problem (which, I suppose, are more appropriate names).
>

I my view text implies something like:

void write(const(char)[]);
size_t read(char[]);

And binary would be:

void write(const(ubyte)[]);
size_t read(ubyte[]);

Should not clash.

>> Will poke around. I like this (I mean composition):
>> https://github.com/jasonwhite/io/blob/master/src/io/stdio.d#L17
>
> Yeah, the idea is to separate buffering, text, and locking operations so
> that they can be composed with any other type of stream (e.g., files,
> in-memory arrays, or sockets).

In-memory array IMHO better not pretend to be a stream. This kind of wrapping goes in the wrong direction (losing capabilities). Instead wrapping a stream and/or array as a buffer range proved to me to be more natural (extending capabilities).

>Currently, std.stdio has all three of
> those facets rolled into one.

Locking though is a province of shared and may need a bit more thought.

-- 
Dmitry Olshansky
January 06, 2014
On Sunday, 5 January 2014 at 13:30:59 UTC, Dmitry Olshansky wrote:
> I my view text implies something like:
>
> void write(const(char)[]);
> size_t read(char[]);
>
> And binary would be:
>
> void write(const(ubyte)[]);
> size_t read(ubyte[]);
>
> Should not clash.

Those would do the same thing for either text or binary data. When I say text writing, I guess I mean the serialization of any type to text (like what std.stdio.write does):

    void write(T)(T value);         // Text writing
    void write(const(ubyte)[] buf); // Binary writing

    write([1, 2, 3]); // want to write "[1, 2, 3]"
                      // but writes "\x01\x02\x03"

This clashes. We need to be able to specify if we want to write/read a text representation or just the raw binary data. In the above case, the most specialized overload will be called.

> In-memory array IMHO better not pretend to be a stream. This kind of wrapping goes in the wrong direction (losing capabilities). Instead wrapping a stream and/or array as a buffer range proved to me to be more natural (extending capabilities).

Shouldn't buffers/arrays provide a stream interface in addition to buffer-specific operations? I don't see why it would conflict with a range interface. As I understand it, ranges act on a single element at a time while streams act on multiple elements at a time. For ArrayBuffer in datapicked, a stream-style read is just lookahead(n) and cur += n. What capabilities are lost?

If buffers/arrays provide a stream interface, then they can be used by code that doesn't directly need the buffering capabilities but would still benefit from them.

>>Currently, std.stdio has all three of
>> those facets rolled into one.
>
> Locking though is a province of shared and may need a bit more thought.

Locking of streams is something that I haven't explored too deeply yet. Streams that communicate with the OS certainly need locking as thread locality makes no difference there.