Jump to page: 1 2
Thread overview
byChunk odd behavior?
Mar 22, 2016
Hanh
Mar 22, 2016
Hanh
Mar 22, 2016
Taylor Hillegeist
Mar 22, 2016
Ali Çehreli
Mar 22, 2016
cy
Mar 23, 2016
Hanh
Mar 23, 2016
Chris Wright
Mar 23, 2016
cym13
Mar 24, 2016
Hanh
Mar 25, 2016
cym13
Mar 26, 2016
Hanh
Mar 26, 2016
cym13
Mar 26, 2016
Hanh
March 22, 2016
Hi all,

I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take.

void test() {
	auto file = new File("test.txt");
	auto input = file.byChunk(2).joiner;
	input.take(3).array;
	foreach (char c; input) {
		writeln(c);
	}
}

Let's say test.txt contains "123456".

The output will be
3
4
5
6

The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't.

It looks like if "take" spans two chunks, it affects the input range otherwise it doesn't.

Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length.

Thanks,
--h




March 22, 2016
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
> Hi all,
>
> I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take.
>
> void test() {
> 	auto file = new File("test.txt");
> 	auto input = file.byChunk(2).joiner;
> 	input.take(3).array;
> 	foreach (char c; input) {
> 		writeln(c);
> 	}
> }
>
> Let's say test.txt contains "123456".
>
> The output will be
> 3
> 4
> 5
> 6
>
> The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't.
>
> It looks like if "take" spans two chunks, it affects the input range otherwise it doesn't.
>
> Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length.
>
> Thanks,
> --h

I have the feeling that it's related to the forward only nature of an InputRange. All would be ok with a take(N)+popFrontN method. I'm going to keep looking.
March 22, 2016
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
> Hi all,
>
> I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take.
>
> void test() {
> 	auto file = new File("test.txt");
> 	auto input = file.byChunk(2).joiner;
> 	input.take(3).array;
> 	foreach (char c; input) {
> 		writeln(c);
> 	}
> }
>
> Let's say test.txt contains "123456".
>
> The output will be
> 3
> 4
> 5
> 6
>
> The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't.
>
> It looks like if "take" spans two chunks, it affects the input range otherwise it doesn't.
>
> Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length.
>
> Thanks,
> --h

I dont know if this helps, but it looks like since take three doesn't consume the chunk it is not removed from the range.

import std.stdio;
import std.algorithm;
import std.range;

void main() {
	auto file = stdin;
	auto input = file.byChunk(2).joiner;
	
	foreach (char c; input.take(3).array) {
		writeln(c);
	}
	
	foreach (char c; input) {
		writeln(c);
	}
}

Produces:
1
2
3 < Got data but didn't eat the chunk.
3
4
5
6
March 22, 2016
On 03/22/2016 12:17 AM, Hanh wrote:
> Hi all,
>
> I'm trying to process a rather large file as an InputRange and run into
> something strange with byChunk / take.
>
> void test() {
>      auto file = new File("test.txt");
>      auto input = file.byChunk(2).joiner;
>      input.take(3).array;
>      foreach (char c; input) {
>          writeln(c);
>      }
> }
>
> Let's say test.txt contains "123456".
>
> The output will be
> 3
> 4
> 5
> 6
>
> The "take" consumed one chunk from the file, but if I increase the chunk
> size to 4, then it won't.

I don't understand the issue fully but byChunk() will treat every character in the file. So, even the newline character(s) are considered.

> Actually, what is the easiest way to read a large file as a stream? My
> file contains a bunch of serialized messages of variable length.

If it's a text file I think I would start with File.byLine (or byLineCopy). Then it depends on how the messages are layed out. One per line? Do you know the size at the start? etc.

Alternatively, use (or examine) one of the great D serialization modules out there. :)

(We already need something like this in the standard library, which I think some people are already working on.)

Ali

March 22, 2016
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:
> 	input.take(3).array;
> 	foreach (char c; input) {

Never use an input range twice. So, here's how to use it twice:

If it's a "forward range" you can use save() to get a copy to use later (but all the std.stdio.* ranges don't implement that). You can also use "std.range.tee" to send the results to an "output range" (something implementing put(K)(K)) while iterating over them.

tee can't produce two input ranges, because without caching all iterated items in memory, only one range can request items on-demand; the other must take them passively.

You could write a thing that takes an InputRange and produces a ForwardRange, by caching those items in memory, but at that point you might as well use .array and get the whole thing.

ByChunk is an input range (not a forward range), so there's undefined behavior when you use it twice. No bugs there, since it wasn't meant to be reused anyway. What it does is cache the last seen chunk, first iterate over that, then read more chunks from the file. So every time you iterate, you'll get that same last chunk.

It's also tricky to use input ranges after mutating their underlying data structure. If you seek in the file, for instance, then a previously created ByChunk will produce the chunk it has cached, and only then start reading chunks from that exact position in the file. A range over some sort of list, if you delete the current item in the list, should the range produce the previous item? The next item? null?

So, as a general rule, never use input ranges twice, and never use them after mutating the underlying data structure. Just recreate them if you want to do something twice, or use tee as mentioned above.
March 23, 2016
Thanks for your help everyone.

I agree that the issue is due to the misusage of an InputRange but what is the semantics of 'take' when applied to an InputRange? It seems that calling it invalidates the range; in which case what is the recommended way to get a few bytes and keep on advancing.

For instance, to read a ushort, I use
range.read!(ushort)()
Unfortunately, it reads a single value.

For now, I use a loop

foreach (i; 0..N) {
  buffer[i] = range.front;
  range.popFront();
  }

Is there a more idiomatic way to do the same thing?

In Scala, 'take' consumes bytes from the iterator. So the same code would be
buffer = range.take(N).toArray

March 23, 2016
On Wed, 23 Mar 2016 03:17:05 +0000, Hanh wrote:
> In Scala, 'take' consumes bytes from the iterator. So the same code would be buffer = range.take(N).toArray

import std.range, std.array;
auto bytes = byteRange.takeExactly(N).array;

There's also take(N), but if the range contains fewer than N elements, it will only give you as many as the range contains. If If you're trying to deserialize something, takeExactly is probably better.


http://dpldocs.info/experimental-docs/std.range.takeExactly.html http://dpldocs.info/experimental-docs/std.array.array.1.html
March 23, 2016
On Wednesday, 23 March 2016 at 03:17:05 UTC, Hanh wrote:
> Thanks for your help everyone.
>
> I agree that the issue is due to the misusage of an InputRange but what is the semantics of 'take' when applied to an InputRange? It seems that calling it invalidates the range; in which case what is the recommended way to get a few bytes and keep on advancing.

Doing *anything* to a range invalidates it (or at least you should expect it to), a range is read-once. Never reuse a range. Some ranges can be saved in order to use a copy of it, but never expect a range to be implicitely reusable.

> For instance, to read a ushort, I use
> range.read!(ushort)()
> Unfortunately, it reads a single value.
>
> For now, I use a loop
>
> foreach (element ; range.enumerate) {
>   buffer[i] = range.front;
>   range.popFront();
>   }
>
> Is there a more idiomatic way to do the same thing?

Two ways, the first one being for reference:

    import std.range: enumerate;
    foreach (element, index ; range.enumerate) {
        buffer[index] = element;
    }

And the other one

> In Scala, 'take' consumes bytes from the iterator. So the same code would be
> buffer = range.take(N).toArray

Then just do that!

    import std.range, std.array;
    auto buffer = range.take(N).array;

    auto example = iota(0, 200, 5).take(5).array;
    assert(example == [0, 5, 10, 15, 20]);

March 24, 2016
On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:

>> In Scala, 'take' consumes bytes from the iterator. So the same code would be
>> buffer = range.take(N).toArray
>
> Then just do that!
>
>     import std.range, std.array;
>     auto buffer = range.take(N).array;
>
>     auto example = iota(0, 200, 5).take(5).array;
>     assert(example == [0, 5, 10, 15, 20]);

Well, that's what I do in the first post but you can't call it twice with an InputRange.

auto buffer1 = range.take(4).array; // ok
range.popFrontN(4); // not ok
auto buffer2 = range.take(4).array; // not ok

March 25, 2016
On Thursday, 24 March 2016 at 07:52:27 UTC, Hanh wrote:
> On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:
>
>>> In Scala, 'take' consumes bytes from the iterator. So the same code would be
>>> buffer = range.take(N).toArray
>>
>> Then just do that!
>>
>>     import std.range, std.array;
>>     auto buffer = range.take(N).array;
>>
>>     auto example = iota(0, 200, 5).take(5).array;
>>     assert(example == [0, 5, 10, 15, 20]);
>
> Well, that's what I do in the first post but you can't call it twice with an InputRange.
>
> auto buffer1 = range.take(4).array; // ok
> range.popFrontN(4); // not ok
> auto buffer2 = range.take(4).array; // not ok

Please, take some time to reread cy's answer above.

    void main(string[] args) {
        import std.range;
        import std.array;
        import std.algorithm;

        auto range = iota(0, 25, 5);

        // Will not consume (forward ranges only)
        //
        // Note however that range elements are not stored in any way by default
        // so reusing the range will also need you to recompute them each time!
        auto buffer1 = range.save.take(4).array;
        assert(buffer1 == [0, 5, 10, 15]);

        // The solution to the recomputation problème, and often the best way to
        // handle range reuse is to store them in an array
        //
        // This is reusable at will with no redundant computation
        auto arr = range.save.array;
        assert(arr == [0, 5, 10, 15, 20]);

        // And it has a range interface too
        auto buffer2 = arr.take(4).array;
        assert(buffer2 == [0, 5, 10, 15]);

        // This consume
        auto buffer3 = range.take(4).array;
        assert(buffer3 == [0, 5, 10, 15]);
    }

« First   ‹ Prev
1 2