Thread overview
Re: D2 byChunk
Dec 11, 2010
Matthias Walter
Dec 11, 2010
Matthias Walter
December 11, 2010
On 12/10/2010 09:57 PM, Matthias Walter wrote:
> Hi all,
>
> I currently work on a parser for some file format. I wanted to use the std.stdio.ByChunk Range to read from a file and extract tokens from the chunks. Obviously it can happen that the current chunk ends before a token can be extracted, in which case I can ask for the next chunk from the Range. In order to keep the already-read part in mind, I need to dup at least the unprocessed part of the older chunk and concatenate it in front of the next part or at least write the code that works like they were concatenated. This looks like a stupid approach to me.
>
> Here is a small example:
>
> file contents: "Hello world"
> chunks: "Hello w" "orld"
>
> First I read the token "Hello" from the first chunk and maybe skip the whitespace. Then I have the "w" (which I need to move away from the buffer, because ByChunk fill overwrite it) and get "orld".
>
> My idea was to have a ByChunk-related Object, which the user can tell how much of the buffer he/she actually used, such that it can move this data to the beginning of the buffer and append the next chunk. This wouldn't need further allocations and give the user contiguous data he/she can work with.
I coded something that works like this:

foreach (ref ubyte[] data; byBuffer(file, 12))
{
  writefln("[%s]", cast(string) data);
  data = data[$-2 .. $];
}

The 2nd line in the loop tells ByBuffer that we didn't process the last two chars and would like to get them again along with newly read data. And as long as we do process something, the internal buffer does not get reallocated.

It works and respects the formal requirements of ranges. Whether it respects the intended semantics, one can discuss about. Any comments whether the above things make sense or is an evil exploit of the provided syntax sugar?
December 11, 2010
On 12/10/10 22:36, Matthias Walter wrote:
> On 12/10/2010 09:57 PM, Matthias Walter wrote:
>> Hi all,
>>
>> I currently work on a parser for some file format. I wanted to use the std.stdio.ByChunk Range to read from a file and extract tokens from the chunks. Obviously it can happen that the current chunk ends before a token can be extracted, in which case I can ask for the next chunk from the Range. In order to keep the already-read part in mind, I need to dup at least the unprocessed part of the older chunk and concatenate it in front of the next part or at least write the code that works like they were concatenated. This looks like a stupid approach to me.
>>
>> Here is a small example:
>>
>> file contents: "Hello world"
>> chunks: "Hello w" "orld"
>>
>> First I read the token "Hello" from the first chunk and maybe skip the whitespace. Then I have the "w" (which I need to move away from the buffer, because ByChunk fill overwrite it) and get "orld".
>>
>> My idea was to have a ByChunk-related Object, which the user can tell how much of the buffer he/she actually used, such that it can move this data to the beginning of the buffer and append the next chunk. This wouldn't need further allocations and give the user contiguous data he/she can work with.
> I coded something that works like this:
> 
> foreach (ref ubyte[] data; byBuffer(file, 12))
> {
>   writefln("[%s]", cast(string) data);
>   data = data[$-2 .. $];
> }
> 
> The 2nd line in the loop tells ByBuffer that we didn't process the last two chars and would like to get them again along with newly read data. And as long as we do process something, the internal buffer does not get reallocated.
> 
> It works and respects the formal requirements of ranges. Whether it respects the intended semantics, one can discuss about. Any comments whether the above things make sense or is an evil exploit of the provided syntax sugar?

I don't think it's a bad approach, but I have a suggestion.

It leaves a lot of room for abuse or misuse if you require the user code to modify the data[] array in order to send this "protect some characters" message.  I think it would be better to provide an explicit function/method that means precisely that.  Maybe return a transparent struct wrapping a view to the buffer's data, that further provides a function for doing precisely this.

foreach( data; byBuffer( file, 12 )) {
  // do things with data, decide we need to keep 2 chars
  data.save( 2 );
}

Or something like it.  With regards to this, you may want to allow the internal buffer to grow (if you aren't already) as needed.  Imagine what would otherwise happen if you needed to 'save' the entire current buffer.

-- Chris N-S
December 11, 2010

On 12/11/2010 01:00 AM, Christopher Nicholson-Sauls wrote:
> On 12/10/10 22:36, Matthias Walter wrote:
>> On 12/10/2010 09:57 PM, Matthias Walter wrote:
>>> Hi all,
>>>
>>> I currently work on a parser for some file format. I wanted to use the std.stdio.ByChunk Range to read from a file and extract tokens from the chunks. Obviously it can happen that the current chunk ends before a token can be extracted, in which case I can ask for the next chunk from the Range. In order to keep the already-read part in mind, I need to dup at least the unprocessed part of the older chunk and concatenate it in front of the next part or at least write the code that works like they were concatenated. This looks like a stupid approach to me.
>>>
>>> Here is a small example:
>>>
>>> file contents: "Hello world"
>>> chunks: "Hello w" "orld"
>>>
>>> First I read the token "Hello" from the first chunk and maybe skip the whitespace. Then I have the "w" (which I need to move away from the buffer, because ByChunk fill overwrite it) and get "orld".
>>>
>>> My idea was to have a ByChunk-related Object, which the user can tell how much of the buffer he/she actually used, such that it can move this data to the beginning of the buffer and append the next chunk. This wouldn't need further allocations and give the user contiguous data he/she can work with.
>> I coded something that works like this:
>>
>> foreach (ref ubyte[] data; byBuffer(file, 12))
>> {
>>   writefln("[%s]", cast(string) data);
>>   data = data[$-2 .. $];
>> }
>>
>> The 2nd line in the loop tells ByBuffer that we didn't process the last two chars and would like to get them again along with newly read data. And as long as we do process something, the internal buffer does not get reallocated.
>>
>> It works and respects the formal requirements of ranges. Whether it respects the intended semantics, one can discuss about. Any comments whether the above things make sense or is an evil exploit of the provided syntax sugar?
> I don't think it's a bad approach, but I have a suggestion.
>
> It leaves a lot of room for abuse or misuse if you require the user code to modify the data[] array in order to send this "protect some characters" message.  I think it would be better to provide an explicit function/method that means precisely that.  Maybe return a transparent struct wrapping a view to the buffer's data, that further provides a function for doing precisely this.
>
> foreach( data; byBuffer( file, 12 )) {
>   // do things with data, decide we need to keep 2 chars
>   data.save( 2 );
> }
>
> Or something like it.  With regards to this, you may want to allow the internal buffer to grow (if you aren't already) as needed.  Imagine what would otherwise happen if you needed to 'save' the entire current buffer.
>
> -- Chris N-S
Thank you! This is a really good idea. So I basically wrap the buffer-array and implement it such that the default behavior (without explicitely doing something) is like the ByChunk mechanism.

Matthias