What is the best way to use requests and iopipe on gzipped JSON file (page 2)

October 13, 2017

Re: What is the best way to use requests and iopipe on gzipped JSON file

Posted by Andrew Edwards
in reply to Steven Schveighoffer

Permalink

Andrew Edwards

Posted in reply to Steven Schveighoffer

Permalink

On Friday, 13 October 2017 at 21:53:12 UTC, Steven Schveighoffer wrote:
> On 10/13/17 4:27 PM, Andrew Edwards wrote:
>> On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote:
>>> On 10/13/17 2:47 PM, Andrew Edwards wrote:
>>>> A bit of advice, please. I'm trying to parse a gzipped JSON file retrieved from the internet. The following naive implementation accomplishes the task:
>>>>
>>>>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>>>>      getContent(url)
>>>>          .data
>>>>          .unzip
>>>>          .runEncoded!((input) {
>>>>              ubyte[] content;
>>>>              foreach (line; input.byLineRange!true) {
>>>>                  content ~= cast(ubyte[])line;
>>>>              }
>>>>              auto json = (cast(string)content).parseJSON;
>>>
>>> input is an iopipe of char, wchar, or dchar. There is no need to cast it around.
>> 
>> In this particular case, all three types (char[], wchar[], and dchar[]) are being returned at different points in the loop. I don't know of any other way to generate a unified buffer than casting it to ubyte[].
>
> This has to be a misunderstanding. The point of runEncoded is to figure out the correct type (based on the BOM), and run your lambda function with the correct type for the whole thing.

Maybe I'm just not finding the correct words to express my thoughts. This is what I mean:

// ===========

void main()
{
	auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
	getContent(url)
		.data
		.unzip
		.runEncoded!((input) {
			char[] content; // Line 20
			foreach (line; input.byLineRange!true) {
				content ~= line;
			}
		});
}

output:
source/app.d(20,13): Error: cannot append type wchar[] to type char[]

Changing line 20 to wchar yields:
source/app.d(20,13): Error: cannot append type char[] to type wchar[]

And changing it to dchar[] yields:
source/app.d(20,13): Error: cannot append type char[] to type dchar[]

> I'm not sure actually this is even needed, as the data could be coming through without a BOM. Without a BOM, it assumes UTF8.
>
>>> Note also that getContent returns a complete body, but unzip may not be so forgiving. But there definitely isn't a reason to create your own buffer here.
>>>
>>> this should work (something like this really should be in iopipe):
>>>
>>> while(input.extend(0) != 0) {} // get data until EOF
>> 
>> This!!! This is what I was looking for. Thank you. I incorrectly assumed that if I didn't process the content of input.window, it would be overwritten on each .extend() so my implementation was:
>> 
>> ubyte[] json;
>> while(input.extend(0) != 0) {
>>      json ~= input.window;
>> }
>> 
>> This didn't work because it invalidated the Unicode data so I ended up splitting by line instead.
>> 
>> Sure enough, this is trivial once one knows how to use it correctly, but I think it would be better to put this in the library as extendAll().
>
> ensureElems(size_t.max) should be equivalent, though I see you responded cryptically with something about JSON there :)

:) I'll have to blame it on my Security+ training. Switching out the while loop with ensureElements() in the following results in an error:

void main()
{
	auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
	getContent(url)
		.data
		.unzip
		.runEncoded!((input) {
			// while(input.extend(0) != 0){} // this works
			input.ensureElems(size_t.max); // this doesn't
			auto json = input.window.parseJSON;
			foreach (size_t ndx, _; json) {
				if (ndx == 0) continue;
				auto title = json[ndx]["title"].str;
				auto author = json[ndx]["writer"].str;
				writefln("title: %s", title);
				writefln("author: %s\n", author);
			}
		});
}

output:

Running ./uhost
std.json.JSONException@std/json.d(1400): Unexpected end of data. (Line 1:8192)
----------------
4   uhost                               0x000000010b671112 pure @safe void std.json.parseJSON!(char[]).parseJSON(char[], int, std.json.JSONOptions).error(immutable(char)[]) + 86

[etc]

On 10/13/17 6:24 PM, Andrew Edwards wrote: > On Friday, 13 October 2017 at 21:53:12 UTC, Steven Schveighoffer wrote: >> This has to be a misunderstanding. The point of runEncoded is to figure out the correct type (based on the BOM), and run your lambda function with the correct type for the whole thing. > > Maybe I'm just not finding the correct words to express my thoughts. This is what I mean: > > // =========== > > void main() > { > auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5"; > getContent(url) > .data > .unzip > .runEncoded!((input) { > char[] content; // Line 20 > foreach (line; input.byLineRange!true) { > content ~= line; > } > }); > } > > output: > source/app.d(20,13): Error: cannot append type wchar[] to type char[] > > Changing line 20 to wchar yields: > source/app.d(20,13): Error: cannot append type char[] to type wchar[] > > And changing it to dchar[] yields: > source/app.d(20,13): Error: cannot append type char[] to type dchar[] Ah, OK. So the way runEncoded works is it necessarily instantiates your lambda with all types of iopipes that it might need. Then it decides at runtime which one to call. So for a single call, it may be one of those 3, but always the same within the loop. It might be tough to do it right, but moot point now, since it's not necessary anyway :) -Steve

On Friday, 13 October 2017 at 22:29:39 UTC, Steven Schveighoffer wrote: > It might be tough to do it right, but moot point now, since it's not necessary anyway :) > > -Steve Yup. Thanks again. Andrew

Hello, Steve On Friday, 13 October 2017 at 22:22:54 UTC, Steven Schveighoffer wrote: > On 10/13/17 6:18 PM, ikod wrote: >> On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote: >>> >>> Eventually, something like this will be possible with jsoniopipe (I need to update and release this too, it's probably broken with some of the changes I just put into iopipe). Hopefully combined with some sort of networking library you could process a JSON stream without reading the whole thing into memory. >> >> This can be done with requests. You can ask not to load whole content in memory, but instead produce input range, which will continue to load data from server when you will be ready to consume: >> >> auto rq = Request(); >> rq.useStreaming = true; >> auto rs = rq.get("http://httpbin.org/image/jpeg"); >> auto stream = rs.receiveAsRange(); >> while(!stream.empty) { >> // stream.front contain next data portion >> writefln("Received %d bytes, total received %d from document legth %d", stream.front.length, rq.contentReceived, rq.contentLength); >> stream.popFront; // continue to load from server >> } > > Very nice, I will add a component to iopipe that converts a "chunk-like" range like this into an iopipe source, as this is going to be needed to interface with existing libraries. I still will want to skip the middle man buffer at some point though :) > > Thanks! > > -Steve Just in order to have complete picture here - getContent returns not just ubyte[], but more rich structure (which can be convered to ubyte[] if needed). Basically it is an immutable(immutable(ubyte)[]) and almost all data there are just data received from network without any data copy. There are more details and docs on https://github.com/ikod/nbuff/blob/master/source/nbuff/buffer.d. Main goal behind Buffer is to minimize data movement, but it also support many range properties, as long as some internal optimized methods. Thanks, Igor

On 10/17/17 4:33 AM, ikod wrote: > Hello, Steve > > On Friday, 13 October 2017 at 22:22:54 UTC, Steven Schveighoffer wrote: >> On 10/13/17 6:18 PM, ikod wrote: >>> On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote: >>>> >>>> Eventually, something like this will be possible with jsoniopipe (I need to update and release this too, it's probably broken with some of the changes I just put into iopipe). Hopefully combined with some sort of networking library you could process a JSON stream without reading the whole thing into memory. >>> >>> This can be done with requests. You can ask not to load whole content in memory, but instead produce input range, which will continue to load data from server when you will be ready to consume: >>> >>> auto rq = Request(); >>> rq.useStreaming = true; >>> auto rs = rq.get("http://httpbin.org/image/jpeg"); >>> auto stream = rs.receiveAsRange(); >>> while(!stream.empty) { >>> // stream.front contain next data portion >>> writefln("Received %d bytes, total received %d from document legth %d", stream.front.length, rq.contentReceived, rq.contentLength); >>> stream.popFront; // continue to load from server >>> } >> >> Very nice, I will add a component to iopipe that converts a "chunk-like" range like this into an iopipe source, as this is going to be needed to interface with existing libraries. I still will want to skip the middle man buffer at some point though :) >> >> Thanks! >> > > Just in order to have complete picture here - getContent returns not just ubyte[], but more rich structure (which can be convered to ubyte[] if needed). Basically it is an immutable(immutable(ubyte)[]) and almost all data there are just data received from network without any data copy. Right, iopipe can use it just fine, without copying, as all arrays are also iopipes. In that case, it skips allocating a buffer, because there is no need. However, I prefer the need to avoid allocating the whole thing in memory, which is why I would prefer the range interface. However, in this case, iopipe needs to copy each chunk to its own buffer. In terms of the most useful/least copying, direct access to the stream itself would be the best, which is why I said "skip the middle man". I feel like this won't be possible directly with requests and iopipe, because you need buffering to deal with parsing the headers. I think it's probably going to be a system built on top of iopipe, using its buffers, that would be the most optimal. > There are more details and docs on https://github.com/ikod/nbuff/blob/master/source/nbuff/buffer.d. Main goal behind Buffer is to minimize data movement, but it also support many range properties, as long as some internal optimized methods. I will take a look when I get a chance, thanks. -Steve

On 10/13/17 6:18 PM, Steven Schveighoffer wrote: > On 10/13/17 6:07 PM, Steven Schveighoffer wrote: >> I reproduced, and it comes down to some sort of bug when size_t.max is passed to ensureElems. >> >> I will find and eradicate it. >> > > I think I know, the buffered input source is attempting to allocate a size_t.max size buffer to hold the expected new data, and cannot do so (obviously). I need to figure out how to handle this properly. I shouldn't be prematurely extending the buffer to read all that data. > > The while loop does work, I may change ensureElems(size_t.max) to do this. But I'm concerned about accidentally allocating huge buffers. For example ensureElems(1_000_000_000) works, but probably allocates a GB of space in order to "work"! This is now fixed. https://github.com/schveiguy/iopipe/pull/12 -Steve

Forums