Jump to page: 1 2
Thread overview
What is the best way to use requests and iopipe on gzipped JSON file
Oct 13, 2017
Andrew Edwards
Oct 13, 2017
Andrew Edwards
Oct 13, 2017
Andrew Edwards
Oct 13, 2017
Andrew Edwards
Oct 13, 2017
Andrew Edwards
Oct 13, 2017
ikod
Oct 17, 2017
ikod
October 13, 2017
A bit of advice, please. I'm trying to parse a gzipped JSON file retrieved from the internet. The following naive implementation accomplishes the task:

	auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
	getContent(url)
		.data
		.unzip
		.runEncoded!((input) {
			ubyte[] content;
			foreach (line; input.byLineRange!true) {
				content ~= cast(ubyte[])line;
			}
			auto json = (cast(string)content).parseJSON;
			foreach (size_t ndx, record; json) {
				if (ndx == 0) continue;
				auto title = json[ndx]["title"].str;
				auto author = json[ndx]["writer"].str;
				writefln("title: %s", title);
				writefln("author: %s\n", author);
			}
		});

However, I'm sure there is a much better way to accomplish this. Is there any way to accomplish something akin to:

	auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
	getContent(url)
		.data
		.unzip
		.runEncoded!((input) {
			foreach (record; input.data.parseJSON[1 .. $]) {
				// use or update record as desired
			}
		});

Thanks,
Andrew

October 13, 2017
On 10/13/17 2:47 PM, Andrew Edwards wrote:
> A bit of advice, please. I'm trying to parse a gzipped JSON file retrieved from the internet. The following naive implementation accomplishes the task:
> 
>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>      getContent(url)
>          .data
>          .unzip
>          .runEncoded!((input) {
>              ubyte[] content;
>              foreach (line; input.byLineRange!true) {
>                  content ~= cast(ubyte[])line;
>              }
>              auto json = (cast(string)content).parseJSON;

input is an iopipe of char, wchar, or dchar. There is no need to cast it around.

Also, there is no need to split it by line, json doesn't care.

Note also that getContent returns a complete body, but unzip may not be so forgiving. But there definitely isn't a reason to create your own buffer here.

this should work (something like this really should be in iopipe):

while(input.extend(0) != 0) {} // get data until EOF

And then:
auto json = input.window.parseJSON;

>              foreach (size_t ndx, record; json) {
>                  if (ndx == 0) continue;
>                  auto title = json[ndx]["title"].str;
>                  auto author = json[ndx]["writer"].str;
>                  writefln("title: %s", title);
>                  writefln("author: %s\n", author);
>              }
>          });
> 
> However, I'm sure there is a much better way to accomplish this. Is there any way to accomplish something akin to:
> 
>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>      getContent(url)
>          .data
>          .unzip
>          .runEncoded!((input) {
>              foreach (record; input.data.parseJSON[1 .. $]) {
>                  // use or update record as desired
>              }
>          });

Eventually, something like this will be possible with jsoniopipe (I need to update and release this too, it's probably broken with some of the changes I just put into iopipe). Hopefully combined with some sort of networking library you could process a JSON stream without reading the whole thing into memory.

Right now, it works just like std.json.parseJSON: it parses an entire JSON message into a DOM form.

-Steve
October 13, 2017
On 10/13/17 3:17 PM, Steven Schveighoffer wrote:

> this should work (something like this really should be in iopipe):
> 
> while(input.extend(0) != 0) {} // get data until EOF

This should work today, actually. Didn't think about it before.

input.ensureElems(size_t.max);

-Steve
October 13, 2017
On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote:
> On 10/13/17 2:47 PM, Andrew Edwards wrote:
>> A bit of advice, please. I'm trying to parse a gzipped JSON file retrieved from the internet. The following naive implementation accomplishes the task:
>> 
>>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>>      getContent(url)
>>          .data
>>          .unzip
>>          .runEncoded!((input) {
>>              ubyte[] content;
>>              foreach (line; input.byLineRange!true) {
>>                  content ~= cast(ubyte[])line;
>>              }
>>              auto json = (cast(string)content).parseJSON;
>
> input is an iopipe of char, wchar, or dchar. There is no need to cast it around.

In this particular case, all three types (char[], wchar[], and dchar[]) are being returned at different points in the loop. I don't know of any other way to generate a unified buffer than casting it to ubyte[].

> Also, there is no need to split it by line, json doesn't care.

I thought as much but my mind was not open enough to see the solution.

> Note also that getContent returns a complete body, but unzip may not be so forgiving. But there definitely isn't a reason to create your own buffer here.
>
> this should work (something like this really should be in iopipe):
>
> while(input.extend(0) != 0) {} // get data until EOF

This!!! This is what I was looking for. Thank you. I incorrectly assumed that if I didn't process the content of input.window, it would be overwritten on each .extend() so my implementation was:

ubyte[] json;
while(input.extend(0) != 0) {
    json ~= input.window;
}

This didn't work because it invalidated the Unicode data so I ended up splitting by line instead.

Sure enough, this is trivial once one knows how to use it correctly, but I think it would be better to put this in the library as extendAll().

> And then:
> Eventually, something like this will be possible with jsoniopipe (I need to update and release this too, it's probably broken with some of the changes I just put into iopipe). Hopefully combined with some sort of networking library you could process a JSON stream without reading the whole thing into memory.

That would be awesome. Again, thank you very much.

Andrew
October 13, 2017
On Friday, 13 October 2017 at 20:17:50 UTC, Steven Schveighoffer wrote:
> On 10/13/17 3:17 PM, Steven Schveighoffer wrote:
>
>> this should work (something like this really should be in iopipe):
>> 
>> while(input.extend(0) != 0) {} // get data until EOF
>
> This should work today, actually. Didn't think about it before.
>
> input.ensureElems(size_t.max);
>
> -Steve

No, it errored out:
std.json.JSONException@std/json.d(1400): Unexpected end of data. (Line 1:8192)
October 13, 2017
On 10/13/17 4:27 PM, Andrew Edwards wrote:
> On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote:
>> On 10/13/17 2:47 PM, Andrew Edwards wrote:
>>> A bit of advice, please. I'm trying to parse a gzipped JSON file retrieved from the internet. The following naive implementation accomplishes the task:
>>>
>>>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>>>      getContent(url)
>>>          .data
>>>          .unzip
>>>          .runEncoded!((input) {
>>>              ubyte[] content;
>>>              foreach (line; input.byLineRange!true) {
>>>                  content ~= cast(ubyte[])line;
>>>              }
>>>              auto json = (cast(string)content).parseJSON;
>>
>> input is an iopipe of char, wchar, or dchar. There is no need to cast it around.
> 
> In this particular case, all three types (char[], wchar[], and dchar[]) are being returned at different points in the loop. I don't know of any other way to generate a unified buffer than casting it to ubyte[].

This has to be a misunderstanding. The point of runEncoded is to figure out the correct type (based on the BOM), and run your lambda function with the correct type for the whole thing.

I'm not sure actually this is even needed, as the data could be coming through without a BOM. Without a BOM, it assumes UTF8.

>> Note also that getContent returns a complete body, but unzip may not be so forgiving. But there definitely isn't a reason to create your own buffer here.
>>
>> this should work (something like this really should be in iopipe):
>>
>> while(input.extend(0) != 0) {} // get data until EOF
> 
> This!!! This is what I was looking for. Thank you. I incorrectly assumed that if I didn't process the content of input.window, it would be overwritten on each .extend() so my implementation was:
> 
> ubyte[] json;
> while(input.extend(0) != 0) {
>      json ~= input.window;
> }
> 
> This didn't work because it invalidated the Unicode data so I ended up splitting by line instead.
> 
> Sure enough, this is trivial once one knows how to use it correctly, but I think it would be better to put this in the library as extendAll().

ensureElems(size_t.max) should be equivalent, though I see you responded cryptically with something about JSON there :)

I will try and reproduce your error, and see if I can figure out why.

-Steve
October 13, 2017
On 10/13/17 4:30 PM, Andrew Edwards wrote:
> On Friday, 13 October 2017 at 20:17:50 UTC, Steven Schveighoffer wrote:
>> On 10/13/17 3:17 PM, Steven Schveighoffer wrote:
>>
>>> this should work (something like this really should be in iopipe):
>>>
>>> while(input.extend(0) != 0) {} // get data until EOF
>>
>> This should work today, actually. Didn't think about it before.
>>
>> input.ensureElems(size_t.max);
>>
> 
> No, it errored out:
> std.json.JSONException@std/json.d(1400): Unexpected end of data. (Line 1:8192)


I reproduced, and it comes down to some sort of bug when size_t.max is passed to ensureElems.

I will find and eradicate it.

-Steve
October 13, 2017
On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote:
> On 10/13/17 2:47 PM, Andrew Edwards wrote:
>> A bit of advice, please. I'm trying to parse a gzipped JSON file retrieved from the internet. The following naive implementation accomplishes the task:
>> 
>>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>>      getContent(url)
>>          .data
>>          .unzip
>>          .runEncoded!((input) {
>>              ubyte[] content;
>>              foreach (line; input.byLineRange!true) {
>>                  content ~= cast(ubyte[])line;
>>              }
>>              auto json = (cast(string)content).parseJSON;
>
> input is an iopipe of char, wchar, or dchar. There is no need to cast it around.
>
> Also, there is no need to split it by line, json doesn't care.
>
> Note also that getContent returns a complete body, but unzip may not be so forgiving. But there definitely isn't a reason to create your own buffer here.
>
> this should work (something like this really should be in iopipe):
>
> while(input.extend(0) != 0) {} // get data until EOF
>
> And then:
> auto json = input.window.parseJSON;
>
>>              foreach (size_t ndx, record; json) {
>>                  if (ndx == 0) continue;
>>                  auto title = json[ndx]["title"].str;
>>                  auto author = json[ndx]["writer"].str;
>>                  writefln("title: %s", title);
>>                  writefln("author: %s\n", author);
>>              }
>>          });
>> 
>> However, I'm sure there is a much better way to accomplish this. Is there any way to accomplish something akin to:
>> 
>>      auto url = "http://api.syosetu.com/novelapi/api/?out=json&lim=500&gzip=5";
>>      getContent(url)
>>          .data
>>          .unzip
>>          .runEncoded!((input) {
>>              foreach (record; input.data.parseJSON[1 .. $]) {
>>                  // use or update record as desired
>>              }
>>          });
>
> Eventually, something like this will be possible with jsoniopipe (I need to update and release this too, it's probably broken with some of the changes I just put into iopipe). Hopefully combined with some sort of networking library you could process a JSON stream without reading the whole thing into memory.

This can be done with requests. You can ask not to load whole content in memory, but instead produce input range, which will continue to load data from server when you will  be ready to consume:

    auto rq = Request();
    rq.useStreaming = true;
    auto rs = rq.get("http://httpbin.org/image/jpeg");
    auto stream = rs.receiveAsRange();
    while(!stream.empty) {
        // stream.front contain next data portion
        writefln("Received %d bytes, total received %d from document legth %d", stream.front.length, rq.contentReceived, rq.contentLength);
        stream.popFront; // continue to load from server
    }


>
> Right now, it works just like std.json.parseJSON: it parses an entire JSON message into a DOM form.
>
> -Steve


October 13, 2017
On 10/13/17 6:07 PM, Steven Schveighoffer wrote:
> On 10/13/17 4:30 PM, Andrew Edwards wrote:
>> On Friday, 13 October 2017 at 20:17:50 UTC, Steven Schveighoffer wrote:
>>> On 10/13/17 3:17 PM, Steven Schveighoffer wrote:
>>>
>>>> this should work (something like this really should be in iopipe):
>>>>
>>>> while(input.extend(0) != 0) {} // get data until EOF
>>>
>>> This should work today, actually. Didn't think about it before.
>>>
>>> input.ensureElems(size_t.max);
>>>
>>
>> No, it errored out:
>> std.json.JSONException@std/json.d(1400): Unexpected end of data. (Line 1:8192)
> 
> 
> I reproduced, and it comes down to some sort of bug when size_t.max is passed to ensureElems.
> 
> I will find and eradicate it.
> 

I think I know, the buffered input source is attempting to allocate a size_t.max size buffer to hold the expected new data, and cannot do so (obviously). I need to figure out how to handle this properly. I shouldn't be prematurely extending the buffer to read all that data.

The while loop does work, I may change ensureElems(size_t.max) to do this. But I'm concerned about accidentally allocating huge buffers. For example ensureElems(1_000_000_000) works, but probably allocates a GB of space in order to "work"!

-Steve
October 13, 2017
On 10/13/17 6:18 PM, ikod wrote:
> On Friday, 13 October 2017 at 19:17:54 UTC, Steven Schveighoffer wrote:
>>
>> Eventually, something like this will be possible with jsoniopipe (I need to update and release this too, it's probably broken with some of the changes I just put into iopipe). Hopefully combined with some sort of networking library you could process a JSON stream without reading the whole thing into memory.
> 
> This can be done with requests. You can ask not to load whole content in memory, but instead produce input range, which will continue to load data from server when you will  be ready to consume:
> 
>      auto rq = Request();
>      rq.useStreaming = true;
>      auto rs = rq.get("http://httpbin.org/image/jpeg");
>      auto stream = rs.receiveAsRange();
>      while(!stream.empty) {
>          // stream.front contain next data portion
>          writefln("Received %d bytes, total received %d from document legth %d", stream.front.length, rq.contentReceived, rq.contentLength);
>          stream.popFront; // continue to load from server
>      }

Very nice, I will add a component to iopipe that converts a "chunk-like" range like this into an iopipe source, as this is going to be needed to interface with existing libraries. I still will want to skip the middle man buffer at some point though :)

Thanks!

-Steve
« First   ‹ Prev
1 2