Using iopipe to stream a gzipped file - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Using iopipe to stream a gzipped file

Thread overview

Using iopipe to stream a gzipped file
Jan 03, 2018 Andrew
Jan 03, 2018 Steven Schveighoffer
Jan 03, 2018 Andrew
Jan 04, 2018 Steven Schveighoffer
Jan 04, 2018 Andrew
Jan 04, 2018 Steven Schveighoffer
Jan 04, 2018 Andrew
Jan 04, 2018 Steven Schveighoffer
Jan 04, 2018 Andrew

January 03, 2018

Using iopipe to stream a gzipped file

Posted by Andrew

Andrew

Hi,

I have a very large gziped text file (all ASCII characters and ~500GB) that I want to stream and process line-by-line, and I thought the iopipe library would be perfect for this, but I can't seem to get it to work. So far, this is the closest I have to getting it to work:

import iopipe.textpipe;
import iopipe.zip;
import iopipe.bufpipe;
import iopipe.stream;

void main()
{

  auto fileToRead = openDev("file.gz").bufd.unzip(CompressionFormat.gzip);

  foreach (line; fileToRead.assumeText.byLineRange!false)
  {
     \\ do stuff
  }
}

but this only processes the first ~200 odd lines (I guess the initial read into the buffer). Can anyone help me out?

Thanks very much

Andrew

January 03, 2018

Re: Using iopipe to stream a gzipped file

Posted by Steven Schveighoffer
in reply to Andrew

Steven Schveighoffer

Posted in reply to Andrew

On 1/3/18 9:45 AM, Andrew wrote:
> Hi,
> 
> I have a very large gziped text file (all ASCII characters and ~500GB) that I want to stream and process line-by-line, and I thought the iopipe library would be perfect for this, but I can't seem to get it to work. So far, this is the closest I have to getting it to work:
> 
> import iopipe.textpipe;
> import iopipe.zip;
> import iopipe.bufpipe;
> import iopipe.stream;
> 
> void main()
> {
> 
>    auto fileToRead = openDev("file.gz").bufd.unzip(CompressionFormat.gzip);
> 
>    foreach (line; fileToRead.assumeText.byLineRange!false)
>    {
>       \\ do stuff
>    }
> }
> 
> but this only processes the first ~200 odd lines (I guess the initial read into the buffer). Can anyone help me out?

Do you have a sample file I can play with? Your iopipe chain looks correct, so I'm not sure why it wouldn't work.

-Steve

January 03, 2018

Re: Using iopipe to stream a gzipped file

Posted by Andrew
in reply to Steven Schveighoffer

Andrew

Posted in reply to Steven Schveighoffer

On Wednesday, 3 January 2018 at 16:09:19 UTC, Steven Schveighoffer wrote:
> On 1/3/18 9:45 AM, Andrew wrote:
>> Hi,
>> 
>> I have a very large gziped text file (all ASCII characters and ~500GB) that I want to stream and process line-by-line, and I thought the iopipe library would be perfect for this, but I can't seem to get it to work. So far, this is the closest I have to getting it to work:
>> 
>> import iopipe.textpipe;
>> import iopipe.zip;
>> import iopipe.bufpipe;
>> import iopipe.stream;
>> 
>> void main()
>> {
>> 
>>    auto fileToRead = openDev("file.gz").bufd.unzip(CompressionFormat.gzip);
>> 
>>    foreach (line; fileToRead.assumeText.byLineRange!false)
>>    {
>>       \\ do stuff
>>    }
>> }
>> 
>> but this only processes the first ~200 odd lines (I guess the initial read into the buffer). Can anyone help me out?
>
> Do you have a sample file I can play with? Your iopipe chain looks correct, so I'm not sure why it wouldn't work.
>
> -Steve

A sample file (about 250MB) can be found here:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

It should have 1,103,800 lines, but the following code only reports 256:

import iopipe.textpipe;
import iopipe.zip;
import iopipe.bufpipe;
import iopipe.stream;
import std.stdio;

void main()
{

   auto fileToRead = openDev("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz").bufd.unzip(CompressionFormat.gzip);

   auto counter = 0;
   foreach (line; fileToRead.assumeText.byLineRange!false)
   {
      counter++;
   }
   writeln(counter);
}

Thanks for looking into this.

Andrew

January 03, 2018

Re: Using iopipe to stream a gzipped file

Posted by Steven Schveighoffer
in reply to Andrew

Steven Schveighoffer

Posted in reply to Andrew

On 1/3/18 12:03 PM, Andrew wrote:

> Thanks for looking into this.
> 

So it looks like the file you have is a concatenated gzip file. If I gunzip the file and recompress it, it works properly.

Looking at the docs of zlib inflate [1]:

" Unlike the gunzip utility and gzread() ..., inflate() will not automatically decode concatenated gzip streams. inflate() will return Z_STREAM_END at the end of the gzip stream.  The state would need to be reset to continue decoding a subsequent gzip stream."

So what is happening is the inflate function is returning Z_STREAM_END, and I'm considering the stream done from that return code.

I'm not sure yet how to fix this. I suppose I can check if any more data exists, and then re-init and continue. I have to look up what a concatenated gzip file is. gzread isn't good for generic purposes, because it requires an actual file input (I want to support any input type, including memory data).

-Steve

[1] https://github.com/dlang/phobos/blob/master/etc/c/zlib.d#L874

January 04, 2018

Re: Using iopipe to stream a gzipped file

Posted by Andrew
in reply to Steven Schveighoffer

Andrew

Posted in reply to Steven Schveighoffer

On Thursday, 4 January 2018 at 02:44:09 UTC, Steven Schveighoffer wrote:
> On 1/3/18 12:03 PM, Andrew wrote:
>
>> Thanks for looking into this.
>> 
>
> So it looks like the file you have is a concatenated gzip file. If I gunzip the file and recompress it, it works properly.
>
> Looking at the docs of zlib inflate [1]:
>
> " Unlike the gunzip utility and gzread() ..., inflate() will not automatically decode concatenated gzip streams. inflate() will return Z_STREAM_END at the end of the gzip stream.  The state would need to be reset to continue decoding a subsequent gzip stream."
>
> So what is happening is the inflate function is returning Z_STREAM_END, and I'm considering the stream done from that return code.
>
> I'm not sure yet how to fix this. I suppose I can check if any more data exists, and then re-init and continue. I have to look up what a concatenated gzip file is. gzread isn't good for generic purposes, because it requires an actual file input (I want to support any input type, including memory data).
>
> -Steve
>
> [1] https://github.com/dlang/phobos/blob/master/etc/c/zlib.d#L874

Ah thank you, that makes sense. These types of files are compressed using the bgzip utility so that the file can be indexed meaning specific rows extracted quickly (there's more details of this here http://www.htslib.org/doc/tabix.html and the code can be found here: https://github.com/samtools/htslib/blob/develop/bgzf.c)

January 04, 2018

Re: Using iopipe to stream a gzipped file

Posted by Steven Schveighoffer
in reply to Andrew

Steven Schveighoffer

Posted in reply to Andrew

On 1/4/18 7:01 AM, Andrew wrote:

> Ah thank you, that makes sense. These types of files are compressed using the bgzip utility so that the file can be indexed meaning specific rows extracted quickly (there's more details of this here http://www.htslib.org/doc/tabix.html and the code can be found here: https://github.com/samtools/htslib/blob/develop/bgzf.c)

Hm... that utility seems to say it will result in bgz file extension? So this must be an extraction from one of those files?

In any case, I'll figure out how to deal with concatenated gzip file, and update iopipe. Next version will focus on a bunch of stuff relating to the 2 zip threads recently posted here.

Thanks!

-Steve

January 04, 2018

Re: Using iopipe to stream a gzipped file

Posted by Andrew
in reply to Steven Schveighoffer

Andrew

Posted in reply to Steven Schveighoffer

On Thursday, 4 January 2018 at 12:15:27 UTC, Steven Schveighoffer wrote:
> On 1/4/18 7:01 AM, Andrew wrote:
>
>> Ah thank you, that makes sense. These types of files are compressed using the bgzip utility so that the file can be indexed meaning specific rows extracted quickly (there's more details of this here http://www.htslib.org/doc/tabix.html and the code can be found here: https://github.com/samtools/htslib/blob/develop/bgzf.c)
>
> Hm... that utility seems to say it will result in bgz file extension? So this must be an extraction from one of those files?
>
> In any case, I'll figure out how to deal with concatenated gzip file, and update iopipe. Next version will focus on a bunch of stuff relating to the 2 zip threads recently posted here.
>
> Thanks!
>
> -Steve

That would be really great for me, thank you! By default bgzip produces a file with the standard .gz extension. Looking at the code it adds an extra field to the standard gzip header:

/* BGZF/GZIP header (speciallized from RFC 1952; little endian):
 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
 | 31|139|  8|  4|              0|  0|255|      6| 66| 67|      2|BLK_LEN|
 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  BGZF extension:
                ^                              ^   ^   ^
                |                              |   |   |
               FLG.EXTRA                     XLEN  B   C
  BGZF format is compatible with GZIP. It limits the size of each compressed
  block to 2^16 bytes and adds and an extra "BC" field in the gzip header which
  records the size.
*/

Thanks again!

Andrew

January 04, 2018

Re: Using iopipe to stream a gzipped file

Posted by Steven Schveighoffer
in reply to Andrew

Steven Schveighoffer

Posted in reply to Andrew

On 1/4/18 7:23 AM, Andrew wrote:
> On Thursday, 4 January 2018 at 12:15:27 UTC, Steven Schveighoffer wrote:
>>
>> In any case, I'll figure out how to deal with concatenated gzip file, and update iopipe. Next version will focus on a bunch of stuff relating to the 2 zip threads recently posted here.
>>
> 
> That would be really great for me, thank you! 

It's now been updated, see version 0.0.3.

Note, the performance isn't something I focused on. I'll note that gzcat | wc -l is 2x faster than your simple example on that file.

I can think of a couple reasons for this:

1. gzcat may use mmap to increase read speed
2. gzcat may read larger chunks at once (this can be tuned using iopipe as well, just set the optimizedReadSize).
3. gzcat file.gz | iopipe_byline -nooutput is about 20% faster than using wc -l, so it's definitely not the line parsing.

Let me know if this works correctly for your other test cases! If not, file an issue: https://github.com/schveiguy/iopipe/issues

-Steve

January 04, 2018

Re: Using iopipe to stream a gzipped file

Posted by Andrew
in reply to Steven Schveighoffer

Andrew

Posted in reply to Steven Schveighoffer

On Thursday, 4 January 2018 at 15:48:21 UTC, Steven Schveighoffer wrote:
>
> It's now been updated, see version 0.0.3.
>
> Note, the performance isn't something I focused on. I'll note that gzcat | wc -l is 2x faster than your simple example on that file.
>
> I can think of a couple reasons for this:
>
> 1. gzcat may use mmap to increase read speed
> 2. gzcat may read larger chunks at once (this can be tuned using iopipe as well, just set the optimizedReadSize).
> 3. gzcat file.gz | iopipe_byline -nooutput is about 20% faster than using wc -l, so it's definitely not the line parsing.
>
> Let me know if this works correctly for your other test cases! If not, file an issue: https://github.com/schveiguy/iopipe/issues
>
> -Steve

That works perfectly, thank you very much!

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation