February 20, 2014
On Wednesday, 19 February 2014 at 16:36:29 UTC, Adam D. Ruppe wrote:
> And if you are writing, use new Compress(HeaderFormat.gzip) then call the compress method and write what it returns to teh file.

I successfully read and printed the contents of a gzipped file, but the documentation is too sparse for me to figure out why I can't write a gzipped file.

http://dlang.org/phobos/std_zlib.html#.Compress

I'd appreciate any tips.


Here's the output:

- - -
$ echo -e "hi there\nhere's some text in a file\n-K" | gzip > test.gz

$ zcat test.gz
hi there
here's some text in a file
-K

$ ./zfile.d test.gz out.gz
hi there
here's some text in a file
-K

$ zcat out.gz

gzip: out.gz: unexpected end of file
- - -


And the code:

- - -
#!/usr/bin/env rdmd
// zfile.d
import std.stdio,
       std.stream,
       std.zlib,
       std.c.process,
       std.process,
       std.file;

void main(string[] args)
{
    if (args.length != 3) {
        writefln("Usage: ./%s <file> <output>", args[0]);
        exit(0);
    }

    // Read command line arguments.
    string filename = args[1];
    string outfile = args[2];
    auto len = filename.length;

    std.file.File input;
    // Automatically decompress the file if it ends with "gz".
    if (filename[len - 2 .. len] == "gz") {
        auto pipe = pipeShell("gunzip -c " ~ filename);
        input = pipe.stdout;
    } else {
        input = std.stdio.File(filename);
    }

    // Write data to a stream in memory
    auto mem = new MemoryStream();
    string line;
    while ((line = input.readln()) !is null) {
        mem.write(line);
        // Also write the line to stdout.
        write(line);
    }

    // Put the uncompressed data into a new gz file.
    auto comp = new Compress(HeaderFormat.gzip);
    auto compressed = comp.compress(mem.data);
    //comp.flush(); // Does not fix the problem.

    // See the raw compressed bytes.
    //writeln(cast(ubyte[])compressed);

    // Write compressed output to a file.
    with (new std.stream.File(outfile, FileMode.OutNew)) {
        writeExact(compressed.ptr, compressed.length);
        //write(cast(ubyte[])compressed); // Also does not work.
    }
}
- - -
February 20, 2014
On Thursday, 20 February 2014 at 03:58:01 UTC, Kamil Slowikowski wrote:
>     auto compressed = comp.compress(mem.data);
>     //comp.flush(); // Does not fix the problem.

You need to write each compressed block and the flush. So more like:

writeToFile(comp.compress(mem.data)); // loop over all the data btw
writeToFile(comp.flush());

and that should do it.


flush returns the remainder of the data.
February 20, 2014
On Thursday, 20 February 2014 at 04:03:45 UTC, Adam D. Ruppe wrote:
> On Thursday, 20 February 2014 at 03:58:01 UTC, Kamil Slowikowski wrote:
>>    auto compressed = comp.compress(mem.data);
>>    //comp.flush(); // Does not fix the problem.
>
> You need to write each compressed block and the flush. So more like:
>
> writeToFile(comp.compress(mem.data)); // loop over all the data btw
> writeToFile(comp.flush());
>
> and that should do it.
>
>
> flush returns the remainder of the data.

Hey Adam, thanks for the tip.

Next problem: the output has strange characters, as shown:

- - -
./zfile.d test.gz out.gz
hi there
here's some text in a file
-K

Thu Feb 20 00:07:52 kamil W530 ~/work/dlang
zcat out.gz
	hi there
here's some text in a file
-K

zcat test.gz | wc -c
39

zcat out.gz | wc -c
63

zcat test.gz | hexdump
0000000 6968 7420 6568 6572 680a 7265 2765 2073
0000010 6f73 656d 7420 7865 2074 6e69 6120 6620
0000020 6c69 0a65 4b2d 000a
0000027

zcat out.gz | hexdump
0000000 0009 0000 0000 0000 6968 7420 6568 6572
0000010 1b0a 0000 0000 0000 6800 7265 2765 2073
0000020 6f73 656d 7420 7865 2074 6e69 6120 6620
0000030 6c69 0a65 0003 0000 0000 0000 4b2d 000a
000003f
- - -


Code:

- - -
#!/usr/bin/env rdmd
import std.stdio,
       std.stream,
       std.zlib,
       std.c.process,
       std.process,
       std.file;

void main(string[] args)
{
    if (args.length != 3) {
        writefln("Usage: ./%s <file> <output>", args[0]);
        exit(0);
    }

    // Read command line arguments.
    string filename = args[1];
    string outfile = args[2];
    auto len = filename.length;

    std.file.File input;
    // Automatically decompress the file if it ends with "gz".
    if (filename[len - 2 .. len] == "gz") {
        auto pipe = pipeShell("gunzip -c " ~ filename);
        input = pipe.stdout;
    } else {
        input = std.stdio.File(filename);
    }

    // Write data to a stream in memory
    auto mem = new MemoryStream();
    string line;
    while ((line = input.readln()) !is null) {
        mem.write(line);
        // Also write the line to stdout.
        write(line);
    }

    // Put the data into a new gz file.
    auto comp = new Compress(HeaderFormat.gzip);
    // See the raw compressed bytes.
    //writeln(cast(ubyte[])compressed);

    // Write compressed output to a file.
    with (new std.stream.File(outfile, FileMode.OutNew)) {
        auto compressed = comp.compress(mem.data);
        writeExact(compressed.ptr, compressed.length);
        // Get any remaining data.
        compressed = comp.flush();
        writeExact(compressed.ptr, compressed.length);
    }
}
- - -
February 20, 2014
On Wednesday, 19 February 2014 at 15:51:53 UTC, Kamil Slowikowski wrote:
> Hi there, I'm new to D and have a lot of learning ahead of me. It would
> be extremely helpful to me if someone with D experience could show me
> some code examples.
>
> I'd like to neatly read and write gzipped files for my work. I have read
> several threads on these forums on the topic of std.zlib or std.zip and I haven't been able to figure it out.
>

Hi Kamil,
I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes:
GzipInputRange, GzipByLine, and GzipOut.
Here is how I can now use them:

_____________________

import gzip;
import std.stdio;

void main() {
auto byLine = new GzipByLine("test.gz");
foreach(line; byLine)
   writeln(line);

auto gzipOutFile = new GzipOut("testout.gz");
gzipOutFile.compress("bla bla bla");
gzipOutFile.finish();
}

That is all quite convenient and I was wondering whether something like that would be useful even in Phobos. But it's clear that for phobos things would involve a lot more work to comply with the requirements. This so far simply served my needs and is not as generic as it could be:

Here is the code:

___________gzip.d__________________
import std.zlib;
import std.stdio;
import std.range;
import std.traits;

class GzipInputRange {
  UnCompress uncompressObj;
  File f;
  auto CHUNKSIZE = 0x4000;
  ReturnType!(f.byChunk) chunkRange;
  bool exhausted;
  char[] uncompressedBuffer;
  size_t bufferIndex;

  this(string filename) {
    f = File(filename, "r");
    chunkRange = f.byChunk(CHUNKSIZE);
    uncompressObj = new UnCompress();
    load();
  }

  void load() {
    if(!chunkRange.empty) {
      auto raw = chunkRange.front.dup;
      chunkRange.popFront();
      uncompressedBuffer = cast(char[])uncompressObj.uncompress(raw);
      bufferIndex = 0;
    }
    else {
      if(!exhausted) {
        uncompressedBuffer = cast(char[])uncompressObj.flush();
        exhausted = true;
        bufferIndex = 0;
      }
      else
        uncompressedBuffer.length = 0;
    }
  }

  @property char front() {
    return uncompressedBuffer[bufferIndex];
  }

  void popFront() {
    bufferIndex += 1;
    if(bufferIndex >= uncompressedBuffer.length) {
      load();
      bufferIndex = 0;
    }
  }

  @property bool empty() {
    return uncompressedBuffer.length == 0;
  }
}

class GzipByLine {
  GzipInputRange range;
  char[] buf;

  this(string filename) {
    this.range = new GzipInputRange(filename);
    popFront();
  }

  @property bool empty() {
    return buf.length == 0;
  }

  void popFront() {
    buf.length = 0;
    while(!range.empty && range.front != '\n') {
      buf ~= range.front;
      range.popFront();
    }
    range.popFront();
  }

  string front() {
    return buf.idup;
  }
}

class GzipOut {
  Compress compressObj;
  File f;

  this(string filename) {
    f = File(filename, "w");
    compressObj = new Compress(HeaderFormat.gzip);
  }

  void compress(string s) {
    auto compressed = compressObj.compress(s.dup);
    f.rawWrite(compressed);
  }

  void finish() {
    auto compressed = compressObj.flush();
    f.rawWrite(compressed);
  }
}


February 20, 2014
On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:
> Hi Kamil,
> I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes:
> GzipInputRange, GzipByLine, and GzipOut.

Stephan, awesome! Thank you very much for sharing your classes. It's nice to see how you've approached this problem. Your code is very clear and easy to understand (for me).

Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.
February 20, 2014
On Thu, Feb 20, 2014 at 9:05 PM, Kamil Slowikowski <kslowikowski@gmail.com>wrote:

>
> Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.
>

That's not an error, that's two different ways to access files: std.stream.File and std.stdio.File - the latter is more recommended to use.


February 20, 2014
On Thursday, 20 February 2014 at 17:05:37 UTC, Kamil Slowikowski
wrote:
> On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:
>> Hi Kamil,
>> I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes:
>> GzipInputRange, GzipByLine, and GzipOut.
>
> Stephan, awesome! Thank you very much for sharing your classes. It's nice to see how you've approached this problem. Your code is very clear and easy to understand (for me).
>
> Also, I now see the error in my code: I believe I should use rawWrite to write compressed data and not writeExact.

You're welcome. If you manage to put GzipOut.finish() into the
destructor of the class to automatically flush the file upon
destruction of the object, let me know. I tried this and it gives
a SegFault… I was too lazy to try to understand it but I am sure
it must be in principle possible.

Stephan
May 03, 2015
On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:
> Hi Kamil,
> I am glad someone has the exact same problem as I had. I actually solved this, inspired by the python API you quoted above. I wrote these classes:
> GzipInputRange, GzipByLine, and GzipOut.
> Here is how I can now use them:
>

I've polished your module a bit at:

https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb0012f637b/zio.d

Reflections:

- Performance is terrible even with -release -noboundscheck -unittest. About 20 times slower than zcat $F | wc -l. I'm guessing

    _chunkRange.front.dup

slows things down. I tried removing the .dup but then I get

    std.zlib.ZlibException@std/zlib.d(59): data error

I don't believe we should have to do a copy of _chunkRange.front but I can't figure out how to solve it. Anybody understands how to fix this?

- Shouldn't GzipOut.finish() call this.close()? Otherwise the file remains unflushed.
- And what about calling this.close() in GzipOut.~this()? Is that needed to?
May 03, 2015
> I've polished your module a bit at:
https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb0012f637b/zio.d

Latest at

https://github.com/nordlow/justd/blob/master/zio.d
May 03, 2015
And there is Zipios++

http://zipios.sourceforge.net/


On Sun, 2015-05-03 at 14:33 +0000, via Digitalmars-d wrote:
> On Thursday, 20 February 2014 at 10:35:50 UTC, Stephan Schiffels wrote:
> > Hi Kamil,
> > I am glad someone has the exact same problem as I had. I
> > actually solved this, inspired by the python API you quoted
> > above. I wrote these classes:
> > GzipInputRange, GzipByLine, and GzipOut.
> > Here is how I can now use them:
> > 
> 
> I've polished your module a bit at:
> 
> https://github.com/nordlow/justd/blob/611ae3aac35a085af966e0c3b717deb 0012f637b/zio.d
> 
> Reflections:
> 
> - Performance is terrible even with -release -noboundscheck -unittest. About 20 times slower than zcat $F | wc -l. I'm guessing
> 
>      _chunkRange.front.dup
> 
> slows things down. I tried removing the .dup but then I get
> 
>      std.zlib.ZlibException@std/zlib.d(59): data error
> 
> I don't believe we should have to do a copy of _chunkRange.front but I can't figure out how to solve it. Anybody understands how to fix this?
> 
> - Shouldn't GzipOut.finish() call this.close()? Otherwise the
> file remains unflushed.
> - And what about calling this.close() in GzipOut.~this()? Is that
> needed to?
-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder