January 03, 2018
On 02.01.18 14:51, Adam D. Ruppe wrote:
> On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin wrote:
>> After this I analyzed the first step of the process (gunzipping the data from a file to memory), and found out, that dlangs UnCompress is much slower than java, and ruby and plain c.
> 
> Yeah, std.zlib is VERY poorly written. You can get much better performance by just calling the C functions yourself instead. (You can just import import etc.c.zlib; it is included still)
> 
> Improving it would mean changing the public API. I think the one-shot compress/uncompress functions are ok, but the streaming class does a lot of unnecessary work inside like copying stuff around.
I added a version that uses the gzip lowlevel apis (similar to my example c program). I am still having problems copying the data fast enough to an dlang array.

please see the updated page: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

--
Christian Köstlin

January 03, 2018
On 1/3/18 2:47 AM, Christian Köstlin wrote:
> On 02.01.18 21:13, Steven Schveighoffer wrote:
>> Well, you don't need to use appender for that (and doing so is copying a
>> lot of the data an extra time). All you need is to extend the pipe until
>> there isn't any more new data, and it will all be in the buffer.
>>
>> // almost the same line from your current version
>> auto mypipe = openDev("../out/nist/2011.json.gz")
>>                    .bufd.unzip(CompressionFormat.gzip);
>>
>> // This line here will work with the current release (0.0.2):
>> while(mypipe.extend(0) != 0) {}
> Thanks for this input, I updated the program to make use of this method
> and compare it to the appender thing as well.
> 

Hm.. the numbers are worse! I would have expected to be at least comparable. I'll have to look into it. Thanks for posting this.

-Steve
January 03, 2018
On 1/3/18 9:42 AM, Steven Schveighoffer wrote:
> On 1/3/18 2:47 AM, Christian Köstlin wrote:
>> On 02.01.18 21:13, Steven Schveighoffer wrote:
>>> Well, you don't need to use appender for that (and doing so is copying a
>>> lot of the data an extra time). All you need is to extend the pipe until
>>> there isn't any more new data, and it will all be in the buffer.
>>>
>>> // almost the same line from your current version
>>> auto mypipe = openDev("../out/nist/2011.json.gz")
>>>                    .bufd.unzip(CompressionFormat.gzip);
>>>
>>> // This line here will work with the current release (0.0.2):
>>> while(mypipe.extend(0) != 0) {}
>> Thanks for this input, I updated the program to make use of this method
>> and compare it to the appender thing as well.
>>
> 
> Hm.. the numbers are worse! I would have expected to be at least comparable. I'll have to look into it. Thanks for posting this.

Alright. I have spent some time examining the issues, and here are my findings:

1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues.

2. extend(0) always attempts to read 8k more bytes. The buffer extends by 8k by going into another couple pages. Somehow this is choking the algorithm. I think the cost of extending pages actually ends up hurting performance (something we need to look at in the GC in general).

3. extendElems should allow extending all the elements in the most optimal fashion. Fixing that, iopipe performs as well as the Appender/iopipe version. This is coming out as well.

Stay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :)

-Steve
January 03, 2018
On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
> 1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues.

Looking at when C realloc actually moves the data, it appears it all of a sudden over-allocates very very large blocks, much larger than the GC will over-allocate. This is why the GC is losing. After a certain size, the GC doesn't allocate blocks to grow into, so we start copying on every realloc.

So it's not so much the performance of the copying, but the number of times copying is required that is affecting the ultimate performance.

-Steve
January 04, 2018
On 03.01.18 22:33, Steven Schveighoffer wrote:
> On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
>> 1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues.
> 
> Looking at when C realloc actually moves the data, it appears it all of a sudden over-allocates very very large blocks, much larger than the GC will over-allocate. This is why the GC is losing. After a certain size, the GC doesn't allocate blocks to grow into, so we start copying on every realloc.
That a very intersting finding! If I find some time today, I will also try a pure malloc/memcpy/free solution in c and also look how many real allocs are done. Thats funny, because this means, that this whole *2 thing for growing buffers that you learn is well taken care of in libc? Or does it still mean, that you should allocate in a pattern like this, for libc's algorithm to kick in? Is there api to see how much realloc really allocated, or is it only possible by comparing pointers to see if realloc delivers a new or the old pointer?

--
Christian  Köstlin

January 04, 2018
I added now a c variant, that does always malloc/memcpy/free. Its much
slower for sure.
Also I put some output in thats shows when a real realloc happens. Its
like you said:

did a real realloc
did a real realloc
did a real realloc
did a real realloc
did a real realloc
did a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did a real realloc
did not a real realloc

funny thing is, i do not always get the same sequence of real reallocs.


--
Christian Köstlin
January 04, 2018
On 1/4/18 4:47 AM, Christian Köstlin wrote:
> I added now a c variant, that does always malloc/memcpy/free. Its much
> slower for sure.
> Also I put some output in thats shows when a real realloc happens. Its
> like you said:
> 
> did a real realloc
> did a real realloc
> did a real realloc
> did a real realloc
> did a real realloc
> did a real realloc
> did not a real realloc
> did not a real realloc
> did not a real realloc
> did not a real realloc
> did not a real realloc
> did not a real realloc
> did not a real realloc
> did not a real realloc
> did a real realloc
> did not a real realloc
> 
> funny thing is, i do not always get the same sequence of real reallocs.

Another thing I did was print out the size of the buffer before the realloc. In other words:

auto oldsize = buffer.length;
auto oldptr = buffer.ptr;
buffer = (cast(ubyte *)realloc(buffer.ptr, newsize))[0 .. newsize];
if(buffer.ptr != oldptr) writeln("moved: ", oldsize);

Even the sizes malloc chooses for its big leap are not consistent from run to run!

I'm wondering if possibly when you keep reallocing a large block, and it doesn't have enough memory, needs more from the system, it just tacks it onto the end of the existing big block. This would make the performance for this microbenchmark highly dependent on not doing any other allocations!

BTW, my numbers are consistent, but there is a difference from yours. For example, my C runs are about 430ms, and my fast D runs are about 650-700ms. I never get the 200ms range runs that you are seeing. I have a similar machine, but still running Sierra.

-Steve
January 04, 2018
On 1/3/18 3:28 PM, Steven Schveighoffer wrote:

> Stay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :)

v0.0.3 has been released. To take advantage of using malloc/realloc, you can use std.experimental.allocator.mallocator.Mallocator as the BufferManager's allocator.

e.g.:

auto myPipe = openDev("file.gz").bufd // not here
      .unzip!Mallocator;              // here

myPipe.ensureElems(); // this works now.

auto bytesRead = myPipe.window.length;

The reason you don't need it on the first bufd is because that is the buffer of the file stream, which isn't going to grow.

Note: the buffer manager does not free the data on destruction! You are responsible for freeing it (if you use Mallocator).

-Steve
January 04, 2018
On 04.01.18 16:53, Steven Schveighoffer wrote:
> On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
> 
>> Stay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :)
> 
> v0.0.3 has been released. To take advantage of using malloc/realloc, you can use std.experimental.allocator.mallocator.Mallocator as the BufferManager's allocator.
> 
> e.g.:
> 
> auto myPipe = openDev("file.gz").bufd // not here
>       .unzip!Mallocator;              // here
> 
> myPipe.ensureElems(); // this works now.
> 
> auto bytesRead = myPipe.window.length;
> 
> The reason you don't need it on the first bufd is because that is the buffer of the file stream, which isn't going to grow.
> 
> Note: the buffer manager does not free the data on destruction! You are responsible for freeing it (if you use Mallocator).
Thanks Steve,
this runs now faster, I will update the table.

Sorry, but I do not get how to cleanup the mallocated memory :) Could you help me out here?

--
Christian Köstlin
January 04, 2018
On 1/4/18 1:57 PM, Christian Köstlin wrote:
> Sorry, but I do not get how to cleanup the mallocated memory :)
> Could you help me out here?

free(mypipe.window.ptr);

At the moment this works because you haven't released any data from the front. But eventually, I will need to figure out how to handle this when it's not so neat.

-Steve