January 04, 2018
On 1/4/18 1:57 PM, Christian Köstlin wrote:
> Thanks Steve,
> this runs now faster, I will update the table.

Still a bit irked that I can't match the C speed :/

But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc.

-Steve
January 05, 2018
On 04.01.18 20:46, Steven Schveighoffer wrote:
> On 1/4/18 1:57 PM, Christian Köstlin wrote:
>> Thanks Steve,
>> this runs now faster, I will update the table.
> 
> Still a bit irked that I can't match the C speed :/
> 
> But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc.
I guess, the code in my program is small enough that the optimize flags do not matter... most of the stuff is pulled from libz? Which is dynamically linked against /usr/lib/libz.1.dylib.

I also cannot understand what I should do more (will try realloc with Mallocator) for the dlang-low-level variant to get to the c speed. rust is doing quite well there

--
Christian Köstlin

January 05, 2018
On 1/5/18 1:01 AM, Christian Köstlin wrote:
> On 04.01.18 20:46, Steven Schveighoffer wrote:
>> On 1/4/18 1:57 PM, Christian Köstlin wrote:
>>> Thanks Steve,
>>> this runs now faster, I will update the table.
>>
>> Still a bit irked that I can't match the C speed :/
>>
>> But, I can't get your C speed to duplicate on my mac even with gcc, so
>> I'm not sure where to start. I find it interesting that you are not
>> using any optimization flags for gcc.
> I guess, the code in my program is small enough that the optimize flags
> do not matter... most of the stuff is pulled from libz? Which is
> dynamically linked against /usr/lib/libz.1.dylib.

Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself.

> I also cannot understand what I should do more (will try realloc with
> Mallocator) for the dlang-low-level variant to get to the c speed.

D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.

> rust is doing quite well there

I'll say a few words of caution here:

1. Almost all of these tests use the same C library to unzip. So it's really not a test of the performance of decompression, but the performance of memory management. And it appears that any test using malloc/realloc is in a different tier. Presumably because of the lack of copies (as discussed earlier).
2. Your rust test (I think, I'm not sure) is testing 2 things in the same run, which could potentially have dramatic consequences for the second test. For instance, it could already have all the required memory blocks ready, and the allocation strategy suddenly gets better. Or maybe there is some kind of caching of the input being done. I think you have a fairer test for the second option by running it in a separate program. I've never used rust, so I don't know what exactly your code is doing.
3. It's hard to make a decision based on such microbenchmarks as to which solution is "better" in an actual real-world program, especially when the state/usage of the memory allocator plays a huge role in this.

-Steve
January 05, 2018
On 05.01.18 15:39, Steven Schveighoffer wrote:
> Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself.
>
> D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.
I added another version that tries to do the "same" as the c version
using mallocator, but i am still way off, perhaps its creating too many
ranges on the underlying array. but its around the same speed as your
great iopipe thing.
My solution does have the same memory leak, as I am not sure how to best
get the memory out of the FastAppender so that it is automagically
cleaned up. Perhaps if we get rc things, this gets easier?
I updated: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip
with the newest numbers on my machine, but I think your iopipe solution
is the best one we can get at the moment!

>> rust is doing quite well there
> 
> I'll say a few words of caution here:
> 
> 1. Almost all of these tests use the same C library to unzip. So it's
> really not a test of the performance of decompression, but the
> performance of memory management. And it appears that any test using
> malloc/realloc is in a different tier. Presumably because of the lack of
> copies (as discussed earlier).
> 2. Your rust test (I think, I'm not sure) is testing 2 things in the
> same run, which could potentially have dramatic consequences for the
> second test. For instance, it could already have all the required memory
> blocks ready, and the allocation strategy suddenly gets better. Or maybe
> there is some kind of caching of the input being done. I think you have
> a fairer test for the second option by running it in a separate program.
> I've never used rust, so I don't know what exactly your code is doing.
> 3. It's hard to make a decision based on such microbenchmarks as to
> which solution is "better" in an actual real-world program, especially
> when the state/usage of the memory allocator plays a huge role in this.
sure .. thats true


January 05, 2018
On 1/5/18 3:09 PM, Christian Köstlin wrote:
> On 05.01.18 15:39, Steven Schveighoffer wrote:
>> Yeah, I guess most of the bottlenecks are inside libz, or the memory
>> allocator. There isn't much optimization to be done in the main program
>> itself.
>>
>> D compiles just the same as C. So theoretically you should be able to
>> get the same performance with a ported version of your C code. It's
>> worth a shot.
> I added another version that tries to do the "same" as the c version
> using mallocator, but i am still way off, perhaps its creating too many
> ranges on the underlying array. but its around the same speed as your
> great iopipe thing.

Hm... I think really there is some magic initial state of the allocator, and that's what allows it to go so fast.

One thing about the D version, because druntime is also using malloc (the GC is backed by malloc'd data after all), the initial state of the heap is quite different from when you start in C. It may be impossible or nearly impossible to duplicate the performance. But the flipside (if this is indeed the case) is that you won't see the same performance in a real-world app anyway, even in C.

One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.

> My solution does have the same memory leak, as I am not sure how to best
> get the memory out of the FastAppender so that it is automagically
> cleaned up. Perhaps if we get rc things, this gets easier?

I've been giving some thought to this. I think iopipe needs some buffer management primitives that allow you to finagle the buffer. I've been needing this for some time anyway (for file seeking). Right now, the buffer itself is buried in the chain, so it's hard to get at the actual buffer.

Alternatively, I probably also need to give some thought to a mechanism that auto-frees the memory when it can tell nobody is still using the iopipe. Given that iopipe's signature feature is direct buffer access, this would mean anything that uses such a feature would have to be unsafe.

-Steve
January 06, 2018
On 05.01.18 23:04, Steven Schveighoffer wrote:
> On 1/5/18 3:09 PM, Christian Köstlin wrote:
>> On 05.01.18 15:39, Steven Schveighoffer wrote:
>>> Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself.
>>>
>>> D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.
>> I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing.
> 
> Hm... I think really there is some magic initial state of the allocator, and that's what allows it to go so fast.
> 
> One thing about the D version, because druntime is also using malloc (the GC is backed by malloc'd data after all), the initial state of the heap is quite different from when you start in C. It may be impossible or nearly impossible to duplicate the performance. But the flipside (if this is indeed the case) is that you won't see the same performance in a real-world app anyway, even in C.
> 
> One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.
yes ... this is something i forgot to try out ... will do now :)
mhh .. interesting numbers ... c is even faster, my d lowlevel solution
is also a little bit faster, but much slower than the no copy version
(funnily, no copy is the wrong name, it just overwrites all the data in
a small buffer).

>> My solution does have the same memory leak, as I am not sure how to best get the memory out of the FastAppender so that it is automagically cleaned up. Perhaps if we get rc things, this gets easier?
> 
> I've been giving some thought to this. I think iopipe needs some buffer management primitives that allow you to finagle the buffer. I've been needing this for some time anyway (for file seeking). Right now, the buffer itself is buried in the chain, so it's hard to get at the actual buffer.
> 
> Alternatively, I probably also need to give some thought to a mechanism that auto-frees the memory when it can tell nobody is still using the iopipe. Given that iopipe's signature feature is direct buffer access, this would mean anything that uses such a feature would have to be unsafe.
yes .. thats tricky ...
one question about iopipe. is it possible to transform the elements in
the pipe as well ... e.g. away from a buffer of bytes to json objects?

--
Christian Köstlin

January 07, 2018
On 1/6/18 11:14 AM, Christian Köstlin wrote:
> On 05.01.18 23:04, Steven Schveighoffer wrote:
>> One thing to try, you preallocate the ENTIRE buffer. This only works if
>> you know how many bytes it will decompress to (not always possible), but
>> it will take the allocator out of the equation completely. And it's
>> probably going to be the most efficient method (you aren't leaving
>> behind smaller unused blocks when you realloc). If for some reason we
>> can't beat/tie the C version doing that, then something else is going on.
> yes ... this is something i forgot to try out ... will do now :)
> mhh .. interesting numbers ... c is even faster, my d lowlevel solution
> is also a little bit faster, but much slower than the no copy version
> (funnily, no copy is the wrong name, it just overwrites all the data in
> a small buffer).

Not from what I'm reading, the C solution is about the same (257 vs. 261). Not sure if you have averaged these numbers, especially on a real computer that might be doing other things.

Note: I would expect it to be a tiny bit faster, but not monumentally faster. From my testing with the reallocation, it only reallocates a large quantity of data once.

However, the D solution should be much faster. Part of the issue is that you still aren't low-level enough :)

Instead of allocating the ubyte array with this line:

ubyte[] buffer = new ubyte[200*1024*1024];

Try this instead:

// from std.array
auto buffer = uninitializedArray!(ubyte[], 200*1024*1024);

The difference is that the first one will have the runtime 0-initialize all the data.

> one question about iopipe. is it possible to transform the elements in
> the pipe as well ... e.g. away from a buffer of bytes to json objects?

Yes! I am working on doing just that, but haven't had a chance to update the toy project I wrote: https://github.com/schveiguy/jsoniopipe

I was planning actually on having an iopipe of JsonItem, which would work just like a normal buffer, but reference the ubyte buffer underneath.

Eventually, the final product should have a range of JsonValue, which you would recurse into in order to parse its children. All of it will be lazy, and stream-based, so you don't have to load the whole file if it's huge.

Note, you can't have an iopipe of JsonValue, because it's a recursive format. JsonItems are just individual defined tokens, so they can be linear.

-Steve
January 09, 2018
On 07.01.18 14:44, Steven Schveighoffer wrote:
> Not from what I'm reading, the C solution is about the same (257 vs. 261). Not sure if you have averaged these numbers, especially on a real computer that might be doing other things.
yes you are right ... for proper benchmarking proper statistics should be in place, taking out extreme values, averaging them, ...

> Note: I would expect it to be a tiny bit faster, but not monumentally faster. From my testing with the reallocation, it only reallocates a large quantity of data once.
> 
> However, the D solution should be much faster. Part of the issue is that you still aren't low-level enough :)
> 
> Instead of allocating the ubyte array with this line:
> 
> ubyte[] buffer = new ubyte[200*1024*1024];
> 
> Try this instead:
> 
> // from std.array
> auto buffer = uninitializedArray!(ubyte[], 200*1024*1024);
thanks for that ... i just did not know how to get an uninitialized array. i was aware, that dlang is nice and puts init there :)

> Yes! I am working on doing just that, but haven't had a chance to update the toy project I wrote: https://github.com/schveiguy/jsoniopipe
> 
> I was planning actually on having an iopipe of JsonItem, which would
> work just like a normal buffer, but reference the ubyte buffer underneath.
> 
> Eventually, the final product should have a range of JsonValue, which you would recurse into in order to parse its children. All of it will be lazy, and stream-based, so you don't have to load the whole file if it's huge.
> 
> Note, you can't have an iopipe of JsonValue, because it's a recursive format. JsonItems are just individual defined tokens, so they can be linear.
sounds really good. i played around with https://github.com/mleise/fast/blob/master/source/fast/json.d ... thats an interesting pull parser with the wrong licence unfortunately ... i wonder if something like this could be done on top of iopipe instead of a "real" buffer.

---
Christian Köstlin
1 2 3
Next ›   Last »