Jump to page: 1 2 3
Thread overview
iopipe v0.0.4 - RingBuffers!
May 11, 2018
Dmitry Olshansky
May 11, 2018
Ali Çehreli
May 11, 2018
Uknown
May 11, 2018
Dmitry Olshansky
May 12, 2018
Uknown
May 11, 2018
Kagamin
May 11, 2018
Dmitry Olshansky
May 11, 2018
Jonathan M Davis
May 12, 2018
Patrick Schluter
May 11, 2018
Joakim
May 12, 2018
Dmitry Olshansky
May 12, 2018
Joakim
May 12, 2018
Dmitry Olshansky
May 11, 2018
Jon Degenhardt
May 12, 2018
Patrick Schluter
May 14, 2018
bioinfornatics
May 15, 2018
biocyberman
May 14, 2018
Claude
May 10, 2018
OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.

However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more).

The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that.

Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later...

As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).

However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.

Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point).

Enjoy!

https://github.com/schveiguy/iopipe
https://code.dlang.org/packages/iopipe
http://schveiguy.github.io/iopipe/

-Steve
May 11, 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>
> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

I’d start with something clinicaly synthetic.
Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.

Basically:
16 Mb buffer fixed
vs
16 Mb mmap-ed ring

Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.

The amount of work done per byte though has to be minimal to actually see anything.


> 
> in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).
>
> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.

Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it.

Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.

>
> Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point).
>
> Enjoy!
>
> https://github.com/schveiguy/iopipe
> https://code.dlang.org/packages/iopipe
> http://schveiguy.github.io/iopipe/
>
> -Steve


May 11, 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.
May 11, 2018
On Friday, 11 May 2018 at 09:55:10 UTC, Kagamin wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
>
> Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.

It doesn’t. Instead it has a buffer mmaped twice side by side. Therefore you can avoid copy at the end when it wraps around.

Otherwise it’s the same buffering as usual.

May 11, 2018
On 5/11/18 1:30 AM, Dmitry Olshansky wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>>
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
> 
> I’d start with something clinicaly synthetic.
> Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.

Hm.. this wouldn't work, because the idea is to keep some of the buffer full. What will happen here is that the buffer will extend to be able to accomodate the extra byte, and then you are back to having less of the buffer full at once. Iopipe is not afraid to increase the buffer :)

> 
> Basically:
> 16 Mb buffer fixed
> vs
> 16 Mb mmap-ed ring
> 
> Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.

This isn't how it will work. The system looks at the buffer and says "oh, I can just read 8MB - 1 byte," which gives you 2 bytes less than you need. Then you need the extra 2 bytes, so it will increase the buffer to hold at least 2 records.

I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;)

Update: nope, still pretty much the same.

> The amount of work done per byte though has to be minimal to actually see anything.

Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise.

What I have learned here is:

1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
2. The use cases are much smaller than I thought
3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.
4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works!

I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.

>> in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).
>>
>> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.
> 
> Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it.
> 
> Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.

I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.

One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.

I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines).

-Steve
May 11, 2018
On 5/11/18 5:55 AM, Kagamin wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
> 
> Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.

As Dmitry hinted at, there actually is no file involved. I'm mapping just straight memory to 2 segments. In fact, in my test application, I'm using stdin as the input, which may not even involve a file.

It's just as fast as using memory, the only cool part is that you can write a buffer that wraps to the beginning as if it were a normal array.

What surprises me is that the copying for the normal buffer doesn't hurt performance that much. I suppose this should probably have been expected, as CPUs are really really good at processing consecutive memory, and the copying you end up having to do is generally small compared to the rest of your app.

-Steve
May 11, 2018
On 05/11/2018 06:28 AM, Steven Schveighoffer wrote:

> 1. Ring buffers are really cool (I still love how it works) and perform
> as well as normal buffers
> 2. The use cases are much smaller than I thought

There is the LMAX Disruptor, which was open sourced a few year ago along with a large number of articles, describing its history and design in great detail. Because of the large number of articles like this one


https://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html

it's impossible to find the one that had left an impression on me at the time I read it. The article was describing their story from the beginning to finally getting to their current design, starting from a simple std::map, lock contentions and other concurrency pitfall. They finally settled on a multi-producer-single-consumer design where the consumer works on one thread. This was giving them the biggest CPU cache advantage.

The producers and the consumer share a ring buffer for communication. Perhaps the example you're looking for is in there somewhere. :)

Ali

May 11, 2018
On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:
> [...]
> I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;)
>
> Update: nope, still pretty much the same.
>

I'm sure someone will find some good show off program.

>> The amount of work done per byte though has to be minimal to actually see anything.
>
> Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise.
>
> What I have learned here is:
>
> 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
> 2. The use cases are much smaller than I thought
> 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.

Now I need to learn all about ring-buffers. Do you have any good starting points?

> 4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works!

That satisfying feeling when the code works exactly the way you wanted it to!

> I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.

I'm sure someone will find a place where its useful.


>>> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.
>> 
>> Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it.
>> 
>> Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.
>
> I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.
>
> One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.

iopipe is looking like a great library!

> I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines).
>
> -Steve

That reminds me of this great blog post detailing grep's performance:
http://ridiculousfish.com/blog/posts/old-age-and-treachery.html

Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested:
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
May 11, 2018
On 5/11/18 10:04 AM, Uknown wrote:
> On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:
>> What I have learned here is:
>>
>> 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
>> 2. The use cases are much smaller than I thought
>> 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.
> 
> Now I need to learn all about ring-buffers. Do you have any good starting points?

I would start here: https://en.wikipedia.org/wiki/Circular_buffer

The point of a circular buffer is to avoid having to copy any data anywhere -- things stay in place as long as they are in-buffer.

So originally, I had intended to make a ring buffer (or circular buffer) where I have a random access range between 2 separate segments. In other words, the range would abstract the fact that the buffer was 2 segments, and give you random access to each element by checking to see which segment it's in.

I never actually made it, because I realized quickly while optimizing e.g. line processing that the huge benefit of having sequential memory far outweighs the drawback of occasionally having to copy data. In other words, you are paying for every element access to avoid paying for a rare small copy.

Consider a byline range. If you have a 8k buffer, and your lines are approximately 80 bytes in length average, when you reach the end of the buffer and have to move whatever existing partial-line to the front of the buffer to continue reading, you are really only copying 1% of the buffer, 1% of the time. But while you are searching for line endings (99% of the time), you are using a simple indexed pointer dereference.

Contrast that with a disjoint buffer where every access to an element first requires a check to see which segment you are in before dereferencing. You have moved the payment from the 1% into the 99%.

BUT, when at dconf, Dmitry and Shachar let me know about a technique to map the same memory segment to 2 consecutive address ranges. This allows you to look at the ring buffer without it ever being disjoint. Simply put, you have a 2x buffer, whereby each half looks at the same memory. Whenever your buffer start gets to the half way point, you simply move the pointers back by half a buffer. Other than that, the code is nearly identical to a straight allocated buffer, and the memory access is just as fast.

So I decided to implement, hoping that I would magically just get a bit better performance. I should have known better :)

>> I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.
>>
>> One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.
> 
> iopipe is looking like a great library!

Thanks! I hope to get more utility out of it. I still need to finish/publish my json parser based on it, and I'm thinking we need some parsing tools really to go on top of it to make things easier to approach.

> That reminds me of this great blog post detailing grep's performance:
> http://ridiculousfish.com/blog/posts/old-age-and-treachery.html
> 
> Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested:
> https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Thanks, I'll take a look at those.

-Steve
May 11, 2018
On 5/10/18 7:22 PM, Steven Schveighoffer wrote:

> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.

Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers).

So at least there is something to strive for :)

-Steve
« First   ‹ Prev
1 2 3