iopipe v0.0.4 - RingBuffers! - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Announce » iopipe v0.0.4 - RingBuffers!

Thread overview

iopipe v0.0.4 - RingBuffers!
May 10, 2018 Steven Schveighoffer
May 11, 2018 Dmitry Olshansky
May 11, 2018 Steven Schveighoffer
May 11, 2018 Ali Çehreli
May 11, 2018 Uknown
May 11, 2018 Steven Schveighoffer
May 11, 2018 Dmitry Olshansky
May 12, 2018 Uknown
May 11, 2018 Kagamin
May 11, 2018 Dmitry Olshansky
May 11, 2018 Steven Schveighoffer
May 11, 2018 Steven Schveighoffer
May 11, 2018 Jonathan M Davis
May 12, 2018 Patrick Schluter
May 11, 2018 Steven Schveighoffer
May 11, 2018 Joakim
May 12, 2018 Steven Schveighoffer
May 12, 2018 Dmitry Olshansky
May 12, 2018 Joakim
May 12, 2018 Dmitry Olshansky
May 12, 2018 Nick Sabalausky (Abscissa)
May 11, 2018 Jon Degenhardt
May 11, 2018 Arun Chandrasekaran
May 12, 2018 Patrick Schluter
May 12, 2018 Steven Schveighoffer
May 14, 2018 bioinfornatics
May 14, 2018 Steven Schveighoffer
May 15, 2018 biocyberman
May 14, 2018 Claude

May 10, 2018

iopipe v0.0.4 - RingBuffers!

Posted by Steven Schveighoffer

Steven Schveighoffer

OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.

However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more).

The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that.

Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later...

As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).

However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.

Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point).

Enjoy!

https://github.com/schveiguy/iopipe
https://code.dlang.org/packages/iopipe
http://schveiguy.github.io/iopipe/

-Steve

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Dmitry Olshansky
in reply to Steven Schveighoffer

Dmitry Olshansky

Posted in reply to Steven Schveighoffer

On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>
> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

I’d start with something clinicaly synthetic.
Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.

Basically:
16 Mb buffer fixed
vs
16 Mb mmap-ed ring

Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.

The amount of work done per byte though has to be minimal to actually see anything.

> 
> in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).
>
> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.

Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it.

Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.

>
> Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point).
>
> Enjoy!
>
> https://github.com/schveiguy/iopipe
> https://code.dlang.org/packages/iopipe
> http://schveiguy.github.io/iopipe/
>
> -Steve

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Kagamin
in reply to Steven Schveighoffer

Kagamin

Posted in reply to Steven Schveighoffer

On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Dmitry Olshansky
in reply to Kagamin

Dmitry Olshansky

Posted in reply to Kagamin

On Friday, 11 May 2018 at 09:55:10 UTC, Kagamin wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
>
> Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.

It doesn’t. Instead it has a buffer mmaped twice side by side. Therefore you can avoid copy at the end when it wraps around.

Otherwise it’s the same buffering as usual.

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Steven Schveighoffer
in reply to Dmitry Olshansky

Steven Schveighoffer

Posted in reply to Dmitry Olshansky

On 5/11/18 1:30 AM, Dmitry Olshansky wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>>
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
> 
> I’d start with something clinicaly synthetic.
> Say your record size is exactly half of buffer + 1 byte. If you were to extend the size of buffer, it would amortize.

Hm.. this wouldn't work, because the idea is to keep some of the buffer full. What will happen here is that the buffer will extend to be able to accomodate the extra byte, and then you are back to having less of the buffer full at once. Iopipe is not afraid to increase the buffer :)

> 
> Basically:
> 16 Mb buffer fixed
> vs
> 16 Mb mmap-ed ring
> 
> Where you read pieces in 8M+1 blocks.Yes, we are aiming to blow the CPU cache there. Otherwise CPU cache is so fast that ocasional copy is zilch, once we hit primary memory it’s not. Adjust sizes for your CPU.

This isn't how it will work. The system looks at the buffer and says "oh, I can just read 8MB - 1 byte," which gives you 2 bytes less than you need. Then you need the extra 2 bytes, so it will increase the buffer to hold at least 2 records.

I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;)

Update: nope, still pretty much the same.

> The amount of work done per byte though has to be minimal to actually see anything.

Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise.

What I have learned here is:

1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
2. The use cases are much smaller than I thought
3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.
4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works!

I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.

>> in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).
>>
>> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.
> 
> Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it.
> 
> Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.

I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.

One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.

I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines).

-Steve

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Steven Schveighoffer
in reply to Kagamin

Steven Schveighoffer

Posted in reply to Kagamin

On 5/11/18 5:55 AM, Kagamin wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
> 
> Depends on OS and hardware. I would expect mmap implementation to be slower as it reads file in chunks of 4kb and relies on page faults.

As Dmitry hinted at, there actually is no file involved. I'm mapping just straight memory to 2 segments. In fact, in my test application, I'm using stdin as the input, which may not even involve a file.

It's just as fast as using memory, the only cool part is that you can write a buffer that wraps to the beginning as if it were a normal array.

What surprises me is that the copying for the normal buffer doesn't hurt performance that much. I suppose this should probably have been expected, as CPUs are really really good at processing consecutive memory, and the copying you end up having to do is generally small compared to the rest of your app.

-Steve

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Ali Çehreli
in reply to Steven Schveighoffer

Ali Çehreli

Posted in reply to Steven Schveighoffer

On 05/11/2018 06:28 AM, Steven Schveighoffer wrote:

> 1. Ring buffers are really cool (I still love how it works) and perform
> as well as normal buffers
> 2. The use cases are much smaller than I thought

There is the LMAX Disruptor, which was open sourced a few year ago along with a large number of articles, describing its history and design in great detail. Because of the large number of articles like this one

https://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html

it's impossible to find the one that had left an impression on me at the time I read it. The article was describing their story from the beginning to finally getting to their current design, starting from a simple std::map, lock contentions and other concurrency pitfall. They finally settled on a multi-producer-single-consumer design where the consumer works on one thread. This was giving them the biggest CPU cache advantage.

The producers and the consumer share a ring buffer for communication. Perhaps the example you're looking for is in there somewhere. :)

Ali

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Uknown
in reply to Steven Schveighoffer

Uknown

Posted in reply to Steven Schveighoffer

On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:
> [...]
> I do get the point of having to go outside the cache. I'll look and see if maybe specifying a 1000 line context helps ;)
>
> Update: nope, still pretty much the same.
>

I'm sure someone will find some good show off program.

>> The amount of work done per byte though has to be minimal to actually see anything.
>
> Right, this is another part of the problem -- if copying is so rare compared to the other operations, then the difference is going to be lost in the noise.
>
> What I have learned here is:
>
> 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
> 2. The use cases are much smaller than I thought
> 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.

Now I need to learn all about ring-buffers. Do you have any good starting points?

> 4. iopipe makes testing with a different kind of buffer really easy, which was one of my original goals. So I'm glad that works!

That satisfying feeling when the code works exactly the way you wanted it to!

> I'm going to (obviously) leave them there, hoping that someone finds a good use case, but I can say that my extreme excitement at getting it to work was depressed quite a bit when I found it didn't really gain much in terms of performance for the use cases I have been doing.

I'm sure someone will find a place where its useful.


>>> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.
>> 
>> Should be mostly trivial in fact. I mean our first designs for IOpipe is where I wanted regex to work with it.
>> 
>> Basically - if we started a match, extend window until we get it or lose it. Then release up to the next point of potential start.
>
> I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.
>
> One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.

iopipe is looking like a great library!

> I don't get why grep is so bad at this, since it is supposedly doing the matching without line boundaries. I was actually quite shocked when iopipe was that much faster -- even when I'm not asking grep to print out line numbers (so it doesn't actually ever really have to keep track of lines).
>
> -Steve

That reminds me of this great blog post detailing grep's performance:
http://ridiculousfish.com/blog/posts/old-age-and-treachery.html

Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested:
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Steven Schveighoffer
in reply to Uknown

Steven Schveighoffer

Posted in reply to Uknown

On 5/11/18 10:04 AM, Uknown wrote:
> On Friday, 11 May 2018 at 13:28:58 UTC, Steven Schveighoffer wrote:
>> What I have learned here is:
>>
>> 1. Ring buffers are really cool (I still love how it works) and perform as well as normal buffers
>> 2. The use cases are much smaller than I thought
>> 3. In most real-world applications, they are a wash, and not worth the OS tricks needed to use it.
> 
> Now I need to learn all about ring-buffers. Do you have any good starting points?

I would start here: https://en.wikipedia.org/wiki/Circular_buffer

The point of a circular buffer is to avoid having to copy any data anywhere -- things stay in place as long as they are in-buffer.

So originally, I had intended to make a ring buffer (or circular buffer) where I have a random access range between 2 separate segments. In other words, the range would abstract the fact that the buffer was 2 segments, and give you random access to each element by checking to see which segment it's in.

I never actually made it, because I realized quickly while optimizing e.g. line processing that the huge benefit of having sequential memory far outweighs the drawback of occasionally having to copy data. In other words, you are paying for every element access to avoid paying for a rare small copy.

Consider a byline range. If you have a 8k buffer, and your lines are approximately 80 bytes in length average, when you reach the end of the buffer and have to move whatever existing partial-line to the front of the buffer to continue reading, you are really only copying 1% of the buffer, 1% of the time. But while you are searching for line endings (99% of the time), you are using a simple indexed pointer dereference.

Contrast that with a disjoint buffer where every access to an element first requires a check to see which segment you are in before dereferencing. You have moved the payment from the 1% into the 99%.

BUT, when at dconf, Dmitry and Shachar let me know about a technique to map the same memory segment to 2 consecutive address ranges. This allows you to look at the ring buffer without it ever being disjoint. Simply put, you have a 2x buffer, whereby each half looks at the same memory. Whenever your buffer start gets to the half way point, you simply move the pointers back by half a buffer. Other than that, the code is nearly identical to a straight allocated buffer, and the memory access is just as fast.

So I decided to implement, hoping that I would magically just get a bit better performance. I should have known better :)

>> I'm thinking it's even simpler than that. All matches are dead on a line break (it's how grep normally works), so you simply have to parse the lines and run each one via regex. What I don't know is how much it costs regex to startup and run on an individual line.
>>
>> One thing I could do to amortize is keep 2N lines in the buffer, and run the regex on a whole context's worth of lines, then dump them all.
> 
> iopipe is looking like a great library!

Thanks! I hope to get more utility out of it. I still need to finish/publish my json parser based on it, and I'm thinking we need some parsing tools really to go on top of it to make things easier to approach.

> That reminds me of this great blog post detailing grep's performance:
> http://ridiculousfish.com/blog/posts/old-age-and-treachery.html
> 
> Also, one of the original authors of grep wrote about its performance optimizations, for anyone interested:
> https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Thanks, I'll take a look at those.

-Steve

May 11, 2018

Re: iopipe v0.0.4 - RingBuffers!

Posted by Steven Schveighoffer
in reply to Steven Schveighoffer

Steven Schveighoffer

Posted in reply to Steven Schveighoffer

On 5/10/18 7:22 PM, Steven Schveighoffer wrote:

> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.

Shameful note: Macos grep is BSD grep, and is not NEARLY as fast as GNU grep, which has much better performance (and is 2x as fast as iopipe_search on my Linux VM, even when printing line numbers).

So at least there is something to strive for :)

-Steve

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation