May 12, 2018
On Saturday, 12 May 2018 at 14:48:58 UTC, Joakim wrote:
> On Saturday, 12 May 2018 at 12:45:16 UTC, Dmitry Olshansky wrote:
>> On Saturday, 12 May 2018 at 12:14:28 UTC, Steven Schveighoffer wrote:
>>>[...]
>>
>> I could offer a few tricks to fix that w/o getting too dirty. GNU grep is fast, but std.regex is faster then that in raw speed on a significant class of quite common patterns. But I loaded file at once.
>>
>>> [...]
>>
>> As such initiative goes it’s either now or never. Please get in touch directly over Slack or smth, let’s make it roll. I wanted to do grep-like utility since 2012. Now at long last we have all the building blocks.
>
> If you're talking about writing a grep prototype in D, that's a great idea, especially for publicizing D. :)

For shaming others to beat us using some other language. Making life better for everyone. Taking a DMD to a gun fight ;)


May 12, 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>
> [...]

They can be problematic with some CPU's and OS's. For modern CPU's there should be no problems but those with exotic caches and virtual memory configurations there can be some aliasing issues. Linus Torvalds talked a little about that case in this thread of realworldtech
https://www.realworldtech.com/forum/?threadid=174426&curpostid=174731
May 12, 2018
On Friday, 11 May 2018 at 16:06:41 UTC, Jonathan M Davis wrote:
> [...]
Oh, there had been an epic forum thread about the use of GNU grep for BSD. i don't remember the details but it was long and heated (it was so epic that I even read it as I normaly don't care at all for BSD stuff).

May 12, 2018
On 5/12/18 3:38 PM, Patrick Schluter wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>>
>> [...]
> 
> They can be problematic with some CPU's and OS's. For modern CPU's there should be no problems but those with exotic caches and virtual memory configurations there can be some aliasing issues. Linus Torvalds talked a little about that case in this thread of realworldtech
> https://www.realworldtech.com/forum/?threadid=174426&curpostid=174731

Thanks for the tip. The nice thing about iopipe is that the buffer type is completely selectable, and nothing changes, except possibly some performance.

So on those arch's, I would expect people to select the normal AllocatedBuffer type.

-Steve
May 12, 2018
On 05/12/2018 08:14 AM, Steven Schveighoffer wrote:
> 
> But I am not expecting any miracles here. GNU grep does pretty much everything it can to achieve performance -- including eschewing the standard library buffering system as I am doing. I can probably match the performance at some point, but I doubt it's worth worrying about. 

I wonder if there's realistic real-world cases where you could beat it due to being a library solution and skipping the cost of launching grep as a new process. Granted, outside of Windows, process launching is considered to be fairly cheap, but it still isn't no-cost.

That would still be a nice feather in D's cap: Comparable to grep for large data, faster than spawning a grep process for smaller data.
May 14, 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>
> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
>
> If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more).
>
> The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that.
>
> Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later...
>
> As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).
>
> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.
>
> Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point).
>
> Enjoy!
>
> https://github.com/schveiguy/iopipe
> https://code.dlang.org/packages/iopipe
> http://schveiguy.github.io/iopipe/
>
> -Steve

Hi Steve,

It is an exciting works, that could help in bioinformatics area.
Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes.

So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed )

It could be interesting to show how iopipe is fast.

You can grab a fastq file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/ and take a look at iopipe perf .

fastq file is plain test format and it is usually a repetition of four lines:
1/ title and description
this line starts with @
2/ sequence line
this line contains ususally DNA letters (ACGT)
3/ comment line
this line starts with +
4/ quality of amino acids
this line has the same length as the sequence line (n°2)

Rarely, the comment section is over multiple lines.
Warning the @ and + characters can be found inside the quality line, thus I search a pattern of two characters '\n@' and '\n+'. I never split file by line as it is a waste of time instead I read the content as a stream.

I hope this show case help you

Good luck :-)

May 14, 2018
On 5/14/18 6:02 AM, bioinfornatics wrote:
> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>> OK, so at dconf I spoke with a few very smart guys about how I can use mmap to make a zero-copy buffer. And I implemented this on the plane ride home.
>>
>> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.
>>
>> If anyone has any good use cases for it, I'm open to suggestions. Something that is going to potentially increase performance is an application that needs to keep the buffer mostly full when extending (i.e. something like 75% full or more).
>>
>> The buffer is selected by using `rbufd` instead of just `bufd`. Everything should be a drop-in replacement except for that.
>>
>> Note: I have ONLY tested on Macos, so if you find bugs in other OSes let me know. This is still a Posix-only library for now, but more on that later...
>>
>> As a test for Ring buffers, I implemented a simple "grep-like" search program that doesn't use regex, but phobos' canFind to look for lines that match. It also prints some lines of context, configurable on the command line. The lines of context I thought would show better performance with the RingBuffer than the standard buffer since it has to keep a bunch of lines in the buffer. But alas, it's roughly the same, even with large number of lines for context (like 200).
>>
>> However, this example *does* show the power of iopipe -- it handles all flavors of unicode with one template function, is quite straightforward (though I want to abstract the line tracking code, that stuff is really tricky to get right). Oh, and it's roughly 10x faster than grep, and a bunch faster than fgrep, at least on my machine ;) I'm tempted to add regex processing to see if it still beats grep.
>>
>> Next up (when my bug fix for dmd is merged, see https://issues.dlang.org/show_bug.cgi?id=17968) I will be migrating iopipe to depend on https://github.com/MartinNowak/io, which should unlock Windows support (and I will add RingBuffer Windows support at that point).
>>
>> Enjoy!
>>
>> https://github.com/schveiguy/iopipe
>> https://code.dlang.org/packages/iopipe
>> http://schveiguy.github.io/iopipe/
>>
> 
> Hi Steve,
> 
> It is an exciting works, that could help in bioinformatics area.
> Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes.
> 
> So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed )
> 
> It could be interesting to show how iopipe is fast.

Yeah, I have been working on and off with Vang Le (biocyberman) on using iopipe to parse such formats. He gave a good presentation at dconf this year on using D in bioinformatics, and I think it is a great fit for D!

At dconf, I threw together a crude fasta parser (with the intention of having it be the base for parsing fastq as well) to demonstrate how iopipe can perform while parsing such things. I have no idea how fast or slow it is, as I just barely got it to work (pass unit tests I made up based on wikipedia entry for fasta), but IMO, the direct buffer access makes fast parsing much more pleasant than having to deal with your own buffering (using phobos makes parsing a bit difficult, however, I still see a need for some parsing tools for iopipe).

You can find that library here: https://github.com/schveiguy/fastaq

Not being in the field of bioinformatics, I can't really say that I am likely to continue development of it, but I'm certainly willing to help with iopipe for anyone who wants to use it in this field.

-Steve
May 14, 2018
On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
> However, I am struggling to find a use case for this that showcases why you would want to use it. While it does work, and works beautifully, it doesn't show any measurable difference vs. the array allocated buffer that copies data when it needs to extend.

I can think of a good use-case:
- Audio streaming (on embedded environment)!

If you have something like a bluetooth audio source, and alsa or any audio hardware API as an audio sink to speakers, you can use the ring-buffer as a fifo between the two.

The bluetooth source has its own pace (and you cannot control it) and has a variable bit-rate, whereas the sink has a constant bit-rate, so you have to have a buffer between them. And you want to reduce the CPU cost has much as possible due to embedded system constraints (or even real-time constraint, especially for audio).

May 15, 2018
On Monday, 14 May 2018 at 14:23:43 UTC, Steven Schveighoffer wrote:
> On 5/14/18 6:02 AM, bioinfornatics wrote:
>> On Thursday, 10 May 2018 at 23:22:02 UTC, Steven Schveighoffer wrote:
>>>[...]
>> 
>> Hi Steve,
>> 
>> It is an exciting works, that could help in bioinformatics area.
>> Indeed in bioinformatics we are I/O bounding and we process lot of big files the amount of data can be in gigabytes, terabytes and even some times in petabytes.
>> 
>> So processing efficiently these amount of data is critic. Some years ago I got a request 'How to parse fastq file format in D?' and monarch_dodra wrote a really fast parser (code: http://dpaste.dzfl.pl/37b893ed )
>> 
>> It could be interesting to show how iopipe is fast.
>
> Yeah, I have been working on and off with Vang Le (biocyberman) on using iopipe to parse such formats. He gave a good presentation at dconf this year on using D in bioinformatics, and I think it is a great fit for D!
>
> At dconf, I threw together a crude fasta parser (with the intention of having it be the base for parsing fastq as well) to demonstrate how iopipe can perform while parsing such things. I have no idea how fast or slow it is, as I just barely got it to work (pass unit tests I made up based on wikipedia entry for fasta), but IMO, the direct buffer access makes fast parsing much more pleasant than having to deal with your own buffering (using phobos makes parsing a bit difficult, however, I still see a need for some parsing tools for iopipe).
>
> You can find that library here: https://github.com/schveiguy/fastaq
>
> Not being in the field of bioinformatics, I can't really say that I am likely to continue development of it, but I'm certainly willing to help with iopipe for anyone who wants to use it in this field.
>
> -Steve

Hi Steve

Great work continuing to improve iopipe. Thank you for the example implementation of fasta/q parser with iopipe. I will definitely continue to work on this. It still requires some more time for me to get over beginner barriers in D. I am currently trying out some work over here https://github.com/bioslaD.

@Johnathan(bioinformatics) It will be great if you can join bioslaD and offer some help to make things move faster.

Vang
1 2 3
Next ›   Last »