Thread overview
eBay's TSV Utilities status update
Apr 29, 2019
Jon Degenhardt
May 03, 2019
James Blachly
May 03, 2019
Jon Degenhardt
April 29, 2019
An update on changes to this tool-set over the last year.

For those not familiar, tsv-utils are a set of command tools for manipulating large tabular data files. Files of numeric and text data common in machine learning and data mining environments. Filtering, statistics, sampling, joins, and more. The tools are intended for large files, larger than ideal for loading in-memory in tools like R or Pandas, but not so big as to necessitate moving to distributed compute environments. The tools are quite fast, the fastest of their kind available.

Besides being real tools, tsv-utils have also provided an environment for exploring the D programming language and the D ecosystem.

In past year there have been two main areas of work.

One area is the sampling and shuffling facilities provided by the tsv-sample program. New sampling methods are available and performance has been improved. tsv-sample is very similar to the excellent GNU shuf tool, but supports sampling methods not available in shuf. Sampling is a rich and diverse area, and the tsv-sample code is perhaps the most algorithmically interesting the tool-set.

The other main update is improved I/O read performance in many of the tools. This is from developing a buffered version of byLine. It is especially effective for skinny files (short lines). Most of the tools saw performance gains of 10-40%.

One of the earlier performance improvements came from buffering output lines. Combined, the line-by-line read-write performance is quite a bit faster than what is available in Phobos. The iopipe / std.io packages (Steve Schveighoff, Martin Nowak) are faster still, these are the place to go for really high performance. (See the links below for a benchmark report.)

Links:
* tsv-utils repo: https://github.com/eBay/tsv-utils
* tsv-sample user docs: https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#tsv-sample-reference
* tsv-sample code docs: https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
* Performance benchmarks on line-oriented I/O facilities: https://github.com/jondegenhardt/dcat-perf/issues/1
May 02, 2019
On 4/29/19 11:23 AM, Jon Degenhardt wrote:
> An update on changes to this tool-set over the last year.
...
> The other main update is improved I/O read performance in many of the tools. This is from developing a buffered version of byLine. It is especially effective for skinny files (short lines). Most of the tools saw performance gains of 10-40%.
> 
> One of the earlier performance improvements came from buffering output lines. Combined, the line-by-line read-write performance is quite a bit faster than what is available in Phobos. The iopipe / std.io packages (Steve Schveighoff, Martin Nowak) are faster still, these are the place to go for really high performance. (See the links below for a benchmark report.)
> 
> Links:
> * tsv-utils repo: https://github.com/eBay/tsv-utils
> * tsv-sample user docs: https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#tsv-sample-reference 
> 
> * tsv-sample code docs: https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
> * Performance benchmarks on line-oriented I/O facilities: https://github.com/jondegenhardt/dcat-perf/issues/1


Jon:

Thank you for this, and thanks for your blog post of a couple of years ago, which I referred to many times while learning D and writing fast(er) CLI tools.

Looking forward to trying Steve's iopipe as well as your bufferedByLineReader.

James
May 03, 2019
On Friday, 3 May 2019 at 03:54:14 UTC, James Blachly wrote:
> On 4/29/19 11:23 AM, Jon Degenhardt wrote:
>> An update on changes to this tool-set over the last year.
> ...
> Thank you for this, and thanks for your blog post of a couple of years ago, which I referred to many times while learning D and writing fast(er) CLI tools.
>
> Looking forward to trying Steve's iopipe as well as your bufferedByLineReader.
>
> James

Thanks for the kind words James!