| |
| Posted by Jon Degenhardt | PermalinkReply |
|
Jon Degenhardt
| An update on changes to this tool-set over the last year.
For those not familiar, tsv-utils are a set of command tools for manipulating large tabular data files. Files of numeric and text data common in machine learning and data mining environments. Filtering, statistics, sampling, joins, and more. The tools are intended for large files, larger than ideal for loading in-memory in tools like R or Pandas, but not so big as to necessitate moving to distributed compute environments. The tools are quite fast, the fastest of their kind available.
Besides being real tools, tsv-utils have also provided an environment for exploring the D programming language and the D ecosystem.
In past year there have been two main areas of work.
One area is the sampling and shuffling facilities provided by the tsv-sample program. New sampling methods are available and performance has been improved. tsv-sample is very similar to the excellent GNU shuf tool, but supports sampling methods not available in shuf. Sampling is a rich and diverse area, and the tsv-sample code is perhaps the most algorithmically interesting the tool-set.
The other main update is improved I/O read performance in many of the tools. This is from developing a buffered version of byLine. It is especially effective for skinny files (short lines). Most of the tools saw performance gains of 10-40%.
One of the earlier performance improvements came from buffering output lines. Combined, the line-by-line read-write performance is quite a bit faster than what is available in Phobos. The iopipe / std.io packages (Steve Schveighoff, Martin Nowak) are faster still, these are the place to go for really high performance. (See the links below for a benchmark report.)
Links:
* tsv-utils repo: https://github.com/eBay/tsv-utils
* tsv-sample user docs: https://github.com/eBay/tsv-utils/blob/master/docs/ToolReference.md#tsv-sample-reference
* tsv-sample code docs: https://tsv-utils.dpldocs.info/tsv_utils.tsv_sample.html
* Performance benchmarks on line-oriented I/O facilities: https://github.com/jondegenhardt/dcat-perf/issues/1
|