Reading from stdin significantly slower than reading file directly? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Reading from stdin significantly slower than reading file directly?

Thread overview

Reading from stdin significantly slower than reading file directly?
Aug 12, 2020 methonash
Aug 13, 2020 Jon Degenhardt
Aug 13, 2020 wjoe
Aug 13, 2020 Steven Schveighoffer
Aug 13, 2020 Jon Degenhardt
Aug 13, 2020 methonash

August 12, 2020

Reading from stdin significantly slower than reading file directly?

Posted by methonash

methonash

Hi,

Relative beginner to D-lang here, and I'm very confused by the apparent performance disparity I've noticed between programs that do the following:

1) cat some-large-file | D-program-reading-stdin-byLine()

2) D-program-directly-reading-file-byLine() using File() struct

The D-lang difference I've noticed from options (1) and (2) is somewhere in the range of 80% wall time taken (7.5s vs 4.1s), which seems pretty extreme.

For comparison, I attempted the same using Perl with the same large file, and I only noticed a 25% difference (10s vs 8s) in performance, which I imagine to be partially attributable to the overhead incurred by using a pipe and its buffer.

So, is this difference in D-lang performance typical? Is this expected behavior?

Was wondering if this may have anything to do with the library definition for std.stdio.stdin (https://dlang.org/library/std/stdio/stdin.html)? Does global file-locking significantly affect read-performance?

For reference: I'm trying to build a single-threaded application; my present use-case cannot benefit from parallelism, because its ultimate purpose is to serve as a single-threaded downstream filter from an upstream application consuming (n-1) system threads.

August 13, 2020

Re: Reading from stdin significantly slower than reading file directly?

Posted by Jon Degenhardt
in reply to methonash

Jon Degenhardt

Posted in reply to methonash

On Wednesday, 12 August 2020 at 22:44:44 UTC, methonash wrote:
> Hi,
>
> Relative beginner to D-lang here, and I'm very confused by the apparent performance disparity I've noticed between programs that do the following:
>
> 1) cat some-large-file | D-program-reading-stdin-byLine()
>
> 2) D-program-directly-reading-file-byLine() using File() struct
>
> The D-lang difference I've noticed from options (1) and (2) is somewhere in the range of 80% wall time taken (7.5s vs 4.1s), which seems pretty extreme.

I don't know enough details of the implementation to really answer the question, and I expect it's a bit complicated.

However, it's an interesting question, and I have relevant programs and data files, so I tried to get some actuals.

The tests I ran don't directly answer the question posed, but may be a useful proxy. I used Unix 'cut' (latest GNU version) and 'tsv-select' from the tsv-utils package (https://github.com/eBay/tsv-utils). 'tsv-select' is written in D, and works like 'cut'. 'tsv-select' reads from stdin or a file via a 'File' struct. It's not using the built-in 'byLine' member though, it uses a version of 'byLine' that includes some additional buffering. Both stdin and a file system file are read this way.

I used a file from the google ngram collection (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and the file TREE_GRM_ESTN.csv from https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html, converted to a tsv file.

The ngram file is a narrow file (21 bytes/line, 4 columns), the TREE file is wider (206 bytes/line, 49 columns). In both cases I cut the 2nd and 3rd columns. This tends to focus processing on input rather than processing and output. I also timed 'wc -l' for another data point.

I ran the benchmarks 5 times each way and recorded the median time below. Machine used is a MacMini (so Mac OS) with 16 GB RAM and SSD drives. The numbers are very consisent for this test on this machine. Differences in the reported times are real deltas, not system noise. The commands timed were:

* bash -c 'tsv-select -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | tsv-select -f 2,3 > /dev/null'
* bash -c 'gcut -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | gcut -f 2,3 > /dev/null'
* bash -c 'gwc -l FILE > /dev/null'
* bash -c 'cat FILE | gwc -l > /dev/null'

Note that 'gwc' and 'gcut' are the GNU versions of 'wc' and 'cut' installed by Homebrew.

Google ngram file (the 's' unigram file):

Test                          Elapsed  System   User
----                          -------  ------   ----
tsv-select -f 2,3 FILE          10.28    0.42   9.85
cat FILE | tsv-select -f 2,3    11.10    1.45  10.23
cut -f 2,3 FILE                 14.64    0.60  14.03
cat FILE | cut -f 2,3           14.36    1.03  14.19
wc -l FILE                       1.32    0.39   0.93
cat FILE | wc -l                 1.18    0.96   1.04

The TREE file:

Test                          Elapsed  System   User
----                          -------  ------   ----
tsv-select -f 2,3 FILE           3.77    0.95   2.81
cat FILE | tsv-select -f 2,3     4.54    2.65   3.28
cut -f 2,3 FILE                 17.78    1.53  16.24
cat FILE | cut -f 2,3           16.77    2.64  16.36
wc -l FILE                       1.38    0.91   0.46
cat FILE | wc -l                 2.02    2.63   0.77

What this shows is that 'tsv-select' (D program) was faster when reading from a file than when reading from a standard input. It doesn't indicate why or whether the delta is due to code D library or code in 'tsv-select'.

Interestingly, 'cut' showed the opposite behavior. It was faster when reading from standard input than when reading from the file. For 'wc', which method was faster was dependent on line length.

Again, I caution against reading too much into this regarding performance of reading from standard input vs a disk file. Much more definitive tests can be done. However, it is an interesting comparison.

Also, the D program is still fast in both cases.

--Jon

August 13, 2020

Re: Reading from stdin significantly slower than reading file directly?

Posted by wjoe
in reply to Jon Degenhardt

wjoe

Posted in reply to Jon Degenhardt

On Thursday, 13 August 2020 at 07:08:21 UTC, Jon Degenhardt wrote:
> Test                          Elapsed  System   User
> ----                          -------  ------   ----
> tsv-select -f 2,3 FILE          10.28    0.42   9.85
> cat FILE | tsv-select -f 2,3    11.10    1.45  10.23
> cut -f 2,3 FILE                 14.64    0.60  14.03
> cat FILE | cut -f 2,3           14.36    1.03  14.19
> wc -l FILE                       1.32    0.39   0.93
> cat FILE | wc -l                 1.18    0.96   1.04
>
>
> The TREE file:
>
> Test                          Elapsed  System   User
> ----                          -------  ------   ----
> tsv-select -f 2,3 FILE           3.77    0.95   2.81
> cat FILE | tsv-select -f 2,3     4.54    2.65   3.28
> cut -f 2,3 FILE                 17.78    1.53  16.24
> cat FILE | cut -f 2,3           16.77    2.64  16.36
> wc -l FILE                       1.38    0.91   0.46
> cat FILE | wc -l                 2.02    2.63   0.77
>

Your table shows that when piping the output from one process to another, there's a lot more time spent in kernel mode. A switch from user mode to kernel mode is expensive [1].
It costs around 1000-1500 clock cycles for a call to getpid() on most systems. That's around 100 clock cycles for the actual switch and the rest is overhead.

My theory is this:
One of the reasons for the slowdown is very likely mutex un/locking of which there is more need when multiple processes and (global) resources are involved compared to a single instance.
Another is copying buffers.
 When you read a file the data is first read into a kernel buffer which is then copied to the user space buffer i.e. the buffer you allocated in your program (the reading part might not happen if the data is still in the cache).
If you read the file directly in your program, the data is copied once from kernel space to user space.
When you read from stdin (which is technically a file) it would seem that cat reads the file which means a copy from kernel to user space (cat), then cat outputs that buffer to stdout (also technically a file) which is another copy, then you read from stdin in your program which will cause another copy from stdout to stdin and finally to your allocated buffer.
Each of those steps may invlovle a mutex un/lock.
Also with pipes you start two programs. Starting a program takes a few ms.

PS. If you do your own caching, or if you don't care about it because you just read a file sequentially once, you may benefit from opening your file with the O_DIRECT flag which basically means that the kernel copies directly into user space buffers.

[1] https://en.wikipedia.org/wiki/Ring_(computer_security)

August 13, 2020

Re: Reading from stdin significantly slower than reading file directly?

Posted by Steven Schveighoffer
in reply to methonash

Steven Schveighoffer

Posted in reply to methonash

On 8/12/20 6:44 PM, methonash wrote:
> Hi,
> 
> Relative beginner to D-lang here, and I'm very confused by the apparent performance disparity I've noticed between programs that do the following:
> 
> 1) cat some-large-file | D-program-reading-stdin-byLine()
> 
> 2) D-program-directly-reading-file-byLine() using File() struct
> 
> The D-lang difference I've noticed from options (1) and (2) is somewhere in the range of 80% wall time taken (7.5s vs 4.1s), which seems pretty extreme.
> 
> For comparison, I attempted the same using Perl with the same large file, and I only noticed a 25% difference (10s vs 8s) in performance, which I imagine to be partially attributable to the overhead incurred by using a pipe and its buffer.
> 
> So, is this difference in D-lang performance typical? Is this expected behavior?
> 
> Was wondering if this may have anything to do with the library definition for std.stdio.stdin (https://dlang.org/library/std/stdio/stdin.html)? Does global file-locking significantly affect read-performance?
> 
> For reference: I'm trying to build a single-threaded application; my present use-case cannot benefit from parallelism, because its ultimate purpose is to serve as a single-threaded downstream filter from an upstream application consuming (n-1) system threads.

Are we missing the obvious here? cat needs to read from disk, write the results into a pipe buffer, then context-switch into your D program, then the D program reads from the pipe buffer.

Whereas, reading from a file just needs to read from the file.

The difference does seem a bit extreme, so maybe there is another more complex explanation.

But for sure, reading from stdin doesn't do anything different than reading from a file if you are using the File struct.

A more appropriate test might be using the shell to feed the file into the D program:

dprogram < FILE

Which means the same code runs for both tests.

-Steve

August 13, 2020

Re: Reading from stdin significantly slower than reading file directly?

Posted by Jon Degenhardt
in reply to Steven Schveighoffer

Jon Degenhardt

Posted in reply to Steven Schveighoffer

On Thursday, 13 August 2020 at 14:41:02 UTC, Steven Schveighoffer wrote:
> But for sure, reading from stdin doesn't do anything different than reading from a file if you are using the File struct.
>
> A more appropriate test might be using the shell to feed the file into the D program:
>
> dprogram < FILE
>
> Which means the same code runs for both tests.

Indeed, using the 'prog < file' approach rather than 'cat file | prog' indeed removes any distinction for 'tsv-select'. 'tsv-select' uses File.rawRead rather than File.byLine.

August 13, 2020

Re: Reading from stdin significantly slower than reading file directly?

Posted by methonash
in reply to Steven Schveighoffer

methonash

Posted in reply to Steven Schveighoffer

Thank you all very much for your detailed feedback!

I wound up pulling the "TREE_GRM_ESTN.csv" file referred to by Jon and used it in subsequent tests. Created D-programs for reading directly through a File() structure, versus reading byLine() from the stdin alias.

After copying the large CSV file to /dev/shm/ (e.g. a ramdisk), I re-ran the two programs repeatedly, and I was able to approach the 20-30% overhead margin I would expect to see for using a shell pipe and its buffer; my results now similarly match Jon's above.

Lesson learned: be wary of networked I/O systems (e.g. Isilon storage arrays); all kinds of weirdness can happen there ...

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation