A file reading benchmark - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » A file reading benchmark

Thread overview

A file reading benchmark
Feb 18, 2012 bearophile
Feb 18, 2012 bearophile
Feb 18, 2012 Walter Bright
Feb 18, 2012 bearophile
Feb 18, 2012 Walter Bright
Feb 18, 2012 Robert Jacques
Feb 18, 2012 Robert Jacques
Feb 23, 2012 Andrei Alexandrescu
Feb 23, 2012 Martin Nowak
Feb 23, 2012 Andrei Alexandrescu
Feb 23, 2012 Martin Nowak
Feb 24, 2012 H. S. Teoh

February 18, 2012

A file reading benchmark

Posted by bearophile

bearophile

A tiny little file lines reading benchmark I've just found on Reddit: http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/

http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html

The Ruby code that generates slowly the test data:
https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
But for my timings I have used only about a 40% of that file, the first 1_965_800 lines, because I have less memory.

My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds (the D version runs in 13.20 seconds if I don't disable the GC).

From many other benchmarks I've seen that file reading line-by-line is slow in D.

-------------------------
My D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
    Appender!(string[][]) rows;
    foreach (line; File(args[1]).byLine())
        rows.put(line.idup.split("\t"));
    writeln(rows.data[1].join(","));
}

-------------------------
My Python 2.6 code:

from sys import argv
from collections import deque
import gc
import psyco

def main():
    gc.disable()
    rows = deque()
    for line in open(argv[1]):
        rows.append(line[:-1].split("\t"))
    print ",".join(rows[1])

psyco.full()
main()

-------------------------
The test data generator in Ruby:

user_id=1
for user_id in (1..10000)
 payments = (rand * 1000).to_i
 for user_payment_id in (1..payments)
   payment_id = user_id.to_s + user_payment_id.to_s
   payment_amount = "%.2f" % (rand * 30);
   is_card_present = "N"
   created_at = (rand * 10000000).to_i
   if payment_id.to_i % 3 == 0
    is_card_present = "Y"
   end
   puts [user_id, payment_id, payment_amount, is_card_present, created_at].join("\t")
 end
end

-------------------------

Bye,
bearophile

February 18, 2012

Re: A file reading benchmark

Posted by bearophile
in reply to bearophile

bearophile

Posted in reply to bearophile

> (the D version runs in 13.20 seconds if I don't disable the GC).

Sorry, I have forgotten the version of the D code with GC disabling:


import std.stdio, std.string, std.array, core.memory;

void main(in string[] args) {
    GC.disable();
    Appender!(string[][]) rows;
    foreach (line; File(args[1]).byLine())
        rows.put(line.idup.split("\t"));
    writeln(rows.data[1].join(","));
}

Bye,
bearophile

February 18, 2012

Re: A file reading benchmark

Posted by Walter Bright
in reply to bearophile

Walter Bright

Posted in reply to bearophile

On 2/17/2012 5:44 PM, bearophile wrote:
> A tiny little file lines reading benchmark I've just found on Reddit:

If you're going to benchmark reading a file line by line, that's what your code should do. But that isn't what it does, it conflates it with appending to an array.

February 18, 2012

Re: A file reading benchmark

Posted by Robert Jacques
in reply to bearophile

Robert Jacques

Posted in reply to bearophile

On Fri, 17 Feb 2012 19:44:04 -0600, bearophile <bearophileHUGS@lycos.com> wrote:
> A tiny little file lines reading benchmark I've just found on Reddit:
> http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/
>
> http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html
>
> The Ruby code that generates slowly the test data:
> https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
> But for my timings I have used only about a 40% of that file, the first 1_965_800 lines, because I have less memory.
>
> My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds (the D version runs in 13.20 seconds if I don't disable the GC).
>
> From many other benchmarks I've seen that file reading line-by-line is slow in D.

Bearophile, when comparing a deque to a classic vector, of course the deque is going to win. This has nothing to do with D, and everything to do with writing a good algorithm.

Also, Appender has known performance problems with large appends (Issue 5813 http://d.puremagic.com/issues/show_bug.cgi?id=5813 , I'm currently folding in Vladimir's suggestions and generating a pull request)

February 18, 2012

Re: A file reading benchmark

Posted by bearophile
in reply to Walter Bright

bearophile

Posted in reply to Walter Bright

Walter Bright:

> If you're going to benchmark reading a file line by line, that's what your code should do. But that isn't what it does, it conflates it with appending to an array.

I have used the benchmark presented on Reddit.

I understand that a read-only benchmark is more useful for D developers. Removing the storing of lines the code is slower than equivalent Python code still.

----------------

>Bearophile, when comparing a deque to a classic vector, of course the deque is going to win. This has nothing to do with D, and everything to do with writing a good algorithm.<

In this specific case you are wrong: in Python if I replace the deque with a list (that is a dynamic array), the runtime grows only by less than 0.2 seconds. I know this benchmark is about Phobos not about D.


> Also, Appender has known performance problems with large appends (Issue 5813 http://d.puremagic.com/issues/show_bug.cgi?id=5813 , I'm currently folding in Vladimir's suggestions and generating a pull request)<

OK. We'll see. But the File.byLine() is slower than file by line reading, regardless appending to deques of arrays or appenders.

Bye,
bearophile

February 18, 2012

Re: A file reading benchmark

Posted by Walter Bright
in reply to bearophile

Walter Bright

Posted in reply to bearophile

On 2/17/2012 7:25 PM, bearophile wrote:
> I have used the benchmark presented on Reddit.

I have no issue with that, just with your conclusion that it's file reading that's slow. It's not supported by your data.

Benchmarking is not so easy, I've very often seen benchmarks purporting to measure X when they were actually measuring Y.

The scientific jargon for this is "isolating the variables".

February 18, 2012

Re: A file reading benchmark

Posted by Robert Jacques
in reply to bearophile

Robert Jacques

Posted in reply to bearophile

On Fri, 17 Feb 2012 21:25:05 -0600, bearophile <bearophileHUGS@lycos.com> wrote:
[snip]
>> Bearophile, when comparing a deque to a classic vector, of course the deque is going to win. This has nothing to do with D, and everything to do with writing a good algorithm.<
>
> In this specific case you are wrong: in Python if I replace the deque with a list (that is a dynamic array), the runtime grows only by less than 0.2 seconds. I know this benchmark is about Phobos not about D.

Whether or not deque vs vector was the primary performance problem, posting an 'apples to oranges' performance benchmark causes people to a) dismiss the thread and b) lower their opinion of the poster.

February 23, 2012

Re: A file reading benchmark

Posted by Andrei Alexandrescu
in reply to bearophile

Andrei Alexandrescu

Posted in reply to bearophile

On 2/17/12 7:44 PM, bearophile wrote:
> A tiny little file lines reading benchmark I've just found on Reddit:
> http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/
>
> http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html
>
> The Ruby code that generates slowly the test data:
> https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
> But for my timings I have used only about a 40% of that file, the first 1_965_800 lines, because I have less memory.
>
> My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds (the D version runs in 13.20 seconds if I don't disable the GC).
>
>  From many other benchmarks I've seen that file reading line-by-line is slow in D.
>
> -------------------------
> My D code:
[snip]

The thread in D.announce prompted me to go back to this, and I've run a simple test that isolates file reads from everything else. After generating the CSV data as described above, I ran this Python code:

import sys

rows = []
f = open(sys.argv[1])
for line in f:
    if len(line) > 10000: rows.append(line[:-1].split("\t"))

and this D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
    Appender!(string[][]) rows;
    auto f = File(args[1]);
    foreach (line; f.byLine()) {
        if (line.length > 10000) rows.put(line.idup.split("\t"));
    }
}

Both programs end up appending nothing because 10000 is larger than any line length.

On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds and the D code at a whopping 9.3 seconds. I looked around where the problem lies and sure enough the issue was with a slow loop in the generic I/O implementation of readln. The commit https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0 brings the speed of the test to 2.1 seconds. We could and should reduce that further with taking buffering in our own hands, but for now this is a good low-hanging fruit to pick.

Andrei

February 23, 2012

Re: A file reading benchmark

Posted by Martin Nowak
in reply to Andrei Alexandrescu

Martin Nowak

Posted in reply to Andrei Alexandrescu

> On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds and the D code at a whopping 9.3 seconds. I looked around where the problem lies and sure enough the issue was with a slow loop in the generic I/O implementation of readln. The commit https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0 brings the speed of the test to 2.1 seconds. We could and should reduce that further with taking buffering in our own hands, but for now this is a good low-hanging fruit to pick.
>
Nice, I just got shocked yesterday by seeing that we call fgetc for every char,
those are usually macros and as we already maintain the per system *_unlocked
functions we might probably use the macro expansions.

February 23, 2012

Re: A file reading benchmark

Posted by Andrei Alexandrescu
in reply to Martin Nowak

Andrei Alexandrescu

Posted in reply to Martin Nowak

On 2/23/12 3:40 PM, Martin Nowak wrote:
>> On my machine (Mac OSX Lion), the Python code clocks around 1.2
>> seconds and the D code at a whopping 9.3 seconds. I looked around
>> where the problem lies and sure enough the issue was with a slow loop
>> in the generic I/O implementation of readln. The commit
>> https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0
>> brings the speed of the test to 2.1 seconds. We could and should
>> reduce that further with taking buffering in our own hands, but for
>> now this is a good low-hanging fruit to pick.
>>
> Nice, I just got shocked yesterday by seeing that we call fgetc for
> every char,
> those are usually macros and as we already maintain the per system
> *_unlocked
> functions we might probably use the macro expansions.

Yah, feel free to work opportunistically on this if you find the time. However, I think long-term we want to give byLine() the freedom of doing its own buffering.

Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation