Jump to page: 1 2 3
Thread overview
Read text file fast, how?
Jul 25, 2015
Johan Holmberg
Jul 25, 2015
Johan Holmberg
Jul 25, 2015
Brandon Ragland
Jul 25, 2015
sigod
Jul 26, 2015
Johan Holmberg
Jul 26, 2015
Brandon Ragland
Jul 27, 2015
Johan Holmberg
Jul 27, 2015
John Colvin
Jul 29, 2015
Jacob Carlborg
Jul 29, 2015
Johan Holmberg
Jul 29, 2015
Jacob Carlborg
Jul 26, 2015
Bigsandwich
Jul 26, 2015
Jesse Phillips
Jul 27, 2015
Martin Nowak
Jul 27, 2015
Tobias Müller
Jul 27, 2015
Jesse Phillips
Jul 27, 2015
Marc Schütz
Jul 27, 2015
Johan Holmberg
July 25, 2015
Hi!

I am trying to port a program I have written earlier to D. My previous versions are in C++ and Python. I was hoping that a D version would be similar in speed to the C++ version, rather than similar to the Python version. But currently it isn't.

Part of the problem may be that I haven't learned the idiomatic way to do things in D. One such thing is perhaps: how do I read large text files in an efficient manner in D?

Currently I have created a little test-program that does the same job as the UNIX-command "wc -lc", i.e. counting the number of lines and characters in a file. The timings I get in different languages are:

D:           15s
C++:       1.1s
Python:   3.7s
Perl:        2.9s

The central loop in my D program looks like:

        foreach (line; f.byLine) {

            nlines += 1;

            nchars += line.length + 1;

        }

I have also tried another variant with this inner loop:

        char[] line;

        while(f.readln(line)) {

            nlines += 1;

            nchars += line.length;

        }

but in both cases this D program is much slower than any of the others in C++/Python/Perl. I don't understand what can cause this dramatic difference to C++, and a factor 4 to Python. My D programs are built with DMD 2.067.1 on MacOS Yosemite, using the flags "-O -release".

Is there something I can do to make the program run faster, and still be "idiomatic D"?

(I append the whole program for reference)

Regards,
/Johan Holmberg

=======================================
import std.stdio;
import std.file;

void main(string[] argv) {
    foreach (fname; argv[1..$]) {
        auto f = File(fname);
        int nlines = 0;
        int nchars = 0;
        foreach (line; f.byLine) {
            nlines += 1;
            nchars += line.length + 1;
        }
        writeln(nlines, "\t", nchars, "\t", fname);
    }
}
=======================================


July 25, 2015
On 7/25/15 8:19 AM, Johan Holmberg via Digitalmars-d wrote:
> Hi!
>
> I am trying to port a program I have written earlier to D. My previous
> versions are in C++ and Python. I was hoping that a D version would be
> similar in speed to the C++ version, rather than similar to the Python
> version. But currently it isn't.
>
> Part of the problem may be that I haven't learned the idiomatic way to
> do things in D. One such thing is perhaps: how do I read large text
> files in an efficient manner in D?
>
> Currently I have created a little test-program that does the same job as
> the UNIX-command "wc -lc", i.e. counting the number of lines and
> characters in a file. The timings I get in different languages are:
>
> D:           15s
> C++:       1.1s
> Python:   3.7s
> Perl:        2.9s

I think this harkens back to the problem discussed here:

http://stackoverflow.com/questions/28922323/improving-line-wise-i-o-operations-in-d/29153508

As I discuss there, the performance bug has been fixed for 2.068. With your code:

$ time wc -l <(repeat 1000000 echo hello)
 1000000 /dev/fd/11
wc -l <(repeat 1000000 echo hello)  0.11s user 2.35s system 54% cpu 4.529 total
$ time ./test.d <(repeat 1000000 echo hello)
 1000000 6000000 /dev/fd/11
./test.d <(repeat 1000000 echo hello)  0.73s user 1.76s system 64% cpu 3.870 total

The compilation was flag free (no -O -inline -release etc).


Andrei

July 25, 2015
On Sat, Jul 25, 2015 at 7:14 PM, Andrei Alexandrescu via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

> On 7/25/15 8:19 AM, Johan Holmberg via Digitalmars-d wrote:
>
>> Hi!
>>
>> I am trying to port a program I have written earlier to D. My previous versions are in C++ and Python. I was hoping that a D version would be similar in speed to the C++ version, rather than similar to the Python version. But currently it isn't.
>>
>> Part of the problem may be that I haven't learned the idiomatic way to do things in D. One such thing is perhaps: how do I read large text files in an efficient manner in D?
>>
>> Currently I have created a little test-program that does the same job as the UNIX-command "wc -lc", i.e. counting the number of lines and characters in a file. The timings I get in different languages are:
>>
>> D:           15s
>> C++:       1.1s
>> Python:   3.7s
>> Perl:        2.9s
>>
>
> I think this harkens back to the problem discussed here:
>
>
> http://stackoverflow.com/questions/28922323/improving-line-wise-i-o-operations-in-d/29153508
>
> As I discuss there, the performance bug has been fixed for 2.068. With your code:
>
> $ time wc -l <(repeat 1000000 echo hello)
>  1000000 /dev/fd/11
> wc -l <(repeat 1000000 echo hello)  0.11s user 2.35s system 54% cpu 4.529
> total
> $ time ./test.d <(repeat 1000000 echo hello)
>  1000000 6000000 /dev/fd/11
> ./test.d <(repeat 1000000 echo hello)  0.73s user 1.76s system 64% cpu
> 3.870 total
>
> The compilation was flag free (no -O -inline -release etc).
>
>
> Andrei
>
>

Thanks, my question seems like a carbon copy of the Stack Overflow article :) Somehow I had missed it when googling.

I download a dmd 2.068 beta, and re-tried with my input file: now the D program takes 1.6s (a 10x improvement).

/johan


July 25, 2015
On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
> Thanks, my question seems like a carbon copy of the Stack Overflow
> article :) Somehow I had missed it when googling.
>
> I download a dmd 2.068 beta, and re-tried with my input file: now the D
> program takes 1.6s (a 10x improvement).

Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei

July 25, 2015
On Saturday, 25 July 2015 at 20:12:26 UTC, Andrei Alexandrescu wrote:
> On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
>> Thanks, my question seems like a carbon copy of the Stack Overflow
>> article :) Somehow I had missed it when googling.
>>
>> I download a dmd 2.068 beta, and re-tried with my input file: now the D
>> program takes 1.6s (a 10x improvement).
>
> Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei

Do you happen to have a link to that source where you fixed it.

I feel like contributing some reading effort today.
July 25, 2015
On Saturday, 25 July 2015 at 22:40:55 UTC, Brandon Ragland wrote:
> On Saturday, 25 July 2015 at 20:12:26 UTC, Andrei Alexandrescu wrote:
>> On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
>>> Thanks, my question seems like a carbon copy of the Stack Overflow
>>> article :) Somehow I had missed it when googling.
>>>
>>> I download a dmd 2.068 beta, and re-tried with my input file: now the D
>>> program takes 1.6s (a 10x improvement).
>>
>> Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei
>
> Do you happen to have a link to that source where you fixed it.
>
> I feel like contributing some reading effort today.

https://github.com/D-Programming-Language/phobos/pull/3089
July 26, 2015
On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via Digitalmars-d < digitalmars-d@puremagic.com> wrote:

> On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
>
>> Thanks, my question seems like a carbon copy of the Stack Overflow article :) Somehow I had missed it when googling.
>>
>> I download a dmd 2.068 beta, and re-tried with my input file: now the D program takes 1.6s (a 10x improvement).
>>
>
> Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei
>
>
My C++ program was actually doing C-style IO via <stdio.h>. I didn't think about the distinction C/C++ when reporting the earlier numbers.

If I switch to full C++ style: <fstream> + <string> + C++ version of getline(), then the C++-solution is even slower than Python: 5.2s. I think it is the C++ libraries of Clang on MacOS Yosemite that are slow.

This prompted me to re-run the tests on a Linux machine (Ubuntu 14.04), still with the same input file, a text file with 7M lines and total size of 466MB:

C++ with <stdio.h> style IO:    0.40s
C++ with <fstream> style IO:   0.31s
D 2.067                                    1.75s
D 2.068 beta 2:                        0.69s
Perl:                                         1.49s
Python:                                    1.86s

So on Ubuntu, the C++ <fstream> version was clearly best. And the improvement in DMD 2.068 beta "only" a factor of 2.5 from 2.067.

/johan


July 26, 2015
On 7/26/15 10:35 AM, Johan Holmberg via Digitalmars-d wrote:
>
> On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via Digitalmars-d
> <digitalmars-d@puremagic.com <mailto:digitalmars-d@puremagic.com>> wrote:
>
>     On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
>
>         Thanks, my question seems like a carbon copy of the Stack Overflow
>         article :) Somehow I had missed it when googling.
>
>         I download a dmd 2.068 beta, and re-tried with my input file:
>         now the D
>         program takes 1.6s (a 10x improvement).
>
>
>     Great, though it still seems to be behind the C++ version, which is
>     a bummer. -- Andrei
>
>
> My C++ program was actually doing C-style IO via <stdio.h>. I didn't
> think about the distinction C/C++ when reporting the earlier numbers.
>
> If I switch to full C++ style: <fstream> + <string> + C++ version of
> getline(), then the C++-solution is even slower than Python: 5.2s. I
> think it is the C++ libraries of Clang on MacOS Yosemite that are slow.
>
> This prompted me to re-run the tests on a Linux machine (Ubuntu 14.04),
> still with the same input file, a text file with 7M lines and total size
> of 466MB:
>
> C++ with <stdio.h> style IO:    0.40s
> C++ with <fstream> style IO:   0.31s
> D 2.067                                    1.75s
> D 2.068 beta 2:                        0.69s
> Perl:                                         1.49s
> Python:                                    1.86s
>
> So on Ubuntu, the C++ <fstream> version was clearly best. And the
> improvement in DMD 2.068 beta "only" a factor of 2.5 from 2.067.
>
> /johan

I think we should investigate this and bring performance to par. Anyone interested? -- Andrei
July 26, 2015
On Sunday, 26 July 2015 at 14:36:09 UTC, Johan Holmberg wrote:
> On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via Digitalmars-d < digitalmars-d@puremagic.com> wrote:
>
>>[...]
> My C++ program was actually doing C-style IO via <stdio.h>. I didn't think about the distinction C/C++ when reporting the earlier numbers.
>
> [...]

It would be interesting to see numbers for the stdio.h code in D since it should be easy to translate and would rule it issues with compiler vs library.
July 26, 2015
On Sunday, 26 July 2015 at 15:36:29 UTC, Andrei Alexandrescu wrote:
> On 7/26/15 10:35 AM, Johan Holmberg via Digitalmars-d wrote:
>> [...]
>
> I think we should investigate this and bring performance to par. Anyone interested? -- Andrei

Here's the link to the fstream libstc++ source for GNU /linux (Ubuntu / Debian)

https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.0/fstream-source.html

Not to sure who's all familiar with it but it uses the basic_streambuf underneath.

« First   ‹ Prev
1 2 3