Thread overview
What are the best available D (not C) File input/output options?
Nov 02
Sergey
November 03
I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes. My naïve attempt to improve the situation pushed it over an hour and 15 minutes. However, replacing std.stdio:File with core.stdc.stdio:FILE* and changing my output code in this latest version from:

	outputFile.writefln("%c\t%u\t%u\t%d.%09u\t%c", ...)

to
	fprintf(outputFile, "%c,%u,%u,%llu.%09llu,%c\n", ...)

reduced the processing time to roughly 7.5 minutes. Why is File.writefln() so appallingly slow? Is there a better D alternative?

I tried std.io but write() only outputs ubyte[] while I'm trying to output text so I abandoned idea early. Now that I've got the program execution time within an acceptable range, I tried replacing core.stdc.fread() with std.io.read() but that increased the time to 24 minutes. Now I'm starting to think there is something seriously wrong with my understanding of how to use D correctly because there's no way D's input/output capabilities can suck so bad in comparison to C's.
November 02
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
> I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes. My naïve attempt to improve the situation pushed it over an hour and 15 minutes. However, replacing std.stdio:File with core.stdc.stdio:FILE* and changing my output code in this latest version from:
>
> 	outputFile.writefln("%c\t%u\t%u\t%d.%09u\t%c", ...)
>
> to
> 	fprintf(outputFile, "%c,%u,%u,%llu.%09llu,%c\n", ...)
>
> reduced the processing time to roughly 7.5 minutes. Why is File.writefln() so appallingly slow? Is there a better D alternative?

First, strace your program. The slowest thing about I/O is the syscall itself. If the D program does more syscalls, it's going to be slower almost no matter what else is going on. Both D and C are using libc to buffer I/O to reduce syscalls, but you might be defeating that by constantly flushing the buffer.

>
> I tried std.io but write() only outputs ubyte[] while I'm trying to output text so I abandoned idea early.

string -> immutable(ubyte)[]: alias with std.string.representation(st)

'alias' meaning, this doesn't allocate. If gives you a byte slice of the same memory the string is using.

You'd still need to do the formatting, before writing.

> Now that I've got the program execution time within an acceptable range, I tried replacing core.stdc.fread() with std.io.read() but that increased the time to 24 minutes. Now I'm starting to think there is something seriously wrong with my understanding of how to use D correctly because there's no way D's input/output capabilities can suck so bad in comparison to C's.


November 02

On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:

>

I tried std.io but write() only outputs ubyte[] while I'm trying to output text so I abandoned idea early.

Just specifically to answer this, this is so you understand this is what is going into the file -- bytes.

You should use a buffering library like iopipe to write properly here (it handles the encoding of text for you).

And I really don't have a good formatting library, you can rely on formattedWrite maybe. A lot of things need to be better for this solution to be smooth, it's one of the things I have to work on.

-Steve

November 02
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
> I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes.

In my experience I/O in D is quite slow.
But you can try to improve it:

Try to use std.outbuffer instead of writeln. And flush the result only in the end.

Also check this article. It is showing how manual buffers in D could speed up the processing of files significantly: https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data-science.html


November 06
Good morning,

First, thanks to you, Steve, and Julian for responding to my inquiry.

On 11/3/23 4:59 AM, Sergey wrote:
> On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
>> I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes.
> 
> In my experience I/O in D is quite slow.
> But you can try to improve it:
> 
> Try to use std.outbuffer instead of writeln. And flush the result only in the end.

Unless I did it incorrectly, this did nothing for me. My understanding is that I should first prepare an OutBuffer to which I write all my output. Once complete, I then write the OutBuffer to file; which still requires the use of writeln, albeit not as often.

First I tried buffering the entire thing, but that turned out to be a big mistake. Next I tried writing and clearing the buffer every 100_000 records (about 3000 writeln calls).

Not as bad as the first attempt but significantly worse than what I obtained with the fopen/fprintf combo. I even tried writing the buffer to disk with fprintf but jumped ship because it took far longer than fopen/fprintf. Can't say how much longer because I terminated execution at 14 minutes.

> Also check this article. It is showing how manual buffers in D could speed up the processing of files significantly: https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data-science.html
> 
> 

The link above was quite helpful. Thanks. I am a bit slow on the uptake so it took a while to figure out how to apply the idea to my own use case. However, once I figured it out, the result was 2 minutes faster than the original C implementation and 3 minutes faster than the fopen/printf port.

Whether it did anything for the writeln implementation or not, I don't know. Wasn't will to wait 45+ minutes for something that can feasibly be done in 6 minutes. I gave up at 12.

Haven't played with std.string.representation as suggested by Julian as yet but I plan to.


Thank again.
--Confuzzled
November 06
On 11/3/23 2:30 AM, Steven Schveighoffer wrote:
> On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
> 
> You should use a buffering library like iopipe to write properly here (it handles the encoding of text for you).
> 

Thanks Steve, I will try that.