Jump to page: 1 2 3
Thread overview
Reading large files, writing large files?
Mar 28, 2005
AEon
Mar 28, 2005
Regan Heath
Mar 28, 2005
Ben Hinkle
Mar 28, 2005
Regan Heath
Mar 28, 2005
AEon
Mar 28, 2005
Ben Hinkle
Mar 28, 2005
Regan Heath
Mar 28, 2005
Derek Parnell
Mar 29, 2005
Ben Hinkle
Mar 29, 2005
Regan Heath
Mar 29, 2005
Ben Hinkle
Mar 29, 2005
Regan Heath
Mar 29, 2005
Derek Parnell
Mar 29, 2005
Ben Hinkle
Mar 29, 2005
Regan Heath
Mar 30, 2005
Ben Hinkle
Mar 30, 2005
Regan Heath
Mar 28, 2005
Regan Heath
Mar 29, 2005
AEon
Mar 29, 2005
Regan Heath
Mar 29, 2005
AEon
Mar 29, 2005
Regan Heath
March 28, 2005
Rethinking the way I normally handle files, since now I am faced with possibly very huge (100MB and) log files. Dito I need to save large log files. So it does not seem to be a good idea to use my, sofar preferred method:

// Ensure file exists
if( ! std.file.exists(cfgPathFile) )
...

// Read complete cfg file into array, removes \r\n via splitlines()
char[][] cfgText = std.string.splitlines( cast(char[]) std.file.read(cfgPathFile) );

Etc... I have very much come to like splitlines, and read, but with
100 MB log files, loading all that into RAM may turn out ugly?


Let's say I'd ignore the RAM issue for a moment, how would I properly use std.file.write() to write into a file?


The method I fear will need to be applied for such huge files is something like this (posted by Martin in this newsgroup):

import std.stream;

void readfile(char[] fn)
{
    File f = new File();
    char[] l;
    f.open(fn);
    while(!f.eof())
    {
        l = f.readLine();
        printf("line: %.*s\n", l);
    }
    f.close();
}

That would be pretty much the ANSI C way... ieek :)... Is there any way to avoid the latter method? And go the nicer D way, as in the first code example?

AEon
March 28, 2005
On Mon, 28 Mar 2005 05:13:36 +0200, AEon <aeon2001@lycos.de> wrote:
> Rethinking the way I normally handle files, since now I am faced with possibly very huge (100MB and) log files. Dito I need to save large log files. So it does not seem to be a good idea to use my, sofar preferred method:
>
> // Ensure file exists
> if( ! std.file.exists(cfgPathFile) )
> ...
>
> // Read complete cfg file into array, removes \r\n via splitlines()
> char[][] cfgText = std.string.splitlines( cast(char[]) std.file.read(cfgPathFile) );
>
> Etc... I have very much come to like splitlines, and read, but with
> 100 MB log files, loading all that into RAM may turn out ugly?
>
>
> Let's say I'd ignore the RAM issue for a moment, how would I properly use std.file.write() to write into a file?
>
>
> The method I fear will need to be applied for such huge files is something like this (posted by Martin in this newsgroup):
>
> import std.stream;
>
> void readfile(char[] fn)
> {
>      File f = new File();
>      char[] l;
>      f.open(fn);
>      while(!f.eof())
>      {
>          l = f.readLine();
>          printf("line: %.*s\n", l);
>      }
>      f.close();
> }
>
> That would be pretty much the ANSI C way... ieek :)... Is there any way to avoid the latter method? And go the nicer D way, as in the first code example?

Try this...

import std.c.stdlib;
import std.stream;
import std.stdio;

class LineReader(Source) : Source
{
	int opApply(int delegate(inout char[]) dg)
	{
		int result = 0;
		char[] line;

		while(!eof())
		{
			line = readLine();
			if (!line) break;
			result = dg(line);
			if (result) break;
		}
		
		return result;
	}
	
	int opApply(int delegate(inout size_t, inout char[]) dg)
	{		
		int result = 0;
		size_t lineno;
		char[] line;

		for(lineno = 1; !eof(); lineno++)
		{
			line = readLine();
			if (!line) break;
			result = dg(lineno,line);
			if (result) break;
		}
				
		return result;
	}
}

int main(char[][] args)
{
	LineReader!(BufferedFile) f;
	
	if (args.length < 2) usage();	
	f = new LineReader!(BufferedFile)();
	
	f.open(args[1],FileMode.In);
	foreach(char[] line; f)
	{
		writefln("READ[",line,"]");
	}
	f.close();
	
	f.open(args[1],FileMode.In);
	foreach(size_t lineno, char[] line; f)
	{
		writefln("READ[",lineno,"][",line,"]");
	}
	f.close();
	
	return 0;	
}

void usage()
{
	writefln("USAGE: test29 <file>");
	writefln("");
	exit(1);
}

Regan
March 28, 2005
>> void readfile(char[] fn)
>> {
>>      File f = new File();
>>      char[] l;
>>      f.open(fn);
>>      while(!f.eof())
>>      {
>>          l = f.readLine();
>>          printf("line: %.*s\n", l);
>>      }
>>      f.close();
>> }

one tiny improvement would be to combine the new File() with the open(fn) into new File(fn).

>> That would be pretty much the ANSI C way... ieek :)... Is there any way to avoid the latter method? And go the nicer D way, as in the first code example?
>
> Try this...
> class LineReader(Source) : Source
> {
[snip]

That's pretty nice. Maybe opApply iterating over lines should be built into Stream. That would resemble the standard Perl style of reading a file line-by-line. I'll poke around with that. It should be pretty easy and it would make line processing with stream much easier to use.


March 28, 2005
On Sun, 27 Mar 2005 22:52:52 -0500, Ben Hinkle <ben.hinkle@gmail.com> wrote:
>>> That would be pretty much the ANSI C way... ieek :)... Is there any way
>>> to avoid the latter method? And go the nicer D way, as in the first code
>>> example?
>>
>> Try this...
>> class LineReader(Source) : Source
>> {
> [snip]
>
> That's pretty nice.

:)

> Maybe opApply iterating over lines should be built into Stream.

That would be nice.

> That would resemble the standard Perl style of reading a file
> line-by-line. I'll poke around with that. It should be pretty easy and it
> would make line processing with stream much easier to use.

Agreed.

Regan
March 28, 2005
Trying to understand what you did, here. There seem to be several concepts I am still missing...

> import std.c.stdlib;
> import std.stream;
> import std.stdio;
> 
> class LineReader(Source) : Source

You seem to be "shadowing" some parent class called Source?

> {
>     int opApply(int delegate(inout char[]) dg)

Alas I still have no idea what "delegate" does, and why it needs to be used?

>     {
>         int result = 0;
>         char[] line;
> 
>         while(!eof())
>         {
>             line = readLine();

How come readLine() knows of the stream?

>             if (!line) break;

"if line == null" then break... no idea what this is good for.

>             result = dg(line);
>             if (result) break;

Don't understand these lines either.

Can it be that you are filling up a "buffer" with all the lines of the stream, until you reach an empty line, to let foreach then scan that "buffer" like it does for any other array? If so that could possibly use up a lot of RAM?!

>         }
>                return result;
>     }
>         int opApply(int delegate(inout size_t, inout char[]) dg)
>     {               int result = 0;
>         size_t lineno;

Why did you use size_t for lineno, would int now also work? (I tested this and it works fine to replace all size_t with int).

>         char[] line;
> 
>         for(lineno = 1; !eof(); lineno++)
>         {
>             line = readLine();
>             if (!line) break;
>             result = dg(lineno,line);
>             if (result) break;
>         }
>                        return result;
>     }
> }

AFAICT you defined 2 "structures" that will let the user use foreach on "f.open" streams. One version that will "just" read lines another that will also let you retrieve the line numbers as well.


> int main(char[][] args)
> {

>     LineReader!(BufferedFile) f;
>     f = new LineReader!(BufferedFile)();

Can be reduced to:

    LineReader!(BufferedFile) f = new LineReader!(BufferedFile)();

making the equivalent coding to

    File f = new File();

more obvious. IOW you seem to have defined a new stream?

>     if (args.length < 2) usage();           f.open(args[1],FileMode.In);
>     foreach(char[] line; f)

Is this default behavior? I.e. that foreach can parse streams? AFAICT this is the the new speciality of your stream, right? Very nice.

>     {
>         writefln("READ[",line,"]");
>     }
>     f.close();
>         f.open(args[1],FileMode.In);
>     foreach(size_t lineno, char[] line; f)

Neat.

>     {
>         writefln("READ[",lineno,"][",line,"]");
>     }
>     f.close();
>         return 0;   }


I noted when testing this code, that it will only read the lines of a stream until an empty line is encountered. Is this indeed intended?

AEon
March 28, 2005
"AEon" <aeon2001@lycos.de> wrote in message news:d290lj$1ukr$1@digitaldaemon.com...
> Trying to understand what you did, here. There seem to be several concepts I am still missing...
>
>> import std.c.stdlib;
>> import std.stream;
>> import std.stdio;
>>
>> class LineReader(Source) : Source
>
> You seem to be "shadowing" some parent class called Source?

The class is templatized. It is a way of subclassing any stream subclass. I
think it would also work to do
class LineReader(Source : Stream) : Source
to force the class Source to be a Stream or Stream subclass.

>> {
>>     int opApply(int delegate(inout char[]) dg)
>
> Alas I still have no idea what "delegate" does, and why it needs to be used?

opApply is used to implement 'foreach' in classes. See
http://www.digitalmars.com/d/statement.html#foreach
Also for info about delegate see http://www.digitalmars.com/d/function.html

>>     {
>>         int result = 0;
>>         char[] line;
>>
>>         while(!eof())
>>         {
>>             line = readLine();
>
> How come readLine() knows of the stream?

It subclasses Stream.

>>             if (!line) break;
>
> "if line == null" then break... no idea what this is good for.

I think this isn't needed. I think it probably is why blank lines stop the foreach.

>>             result = dg(line);
>>             if (result) break;
>
> Don't understand these lines either.

This is part of the foreach magic.

> Can it be that you are filling up a "buffer" with all the lines of the stream, until you reach an empty line, to let foreach then scan that "buffer" like it does for any other array? If so that could possibly use up a lot of RAM?!
>
>>         }
>>        return result;
>>     }
>>     int opApply(int delegate(inout size_t, inout char[]) dg)
>>     {       int result = 0;
>>         size_t lineno;
>
> Why did you use size_t for lineno, would int now also work? (I tested this and it works fine to replace all size_t with int).

on 32 bit machine size_t is uint. On 64 bit it is ulong.


March 28, 2005
On Mon, 28 Mar 2005 15:25:57 +0200, AEon <aeon2001@lycos.de> wrote:
> Trying to understand what you did, here. There seem to be several concepts I am still missing...

Ben has done a fairly good job of explaining it. I'll have a go too, the combination of our efforts will hopefully explain "everything". :)

>> import std.c.stdlib;
>> import std.stream;
>> import std.stdio;
>>  class LineReader(Source) : Source
>
> You seem to be "shadowing" some parent class called Source?

This technique is called a "Snap-On". I am creating a new template class "LineReader" which is a child class of an unspecified (at this stage) class.

Later when I say: "LineReader!(BufferedFile) f;"

it specifies that "Source" is "BufferedFile".

>> {
>>     int opApply(int delegate(inout char[]) dg)
>
> Alas I still have no idea what "delegate" does, and why it needs to be used?

A delegate is like a function pointer, except that a delegate points to a (non-static) class member function. So calling it is like calling a class member on a class.

In this case the delegate is part of the "magic" that makes foreach work on a custom class like LineReader.

>>     {
>>         int result = 0;
>>         char[] line;
>>          while(!eof())
>>         {
>>             line = readLine();
>
> How come readLine() knows of the stream?

Because LineReader is a child class of BufferedFile, which is a stream. The readLine call above calls the readLine of the parent class BufferedFile.

>>             if (!line) break;
>
> "if line == null" then break... no idea what this is good for.

I was trying to stop at the end of the file, it appears this stops on blank lines. IMO readLine is broken, it is returning null for a blank line, it should return "".

The difference between null and "" in the case of char[] is that null has a null .ptr and "is null" is true, so...

if (!line.ptr) break;
if (line is null) break;

statements should only fire when line is null and not "". But it appears readLine does not differentiate between null and "".

>>             result = dg(line);
>>             if (result) break;
>
> Don't understand these lines either.

As Ben said, it's part of the foreach "magic", his links should explain it. If not, let us know how the docs are deficient and hopefully someone can improve them.

> Can it be that you are filling up a "buffer" with all the lines of the stream, until you reach an empty line, to let foreach then scan that "buffer" like it does for any other array? If so that could possibly use up a lot of RAM?!

No. I am reading one line at a time. When I call the delegate I am effectively executing the body of the foreach statement with the line I pass. Then I discard the line and read the next one. So only 1 line is in memory at a time.

>>         }
>>                return result;
>>     }
>>         int opApply(int delegate(inout size_t, inout char[]) dg)
>>     {               int result = 0;
>>         size_t lineno;
>
> Why did you use size_t for lineno, would int now also work? (I tested this and it works fine to replace all size_t with int).

As Ben mentioned, size_t is either a 32 or 64 bit type depending on the underlying OS/processor. I believe the idea is that using it chooses the most "sensible" type for holding "size" values on the current OS/processor.

>>         char[] line;
>>          for(lineno = 1; !eof(); lineno++)
>>         {
>>             line = readLine();
>>             if (!line) break;
>>             result = dg(lineno,line);
>>             if (result) break;
>>         }
>>                        return result;
>>     }
>> }
>
> AFAICT you defined 2 "structures" that will let the user use foreach on "f.open" streams. One version that will "just" read lines another that will also let you retrieve the line numbers as well.

Not 2 structures in the sense of D structs but 2 methods allowing foreach on my new class LineReader, which extends BufferedFile (by adding the foreach ability).

>> int main(char[][] args)
>> {
>
>>     LineReader!(BufferedFile) f;
>>     f = new LineReader!(BufferedFile)();
>
> Can be reduced to:
>
>      LineReader!(BufferedFile) f = new LineReader!(BufferedFile)();
>
> making the equivalent coding to
>
>      File f = new File();
>
> more obvious.

You could, I have chosen not to allocate the class till after my error checking, but then I could have moved "LineReader!(BufferedFile) f;" to after the error checking also.. I guess I'm used to C. :)

> IOW you seem to have defined a new stream?

Yes. I have extended/added foreach-ability to any Stream class.

>>     if (args.length < 2) usage();           f.open(args[1],FileMode.In);
>>     foreach(char[] line; f)
>
> Is this default behavior? I.e. that foreach can parse streams? AFAICT this is the the new speciality of your stream, right? Very nice.

It's new speciality of my stream. I think we should add it to Streams though.
In addition we could add

foreach(char c; f) {}

to read characters one at a time.

>>     {
>>         writefln("READ[",line,"]");
>>     }
>>     f.close();
>>         f.open(args[1],FileMode.In);
>>     foreach(size_t lineno, char[] line; f)
>
> Neat.
>
>>     {
>>         writefln("READ[",lineno,"][",line,"]");
>>     }
>>     f.close();
>>         return 0;   }
>
>
> I noted when testing this code, that it will only read the lines of a stream until an empty line is encountered. Is this indeed intended?

No it was not intended. IMO readLine is broken.

Regan
March 28, 2005
On Mon, 28 Mar 2005 13:43:08 -0500, Ben Hinkle <bhinkle@mathworks.com> wrote:
> "AEon" <aeon2001@lycos.de> wrote in message
> news:d290lj$1ukr$1@digitaldaemon.com...
>> Trying to understand what you did, here. There seem to be several concepts
>> I am still missing...
>>
>>> class LineReader(Source) : Source
>>
>> You seem to be "shadowing" some parent class called Source?
>
> The class is templatized. It is a way of subclassing any stream subclass. I
> think it would also work to do
> class LineReader(Source : Stream) : Source
> to force the class Source to be a Stream or Stream subclass.

Good point, that is probably more correct.

>>>             if (!line) break;
>>
>> "if line == null" then break... no idea what this is good for.
>
> I think this isn't needed. I think it probably is why blank lines stop the
> foreach.

I think readLine is broken. It needs to return "" and not null.
The difference being that "" has a non null "line.ptr" and "line is null" is not true.

Regan
March 28, 2005
On Tue, 29 Mar 2005 10:39:49 +1200, Regan Heath wrote:


[snip]


> I think readLine is broken. It needs to return "" and not null.
> The difference being that "" has a non null "line.ptr" and "line is null"
> is not true.

I've mentioned this before. D can not guarantee that a coder will always be able to distinguish between an empty line and an uninitialized line. I believe the two are distinct and useful idioms, and I know that it is theoretically possible, but sometimes when you pass a "", it gets received as null; however not in all situations. :-(

-- 
Derek Parnell
Melbourne, Australia
http://www.dsource.org/projects/build v1.16 released
29/03/2005 9:24:10 AM
March 29, 2005
>>>>             if (!line) break;
>>>
>>> "if line == null" then break... no idea what this is good for.
>>
>> I think this isn't needed. I think it probably is why blank lines stop
>> the
>> foreach.
>
> I think readLine is broken. It needs to return "" and not null.
> The difference being that "" has a non null "line.ptr" and "line is null"
> is not true.

IMO the right way to check if a string is empty is asking if the length is
0. Setting an array's length to 0 automatically sets the ptr to null. So
relying on any specific behavior of the ptr of a 0 length array is dangerous
at best (since it would rely on always slicing to resize). For example the
statement
  str.length = str.length;
does nothing if length > 0 and sets the ptr to null if length == 0.
One can argue about D's behavior about nulling the ptr but that's the
current situation. Perhaps it should be illegal to implicitly cast a dynamic
array to a ptr.


« First   ‹ Prev
1 2 3