Thread overview
How to free memory allocated via double[][] using dmd-2.0.12?
Apr 08, 2008
Markus Dittrich
Apr 08, 2008
Regan Heath
Apr 08, 2008
Markus Dittrich
Apr 08, 2008
BCS
Apr 08, 2008
Markus Dittrich
Apr 08, 2008
BCS
Apr 08, 2008
Bill Baxter
Apr 08, 2008
Markus Dittirich
Apr 09, 2008
Bill Baxter
Apr 09, 2008
BCS
April 08, 2008
Hi,

For a data processing application I need to read a large number
of data sets from disk. Due to their size, they have to be read and
processed sequentially, i.e. in pseudocode

int main()
{
    while (some condition)
    {
         double[][] myLargeDataset = read_data();
         process_data(myLargeDataset);
         // free all memory here otherwise next cycle will
         // run out of memory
     }

   return 0;
}

Now, the "problem" is the fact that each single data-set saturates
system memory and hence I need to make sure that all memory
is freed after each process_data step is complete. Unfortunately,
using dmd-2.012 I have not been able to achieve this. Whatever
I do (including nothing, i.e., letting the GC do its job), the resulting
binary keeps accumulating memory and crashing shortly after).
I've tried deleting the array, setting the array lengths to 0, and manually
forcing the GC to collect, to no avail. Hence, is there something I am
doing terribly wrong or is this a bug in dmd?

Thanks much,
Markus
April 08, 2008
Markus Dittrich wrote:
> Hi,
> 
> For a data processing application I need to read a large number
> of data sets from disk. Due to their size, they have to be read and processed sequentially, i.e. in pseudocode
> 
> int main()
> {
>     while (some condition)
>     {
>          double[][] myLargeDataset = read_data();
>          process_data(myLargeDataset);
>          // free all memory here otherwise next cycle will          // run out of memory
>      }
>        return 0;
> }
> 
> Now, the "problem" is the fact that each single data-set saturates
> system memory and hence I need to make sure that all memory
> is freed after each process_data step is complete. Unfortunately,
> using dmd-2.012 I have not been able to achieve this. Whatever
> I do (including nothing, i.e., letting the GC do its job), the resulting binary keeps accumulating memory and crashing shortly after). I've tried deleting the array, setting the array lengths to 0, and manually
> forcing the GC to collect, to no avail. Hence, is there something I am doing terribly wrong or is this a bug in dmd?

Did you try just setting the array reference to null.  This should make the contents of the array unreachable and therefore it should be collected when you next allocate (and run short on memory).

i.e.

int main()
{
    double[][] myLargeDataset;
    while (some condition)
    {
	 myLargeDataset = read_data();
         process_data(myLargeDataset);
         myLargeDataset = null;
     }

   return 0;
}

Regan
April 08, 2008
Regan Heath Wrote:

> Did you try just setting the array reference to null.  This should make the contents of the array unreachable and therefore it should be collected when you next allocate (and run short on memory).
> 
> i.e.
> 
> int main()
> {
>      double[][] myLargeDataset;
>      while (some condition)
>      {
> 	 myLargeDataset = read_data();
>           process_data(myLargeDataset);
>           myLargeDataset = null;
>       }
> 
>     return 0;
> }
> 
> Regan

Thanks for the hint. I just tried this again just to make sure and also tried plopping an std.gc.fullCollect() right after to force the GC to collect. In both cases I can watch memory consumption grow continuously with the system running out of memory eventually. Maybe its a GC bug?

Markus
April 08, 2008
Markus Dittrich wrote:
> Hi,
> 
> For a data processing application I need to read a large number
> of data sets from disk. Due to their size, they have to be read and processed sequentially, i.e. in pseudocode
> 
> int main()
> {
>     while (some condition)
>     {
>          double[][] myLargeDataset = read_data();
>          process_data(myLargeDataset);
>          // free all memory here otherwise next cycle will          // run out of memory
>      }
>        return 0;
> }
> 
> Now, the "problem" is the fact that each single data-set saturates
> system memory and hence I need to make sure that all memory
> is freed after each process_data step is complete. Unfortunately,
> using dmd-2.012 I have not been able to achieve this. Whatever
> I do (including nothing, i.e., letting the GC do its job), the resulting binary keeps accumulating memory and crashing shortly after). I've tried deleting the array, setting the array lengths to 0, and manually
> forcing the GC to collect, to no avail. Hence, is there something I am doing terribly wrong or is this a bug in dmd?
> 
> Thanks much,
> Markus

One "hack" would be to have read_data() allocate a big buffer and then slice the parts of the double[][] out of it. This would have the advantage that you can just keep track of the buffer and on the next pass just reuse it in it's entirety, you never have to delete it.

double[][] read_data()
{
	static byte[] buff;
	if(buff.prt is null) buff = new byte[huge];

	byte left = buff;

	T[] Alloca(T)(int i)
	{
		T[] ret = (cast(*T)left.prt)[0..i];
		buff = buff[i*T.sizeof..$];
		return ret;
	}

	/// code uses Alloca!(double) and Alloca!(double[]) for
	/// allocations. Don't use .length or ~=

	
}
April 08, 2008
BCS Wrote:

> Markus Dittrich wrote:
> > Hi,
> > 
> > For a data processing application I need to read a large number
> > of data sets from disk. Due to their size, they have to be read and
> > processed sequentially, i.e. in pseudocode
> > 
> > int main()
> > {
> >     while (some condition)
> >     {
> >          double[][] myLargeDataset = read_data();
> >          process_data(myLargeDataset);
> >          // free all memory here otherwise next cycle will
> >          // run out of memory
> >      }
> > 
> >    return 0;
> > }
> > 
> > Now, the "problem" is the fact that each single data-set saturates
> > system memory and hence I need to make sure that all memory
> > is freed after each process_data step is complete. Unfortunately,
> > using dmd-2.012 I have not been able to achieve this. Whatever
> > I do (including nothing, i.e., letting the GC do its job), the resulting
> > binary keeps accumulating memory and crashing shortly after).
> > I've tried deleting the array, setting the array lengths to 0, and manually
> > forcing the GC to collect, to no avail. Hence, is there something I am
> > doing terribly wrong or is this a bug in dmd?
> > 
> > Thanks much,
> > Markus
> 
> One "hack" would be to have read_data() allocate a big buffer and then slice the parts of the double[][] out of it. This would have the advantage that you can just keep track of the buffer and on the next pass just reuse it in it's entirety, you never have to delete it.
> 
> double[][] read_data()
> {
> 	static byte[] buff;
> 	if(buff.prt is null) buff = new byte[huge];
> 
> 	byte left = buff;
> 
> 	T[] Alloca(T)(int i)
> 	{
> 		T[] ret = (cast(*T)left.prt)[0..i];
> 		buff = buff[i*T.sizeof..$];
> 		return ret;
> 	}
> 
> 	/// code uses Alloca!(double) and Alloca!(double[]) for
> 	/// allocations. Don't use .length or ~=
> 
> 
> }

Thanks much for you response! I could certainly role my own buffer
management. Unfortunately, the "real" app is more complicated
than the "proof of concept" code I posted and doing so would require a bit
more work. After all, the main reason for using D for this type of thing
was the fact that I didn't want to deal with manual memory management ;)

From the posts I gather that I am not doing anything fundamentally wrong, and I'll probably file a bug for this later.


April 08, 2008
Markus Dittrich wrote:
>
> Unfortunately, the "real" app is more complicated than the "proof of concept" code I posted and doing so would require a bit
> more work.
> 

life would be so much nicer if real life didn't get in the way :b
April 08, 2008
Markus Dittrich wrote:
> Hi,
> 
> For a data processing application I need to read a large number
> of data sets from disk. Due to their size, they have to be read and processed sequentially, i.e. in pseudocode
> 
> int main()
> {
>     while (some condition)
>     {
>          double[][] myLargeDataset = read_data();
>          process_data(myLargeDataset);
>          // free all memory here otherwise next cycle will          // run out of memory
>      }
>        return 0;
> }
> 
> Now, the "problem" is the fact that each single data-set saturates
> system memory and hence I need to make sure that all memory
> is freed after each process_data step is complete. Unfortunately,
> using dmd-2.012 I have not been able to achieve this. Whatever
> I do (including nothing, i.e., letting the GC do its job), the resulting binary keeps accumulating memory and crashing shortly after). I've tried deleting the array, setting the array lengths to 0, and manually
> forcing the GC to collect, to no avail. Hence, is there something I am doing terribly wrong or is this a bug in dmd?

Markus,  you do not show us what either read_data or process_data do. It is possible that one of those is somehow holding on to references to the data.  This would prevent the GC from collecting the memory.

Another problem is if you allocate the memory initially as void[] then the GC will scan it for pointers, and in a big float buffer you'll get a lot of false hits.  To prevent that, allocate the buffer initially as byte[] (or double[] --- just not void).

Anyway, if you want a speedy fix, you'll need to distill this down into something that is actually reproducible by Walter.

--bb
April 08, 2008
Bill Baxter Wrote:

> 
> Markus,  you do not show us what either read_data or process_data do. It is possible that one of those is somehow holding on to references to the data.  This would prevent the GC from collecting the memory.
> 
> Another problem is if you allocate the memory initially as void[] then the GC will scan it for pointers, and in a big float buffer you'll get a lot of false hits.  To prevent that, allocate the buffer initially as byte[] (or double[] --- just not void).
> 
> Anyway, if you want a speedy fix, you'll need to distill this down into something that is actually reproducible by Walter.
> 
> --bb

Hi Bill,

You're of course absolutely correct! Below is a proof of concept code
that still exhibits the issue I was describing. The parse code needs
to handle row centric ascii data with a variable number of columns.
The file "data_random.dat" contains a single row of random integers.
After a few iterations the code runs out of memory on my machine
and no deleting seems to help.

import std.stream;
import std.stdio;
import std.contracts;
import std.gc;


public double[][] parse(BufferedFile inputFile)
{

  double[][] array;
  foreach(char[] line; inputFile)
  {
    double[] temp;

    foreach(string item; std.string.split(assumeUnique(line)))
    {
       temp ~= std.string.atof(item);
    }

    array ~= temp;
  }

  /* rewind for next round */
  inputFile.seekSet(0);

  return array;
}



int main()
{
  BufferedFile inputFile = new BufferedFile("data_random.dat");

  while(1)
  {
    double[][] foo = parse(inputFile);
  }

  return 1;
}

Thanks much,
Markus

April 09, 2008
Markus Dittirich wrote:
> Bill Baxter Wrote:
> 
>> Markus,  you do not show us what either read_data or process_data do. It is possible that one of those is somehow holding on to references to the data.  This would prevent the GC from collecting the memory.
>>
>> Another problem is if you allocate the memory initially as void[] then the GC will scan it for pointers, and in a big float buffer you'll get a lot of false hits.  To prevent that, allocate the buffer initially as byte[] (or double[] --- just not void).
>>
>> Anyway, if you want a speedy fix, you'll need to distill this down into something that is actually reproducible by Walter.
>>
>> --bb
> 
> Hi Bill,
> 
> You're of course absolutely correct! Below is a proof of concept code
> that still exhibits the issue I was describing. The parse code needs to handle row centric ascii data with a variable number of columns.
> The file "data_random.dat" contains a single row of random integers.
> After a few iterations the code runs out of memory on my machine
> and no deleting seems to help.
> 
> import std.stream;
> import std.stdio;
> import std.contracts;
> import std.gc;
> 
> 
> public double[][] parse(BufferedFile inputFile)
> {
> 
>   double[][] array;
>   foreach(char[] line; inputFile)
>   {
>     double[] temp;
> 
>     foreach(string item; std.string.split(assumeUnique(line)))
>     {
>        temp ~= std.string.atof(item);
>     }
> 
>     array ~= temp;
>   }
> 
>   /* rewind for next round */
>   inputFile.seekSet(0);
> 
>   return array;
> }
> 
> 
> 
> int main()
> {
>   BufferedFile inputFile = new BufferedFile("data_random.dat");
> 
>   while(1)
>   {
>     double[][] foo = parse(inputFile);
>   }
> 
>   return 1;
> }

Ok.  You should add that to the bug report.

However, that test program works fine for me on Windows.
I tried it with
  DMD/Phobos 1.028,
  DMD/Tango/Tangobos 1.028, and
  DMD/Phobos 2.012.

--bb
April 09, 2008
Markus Dittirich wrote:
> Bill Baxter Wrote:
> 
> 
>>Markus,  you do not show us what either read_data or process_data do. It is possible that one of those is somehow holding on to references to the data.  This would prevent the GC from collecting the memory.
>>
>>Another problem is if you allocate the memory initially as void[] then the GC will scan it for pointers, and in a big float buffer you'll get a lot of false hits.  To prevent that, allocate the buffer initially as byte[] (or double[] --- just not void).
>>
>>Anyway, if you want a speedy fix, you'll need to distill this down into something that is actually reproducible by Walter.
>>
>>--bb
> 
> 
> Hi Bill,
> 
> You're of course absolutely correct! Below is a proof of concept code
> that still exhibits the issue I was describing. The parse code needs to handle row centric ascii data with a variable number of columns.
> The file "data_random.dat" contains a single row of random integers.
> After a few iterations the code runs out of memory on my machine
> and no deleting seems to help.
> 

does it change things if you drop the ~= in favor of extending the array? What about if you preallocate the array with the correct size to begin with? (I know this might not be doable in the general case)