Jump to page: 1 2
Thread overview
Trying to reduce memory usage
Feb 12
Josh
Feb 12
mw
Feb 12
frame
Feb 12
frame
Feb 13
Daniel N
Feb 17
tsbockman
Feb 17
tsbockman
Feb 23
tsbockman
February 12
I'm trying to read in a text file that has many duplicated lines and output a file with all the duplicates removed. By the end of this code snippet, the memory usage is ~5x the size of the infile (which can be multiple GB each), and when this is in a loop the memory usage becomes unmanageable and often results in an OutOfMemory error or just a complete lock up of the system. Is there a way to reduce the memory usage of this code without sacrificing speed to any noticeable extent? My assumption is the .sort.uniq needs improving, but I can't think of an easier/not much slower way of doing it.

Windows 10 x64
LDC - the LLVM D compiler (1.21.0-beta1):
  based on DMD v2.091.0 and LLVM 10.0.0

-----------------------------------

auto filename = "path\\to\\file.txt.temp";
auto array = appender!(string[]);
File infile = File(filename, "r");
foreach (line; infile.byLine) {
  array ~= line.to!string;
}
File outfile = File(stripExtension(filename), "w");
foreach (element; (array[]).sort.uniq) {
  outfile.myrawWrite(element ~ "\n"); // used to not print the \r on windows
}
outfile.close;
array.clear;
array.shrinkTo(0);
infile.close;

-----------------------------------

Thanks.
February 12
On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
> I'm trying to read in a text file that has many duplicated lines and output a file with all the duplicates removed.

If you only need to remove duplicates, keep (and compare) a string hash for each line is good enough. Memory usage should be just n x integers.


February 11
On Fri, Feb 12, 2021 at 01:45:23AM +0000, mw via Digitalmars-d-learn wrote:
> On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
> > I'm trying to read in a text file that has many duplicated lines and output a file with all the duplicates removed.
> 
> If you only need to remove duplicates, keep (and compare) a string hash for each line is good enough. Memory usage should be just n x integers.
[...]

+1. This can be even done on-the-fly: you don't even need to use .sort or .uniq.  Just something like this:

	bool[size_t] hashes;
	foreach (line; stdin.byLine) {
		auto h = hashOf(line); // use a suitable hash function here
		if (h !in hashes) {
			outfile.writeln(line);
			hashes[h] = true;
		}
		// else this line already seen before; just skip it
	}

This turns the OP's O(n log n) algorithm into an O(n) algorithm, doesn't
need to copy the entire content of the file into memory, and also uses
much less memory by storing only hashes.


T

-- 
MASM = Mana Ada Sistem, Man!
February 12
On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:

> This turns the OP's O(n log n) algorithm into an O(n) algorithm, doesn't
> need to copy the entire content of the file into memory, and also uses
> much less memory by storing only hashes.

But this kind of hash is maybe insufficient to avoid hash collisions. For such big data slower but stronger algorithms like SHA are advisable.

Also associative arrays uses the same weak algorithm where you can run into collision issues. Thus using the hash from string data as key can be a problem. I always use a quick hash as key but hold actually a collection of hashes in them and do a lookup to be on the safe side.


February 12
On Friday, 12 February 2021 at 07:23:12 UTC, frame wrote:
> On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:
>
>> This turns the OP's O(n log n) algorithm into an O(n) algorithm, doesn't
>> need to copy the entire content of the file into memory, and also uses
>> much less memory by storing only hashes.
>
> But this kind of hash is maybe insufficient to avoid hash collisions. For such big data slower but stronger algorithms like SHA are advisable.
>
> Also associative arrays uses the same weak algorithm where you can run into collision issues. Thus using the hash from string data as key can be a problem. I always use a quick hash as key but hold actually a collection of hashes in them and do a lookup to be on the safe side.

Forgot to mention that this kind of solution needs a better approach if you don't want to miss a potential different line:

You can use a weak hash but track the line position and count how often the same hash occurs as a pre-process. In the post-process you look for this lines again and compare if they are really identical or hash collisions to correct.
February 12
On Fri, Feb 12, 2021 at 07:23:12AM +0000, frame via Digitalmars-d-learn wrote:
> On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:
> 
> > This turns the OP's O(n log n) algorithm into an O(n) algorithm,
> > doesn't need to copy the entire content of the file into memory, and
> > also uses much less memory by storing only hashes.
> 
> But this kind of hash is maybe insufficient to avoid hash collisions. For such big data slower but stronger algorithms like SHA are advisable.

I used toHash merely as an example. Obviously, you should use a hash that works well with the input data you're trying to process (i.e., minimal chances of collision, not too slow to compute, etc.). SHA hashes are probably a safe bet, as chances of collision are negligible.


> Also associative arrays uses the same weak algorithm where you can run into collision issues. Thus using the hash from string data as key can be a problem. I always use a quick hash as key but hold actually a collection of hashes in them and do a lookup to be on the safe side.
[...]

You can use a struct wrapper that implements its own toHash method.


T

-- 
Mediocrity has been pushed to extremes.
February 12
On 2/11/21 6:22 PM, H. S. Teoh wrote:

> 	bool[size_t] hashes;

I would start with an even simpler solution until it's proven that there still is a memory issue:

import std.stdio;

void main() {
	bool[string] lines;
	foreach (line; stdin.byLine) {
		if (line !in lines) {
			stdout.writeln(line);
			lines[line.idup] = true;
		}
		// else this line already seen before; just skip it
	}
}

(Grr... Thanks for the tab characters! :p)

Ali

February 13
On Saturday, 13 February 2021 at 04:19:17 UTC, Ali Çehreli wrote:
> On 2/11/21 6:22 PM, H. S. Teoh wrote:
>
> > 	bool[size_t] hashes;
>
> I would start with an even simpler solution until it's proven that there still is a memory issue:
>
> import std.stdio;
>
> void main() {
> 	bool[string] lines;
> 	foreach (line; stdin.byLine) {
> 		if (line !in lines) {
> 			stdout.writeln(line);
> 			lines[line.idup] = true;
> 		}
> 		// else this line already seen before; just skip it
> 	}
> }
>
> (Grr... Thanks for the tab characters! :p)
>
> Ali

Try this Boost Licensed tool.
https://github.com/eBay/tsv-utils/tree/master/tsv-uniq


February 17
On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
> I'm trying to read in a text file that has many duplicated lines and output a file with all the duplicates removed. By the end of this code snippet, the memory usage is ~5x the size of the infile (which can be multiple GB each), and when this is in a loop the memory usage becomes unmanageable and often results in an OutOfMemory error or just a complete lock up of the system. Is there a way to reduce the memory usage of this code without sacrificing speed to any noticeable extent? My assumption is the .sort.uniq needs improving, but I can't think of an easier/not much slower way of doing it.

I spent some time experimenting with this problem, and here is the best solution I found, assuming that perfect de-duplication is required. (I'll put the code up on GitHub / dub if anyone wants to have a look.)

--------------------------
    0) Memory map the input file, so that the program can pass around slices to it directly
without making copies. This also allows the OS to page it in and out of physical memory
for us, even if it is too large to fit all at once.

    1) Pre-compute the required space for all large data structures, even if an additional pass is required to do so. This makes the rest of the algorithm significantly more efficient with memory, time, and lines of code.

    2) Do a top-level bucket sort of the file using a small (8-16 bit) hash into some scratch space. The target can be either in RAM, or in another memory-mapped file if we really need to minimize physical memory use.

The small hash can be a few bits taken off the top of a larger hash (I used std.digest.murmurhash). The larger hash is cached for use later on, to accelerate string comparisons, avoid unnecessary I/O, and perhaps do another level of bucket sort.

If there is too much data to put in physical memory all at once, be sure to copy the full text of each line into a region of the scratch file where it will be together with the other lines that share the same small hash. This is critical, as otherwise the string comparisons in the next step turn into slow random I/O.

    3) For each bucket, sort, filter out duplicates, and write to the output file. Any sorting algorithm(s) may be used if all associated data fits in physical memory. If not, use a merge sort, whose access patterns won't thrash the disk too badly.

    4) Manually release all large data structures, and delete the scratch file, if one was used. This is not difficult to do, since their life times are well-defined, and ensures that the program won't hang on to GiB of space any longer than necessary.
--------------------------

I wrote an optimized implementation of this algorithm. It's fast, efficient, and really does work on files too large for physical memory. However, it is complicated at almost 800 lines.

On files small enough to fit in RAM, it is similar in speed to the other solutions posted, but less memory hungry. Memory consumption in this case is around (sourceFile.length + 32 * lineCount * 3 / 2) bytes. Run time is similar to other posted solutions: about 3 seconds per GiB on my desktop.

When using a memory-mapped scratch file to accommodate huge files, the physical memory required is around max(largestBucket.data.length + 32 * largestBucket.lineCount * 3 / 2, bucketCount * writeBufferSize) bytes. (Virtual address space consumption is far higher, and the OS will commit however much physical memory is available and not needed by other tasks.) The run time is however long it takes the disk to read the source file twice, write a (sourceFile.length + 32 * lineCount * 3 / 2) byte scratch file, read back the scratch file, and write the destination file.

I tried it with a 38.8 GiB, 380_000_000 line file on a magnetic hard drive. It needed a 50.2 GiB scratch file and took about an hour (after much optimization and many bug fixes).
February 17
On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
> On files small enough to fit in RAM, it is similar in speed to the other solutions posted, but less memory hungry. Memory consumption in this case is around (sourceFile.length + 32 * lineCount * 3 / 2) bytes. Run time is similar to other posted solutions: about 3 seconds per GiB on my desktop.

Oops, I think the memory consumption should be (sourceFile.length + 32 * (lineCount + largestBucket.lineCount / 2)) bytes. (In the limit where everything ends up in one bucket, it's the same, but that shouldn't normally happen unless the entire file has only one unique line in it.)
« First   ‹ Prev
1 2