Thread overview
4x speedup of recursive rmdir in std.file
Feb 04, 2012
Jay Norwood
Feb 05, 2012
Nick Sabalausky
Feb 05, 2012
Jay Norwood
Feb 05, 2012
Nick Sabalausky
Feb 05, 2012
Jay Norwood
Feb 07, 2012
Jay Norwood
Feb 07, 2012
deadalnix
February 04, 2012
It would be good if the std.file operations used the D multi- thread features, since you've done such a nice job of making them easy.   I hacked up your std.file recursive remove and got a 4x speed-up on a win7 system with corei7 using the examples from the D programming language book.  Code is below with a hard-coded file I was using for test.  I'm just learning this, so I know you can do better ...

Delete time dropped from 1minute 5 secs to less than 15 secs. This was on an ssd drive.

module main;

import std.stdio;
import std.file;
import std.datetime;
import std.concurrency;
const int THREADS = 16;
int main(string[] argv)
{
   writeln("removing H:/pa10_120130/xx8");
	auto st1 = Clock.currTime(); //Current time in local time.
	rmdirRecurse2("H:/pa10_120130/xx8");
 	auto st2 = Clock.currTime(); //Current time in local time.
	auto dif = st2  - st1 ;
	auto ts= dif.toString();
	writeln("time:");
	writeln(ts);
	writeln("finished !");
   return 0;
}
void rmdirRecurse2(in char[] pathname){
    DirEntry de = dirEntry(pathname);
    rmdirRecurse2(de);
}
void rmdirRecurse2(ref DirEntry de){
	if(!de.isDir)
		throw new FileException( de.name, " is not a
directory");
	if(de.isSymlink())
		remove(de.name);
	else    {
		Tid tid[THREADS];
		int i=0;
        for(;i<THREADS;i++){
			tid[i]= spawn(&fileRemover);
		}
		Tid tidd = spawn(&dirRemover);

		// all children, recursively depth-first
	    i=0;
		foreach(DirEntry e; dirEntries(de.name,
SpanMode.depth, false))        {
			string nm = e.name;
            attrIsDir(e.linkAttributes) ? tidd.send(nm)  : tid
[i].send(nm),i=(i+1)%THREADS;
		}

        // wait for the THREADS threads to complete their file
removes and acknowledge
		// receipt of the tid
		for (i=0;i<THREADS;i++){
			tid[i].send(thisTid);
			receiveOnly!Tid();
		}
		tidd.send(thisTid);
		receiveOnly!Tid();

		// the dir itself
		rmdir(de.name);
	}
}
	void fileRemover() {
		for(bool running=true;running;){
		receive(
				(string s) {
					remove(s);
				}, // remove the files
				(Tid x) {
					x.send(thisTid);
					running=false;

				} // this is the terminator
				);
		}
	}

	void dirRemover() {
		string[] dirs;
		for(bool running=true;running;){
			receive(
					(string s) {
						dirs~=s;
					},
					(Tid x) {
						foreach(string
d;dirs){
							rmdir(d);
						}
						x.send(thisTid);
						running = false;
					}
					);
		}
	}


February 05, 2012
"Jay Norwood" <jayn@prismnet.com> wrote in message news:jgkfdf$qb5$1@digitalmars.com...
> It would be good if the std.file operations used the D multi- thread features, since you've done such a nice job of making them easy.   I hacked up your std.file recursive remove and got a 4x speed-up on a win7 system with corei7 using the examples from the D programming language book.  Code is below with a hard-coded file I was using for test.  I'm just learning this, so I know you can do better ...
>
> Delete time dropped from 1minute 5 secs to less than 15 secs. This was on an ssd drive.
>

Interesting. How does it perform when just running on one core?


February 05, 2012
== Quote from Nick Sabalausky (a@a.a)'s article
 > Interesting. How does it perform when just running on one core?

The library without the threads is 1 min 5 secs for the 1.5GB directory structure with about 32k files.  This is on an 510 series intel ssd.  The win7 os removes it in almost exactly the same time, and you can see from their task manager it is also being done single core and only a small percentage of cpu.  In contrast, all 8 threads in the task manager max out for a period when running this multi-thread remove. The regular file deletes are occurring in parallel.  A single thread removes the directory structure after waiting for all the regular files to be deleted by the parallel threads.  I attached a screen capture.

I tried last night to do a similar thing with the unzip processing in std.zip, but the library code is written in such a way that the parallel threads would need to create the whole zip archive directory in order to process the elements.   I would hope to be able to solve this problem and provide a similar 4x speedup to the unzip of, for example 7zip, which is currently also showing execution on a single thread.  7zip takes about 50 seconds to unzip this file.

What is needed is probably a dumber archive element processing call that gets passed an archive element immutable structure read by the main thread.  The parallel threads could then seek to the position and process just each assigned single element without loading the whole file.

Also, the current design requires a memory buffer with the whole zip archive in it before it can create the archive directory. There should instead be some way of sequentially processing the file.
February 05, 2012
"Jay Norwood" <jayn@prismnet.com> wrote in message news:jgm5vh$hbe$1@digitalmars.com...
> == Quote from Nick Sabalausky (a@a.a)'s article
> > Interesting. How does it perform when just running on one core?
>
> The library without the threads is 1 min 5 secs for the 1.5GB directory structure with about 32k files.  This is on an 510 series intel ssd.  The win7 os removes it in almost exactly the same time, and you can see from their task manager it is also being done single core and only a small percentage of cpu.  In contrast, all 8 threads in the task manager max out for a period when running this multi-thread remove. The regular file deletes are occurring in parallel.  A single thread removes the directory structure after waiting for all the regular files to be deleted by the parallel threads.  I attached a screen capture.
>

What I'm wondering is this:

Suppose all the cores but one are already preoccupied with other stuff, or maybe you're even running on a single-core. Does the threading add enough overhead that it would actually go slower than the original single-threaded version?

If not, then this would indeed be a fantastic improvement to phobos. Otherwise, I wonder how such a situation could be mitigated?

> I tried last night to do a similar thing with the unzip processing in std.zip, but the library code is written in such a way that the parallel threads would need to create the whole zip archive directory in order to process the elements.   I would hope to be able to solve this problem and provide a similar 4x speedup to the unzip of, for example 7zip, which is currently also showing execution on a single thread.  7zip takes about 50 seconds to unzip this file.
>

That would be cool.


February 05, 2012
On 2/5/12 10:16 AM, Nick Sabalausky wrote:
> "Jay Norwood"<jayn@prismnet.com>  wrote in message
> news:jgm5vh$hbe$1@digitalmars.com...
>> == Quote from Nick Sabalausky (a@a.a)'s article
>>> Interesting. How does it perform when just running on one core?
>>
>> The library without the threads is 1 min 5 secs for the 1.5GB
>> directory structure with about 32k files.  This is on an 510
>> series intel ssd.  The win7 os removes it in almost exactly the
>> same time, and you can see from their task manager it is also
>> being done single core and only a small percentage of cpu.  In
>> contrast, all 8 threads in the task manager max out for a period
>> when running this multi-thread remove. The regular file deletes
>> are occurring in parallel.  A single thread removes the directory
>> structure after waiting for all the regular files to be deleted by
>> the parallel threads.  I attached a screen capture.
>>
>
> What I'm wondering is this:
>
> Suppose all the cores but one are already preoccupied with other stuff, or
> maybe you're even running on a single-core. Does the threading add enough
> overhead that it would actually go slower than the original single-threaded
> version?
>
> If not, then this would indeed be a fantastic improvement to phobos.
> Otherwise, I wonder how such a situation could be mitigated?

There's a variety of ways, but the simplest approach is to pass a parameter to the function telling how many threads it's allowed to spawn. Jay?

Andrei


February 05, 2012
== Quote from Andrei Alexandrescu
> > Suppose all the cores but one are already preoccupied with
other stuff, or
> > maybe you're even running on a single-core. Does the threading
add enough
> > overhead that it would actually go slower than the original
single-threaded
> > version?
> >
> > If not, then this would indeed be a fantastic improvement to
phobos.
> > Otherwise, I wonder how such a situation could be mitigated?
> There's a variety of ways, but the simplest approach is to pass a parameter to the function telling how many threads it's allowed
to
> spawn. Jay?
> Andrei

I can tell you that there are a couple of seconds improvement in the execution time running 16 threads vs 8 on the i7 on the ssd drive, so we aren't keeping all the cores busy with 8 threads. I suppose they are all blocked waiting for file system operations for some portion of time even with 8 threads.  I would guess that even on a single core it would be an advantage to have multiple threads available for the core to work on when it blocks waiting for the fs operations.

The previous results were on an ssd drive.  I tried again on  a Seagate sata3 7200rpm hard drive it took 2 minutes 12 sec to delete the same layout using OS, and never used more than 10% cpu.

The one thread configuration of the D program similarly used less than 10% cpu but took only 1 minute 50 seconds to delete the same layout.

Anything above 1 thread configuration on the sata drive began degrading the D program performance when using the hard drive. I'll have to scratch my head on this a while.  This is for an optiplex 790, win7-64, using the board's sata for both the ssd and the hd.

The extract of the zip using 7zip takes 1:55 on the seagate disk drive, btw ... vs about 50 secs on the ssd.





February 05, 2012
On 2/5/12 3:04 PM, Jay Norwood wrote:
> I can tell you that there are a couple of seconds improvement in
> the execution time running 16 threads vs 8 on the i7 on the ssd
> drive, so we aren't keeping all the cores busy with 8 threads. I
> suppose they are all blocked waiting for file system operations
> for some portion of time even with 8 threads.  I would guess that
> even on a single core it would be an advantage to have multiple
> threads available for the core to work on when it blocks waiting
> for the fs operations.
[snip]

That's why I'm saying - let's leave the decision to the user. Take a uint parameter for the number of threads to be used, where 0 means leave it to phobos, and default to 0.

Andrei

February 07, 2012
Andrei Alexandrescu Wrote:
> That's why I'm saying - let's leave the decision to the user. Take a uint parameter for the number of threads to be used, where 0 means leave it to phobos, and default to 0.
> 
> Andrei
> 


ok, here is another version.  I was reading about the std.parallelism library, and I see I can do the parallel removes more cleanly.  Plus the library figures out the number of cores and limits the taskpool size accordingly. It is only a slight bit slower than the other code.  It looks like they choose 7 threads in the taskPool when you have 8 cores.

So, I do the regular files in parallel, then pass it back to the original library code which cleans up the  directory-only  tree non-parallel.  I also added in code to get the directory names from argv.


module main;

import std.stdio;
import std.file;
import std.datetime;
import std.parallelism;

int main(string[] argv)
{
	if (argv.length < 2){
		writeln ("need to specify one or more directories to remove");
		return 0;
	}
	foreach(string dir; argv[1..$]){
		writeln("removing directory: "~ dir );
		auto st1 = Clock.currTime(); //Current time in local time.
		rmdirRecurse2(dir);
 		auto st2 = Clock.currTime(); //Current time in local time.
		auto dif = st2  - st1 ;
		auto ts= dif.toString();
		writeln("time:"~ts);
	}
	writeln("finished !");
	return 0;
}
void rmdirRecurse2(in char[] pathname){
    DirEntry de = dirEntry(pathname);
    rmdirRecurse2(de);
}
void rmdirRecurse2(ref DirEntry de){
	string[] files;

	if(!de.isDir)
		throw new FileException( de.name, " is not a directory");
	if(de.isSymlink())
		remove(de.name);
	else    {
		// make an array of the regular files only
 		foreach(DirEntry e; dirEntries(de.name, SpanMode.depth, false)){
             if (!attrIsDir(e.linkAttributes)){
 				 files ~= e.name ;
 			 }
 		}

		// parallel foreach for regular files
		foreach(fn ; taskPool.parallel(files,1000)) {
			remove(fn);
		}

		// let the original code remove the directories only
		rmdirRecurse(de);
	}
}


February 07, 2012
Le 05/02/2012 18:38, Andrei Alexandrescu a écrit :
> On 2/5/12 10:16 AM, Nick Sabalausky wrote:
>> "Jay Norwood"<jayn@prismnet.com> wrote in message
>> news:jgm5vh$hbe$1@digitalmars.com...
>>> == Quote from Nick Sabalausky (a@a.a)'s article
>>>> Interesting. How does it perform when just running on one core?
>>>
>>> The library without the threads is 1 min 5 secs for the 1.5GB
>>> directory structure with about 32k files. This is on an 510
>>> series intel ssd. The win7 os removes it in almost exactly the
>>> same time, and you can see from their task manager it is also
>>> being done single core and only a small percentage of cpu. In
>>> contrast, all 8 threads in the task manager max out for a period
>>> when running this multi-thread remove. The regular file deletes
>>> are occurring in parallel. A single thread removes the directory
>>> structure after waiting for all the regular files to be deleted by
>>> the parallel threads. I attached a screen capture.
>>>
>>
>> What I'm wondering is this:
>>
>> Suppose all the cores but one are already preoccupied with other
>> stuff, or
>> maybe you're even running on a single-core. Does the threading add enough
>> overhead that it would actually go slower than the original
>> single-threaded
>> version?
>>
>> If not, then this would indeed be a fantastic improvement to phobos.
>> Otherwise, I wonder how such a situation could be mitigated?
>
> There's a variety of ways, but the simplest approach is to pass a
> parameter to the function telling how many threads it's allowed to
> spawn. Jay?
>
> Andrei
>
>

That cold be a solution, but this is a bad separation of concerns IMO, and should be like that in phobos.

The parameter should be a thread pool or something similar. This allow to not only choose the number of thread, but also to choose how the task is distributed over threads, eventually mix thoses task with other tasks (by using the same thread pool in other places).

It allow to basically separate the problem of deleting and the problem of spreading the task over multiple threads and with which policy.