February 12, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:dsobqi$2mgg$1@digitaldaemon.com...
>
> "Matthew" <matthew@hat.stlsoft.dot.org> wrote in message news:dso4rh$2d2u$1@digitaldaemon.com...
> > I decided that facts would be more powerful than conjecture. Enumerating
> > source files - something I commonly do - using both a D program using
> > listdir and a Ruby script using recls/Ruby, I get the following results
> > for
> > one of my main work directories H:/Dev. The search pattern is
> > "*.h;*.rb;*.hpp;*.java;*.d;*.py;*.vbs;*.c;*.cpp;*.cxx;*.idl;*.odl;*.rc".
>
> Please post the code of the benchmark you're using, as I can't comment on
it
> until I see exactly what it's doing.

Go wild





February 12, 2006
Had to split the post, as attachments combined exceeded NG limit. Latest whereis.exe can be obtained from http://synesis.com.au/downloads/recls/whereis.zip






February 12, 2006
"Matthew" <matthew@hat.stlsoft.dot.org> wrote in message news:dsod98$2o63$1@digitaldaemon.com...
> Go wild

Thanks. Let's have some fun with the analysis! Here's your version:
-------------------------------------------------
import std.file;
import std.stdio;
import std.string;

int main()
{
 char[]  PATTERNS =
"*.h;*.rb;*.hpp;*.java;*.d;*.py;*.vbs;*.c;*.cpp;*.cxx;*.idl;*.odl;*.rc";
// char[]  ROOT  = "H:/dev";
 char[]  ROOT  = "H:/";
// char[]  ROOT  = "H:/freelibs/openrj/current";
 char[][] filters  = split(PATTERNS, ";");
 int   n   = 0;

 foreach(char[] pattern; filters)
 {
//  writefln("pattern: %s", pattern);

  char[][] fs = listdir(ROOT, pattern);

  foreach(char[] fe; fs)
  {
//   writefln("File: %s", fe);
   ++n;
  }
 }

 printf("Number: %d\n", n);

 return 0;
}
--------------------------------------------------
There are 13 patterns searched for, so 13 trips through listdir. Taking your numbers, the D version takes roughly 6 times longer to run than the Ruby one. I adapted it slightly to my system so I could run it, and set it up for 8 patterns:
--------------------------------------------------
import std.file;
import std.stdio;
import std.string;

int main()
{
    char[]   PATTERNS = "*.h;*.hpp;*.bak;*.d;*.vbs;*.c;*.cpp;*.exe";
    char[]   ROOT     = r"c:\cbx";
    char[][] filters  = split(PATTERNS, ";");
    int n             = 0;

    foreach(char[] pattern; filters)
    {
//      writefln("pattern: %s", pattern);

        char[][] fs = listdir(ROOT, pattern);

        foreach(char[] fe; fs)
        {
//          writefln("File: %s", fe);
            ++n;
        }
    }

    printf("Number: %d\n", n);

    return 0;
}
--------------------------------------------------
As I think I said previously, the listdir one used in D just does the (primitive) fnmatch to do the pattern match, which is intended to behave like the operating system wildcard matching behavior. It does not support "OR" patterns. But we can write our own to use regular expressions, as in:
---------------------------------------------------
import std.file;
import std.stdio;
import std.string;

int main()
{
    char[] pattern = r"\.(h|hpp|bak|d|vbs|c|cpp|exe)$";
    char[] ROOT = r"c:\cbx";
    int    n    = 0;

//  writefln("pattern: %s", pattern);

    char[][] fs = listdir(ROOT, pattern);

    foreach(char[] fe; fs)
    {
//      writefln("File: %s", fe);
        ++n;
    }

    printf("Number: %d\n", n);

    return 0;
}

import std.regexp;

char[][] listdir(char[] pathname, char[] pattern)
{   char[][] result;
    auto r = new RegExp(pattern);

    bool callback(DirEntry* de)
    {
        if (de.isdir)
            std.file.listdir(de.name, &callback);
        else
        {   if (r.test(de.name))
                result ~= de.name;
        }
        return true; // continue
    }

    std.file.listdir(pathname, &callback);
    return result;
}
----------------------------------------------------
There's only one trip through listdir(). What are the numbers?

Matthew's: 2.67
Walter's: 0.36

Interestingly, that shows a 7.4 times speedup, but 8 patterns. This implies that the D version might be perhaps twice as fast as the Ruby one on your system. (I don't have Ruby on my system.) This post is long enough, I'll provide more design commentary in another post.


February 12, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:dsogst$2u8b$1@digitaldaemon.com...
>
> "Matthew" <matthew@hat.stlsoft.dot.org> wrote in message news:dsod98$2o63$1@digitaldaemon.com...
> > Go wild
>
> Thanks. Let's have some fun with the analysis! Here's your version:
> -------------------------------------------------
> import std.file;
> import std.stdio;
> import std.string;
>
> int main()
> {
>  char[]  PATTERNS =
> "*.h;*.rb;*.hpp;*.java;*.d;*.py;*.vbs;*.c;*.cpp;*.cxx;*.idl;*.odl;*.rc";
> // char[]  ROOT  = "H:/dev";
>  char[]  ROOT  = "H:/";
> // char[]  ROOT  = "H:/freelibs/openrj/current";
>  char[][] filters  = split(PATTERNS, ";");
>  int   n   = 0;
>
>  foreach(char[] pattern; filters)
>  {
> //  writefln("pattern: %s", pattern);
>
>   char[][] fs = listdir(ROOT, pattern);
>
>   foreach(char[] fe; fs)
>   {
> //   writefln("File: %s", fe);
>    ++n;
>   }
>  }
>
>  printf("Number: %d\n", n);
>
>  return 0;
> }
> --------------------------------------------------
> There are 13 patterns searched for, so 13 trips through listdir. Taking
your
> numbers, the D version takes roughly 6 times longer to run than the Ruby one. I adapted it slightly to my system so I could run it, and set it up
for
> 8 patterns:
> --------------------------------------------------
> import std.file;
> import std.stdio;
> import std.string;
>
> int main()
> {
>     char[]   PATTERNS = "*.h;*.hpp;*.bak;*.d;*.vbs;*.c;*.cpp;*.exe";
>     char[]   ROOT     = r"c:\cbx";
>     char[][] filters  = split(PATTERNS, ";");
>     int n             = 0;
>
>     foreach(char[] pattern; filters)
>     {
> //      writefln("pattern: %s", pattern);
>
>         char[][] fs = listdir(ROOT, pattern);
>
>         foreach(char[] fe; fs)
>         {
> //          writefln("File: %s", fe);
>             ++n;
>         }
>     }
>
>     printf("Number: %d\n", n);
>
>     return 0;
> }
> --------------------------------------------------
> As I think I said previously, the listdir one used in D just does the (primitive) fnmatch to do the pattern match, which is intended to behave like the operating system wildcard matching behavior. It does not support "OR" patterns. But we can write our own to use regular expressions, as in:
> ---------------------------------------------------
> import std.file;
> import std.stdio;
> import std.string;
>
> int main()
> {
>     char[] pattern = r"\.(h|hpp|bak|d|vbs|c|cpp|exe)$";
>     char[] ROOT = r"c:\cbx";
>     int    n    = 0;
>
> //  writefln("pattern: %s", pattern);
>
>     char[][] fs = listdir(ROOT, pattern);
>
>     foreach(char[] fe; fs)
>     {
> //      writefln("File: %s", fe);
>         ++n;
>     }
>
>     printf("Number: %d\n", n);
>
>     return 0;
> }
>
> import std.regexp;
>
> char[][] listdir(char[] pathname, char[] pattern)
> {   char[][] result;
>     auto r = new RegExp(pattern);
>
>     bool callback(DirEntry* de)
>     {
>         if (de.isdir)
>             std.file.listdir(de.name, &callback);
>         else
>         {   if (r.test(de.name))
>                 result ~= de.name;
>         }
>         return true; // continue
>     }
>
>     std.file.listdir(pathname, &callback);
>     return result;
> }
> ----------------------------------------------------
> There's only one trip through listdir(). What are the numbers?
>
> Matthew's: 2.67
> Walter's: 0.36
>
> Interestingly, that shows a 7.4 times speedup, but 8 patterns. This
implies
> that the D version might be perhaps twice as fast as the Ruby one on your system. (I don't have Ruby on my system.) This post is long enough, I'll provide more design commentary in another post.

LOL! So you've shown that you can rewrite what is essentially your own - please don't call it Matther's code; I wouldn't write something like that in a blue fit - code faster than you did previously after someone points out that it was not done well in the first place? Well, I guess we'll have to take some comfort in that.

But questions abound: How many matching files are under c:\cbx? What is 2.67? What is 0.36? How were these times taken? Why have you changed the number of patterns? Hey, why not just use 1? What do you think you've proved/demonstrated by a test of some D against some other D? Why haven't you installed Ruby and done a comparison? Where is the comparison of the memory hit? Why offer conjecture ("perhaps twice as fast as the Ruby one") when I spent a lot of time and effort producing cold hard facts?

Aside: I have little doubt that you will find a way to convince yourself that listdir() is perfectly good enough and that you can code-tweak its _design flaws_ into irrelevance, and that std.recls is therefore not worth it. But I wonder why you spend all this effort to convince yourself/the NG of something that is false, when just a portion of all that effort could have just been spent in upgrading std.recls? I have to suspect something more is at work? Is it that it is C/C++? Is it the download size? What? Really, I'm keen to know, as then I can stop working hard to present facts, and be content to shut up in the knowledge that no facts will be adequate.


Matthew

P.S. I recall a certain debate about openrj, in which you ended up having to convince me that I could not legitimately claim performance supremacy of the C version because I am an experienced (I think you said "expert") C programmer, and that writing efficient code is something that's largely unconscious. I look forward to your logical leaps in this case.




February 12, 2006
Matthew wrote:
> <SNIP>
> I would again, as I
> have many times over the years in which I've been involved with D, observe
> that being a good compiler designer/writer doesn't make you a good library
> designer/writer, and being a good library designer/writer doesn't make you a
> good application designer/writer, and so on and so forth. What D needs is a
> combination of talents in all such areas. Since there is hardly an overall
> lack of all such in this NG, perhaps we might consider organising for the
> common purpose of making D good and ready, rather than attempting to shoot
> each other's feet off?
> 
> Matthew
> 

Matthew, your argument is very persuasive. There does seem to be a certain amount of shooting at each other's feet, and it does get in the way of the goal - a successful D 1.0. It would really help to have some sort of a community structure organized at a common goal.
February 13, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:dsogst$2u8b$1@digitaldaemon.com...
>> There are 13 patterns searched for, so 13 trips through listdir. Taking
> your
>> numbers, the D version takes roughly 6 times longer to run than the Ruby one. I adapted it slightly to my system so I could run it, and set it up
> for
>> 8 patterns:


Where does this 6 come from? (Ignoring the first times for each to give a fair comparison) the D version takes 31 times longer than the Ruby version!

      80359 2540203
      83656 2733218
      88062 2630031
      86809 2864390

      84721.5 2691961 31.77423


>> Matthew's: 2.67
>> Walter's: 0.36
>>
>> Interestingly, that shows a 7.4 times speedup, but 8 patterns. This
> implies
>> that the D version might be perhaps twice as fast as the Ruby one on your system. (I don't have Ruby on my system.) This post is long enough, I'll provide more design commentary in another post.
>

So with your speed up of 8, it's still 4 times slower.

Reason: Design is bad.
Solution: Ditch listdir(), or remove its recursive abilities.


February 13, 2006
Forgive me, but from reading the samples in this argument, I see:

Matthew says the standard library should do as much of the task as conceivably possible, within reasonable bounds.  He says D's Phobos does not do this.

Walter says that the standard library should be simple, and support normal options with the option of adding code on the user-side for complex tasks.

Anyway, if I were to take a very large array, and run through it 8 times - each time pulling out just 1/16th of the array (half of what I want), I would clearly be a moron.  If, on the other hand, I went through it once and pulled out the 8/16ths (half) of the array all in one blow, it would seem I regained my senses.

This is not rocket science, and I fail to see why it was necessary to develop a testcase to show this.  Or why anyone might be surprised that doing it the better way would take a small fraction of the time. Obviously, really.

But, yes, the question is in the implementation.  And from my description above, if it is so correct... this is more of a philosophical question than one about performance.

This just seems like an awful amount of commotion for the relatively uncommon task of building a full-tree list of all files matching some extensions.  I can count the number of programs I'd consider useful that do that on one hand (and grep, which does not do this, is the first one.)  What a great measure of a library's design!

I wonder which better supports VESA.  That used to be big, didn't it?

-[Unknown]
February 13, 2006
"Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:dsokf0$3d$1@digitaldaemon.com...
> Forgive me, but from reading the samples in this argument, I see:
>
> Matthew says the standard library should do as much of the task as conceivably possible, within reasonable bounds.  He says D's Phobos does not do this.

I don't think that's an accurate representation of my position. In many cases I like a minimalist approach.

The point here is that the underlying operations are I/O bound, so any decent library needs to minimise the amount of I/O. (And avoid non-linear user-mode behaviour as well, of course.)

> Walter says that the standard library should be simple, and support normal options with the option of adding code on the user-side for complex tasks.

Great. In which case listdir() shouldn't be recursive. There's a good reason why no operating systems provide such recursive facilities, as the designer(s) of listdir() have failed to recognise. (Note: UNIX does have glob(), which is effectively what listdir() is trying to be, and that has the same caveats, but that's not a system call, and it's got a different, and generally more limited, kind of recursion.)

> Anyway, if I were to take a very large array, and run through it 8 times - each time pulling out just 1/16th of the array (half of what I want), I would clearly be a moron.  If, on the other hand, I went through it once and pulled out the 8/16ths (half) of the array all in one blow, it would seem I regained my senses.
>
> This is not rocket science, and I fail to see why it was necessary to develop a testcase to show this.  Or why anyone might be surprised that doing it the better way would take a small fraction of the time. Obviously, really.

True, but there's another flaw in the design. By combining recursion and en bloc results it can lead to huge latencies and memory consumption, suffering from non-linear behaviour. (Perhaps I didn't stress that aspect enough in my previous rants. The two aspects are not commensurate, and if listdir() is preserved then it needs to stop being recursive.)

> But, yes, the question is in the implementation.  And from my description above, if it is so correct... this is more of a philosophical question than one about performance.

Well, the question is in the design. If that tallies with your point, then ok. If not, then I disagree.

> This just seems like an awful amount of commotion for the relatively uncommon task of building a full-tree list of all files matching some extensions.  I can count the number of programs I'd consider useful that do that on one hand (and grep, which does not do this, is the first one.) What a great measure of a library's design!

It's horses for courses, surely? I do such things many many times a day. You do it rarely. Both are valid. A standard library needs general facilities, and all of those facilities should be well designed and implemented.

My point is that this case is but one example of bad design in the Phobos libraries. Yet Walter will refuse to acknowledge this until our bones are dry and dusty. And there's a library already in Phobos that does such things well. Yet Walter refuses to update it.

I cannot fathom either. To me, the three words "I am wrong" are perhaps the three most useful in the lexicon. So very liberating. So engendering of mutual respect, since if you admit your flaws to me, I can admit mine to you, and we each have much smaller and more manageable facades to project.

Broadly, I want to know why Phobos appears little more than a fetid stagnant heap. Nobody likes it. I'm sure contributors would like to be able to revise/remove/rework their contributions. (apart from std.recls, I'd like to junk / rewrite all mine.) But Walter neither appears to do much about it, nor will he cede its control to people who might be more expert in library design and/or have more time to work on it. (I am not meaning me, here, in case anyone thinks this is a sideways auto-backslap.)

>
> I wonder which better supports VESA.  That used to be big, didn't it?

No idea. It impacts not one whit on my programming life.


February 13, 2006
Now settle down, kiddies. You're only getting yourself all upset.

All we have gained from this unsightly tantrum is that listdir is not optimised and recls is optimised.

I have a gut feeling that optimising listdir is not going to be all that hard to do. I even had a 10-minute go at doing that and got a huge performance improvement.

BTW, this might just be a documentation 'bug', but the docs for DirEntry say that ...

  char[] name;
     file or directory name

but it is actually the name *plus* full path to the name. I tripped up on this when using the pattern "test.*" and it failed to find any files. It seems I needed to use the pattern "*\test.*" to find JUST files that started with 'test'.

For somebody's amusement I've included an expanded and better performing
listdir() (though I'm sure it can be improved and also its not a comparison
to either recls or Ruby)...

//------------------------
private
{
    import std.string;
    import std.path;
    import std.file;
}

char[][] filelist(char[] pPathname, char[] pPatterns, bool pRecurse = true)
{   char[][] lResults;
    char[][] lPatterns;
    int lSpot;

    // Pre-assign a large buffer for finds.
    lResults.length = 1000;

    bool callback(DirEntry* de)
    {
    	if (de.isdir)
    	{
    	    // sometimes I don't want recursion :-)
    	    if (pRecurse)
    	        listdir(de.name, &callback);
	    }
    	else
    	{
            bool lUsePath; // true: include path names else just file names
            bool lExtOnly;  // true: the patterns are file extentions only

    	    foreach(char[] lThisPattern; lPatterns)
    	    {
     	        char[] lBase;
     	        // First check for 'special' pPatterns qualifiers
        	    if (lThisPattern == "_P")
        	    {
            	    lUsePath = true;
        	    }
        	    else if (lThisPattern == "!P")
        	    {
            	    lUsePath = false;
        	    }
        	    else if (lThisPattern == "_E")
        	    {
            	    lExtOnly = true;
        	    }
        	    else if (lThisPattern == "!E")
        	    {
            	    lExtOnly = false;
        	    }
        	    else
        	    {
            	    // Are we testing the full name or just the file name?
            	    if (lUsePath)
            	    {
        	            lBase = de.name;
            	    }
            	    else
            	    {
            	        lBase = std.path.getBaseName(de.name);
        	        }

        	        // Is the current pattern only for extentions?
        	        if (lExtOnly)
        	            lThisPattern = "*." ~ lThisPattern;

        	        if (std.path.fnmatch(lBase, lThisPattern))
        	        {
        		        // Expand buffer if needed
        		        if (lSpot >= lResults.length)
        		            lResults.length = lResults.length + 1000;

        		        // capture the name
        		        lResults[lSpot] = de.name;
        		        lSpot++;
        		        // Stop looking at patterns; a file is use only once.
        		        break;
        		    }
                }
    		}
    	}
    	return true; // continue
    }

    // Cater for multiple patterns delimited by a semi-colon.
    lPatterns = std.string.split(pPatterns, ";");

    // Start searching.
    listdir(pPathname, &callback);

    // trim the buffer back to size before returning.
    lResults.length = lSpot;
    return lResults;
}
//-----------------

The 'special' pattern qualifier "_E" makes it easier to type a set of extents.

  _E;d;exe;map;obj;h;c;cpp;tst;tmp

And by default, only file names are examined, but if you use "_P" pattern qualifier then path names are also examined.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocracy!"
13/02/2006 11:25:20 AM
February 13, 2006
I have a simple question, for Matthew largely.

Lets assume the latest version of recls is packaged with the latest version of the D compiler.

I download the latest D compiler, extract all the files and am at the stage where I can write a D program and compile with "dmd <file>" at command prompt.

I want to write a program which uses recls, what further steps do I have to take in order to do so?

I ask because I just tried to use the one which is currently part of the D compiler distribution and failed to get anything working, missing symbols. I suspect I need to build a recls library. I found some makefiles but I failed to get those to compile even after modifying them to find the DMC tools dmc, lib, link, etc as opposed to the microsoft ones I have install with VC6.0. FYI the error reported a duplicate symbol in my Microsoft headers. I wonder if it is including the wrong files or if I have old files.

Either way, that is not the point of this post, rather, I simply want to know what a user has to do in an ideal situation in order to use the latest recls with D. I suspect the C library is a requirement that is not going to vanish in the future?

If so, I would prefer not to have that requirement. In other words I would prefer the solution to be a native D/phobos one that does not rely on external libraries.

I am not making an argument _for_ "listdir" in it's current form. I am making an argument for a solution that does not use an external library, unless (there is always one of these) that solution has a clear advantage that the "non-external lib" solution cannot match.

Of course if I am wrong about the external library requirement, lemme know and we'll ignore this point :)

Regan

On Mon, 13 Feb 2006 11:55:41 +1100, Matthew <nowhere@noaddress.co.us> wrote:
> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message
> news:dsokf0$3d$1@digitaldaemon.com...
>> Forgive me, but from reading the samples in this argument, I see:
>>
>> Matthew says the standard library should do as much of the task as
>> conceivably possible, within reasonable bounds.  He says D's Phobos does
>> not do this.
>
> I don't think that's an accurate representation of my position. In many
> cases I like a minimalist approach.
>
> The point here is that the underlying operations are I/O bound, so any
> decent library needs to minimise the amount of I/O. (And avoid non-linear
> user-mode behaviour as well, of course.)
>
>> Walter says that the standard library should be simple, and support normal
>> options with the option of adding code on the user-side for complex tasks.
>
> Great. In which case listdir() shouldn't be recursive. There's a good reason
> why no operating systems provide such recursive facilities, as the
> designer(s) of listdir() have failed to recognise. (Note: UNIX does have
> glob(), which is effectively what listdir() is trying to be, and that has
> the same caveats, but that's not a system call, and it's got a different,
> and generally more limited, kind of recursion.)
>
>> Anyway, if I were to take a very large array, and run through it 8 times -
>> each time pulling out just 1/16th of the array (half of what I want), I
>> would clearly be a moron.  If, on the other hand, I went through it once
>> and pulled out the 8/16ths (half) of the array all in one blow, it would
>> seem I regained my senses.
>>
>> This is not rocket science, and I fail to see why it was necessary to
>> develop a testcase to show this.  Or why anyone might be surprised that
>> doing it the better way would take a small fraction of the time.
>> Obviously, really.
>
> True, but there's another flaw in the design. By combining recursion and en
> bloc results it can lead to huge latencies and memory consumption, suffering
> from non-linear behaviour. (Perhaps I didn't stress that aspect enough in my
> previous rants. The two aspects are not commensurate, and if listdir() is
> preserved then it needs to stop being recursive.)
>
>> But, yes, the question is in the implementation.  And from my description
>> above, if it is so correct... this is more of a philosophical question
>> than one about performance.
>
> Well, the question is in the design. If that tallies with your point, then
> ok. If not, then I disagree.
>
>> This just seems like an awful amount of commotion for the relatively
>> uncommon task of building a full-tree list of all files matching some
>> extensions.  I can count the number of programs I'd consider useful that
>> do that on one hand (and grep, which does not do this, is the first one.)
>> What a great measure of a library's design!
>
> It's horses for courses, surely? I do such things many many times a day. You
> do it rarely. Both are valid. A standard library needs general facilities,
> and all of those facilities should be well designed and implemented.
>
> My point is that this case is but one example of bad design in the Phobos
> libraries. Yet Walter will refuse to acknowledge this until our bones are
> dry and dusty. And there's a library already in Phobos that does such things
> well. Yet Walter refuses to update it.
>
> I cannot fathom either. To me, the three words "I am wrong" are perhaps the
> three most useful in the lexicon. So very liberating. So engendering of
> mutual respect, since if you admit your flaws to me, I can admit mine to
> you, and we each have much smaller and more manageable facades to project.
>
> Broadly, I want to know why Phobos appears little more than a fetid stagnant
> heap. Nobody likes it. I'm sure contributors would like to be able to
> revise/remove/rework their contributions. (apart from std.recls, I'd like to
> junk / rewrite all mine.) But Walter neither appears to do much about it,
> nor will he cede its control to people who might be more expert in library
> design and/or have more time to work on it. (I am not meaning me, here, in
> case anyone thinks this is a sideways auto-backslap.)
>
>>
>> I wonder which better supports VESA.  That used to be big, didn't it?
>
> No idea. It impacts not one whit on my programming life.
>
>