August 08, 2010
Jonathan M Davis wrote:

> void removeTabs(int tabSize, string fileName)
> {
>     auto file = File(fileName);
>     string[] output;
> 
>     foreach(line; file.byLine())
>     {
>         int lastTab = 0;
> 
>         while(lastTab != -1)
>         {
>             const int tab = line.indexOf('\t');
> 
>             if(tab == -1)
>                 break;
> 
>             const int numSpaces = tabSize - tab % tabSize;
> 
>             line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1
>             .. $];
> 
>             lastTab = tab + numSpaces;
>         }
> 
>         output ~= line.idup;
>     }
> 
>     std.file.write(fileName, output.join("\n"));
> }

Actually, looking at the code again, that while loop really should be while(1) rather than while(lastTab != -1), but it will work the same regardless.

- Jonathan M Davis

August 08, 2010
On 08/07/2010 11:04 PM, Jonathan M Davis wrote:
> On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:
>> A good exercise would be rewriting these tools in idiomatic D2 and
>> assess the differences.
>>
>>
>> Andrei
>
> I didn't try and worry about multiline string literals, but here are my more
> idiomatic solutions:
>
>
>
> detab:
>
> /* Replace tabs with spaces, and remove trailing whitespace from lines.
>    */
>
> import std.conv;
> import std.file;
> import std.stdio;
> import std.string;
>
> void main(string[] args)
> {
>      const int tabSize = to!int(args[1]);
>      foreach(f; args[2 .. $])
>          removeTabs(tabSize, f);
> }
>
>
> void removeTabs(int tabSize, string fileName)
> {
>      auto file = File(fileName);
>      string[] output;
>
>      foreach(line; file.byLine())
>      {
>          int lastTab = 0;
>
>          while(lastTab != -1)
>          {
>              const int tab = line.indexOf('\t');
>
>              if(tab == -1)
>                  break;
>
>              const int numSpaces = tabSize - tab % tabSize;
>
>              line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $];
>
>              lastTab = tab + numSpaces;
>          }
>
>          output ~= line.idup;
>      }
>
>      std.file.write(fileName, output.join("\n"));
> }

Very nice. Here's how I'd improve removeTabs:

#!/home/andrei/bin/rdmd
import std.conv;
import std.file;
import std.getopt;
import std.stdio;
import std.string;

void main(string[] args)
{
    uint tabSize = 8;
    getopt(args, "tabsize|t", &tabSize);
    foreach(f; args[1 .. $])
        removeTabs(tabSize, f);
}

void removeTabs(int tabSize, string fileName)
{
    auto file = File(fileName);
    string output;
    bool changed;

    foreach(line; file.byLine(File.KeepTerminator.yes))
    {
        int lastTab = 0;

        while(lastTab != -1)
        {
            const tab = line.indexOf('\t');
            if(tab == -1)
                break;
            const numSpaces = tabSize - tab % tabSize;
            line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $];
            lastTab = tab + numSpaces;
            changed = true;
        }

        output ~= line;
    }

    file.close();
    if (changed)
        std.file.write(fileName, output);
}

> -------------------------------------------
>
> The three differences between mine and Walter's are that mine takes the tab size
> as the first argumen,t it doesn't put a newline at the end of the file, and it
> writes the file even if it changed (you could test for that, but when using
> byLine(), it's a bit harder). Interestingly enough, from the few tests that I
> ran, mine seems to be somewhat faster. I also happen to think that the code is
> clearer (it's certainly shorter), though that might be up for debate.
>
> -------------------------------------------
>
>
>
> tolf:
>
> /* Replace line endings with LF
>    */
>
> import std.file;
> import std.string;
>
> void main(string[] args)
> {
>      foreach(f; args[1 .. $])
>          fixEndLines(f);
> }
>
> void fixEndLines(string fileName)
> {
>      auto fileStr = std.file.readText(fileName);
>      auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n");
>
>      std.file.write(fileName, result);
> }
>
> -------------------------------------------
>
> This version is ludicrously simple. And it was also faster than Walter's in the
> few tests that I ran. In either case, I think that it is definitely clearer code.

Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them.

This makes me think we should have a range that detects and replaces patterns lazily and on the fly. I've always thought that loading entire files in memory and working on them is "cheating" in some sense, and a range would help with replacing patterns in streams.

> I would have thought that being more idomatic would have resulted in slower code
> than what Walter did, but interestingly enough, both programs are faster with my
> code. They might take more memory though. I'm not quite sure how to check that.
> In any cases, you wanted some idiomatic D2 solutions, so there you go.

Looking very good, thanks. I think we should have a feature these and a few others as examples on the website.


Andrei
August 08, 2010
On 08/07/2010 11:16 PM, bearophile wrote:
> Jonathan M Davis:
>> I would have thought that being more idomatic would have resulted
>> in slower code than what Walter did, but interestingly enough, both
>> programs are faster with my code. They might take more memory
>> though. I'm not quite sure how to check that. In any cases, you
>> wanted some idiomatic D2 solutions, so there you go.
>
> Your code looks better.
>
> My (probably controversial) opinion on this is that the idiomatic D
> solution for those text "scripts" is to use a scripting language, as
> Python :-)
>
> In this case a Python version is more readable, shorter and probably
> faster too because reading the lines of a _normal_ text file is
> faster in Python compared to D (because Python is more optimized for
> such purposes. I can show benchmarks on request).
>
> On the other hand D2 is in its debugging phase, so it's good to use
> it even for purposes it's not the best language for, to catch bugs or
> performance bugs. So I think it's positive to write such scripts in
> D2, even if in a real-world setting I want to use Python to write
> them.

I think it's worth targeting D2 to tasks that are usually handled by scripting languages. I've done a lot of that and it beats the hell out of rewriting in D a script that's grown out of control

Andrei
August 08, 2010
On Saturday 07 August 2010 21:59:50 Andrei Alexandrescu wrote:
> Very nice. Here's how I'd improve removeTabs:
> 
> #!/home/andrei/bin/rdmd
> import std.conv;
> import std.file;
> import std.getopt;
> import std.stdio;
> import std.string;
> 
> void main(string[] args)
> {
>      uint tabSize = 8;
>      getopt(args, "tabsize|t", &tabSize);
>      foreach(f; args[1 .. $])
>          removeTabs(tabSize, f);
> }
> 
> void removeTabs(int tabSize, string fileName)
> {
>      auto file = File(fileName);
>      string output;
>      bool changed;
> 
>      foreach(line; file.byLine(File.KeepTerminator.yes))
>      {
>          int lastTab = 0;
> 
>          while(lastTab != -1)
>          {
>              const tab = line.indexOf('\t');
>              if(tab == -1)
>                  break;
>              const numSpaces = tabSize - tab % tabSize;
>              line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab +
> 1 .. $];
>              lastTab = tab + numSpaces;
>              changed = true;
>          }
> 
>          output ~= line;
>      }
> 
>      file.close();
>      if (changed)
>          std.file.write(fileName, output);
> }

Ah. I needed to close the file. I pretty much always just use readText(), so I didn't catch that. Also, it does look like detecting whether the file changed was a bit simpler than I thought that it would be. Quite simple really. Thanks.

> Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them.

Yes. That would be good. It's the kind of thing that I forget - probably because most of the code that I write generates new files rather than updating pre- existing ones.

> 
> This makes me think we should have a range that detects and replaces patterns lazily and on the fly. I've always thought that loading entire files in memory and working on them is "cheating" in some sense, and a range would help with replacing patterns in streams.

It would certainly be nice to have a way to reasonably process with ranges without having to load the whole thing into memory at once. Most of the time, I wouldn't care too much, but if you start processing large files, having the whole thing in memory could be a problem (especially if you have multiple versions of it which were created along the way as you were manipulating it). Haskell does lazy loading of files by default and doesn't load the data until you read the appropriate part of the string. It shouldn't be all that hard to do something similar with D and ranges. The hard port would be trying to do all of it in a way that makes it so that all of the processing of the file's data doesn't have to load it all into memory (let alone load it multiple times). I'm not sure that you could do that without explicitly processing a file line by line, writing it to disk after each line is processed, since you could be doing an arbitrary set of operations on the data. It could be interesting to try and find a solution for that though.

> 
> Looking very good, thanks. I think we should have a feature these and a few others as examples on the website.

Well, I for one, much prefer the ability to program in a manner that's closer to telling the computer to do what I want rather than having to tell it how to do what I want (the replace end-of-line character program being a prime example). It makes life much simpler. Ranges certainly help a lot in that regard too. And having good example code of how to program that way could help encourage people to program that way and use std.range and std.algorithm and their ilk rather than trying more low-level solutions which aren't as easy to understand.

- Jonathan M Davis
August 08, 2010
Jonathan M Davis wrote:
> It would certainly be nice to have a way to reasonably process with ranges without having to load the whole thing into memory at once.

Because of asynchronous I/O, being able to start processing and start writing the new file before the old one is finished reading should speed things up.
August 08, 2010
On 08/07/2010 11:16 PM, bearophile wrote:
> In this case a Python version is more readable, shorter and probably
> faster too because reading the lines of a _normal_ text file is
> faster in Python compared to D (because Python is more optimized for
> such purposes. I can show benchmarks on request).

That would be great so we can tune our approach. Thanks!

Andrei
August 08, 2010
I usually do the same thing with a shell pipe
	expand | sed 's/ *$//;s/\r$//;s/\r/\n/'


On 07/08/10 02:34, Walter Bright wrote:
> I wrote these two trivial utilities for the purpose of canonicalizing
> source code before checkins and to deal with FreeBSD's inability to deal
> with CRLF line endings, and because I can never figure out the right
> settings for git to make it do the canonicalization.
>
> tolf - converts LF, CR, and CRLF line endings to LF.
>
> detab - converts all tabs to the correct number of spaces. Assumes tabs
> are 8 column tabs. Removes trailing whitespace from lines.
>
> Posted here just in case someone wonders what they are.
> ---------------------------------------------------------
> /* Replace tabs with spaces, and remove trailing whitespace from lines.
> */
>
> import std.file;
> import std.path;
>
> int main(string[] args)
> {
> foreach (f; args[1 .. $])
> {
> auto input = cast(char[]) std.file.read(f);
> auto output = filter(input);
> if (output != input)
> std.file.write(f, output);
> }
> return 0;
> }
>
>
> char[] filter(char[] input)
> {
> char[] output;
> size_t j;
>
> int column;
> for (size_t i = 0; i < input.length; i++)
> {
> auto c = input[i];
>
> switch (c)
> {
> case '\t':
> while ((column & 7) != 7)
> { output ~= ' ';
> j++;
> column++;
> }
> c = ' ';
> column++;
> break;
>
> case '\r':
> case '\n':
> while (j && output[j - 1] == ' ')
> j--;
> output = output[0 .. j];
> column = 0;
> break;
>
> default:
> column++;
> break;
> }
> output ~= c;
> j++;
> }
> while (j && output[j - 1] == ' ')
> j--;
> return output[0 .. j];
> }
> -----------------------------------------------------
> /* Replace line endings with LF
> */
>
> import std.file;
> import std.path;
>
> int main(string[] args)
> {
> foreach (f; args[1 .. $])
> {
> auto input = cast(char[]) std.file.read(f);
> auto output = filter(input);
> if (output != input)
> std.file.write(f, output);
> }
> return 0;
> }
>
>
> char[] filter(char[] input)
> {
> char[] output;
> size_t j;
>
> for (size_t i = 0; i < input.length; i++)
> {
> auto c = input[i];
>
> switch (c)
> {
> case '\r':
> c = '\n';
> break;
>
> case '\n':
> if (i && input[i - 1] == '\r')
> continue;
> break;
>
> case 0:
> continue;
>
> default:
> break;
> }
> output ~= c;
> j++;
> }
> return output[0 .. j];
> }
> ------------------------------------------

August 08, 2010
Norbert Nemec wrote:
> I usually do the same thing with a shell pipe
>     expand | sed 's/ *$//;s/\r$//;s/\r/\n/'

<g>
August 08, 2010
Andrei Alexandrescu:
> This makes me think we should have a range that detects and replaces patterns lazily and on the fly.

In Python there is a helper module: http://docs.python.org/library/fileinput.html


> I think it's worth targeting D2 to tasks that are usually handled by scripting languages. I've done a lot of that and it beats the hell out of rewriting in D a script that's grown out of control

Dynamic languages are handy but they require some rigour when you program. Python is probably unfit to write one million lines long programs, but if you train yourself a little and keep your code clean, you usually become able to write clean larghish programs in Python.


> That would be great so we can tune our approach. Thanks!

In my dlibs I have the xio module that read by lines efficiently, it was faster than the iterating on the lines of BufferedFile.
There are tons of different benchmarks that you may use, but a simple one to start is better, one that just iterates the file lines. See below.

Related: experiments have shown that the (oldish) Java GC improves its performance if it is able to keep strings (that are immutable) in a separate memory pool, _and_ be able to recognize duplicated strings, of course keeping only one string for each equality set. It's positive to do a similar experiment with the D GC, but first you need applications that use the GC to test if this idea is an improvement :-)


So I have used a minimal benchmark:

--------------------------

# Python 2.7 code
from sys import argv

def process(file_name):
    total = 0
    for line in open(file_name):
        total += len(line)
    return total

print "Total:", process(argv[1])

--------------------------

// D2 code
import std.stdio: File, writeln;

int process(string fileName) {
    int total = 0;

    auto file = File(fileName);
    foreach (rawLine; file.byLine()) {
        string line = rawLine.idup;
        total += line.length;
    }
    file.close();

    return total;
}

void main(string[] args) {
    if (args.length == 2)
        writeln("Total: ", process(args[1]));
}

--------------------------

In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.

I have used Python 2.7 with no Psyco JIT (http://psyco.sourceforge.net/ ) to speed up the Python code because it's not available yet for Python 2.7. D code compiled with dmd 2.047, optimized build.

As test text data I have used a concatenation of all text files here (they are copyrighted, but freely usable): http://gnosis.cx/TPiP/

The result on Windows is a file of 1_116_552 bytes.

I have attached the file to itself, duplicating its length, some times, the result is a file of 71_459_328 bytes (this is not an fully realistic case because you often have many small files to read instead of a very large one).

The timings are taken with warm disk cache, so they are essentially read from RAM. This is not fully realistic, but if you want to write a benchmark you have to do this, because for me it's very hard on Windows to make sure that the disk cache is fully empty. So it's better to do the opposite and benchmark a warm file.


The output of the Python code is:
Total: 69789888
Found in 0.88 seconds (best of 6, the variance is minimal).


The output of the D code is:
Total: 69789888
Found in 1.28 seconds (best of 6, minimal variance).


If in the D2 code I comment out the idup like this:

    foreach (rawLine; file.byLine()) {
        total += rawLine.length;
    }

The output of the D code without idup is:
Total: 69789888
Found in 0.75 seconds (best of 6, minimal variance).

As you see it's a matter of GC efficiency too.

Beside the GC the cause of the higher performance of the Python code comes from a tuned design, you can see the function getline_via_fgets here:
http://svn.python.org/view/python/trunk/Objects/fileobject.c?revision=81275&view=markup
It uses a "stack buffer" (char buf[MAXBUFSIZE]; where MAXBUFSIZE is 300) too.

Bye,
bearophile
August 08, 2010
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> Jonathan M Davis:
> > I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.
> Your code looks better.
> My (probably controversial) opinion on this is that the idiomatic D solution for
those text "scripts" is to use a scripting language, as Python :-)
> In this case a Python version is more readable, shorter and probably faster too
because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request).
> On the other hand D2 is in its debugging phase, so it's good to use it even for
purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them.
> Bye,
> bearophile

I disagree completely.  D is clearly designed from the "simple things should be simple and complicated things should be possible" point of view.  If it doesn't work well for these kinds of short scripts then we've failed at making simple things simple and we're just like every other crappy "large scale, industrial strength" language like Java and C++ that's great for megaprojects but makes simple things complicated.

That said, I think D does a great job in this regard.  I actually use Python as my language of second choice for things D isn't good at.  Mostly this means needing Python's huge standard library, needing 64-bit support, or needing to share my code with people who don't know D.  Needing to write a very short script tends not to be a reason for me to switch over.  It's not that rare for me to start with a short script and then end up adding something that needs performance to it (like monte carlo simulation of a null probability distribution) and I don't find D substantially harder to use for these cases.