Jump to page: 1 2
Thread overview
#line decoder
Sep 25, 2008
BCS
Text editing [Was: Re: #line decoder]
Sep 25, 2008
bearophile
Sep 25, 2008
Sergey Gromov
Sep 25, 2008
bearophile
Sep 27, 2008
Sergey Gromov
Sep 27, 2008
Sergey Gromov
Sep 27, 2008
bearophile
Sep 27, 2008
Sergey Gromov
Sep 28, 2008
bearophile
Sep 28, 2008
Sergey Gromov
Sep 28, 2008
bearophile
Sep 29, 2008
Denis Koroskin
Sep 27, 2008
Sergey Gromov
Sep 26, 2008
Christopher Wright
September 25, 2008
I’m doing a pile of code generation an got tired of manually translating the effects of #line directives so I just wiped out a little tool to do it for me.

You give it the input file, the “filename” and line number you are looking for and it either spits out that line or the actual line number it’s on depending on what you ask for.

source: http://www.dsource.org/projects/scrapple/browser/trunk/lines/lines.d


September 25, 2008
BCS Wrote:
> so I just wiped out a little tool to do it for me.

I think that's a job fitter for a script, for example in Python (or Ruby or Perl).

-----------------

But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements.

For example, this is "data.txt":

jan        sun     12
feb        mon     14
mar        fri     23
aug        sat     3
jun        tue     15

The purpose is to read those 3 columns as separated arrays, so the output is transposed:

[["jan", "feb", "mar", "aug", "jun"],
 ["sun", "mon", "fri", "sat", "tue"],
 ["12", "14", "23", "3", "15"]]

With Python you can do it this way (this produces immutable sub arrays), loader1:

r1 = zip(*(l.split() for l in file("data2.txt")))

If you duplicate many times the data.txt to have a bigger file (for example a 44 MB text file) like this:

>>> t = file("data.txt").read()
>>> t2 = t * 400000
>>> len(t2)
43600000
>>> file("data2.txt", "w").write(t2)

you find that transposing is too much slow. This is my faster Python+Psyco version:

# loader2
from collections import deque
def main():
    fin = file("data2.txt")
    cols = [deque([el]) for el in fin.readline().split()]
    for line in fin:
        col = 0
        for el in line.split():
            cols[col].append(el)
            col += 1
import psyco; psyco.full()
main()

My first D version:

// loader3
import std.stream;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}

Disabling the GC doesn't help:

// loader4
import std.stream, std.gc;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

Here using an ArrayBuilder of mine increases two times because the GC wastes a large amount of time scanning pointers (I have not converted the ArrayBuilders into arrays because that operation is very fast, just a slicing, so hopefully it doesn't change the overall running time):

// loader5
import std.stream;
import std.string: split;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}

Using the ArrayBuilder and disabling the GC helps a little:

// loader6
import std.stream;
import std.string: split;
import d.builders: ArrayBuilder;
import std.gc;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

Using a lazy splitter generally helps a lot, this time helps a little because the lines are made of few parts:

// loader7
import std.stream;
import std.string: split;
import d.string: xsplit;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
}

But I don't understand why disabling the GC with the xsplit worsens the situation:

// loader8
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
}

Using both an ArrayBuilder and lazy splitting improves the situation significantly:

// loader9
import std.stream;
import std.string: split;
import d.string: xsplit;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
}

This time disabling the GC helps a lot (see loader8 for comparison), we are now closer to the timings of Python, so lot of people at this point can probably stop trying to optimize the code:

// loader10
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
}

But I have done few other versions, trying to go faster than then Python version. In the following versions I have assumed the programmer knows the number of colums (3), in theory this helps the code. In theory this is very fast, because the xsplit() is very fast and there is no need to dup the elements, because you can just keep the txt string around (that read() loads quickly), and a matrix of string slices. In practice the GC kills the performance, I don't know why. Note the running time of the foreach loop is very fast (0.3-0.4 s), most of the time is used in the final deallocation:

// loader11
import std.file: read;
import d.string: xsplit;
import d.builders: ArrayBuilder;
void main() {
    const int NCOLS = 3;
    auto txt = cast(string)read("data2.txt");
    auto cols = new ArrayBuilder!(string)[NCOLS];
    foreach (i, el; txt.xsplit())
        cols[i % NCOLS] ~= el;
}

Disabling the GC helps:

// loader12
import std.file: read;
import d.string: xsplit;
import d.builders: ArrayBuilder;
import std.gc;
void main() {
    const int NCOLS = 3;
    auto txt = cast(string)read("data2.txt");
    auto cols = new ArrayBuilder!(string)[NCOLS];
    disable();
    foreach (i, el; txt.xsplit())
        cols[i % NCOLS] ~= el;
    enable();
}

I have tried few other mad things without good results.

Timings, data2.txt, warm timings, best of 3:
  loader1:  23.05 s
  loader2:   3.00 s
  loader3:   8.82 s
  loader4:  11.44 s
  loader5:  21.31 s
  loader6:   7.20 s
  loader7:   7.51 s
  loader8:   8.45 s
  loader9:   5.46 s
  loader10:  3.73 s
  loader11: 82.54 s
  loader12: 38.87 s

I can now probably write a pure C version, but I am too much lazy for that :-)

If you know ways to speed up the loading of such file in D I'll gladly take a look at your code. In bioinformatics I process txt files all the time.

Bye,
bearophile
September 25, 2008
In article <gbg0uf$uka$1@digitalmars.com>, bearophileHUGS@lycos.com says...
> // loader3
> import std.stream;
> import std.string: split;
> void main() {
>     auto fin = new BufferedFile("data2.txt");
>     string[] first = fin.readLine().split();
>     auto cols = new string[first.length];
>     foreach (col, el; first)
>         cols[col] ~= el.dup;
>     foreach (string line; fin)
>         foreach (col, el; line.split())
>             cols[col] ~= el.dup;
> }

Specifically in this line:

>     auto cols = new string[first.length];

you probably actually want "new string[][first.length]", otherwise you simply get three huge strings.

My timing of your original sample is 9.8s (best out of 5 runs).  The fixed sample timing is 48.9s.
September 25, 2008
Sergey Gromov:
> you probably actually want "new string[][first.length]", otherwise you simply get three huge strings.

Yes, I have not seem them missing commas in the test run output (this tells me to always avoid writeln and always use my putr, that puts "" around strings when they printed inside containers).
The timings of the loader3/loader4 versions now show that the ArrayBuilder is quite useful.

The version 4 too has the same bug, the new versions:

// loader3
import std.stream;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[][first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
}


// loader4
import std.stream, std.gc;
import std.string: split;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new string[][first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.split())
            cols[col] ~= el.dup;
    enable();
}

And just to be sure and have a more real test, I have added a benchmark that shows the final array creation too (the final append is very quick, less than 0.01 s), derived from loader10:

// loader10b
import std.stream;
import std.string: split;
import d.string: xsplit;
import std.gc;
import d.builders: ArrayBuilder;
void main() {
    auto fin = new BufferedFile("data2.txt");
    string[] first = fin.readLine().split();
    auto cols = new ArrayBuilder!(string)[first.length];
    foreach (col, el; first)
        cols[col] ~= el.dup;
    disable();
    foreach (string line; fin)
        foreach (col, el; line.xsplit())
            cols[col] ~= el.dup;
    enable();
    string[][] truecols;
    foreach (col; cols)
        truecols ~= col.toarray;
}


Updated timings:
Timings, data2.txt, warm timings, best of 3:
  loader1:  23.05 s
  loader2:   3.00 s
  loader3:  44.79 s
  loader4:  39.28 s
  loader5:  21.31 s
  loader6:   7.20 s
  loader7:   7.51 s
  loader8:   8.45 s
  loader9:   5.46 s
  loader10:  3.73 s
  loader10b: 3.88 s
  loader11: 82.54 s
  loader12: 38.87 s

Bye and thank you,
bearophile
September 26, 2008
bearophile wrote:
> BCS Wrote:
>> so I just wiped out a little tool to do it for me.
> 
> I think that's a job fitter for a script, for example in Python (or Ruby or Perl).
> 
> -----------------
> 
> But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements.

I actually do most of my scripting in D (except for the odd line of sed/awk now and then). I just don't know any scripting languages well enough to write my scripts in, say, Python, unless I want to take more than one order of magnitude longer to write the code.

Even with sufficient familiarity, I'd only see small gains, so why bother?
September 27, 2008
In article <gbgem1$1qvo$1@digitalmars.com>, bearophileHUGS@lycos.com says...
> Updated timings:
> Timings, data2.txt, warm timings, best of 3:
>   loader1:  23.05 s
>   loader2:   3.00 s
>   loader3:  44.79 s
>   loader4:  39.28 s
>   loader5:  21.31 s
>   loader6:   7.20 s
>   loader7:   7.51 s
>   loader8:   8.45 s
>   loader9:   5.46 s
>   loader10:  3.73 s
>   loader10b: 3.88 s
>   loader11: 82.54 s
>   loader12: 38.87 s

You've intrigued me so I did some experimenting, too.  I won't post my bad tries but the better one.

import std.stream: BufferedFile;
import std.string: split;
import std.gc: enable, disable;
import rope: Rope;
void main()
{
    // disable();
    auto fin = new BufferedFile("data2.txt");
    alias Rope!(string) SRope;
    SRope[] result;
    foreach (el; split(cast(string) fin.readLine()))
    {
        result ~= new SRope;
        result[$-1] ~= el;
    }
    foreach (char[] line; fin)
        foreach (id, el; split(cast(string) line))
            result[id] ~= el;
    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    // enable();
}

The rope thing is my version of an ArrayBuilder ;) :

module rope;
class Rope(T)
{
    typeof(this) opCatAssign(T el)
    {
        if (pool >= current.length)
        {
            chunks ~= current;
            current = new T[current.length * 2];
            pool = 0;
        }
        current[pool++] = el;
        return this;
    }
    T[] get()
    {
        T[] result;
        foreach (c; chunks)
            result ~= c;
        return result ~ current[0 .. pool];
    }
    this()
    {
        current = new T[16];
        pool = 0;
    }
    private
    {
        T[][] chunks;
        T[] current;
        size_t pool;
    }
}

The timings are:
2.77s  the loader as it is posted here
2.99s  with garbage collector disabled (uncomment commented)
3.2s   Python version

On a side note, I've got a lot of weird timings today.  It turned out that on many occasions, when I had very bad timings, most of the processing time were spent on a closing brace of a function, or on an std.file.read obviously not reading the file but doing some memory handling.  Something definitely goes wrong sometimes.
September 27, 2008
In article <MPG.2347c0e378c6cc6a98970c@news.digitalmars.com>, snake.scaly@gmail.com says...
> module rope;
> class Rope(T)
> {
>     typeof(this) opCatAssign(T el)
>     {
>         if (pool >= current.length)
>         {
>             chunks ~= current;
>             current = new T[current.length * 2];
>             pool = 0;
>         }
>         current[pool++] = el;
>         return this;
>     }
>     T[] get()
>     {
>         T[] result;
>         foreach (c; chunks)
>             result ~= c;
>         return result ~ current[0 .. pool];
>     }
>     this()
>     {
>         current = new T[16];
>         pool = 0;
>     }
>     private
>     {
>         T[][] chunks;
>         T[] current;
>         size_t pool;
>     }
> }

The simpler, the better: use Array instead of Rope and get 2.36s:

module array;
class Array(T)
{
    this()
    {
        data = new T[16];
    }
    typeof(this) opCatAssign(T el)
    {
        if (pos == data.length)
            data.length = data.length * 2;
        data[pos++] = el;
        return this;
    }
    T[] get()
    {
        return data[0 .. pos];
    }
    private
    {
        T[] data;
        size_t pos;
    }
}
September 27, 2008
Sergey Gromov, there's a bug in your code, you don't perform the dup, so the contents of all the string slices are lost. This is a bug-prone feature of the current D, I have done several times similar bugs in the past.

The timings:

Updated timings:
Timings, data2.txt, warm timings, best of 3:
  loader1:    23.05 s
  loader2:     3.00 s
  loader3:    44.79 s
  loader4:    39.28 s
  loader5:    21.31 s
  loader6:     7.20 s
  loader7:     7.51 s
  loader8:     8.45 s
  loader9:     5.46 s
  loader10:    3.73 s
  loader10b:  3.88 s
  loader11:   82.54 s
  loader12:   38.87 s
  loader13:   28.28 s
  loader14:   15.37 s
  loader13:    6.74 s (no GC)
  loader14:    6.94 s (no GC)


Your code I have used:

// loader13
import std.stream: BufferedFile;
import std.string: split;
//import std.gc: enable, disable;

class Rope(T) {
    private {
        T[][] chunks;
        T[] current;
        size_t pool;
    }
    this() {
        current = new T[16];
        pool = 0;
    }
    typeof(this) opCatAssign(T el) {
        if (pool >= current.length) {
            chunks ~= current;
            current = new T[current.length * 2];
            pool = 0;
        }
        current[pool++] = el;
        return this;
    }
    T[] get() {
        T[] result;
        foreach (c; chunks)
            result ~= c;
        return result ~ current[0 .. pool];
    }
}

void main() {
    //disable();
    auto fin = new BufferedFile("data2.txt");
    alias Rope!(string) SRope;
    SRope[] result;
    foreach (el; split(cast(string) fin.readLine())) {
        result ~= new SRope;
        result[$-1] ~= el;
    }

    foreach (char[] line; fin)
        foreach (id, el; split(cast(string) line))
            result[id] ~= el.dup;

    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    //enable();
}



// loader14
import std.stream: BufferedFile;
import std.string: split;
//import std.gc: enable, disable;

class Array(T) {
    private {
        T[] data;
        size_t pos;
    }
    this() {
        data = new T[16];
    }
    typeof(this) opCatAssign(T el) {
        if (pos == data.length)
            data.length = data.length * 2;
        data[pos++] = el;
        return this;
    }
    T[] get() {
        return data[0 .. pos];
    }
}

void main() {
    //disable();
    auto fin = new BufferedFile("data2.txt");
    alias Array!(string) SArray;
    SArray[] result;
    foreach (el; split(cast(string) fin.readLine())) {
        result ~= new SArray;
        result[$-1] ~= el;
    }

    foreach (char[] line; fin)
        foreach (id, el; line.split())
            result[id] ~= el.dup;

    auto month = result[0].get();
    auto day = result[1].get();
    auto num = result[2].get();
    //enable();
}

Bye,
bearophile
September 27, 2008
Thu, 25 Sep 2008 12:36:17 -0400,
bearophile wrote:
> Updated timings:
> Timings, data2.txt, warm timings, best of 3:
>   loader1:  23.05 s
>   loader2:   3.00 s
>   loader3:  44.79 s
>   loader4:  39.28 s
>   loader5:  21.31 s
>   loader6:   7.20 s
>   loader7:   7.51 s
>   loader8:   8.45 s
>   loader9:   5.46 s
>   loader10:  3.73 s
>   loader10b: 3.88 s
>   loader11: 82.54 s
>   loader12: 38.87 s

And, for completeness sake, a straight-forward implementation in C++:

#include <fstream>
#include <string>
#include <memory>
#include <algorithm>
#include <ctype.h>
#include <vector>
#include <functional>

using namespace std;

int main()
{
    char buf[1024];
    char *pos, *end;
    ifstream ifs("data2.txt");
    // result
    vector<vector<string> > result;
    // number of columns
    ifs.getline(buf, sizeof buf);
    pos = buf;
    end = buf + ifs.gcount();
    // loop over the tokens
    while (pos != end)
    {
        char *wordEnd = find_if(pos, end, isspace);
        result.push_back(vector<string>());
        result.back().push_back(string(pos, wordEnd));
        pos = find_if(wordEnd, end, not1(ptr_fun(isspace)));
    }
    // rest of the lines
    while (ifs.good())
    {
        ifs.getline(buf, sizeof buf);
        pos = buf;
        end = buf + ifs.gcount();
        // loop over the tokens
        size_t col = 0;
        while (pos != end)
        {
            char *wordEnd = find_if(pos, end, isspace);
            result[col].push_back(string(pos, wordEnd));
            pos = find_if(wordEnd, end, not1(ptr_fun(isspace)));
            ++col;
        }
    }
}

On my system:
C++:    3.93 s
best D: 2.36 s
Python: 3.20 s

Replacing the inner vector<> with a list<> makes matters worse: 5.68 s instead of 3.93.  I also guessed that the problem was in vectors copying all the strings on every re-allocation and tried to replace strings with boost::shared_ptr<string>.  Which made for 13.35 s.  I'm not sure I can optimize this code any further without re-writing it in Cee.
September 27, 2008
Sat, 27 Sep 2008 08:00:02 -0400,
bearophile wrote:
>   loader13:   28.28 s
>   loader14:   15.37 s
>   loader13:    6.74 s (no GC)
>   loader14:    6.94 s (no GC)

This will teach me not to do research late at night. D-:
« First   ‹ Prev
1 2