Thread overview | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
September 25, 2008 #line decoder | ||||
---|---|---|---|---|
| ||||
I’m doing a pile of code generation an got tired of manually translating the effects of #line directives so I just wiped out a little tool to do it for me. You give it the input file, the “filename” and line number you are looking for and it either spits out that line or the actual line number it’s on depending on what you ask for. source: http://www.dsource.org/projects/scrapple/browser/trunk/lines/lines.d |
September 25, 2008 Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS Wrote: > so I just wiped out a little tool to do it for me. I think that's a job fitter for a script, for example in Python (or Ruby or Perl). ----------------- But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements. For example, this is "data.txt": jan sun 12 feb mon 14 mar fri 23 aug sat 3 jun tue 15 The purpose is to read those 3 columns as separated arrays, so the output is transposed: [["jan", "feb", "mar", "aug", "jun"], ["sun", "mon", "fri", "sat", "tue"], ["12", "14", "23", "3", "15"]] With Python you can do it this way (this produces immutable sub arrays), loader1: r1 = zip(*(l.split() for l in file("data2.txt"))) If you duplicate many times the data.txt to have a bigger file (for example a 44 MB text file) like this: >>> t = file("data.txt").read() >>> t2 = t * 400000 >>> len(t2) 43600000 >>> file("data2.txt", "w").write(t2) you find that transposing is too much slow. This is my faster Python+Psyco version: # loader2 from collections import deque def main(): fin = file("data2.txt") cols = [deque([el]) for el in fin.readline().split()] for line in fin: col = 0 for el in line.split(): cols[col].append(el) col += 1 import psyco; psyco.full() main() My first D version: // loader3 import std.stream; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; } Disabling the GC doesn't help: // loader4 import std.stream, std.gc; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; enable(); } Here using an ArrayBuilder of mine increases two times because the GC wastes a large amount of time scanning pointers (I have not converted the ArrayBuilders into arrays because that operation is very fast, just a slicing, so hopefully it doesn't change the overall running time): // loader5 import std.stream; import std.string: split; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; } Using the ArrayBuilder and disabling the GC helps a little: // loader6 import std.stream; import std.string: split; import d.builders: ArrayBuilder; import std.gc; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; enable(); } Using a lazy splitter generally helps a lot, this time helps a little because the lines are made of few parts: // loader7 import std.stream; import std.string: split; import d.string: xsplit; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; } But I don't understand why disabling the GC with the xsplit worsens the situation: // loader8 import std.stream; import std.string: split; import d.string: xsplit; import std.gc; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; enable(); } Using both an ArrayBuilder and lazy splitting improves the situation significantly: // loader9 import std.stream; import std.string: split; import d.string: xsplit; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; } This time disabling the GC helps a lot (see loader8 for comparison), we are now closer to the timings of Python, so lot of people at this point can probably stop trying to optimize the code: // loader10 import std.stream; import std.string: split; import d.string: xsplit; import std.gc; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; enable(); } But I have done few other versions, trying to go faster than then Python version. In the following versions I have assumed the programmer knows the number of colums (3), in theory this helps the code. In theory this is very fast, because the xsplit() is very fast and there is no need to dup the elements, because you can just keep the txt string around (that read() loads quickly), and a matrix of string slices. In practice the GC kills the performance, I don't know why. Note the running time of the foreach loop is very fast (0.3-0.4 s), most of the time is used in the final deallocation: // loader11 import std.file: read; import d.string: xsplit; import d.builders: ArrayBuilder; void main() { const int NCOLS = 3; auto txt = cast(string)read("data2.txt"); auto cols = new ArrayBuilder!(string)[NCOLS]; foreach (i, el; txt.xsplit()) cols[i % NCOLS] ~= el; } Disabling the GC helps: // loader12 import std.file: read; import d.string: xsplit; import d.builders: ArrayBuilder; import std.gc; void main() { const int NCOLS = 3; auto txt = cast(string)read("data2.txt"); auto cols = new ArrayBuilder!(string)[NCOLS]; disable(); foreach (i, el; txt.xsplit()) cols[i % NCOLS] ~= el; enable(); } I have tried few other mad things without good results. Timings, data2.txt, warm timings, best of 3: loader1: 23.05 s loader2: 3.00 s loader3: 8.82 s loader4: 11.44 s loader5: 21.31 s loader6: 7.20 s loader7: 7.51 s loader8: 8.45 s loader9: 5.46 s loader10: 3.73 s loader11: 82.54 s loader12: 38.87 s I can now probably write a pure C version, but I am too much lazy for that :-) If you know ways to speed up the loading of such file in D I'll gladly take a look at your code. In bioinformatics I process txt files all the time. Bye, bearophile |
September 25, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | In article <gbg0uf$uka$1@digitalmars.com>, bearophileHUGS@lycos.com says... > // loader3 > import std.stream; > import std.string: split; > void main() { > auto fin = new BufferedFile("data2.txt"); > string[] first = fin.readLine().split(); > auto cols = new string[first.length]; > foreach (col, el; first) > cols[col] ~= el.dup; > foreach (string line; fin) > foreach (col, el; line.split()) > cols[col] ~= el.dup; > } Specifically in this line: > auto cols = new string[first.length]; you probably actually want "new string[][first.length]", otherwise you simply get three huge strings. My timing of your original sample is 9.8s (best out of 5 runs). The fixed sample timing is 48.9s. |
September 25, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov: > you probably actually want "new string[][first.length]", otherwise you simply get three huge strings. Yes, I have not seem them missing commas in the test run output (this tells me to always avoid writeln and always use my putr, that puts "" around strings when they printed inside containers). The timings of the loader3/loader4 versions now show that the ArrayBuilder is quite useful. The version 4 too has the same bug, the new versions: // loader3 import std.stream; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[][first.length]; foreach (col, el; first) cols[col] ~= el.dup; foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; } // loader4 import std.stream, std.gc; import std.string: split; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new string[][first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.split()) cols[col] ~= el.dup; enable(); } And just to be sure and have a more real test, I have added a benchmark that shows the final array creation too (the final append is very quick, less than 0.01 s), derived from loader10: // loader10b import std.stream; import std.string: split; import d.string: xsplit; import std.gc; import d.builders: ArrayBuilder; void main() { auto fin = new BufferedFile("data2.txt"); string[] first = fin.readLine().split(); auto cols = new ArrayBuilder!(string)[first.length]; foreach (col, el; first) cols[col] ~= el.dup; disable(); foreach (string line; fin) foreach (col, el; line.xsplit()) cols[col] ~= el.dup; enable(); string[][] truecols; foreach (col; cols) truecols ~= col.toarray; } Updated timings: Timings, data2.txt, warm timings, best of 3: loader1: 23.05 s loader2: 3.00 s loader3: 44.79 s loader4: 39.28 s loader5: 21.31 s loader6: 7.20 s loader7: 7.51 s loader8: 8.45 s loader9: 5.46 s loader10: 3.73 s loader10b: 3.88 s loader11: 82.54 s loader12: 38.87 s Bye and thank you, bearophile |
September 26, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | bearophile wrote:
> BCS Wrote:
>> so I just wiped out a little tool to do it for me.
>
> I think that's a job fitter for a script, for example in Python (or Ruby or Perl).
>
> -----------------
>
> But trying to write such little text oriented programs in D is positive anyway, because it helps spot that there are some rough edges in D/Phobos still that deserve some improvements.
I actually do most of my scripting in D (except for the odd line of sed/awk now and then). I just don't know any scripting languages well enough to write my scripts in, say, Python, unless I want to take more than one order of magnitude longer to write the code.
Even with sufficient familiarity, I'd only see small gains, so why bother?
|
September 27, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | In article <gbgem1$1qvo$1@digitalmars.com>, bearophileHUGS@lycos.com says... > Updated timings: > Timings, data2.txt, warm timings, best of 3: > loader1: 23.05 s > loader2: 3.00 s > loader3: 44.79 s > loader4: 39.28 s > loader5: 21.31 s > loader6: 7.20 s > loader7: 7.51 s > loader8: 8.45 s > loader9: 5.46 s > loader10: 3.73 s > loader10b: 3.88 s > loader11: 82.54 s > loader12: 38.87 s You've intrigued me so I did some experimenting, too. I won't post my bad tries but the better one. import std.stream: BufferedFile; import std.string: split; import std.gc: enable, disable; import rope: Rope; void main() { // disable(); auto fin = new BufferedFile("data2.txt"); alias Rope!(string) SRope; SRope[] result; foreach (el; split(cast(string) fin.readLine())) { result ~= new SRope; result[$-1] ~= el; } foreach (char[] line; fin) foreach (id, el; split(cast(string) line)) result[id] ~= el; auto month = result[0].get(); auto day = result[1].get(); auto num = result[2].get(); // enable(); } The rope thing is my version of an ArrayBuilder ;) : module rope; class Rope(T) { typeof(this) opCatAssign(T el) { if (pool >= current.length) { chunks ~= current; current = new T[current.length * 2]; pool = 0; } current[pool++] = el; return this; } T[] get() { T[] result; foreach (c; chunks) result ~= c; return result ~ current[0 .. pool]; } this() { current = new T[16]; pool = 0; } private { T[][] chunks; T[] current; size_t pool; } } The timings are: 2.77s the loader as it is posted here 2.99s with garbage collector disabled (uncomment commented) 3.2s Python version On a side note, I've got a lot of weird timings today. It turned out that on many occasions, when I had very bad timings, most of the processing time were spent on a closing brace of a function, or on an std.file.read obviously not reading the file but doing some memory handling. Something definitely goes wrong sometimes. |
September 27, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | In article <MPG.2347c0e378c6cc6a98970c@news.digitalmars.com>, snake.scaly@gmail.com says... > module rope; > class Rope(T) > { > typeof(this) opCatAssign(T el) > { > if (pool >= current.length) > { > chunks ~= current; > current = new T[current.length * 2]; > pool = 0; > } > current[pool++] = el; > return this; > } > T[] get() > { > T[] result; > foreach (c; chunks) > result ~= c; > return result ~ current[0 .. pool]; > } > this() > { > current = new T[16]; > pool = 0; > } > private > { > T[][] chunks; > T[] current; > size_t pool; > } > } The simpler, the better: use Array instead of Rope and get 2.36s: module array; class Array(T) { this() { data = new T[16]; } typeof(this) opCatAssign(T el) { if (pos == data.length) data.length = data.length * 2; data[pos++] = el; return this; } T[] get() { return data[0 .. pos]; } private { T[] data; size_t pos; } } |
September 27, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sergey Gromov | Sergey Gromov, there's a bug in your code, you don't perform the dup, so the contents of all the string slices are lost. This is a bug-prone feature of the current D, I have done several times similar bugs in the past. The timings: Updated timings: Timings, data2.txt, warm timings, best of 3: loader1: 23.05 s loader2: 3.00 s loader3: 44.79 s loader4: 39.28 s loader5: 21.31 s loader6: 7.20 s loader7: 7.51 s loader8: 8.45 s loader9: 5.46 s loader10: 3.73 s loader10b: 3.88 s loader11: 82.54 s loader12: 38.87 s loader13: 28.28 s loader14: 15.37 s loader13: 6.74 s (no GC) loader14: 6.94 s (no GC) Your code I have used: // loader13 import std.stream: BufferedFile; import std.string: split; //import std.gc: enable, disable; class Rope(T) { private { T[][] chunks; T[] current; size_t pool; } this() { current = new T[16]; pool = 0; } typeof(this) opCatAssign(T el) { if (pool >= current.length) { chunks ~= current; current = new T[current.length * 2]; pool = 0; } current[pool++] = el; return this; } T[] get() { T[] result; foreach (c; chunks) result ~= c; return result ~ current[0 .. pool]; } } void main() { //disable(); auto fin = new BufferedFile("data2.txt"); alias Rope!(string) SRope; SRope[] result; foreach (el; split(cast(string) fin.readLine())) { result ~= new SRope; result[$-1] ~= el; } foreach (char[] line; fin) foreach (id, el; split(cast(string) line)) result[id] ~= el.dup; auto month = result[0].get(); auto day = result[1].get(); auto num = result[2].get(); //enable(); } // loader14 import std.stream: BufferedFile; import std.string: split; //import std.gc: enable, disable; class Array(T) { private { T[] data; size_t pos; } this() { data = new T[16]; } typeof(this) opCatAssign(T el) { if (pos == data.length) data.length = data.length * 2; data[pos++] = el; return this; } T[] get() { return data[0 .. pos]; } } void main() { //disable(); auto fin = new BufferedFile("data2.txt"); alias Array!(string) SArray; SArray[] result; foreach (el; split(cast(string) fin.readLine())) { result ~= new SArray; result[$-1] ~= el; } foreach (char[] line; fin) foreach (id, el; line.split()) result[id] ~= el.dup; auto month = result[0].get(); auto day = result[1].get(); auto num = result[2].get(); //enable(); } Bye, bearophile |
September 27, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | Thu, 25 Sep 2008 12:36:17 -0400, bearophile wrote: > Updated timings: > Timings, data2.txt, warm timings, best of 3: > loader1: 23.05 s > loader2: 3.00 s > loader3: 44.79 s > loader4: 39.28 s > loader5: 21.31 s > loader6: 7.20 s > loader7: 7.51 s > loader8: 8.45 s > loader9: 5.46 s > loader10: 3.73 s > loader10b: 3.88 s > loader11: 82.54 s > loader12: 38.87 s And, for completeness sake, a straight-forward implementation in C++: #include <fstream> #include <string> #include <memory> #include <algorithm> #include <ctype.h> #include <vector> #include <functional> using namespace std; int main() { char buf[1024]; char *pos, *end; ifstream ifs("data2.txt"); // result vector<vector<string> > result; // number of columns ifs.getline(buf, sizeof buf); pos = buf; end = buf + ifs.gcount(); // loop over the tokens while (pos != end) { char *wordEnd = find_if(pos, end, isspace); result.push_back(vector<string>()); result.back().push_back(string(pos, wordEnd)); pos = find_if(wordEnd, end, not1(ptr_fun(isspace))); } // rest of the lines while (ifs.good()) { ifs.getline(buf, sizeof buf); pos = buf; end = buf + ifs.gcount(); // loop over the tokens size_t col = 0; while (pos != end) { char *wordEnd = find_if(pos, end, isspace); result[col].push_back(string(pos, wordEnd)); pos = find_if(wordEnd, end, not1(ptr_fun(isspace))); ++col; } } } On my system: C++: 3.93 s best D: 2.36 s Python: 3.20 s Replacing the inner vector<> with a list<> makes matters worse: 5.68 s instead of 3.93. I also guessed that the problem was in vectors copying all the strings on every re-allocation and tried to replace strings with boost::shared_ptr<string>. Which made for 13.35 s. I'm not sure I can optimize this code any further without re-writing it in Cee. |
September 27, 2008 Re: Text editing [Was: Re: #line decoder] | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | Sat, 27 Sep 2008 08:00:02 -0400,
bearophile wrote:
> loader13: 28.28 s
> loader14: 15.37 s
> loader13: 6.74 s (no GC)
> loader14: 6.94 s (no GC)
This will teach me not to do research late at night. D-:
|
Copyright © 1999-2021 by the D Language Foundation