Thread overview
Can I speed up this log parsing script further?
Jun 09, 2017
uncorroded
Jun 09, 2017
rikki cattermole
Jun 09, 2017
Daniel Kozak
Jun 09, 2017
Daniel Kozak
Jun 09, 2017
Daniel Kozak
Jun 09, 2017
uncorroded
Jun 09, 2017
Daniel Kozak
Jun 09, 2017
uncorroded
Jun 09, 2017
H. S. Teoh
Jun 09, 2017
Daniel Kozak
June 09, 2017
Hi guys,

I am a beginner in D. As a project, I converted a log-parsing script in Python which we use at work, to D. This link was helpful - ( https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release build), it takes 1.1 sec and with ldc, it takes 0.3 sec.

The Python script (run with system python, not Pypy) takes 0.75 sec. The D and Python functions are here and on pastebin ( D - https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).

Basically, i am reading a line, checking for 2 constants. If either one is found, some processing is done on line and stored to an array for later analysis. I tried reading the file entirely in one go using std.file : readText and using std.algorithm : splitter for lazily splitting newline but there is no difference in speed, so I used the byLine approach mentioned in the linked blog. Is there a better way of doing this in D?

Note:
I ran GC profiling as mentioned in linked blog. The results were:

Number of collections:  3
	Total GC prep time:  0 milliseconds
	Total mark time:  0 milliseconds
	Total sweep time:  0 milliseconds
	Total page recovery time:  0 milliseconds
	Max Pause Time:  0 milliseconds
	Grand total GC time:  2 milliseconds
GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms

So GC does not seem to be an issue.

Here's the D script:

import std.stdio;
import std.string;
import std.array;
import std.algorithm : splitter;
import std.typecons : tuple, Tuple;
import std.conv : to;

void read_log(string filename) {
    File file = File(filename, "r");
    Tuple!(char[], int, char[])[] npushed;
    Tuple!(int, char[], int, bool, bool)[] pushed;
    foreach (line; file.byLine) {
        if (line.indexOf("SOC_NOT_PUSHED") != -1) {
            auto tarr = line.split();
            npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]), tarr[$ - 2]);
            continue;
        }
        if (line.indexOf("SYNC_PUSH:") != -1) {
            auto rel = line.split("SYNC_PUSH:")[1].strip();
            auto att = rel.split(" at ");
            auto ina = att[1].split(" in ");
            auto msa = ina[1].split(" ms ");
            pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
                    msa[1].indexOf("PA-SOC_POP") != -1, msa[1].indexOf("CU-SOC_POP") != -1);
        }
    }
    // Using the arrays later on in production script
    writeln(npushed.length);
    writeln(pushed.length);
}


Here is Python function:

def read_log(fname):
    try:
        with open(fname, 'r') as f:
            raw = f.read().splitlines()
            ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
            ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if 'SYNC_PUSH:' in w]
            not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
            ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split(' in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]), set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in ss]
            pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
            return not_pushed, pushed
    except:
        return []

June 09, 2017
On 09/06/2017 8:34 AM, uncorroded wrote:
> Hi guys,
> 
> I am a beginner in D. As a project, I converted a log-parsing script in Python which we use at work, to D. This link was helpful - ( https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
> 
> The Python script (run with system python, not Pypy) takes 0.75 sec. The D and Python functions are here and on pastebin ( D - https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
> 
> Basically, i am reading a line, checking for 2 constants. If either one is found, some processing is done on line and stored to an array for later analysis. I tried reading the file entirely in one go using std.file : readText and using std.algorithm : splitter for lazily splitting newline but there is no difference in speed, so I used the byLine approach mentioned in the linked blog. Is there a better way of doing this in D?
> 
> Note:
> I ran GC profiling as mentioned in linked blog. The results were:
> 
> Number of collections:  3
>      Total GC prep time:  0 milliseconds
>      Total mark time:  0 milliseconds
>      Total sweep time:  0 milliseconds
>      Total page recovery time:  0 milliseconds
>      Max Pause Time:  0 milliseconds
>      Grand total GC time:  2 milliseconds
> GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms
> 
> So GC does not seem to be an issue.
> 
> Here's the D script:
> 
> import std.stdio;
> import std.string;
> import std.array;
> import std.algorithm : splitter;
> import std.typecons : tuple, Tuple;
> import std.conv : to;
> 
> void read_log(string filename) {
>      File file = File(filename, "r");
>      Tuple!(char[], int, char[])[] npushed;
>      Tuple!(int, char[], int, bool, bool)[] pushed;
>      foreach (line; file.byLine) {
>          if (line.indexOf("SOC_NOT_PUSHED") != -1) {
>              auto tarr = line.split();
>              npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]), tarr[$ - 2]);
>              continue;
>          }
>          if (line.indexOf("SYNC_PUSH:") != -1) {
>              auto rel = line.split("SYNC_PUSH:")[1].strip();
>              auto att = rel.split(" at ");
>              auto ina = att[1].split(" in ");
>              auto msa = ina[1].split(" ms ");
>              pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
>                      msa[1].indexOf("PA-SOC_POP") != -1, msa[1].indexOf("CU-SOC_POP") != -1);
>          }
>      }
>      // Using the arrays later on in production script
>      writeln(npushed.length);
>      writeln(pushed.length);
> }
> 
> 
> Here is Python function:
> 
> def read_log(fname):
>      try:
>          with open(fname, 'r') as f:
>              raw = f.read().splitlines()
>              ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
>              ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if 'SYNC_PUSH:' in w]
>              not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
>              ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split(' in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]), set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in ss]
>              pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
>              return not_pushed, pushed
>      except:
>          return []
> 

The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.
June 09, 2017
I would considered using appender for pushed and npushed. Can you post file on which you are running benchmarking?

On Fri, Jun 9, 2017 at 9:50 AM, rikki cattermole via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> On 09/06/2017 8:34 AM, uncorroded wrote:
>
>> Hi guys,
>>
>> I am a beginner in D. As a project, I converted a log-parsing script in Python which we use at work, to D. This link was helpful - ( https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
>>
>> The Python script (run with system python, not Pypy) takes 0.75 sec. The
>> D and Python functions are here and on pastebin ( D -
>> https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
>>
>> Basically, i am reading a line, checking for 2 constants. If either one is found, some processing is done on line and stored to an array for later analysis. I tried reading the file entirely in one go using std.file : readText and using std.algorithm : splitter for lazily splitting newline but there is no difference in speed, so I used the byLine approach mentioned in the linked blog. Is there a better way of doing this in D?
>>
>> Note:
>> I ran GC profiling as mentioned in linked blog. The results were:
>>
>> Number of collections:  3
>>      Total GC prep time:  0 milliseconds
>>      Total mark time:  0 milliseconds
>>      Total sweep time:  0 milliseconds
>>      Total page recovery time:  0 milliseconds
>>      Max Pause Time:  0 milliseconds
>>      Grand total GC time:  2 milliseconds
>> GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms
>>
>> So GC does not seem to be an issue.
>>
>> Here's the D script:
>>
>> import std.stdio;
>> import std.string;
>> import std.array;
>> import std.algorithm : splitter;
>> import std.typecons : tuple, Tuple;
>> import std.conv : to;
>>
>> void read_log(string filename) {
>>      File file = File(filename, "r");
>>      Tuple!(char[], int, char[])[] npushed;
>>      Tuple!(int, char[], int, bool, bool)[] pushed;
>>      foreach (line; file.byLine) {
>>          if (line.indexOf("SOC_NOT_PUSHED") != -1) {
>>              auto tarr = line.split();
>>              npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]),
>> tarr[$ - 2]);
>>              continue;
>>          }
>>          if (line.indexOf("SYNC_PUSH:") != -1) {
>>              auto rel = line.split("SYNC_PUSH:")[1].strip();
>>              auto att = rel.split(" at ");
>>              auto ina = att[1].split(" in ");
>>              auto msa = ina[1].split(" ms ");
>>              pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
>>                      msa[1].indexOf("PA-SOC_POP") != -1,
>> msa[1].indexOf("CU-SOC_POP") != -1);
>>          }
>>      }
>>      // Using the arrays later on in production script
>>      writeln(npushed.length);
>>      writeln(pushed.length);
>> }
>>
>>
>> Here is Python function:
>>
>> def read_log(fname):
>>      try:
>>          with open(fname, 'r') as f:
>>              raw = f.read().splitlines()
>>              ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
>>              ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if
>> 'SYNC_PUSH:' in w]
>>              not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
>>              ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split('
>> in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]),
>> set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in
>> ss]
>>              pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else
>> 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
>>              return not_pushed, pushed
>>      except:
>>          return []
>>
>>
> The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.
>


June 09, 2017
Dne 9.6.2017 v 09:50 rikki cattermole via Digitalmars-d-learn napsal(a):

> On 09/06/2017 8:34 AM, uncorroded wrote:
>> Hi guys,
>>
>> ...
>
> The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) is expensive. Your python code doesn't write anything to stdout (or perform any calls). It would also be good to get the results of dmd -release as well.
Not in this case, he only write two numbers at the end, so I guess there will be no difference
June 09, 2017
On Fri, Jun 9, 2017 at 9:34 AM, uncorroded via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> Hi guys,
>
> I am a beginner in D. As a project, I converted a log-parsing script in Python which we use at work, to D. This link was helpful - ( https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
>
> The Python script (run with system python, not Pypy) takes 0.75 sec. The D
> and Python functions are here and on pastebin ( D -
> https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
>
> Basically, i am reading a line, checking for 2 constants. If either one is found, some processing is done on line and stored to an array for later analysis. I tried reading the file entirely in one go using std.file : readText and using std.algorithm : splitter for lazily splitting newline but there is no difference in speed, so I used the byLine approach mentioned in the linked blog. Is there a better way of doing this in D?
>
> There is no difference in speed because you do not process your data
lazily, so you make many allocations, so this is main reason why it is so slow. I could improve that, but I will need to see some example data, which you are trying to parse.

But some rules,
1.) instead of ~= you shoud use std.array.appender
2.) instead of std.string.split you could use std.algorithm.splitter or
std.algorithm.findSplit
3.) instead of indexOf I would use std.algorithm.startsWith (in case it is
on the begining of the line)


June 09, 2017
On Friday, 9 June 2017 at 08:58:38 UTC, Daniel Kozak wrote:

>> There is no difference in speed because you do not process your data
> lazily, so you make many allocations, so this is main reason why it is so slow. I could improve that, but I will need to see some example data, which you are trying to parse.
>
> But some rules,
> 1.) instead of ~= you shoud use std.array.appender
> 2.) instead of std.string.split you could use std.algorithm.splitter or
> std.algorithm.findSplit
> 3.) instead of indexOf I would use std.algorithm.startsWith (in case it is
> on the begining of the line)

Thanks everyone for the tips.
The log file itself is 52 MB but I have added a sample in pastebin ( https://pastebin.com/vj778PK4 ). Will try the suggestions today evening.
June 09, 2017
import std.stdio;
import std.array: appender, array;
import std.algorithm : findSplit, splitter, joiner, canFind, map;
import std.typecons : tuple, Tuple;
import std.conv : to;
import std.range : dropOne, dropExactly, takeExactly, chain;

alias push_type = Tuple!(int, char[], int, bool, bool);
alias npush_type = Tuple!(char[], int, char[]);

void read_log(string filename) {
    File file = File(filename, "r");
    auto npushed = appender!(npush_type[])();
    auto pushed = appender!(push_type[])();
    foreach (line; file.byLine) {
        if (auto findResult = line.findSplit(" SYNC_PUSH: ")) {
            auto rel = findResult[2];
            auto att = rel.splitter(" ");

            auto firstVal = att.front.to!int;
            auto secondVal = att.dropExactly(2).takeExactly(2).joiner("
").to!(char[]).dup;
            auto thirdVal = att.dropExactly(5).front.to!int;
            auto fourthVal = findResult[2].canFind("PA-SOC_POP");
            auto fifthVal = findResult[2].canFind("CU-SOC_POP");
            pushed.put(tuple(firstVal, secondVal, thirdVal, fourthVal,
fifthVal));
            continue;
        }
        if (auto findResult = line.findSplit(" SOC_NOT_PUSHED: ")) {
            auto leftPart = findResult[0].splitter(" ").dropExactly(2)
                                                       .takeExactly(2);
            auto rightPart = findResult[2].splitter(" ").takeExactly(2);
            auto firstVal = chain(leftPart.front,
leftPart.dropOne.front).to!(char[]);
            auto thirdVal = rightPart.front.to!(char[]).dup;
            auto secondVal = rightPart.dropOne.front.to!int;
            npushed.put(tuple(firstVal, secondVal, thirdVal));
            continue;
        }
    }
    // Doing more stuff with these arrays later. For now, just printing
lengths
    writeln(npushed.data.length);
    writeln(pushed.data.length);
}

On Fri, Jun 9, 2017 at 12:01 PM, uncorroded via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> On Friday, 9 June 2017 at 08:58:38 UTC, Daniel Kozak wrote:
>
> There is no difference in speed because you do not process your data
>>>
>> lazily, so you make many allocations, so this is main reason why it is so slow. I could improve that, but I will need to see some example data, which you are trying to parse.
>>
>> But some rules,
>> 1.) instead of ~= you shoud use std.array.appender
>> 2.) instead of std.string.split you could use std.algorithm.splitter or
>> std.algorithm.findSplit
>> 3.) instead of indexOf I would use std.algorithm.startsWith (in case it is
>> on the begining of the line)
>>
>
> Thanks everyone for the tips.
> The log file itself is 52 MB but I have added a sample in pastebin (
> https://pastebin.com/vj778PK4 ). Will try the suggestions today evening.
>


June 09, 2017
On Friday, 9 June 2017 at 14:19:48 UTC, Daniel Kozak wrote:
> import std.stdio;
> import std.array: appender, array;
> import std.algorithm : findSplit, splitter, joiner, canFind, map;
> import std.typecons : tuple, Tuple;
> import std.conv : to;
> import std.range : dropOne, dropExactly, takeExactly, chain;
>
> alias push_type = Tuple!(int, char[], int, bool, bool);
> alias npush_type = Tuple!(char[], int, char[]);
>
> void read_log(string filename) {
>     File file = File(filename, "r");
>     auto npushed = appender!(npush_type[])();
>     auto pushed = appender!(push_type[])();
>     foreach (line; file.byLine) {
>         if (auto findResult = line.findSplit(" SYNC_PUSH: ")) {
>             auto rel = findResult[2];
>             auto att = rel.splitter(" ");
>
>             auto firstVal = att.front.to!int;
>             auto secondVal = att.dropExactly(2).takeExactly(2).joiner("
> ").to!(char[]).dup;
>             auto thirdVal = att.dropExactly(5).front.to!int;
>             auto fourthVal = findResult[2].canFind("PA-SOC_POP");
>             auto fifthVal = findResult[2].canFind("CU-SOC_POP");
>             pushed.put(tuple(firstVal, secondVal, thirdVal, fourthVal,
> fifthVal));
>             continue;
>         }
>         if (auto findResult = line.findSplit(" SOC_NOT_PUSHED: ")) {
>             auto leftPart = findResult[0].splitter(" ").dropExactly(2)
>                                                        .takeExactly(2);
>             auto rightPart = findResult[2].splitter(" ").takeExactly(2);
>             auto firstVal = chain(leftPart.front,
> leftPart.dropOne.front).to!(char[]);
>             auto thirdVal = rightPart.front.to!(char[]).dup;
>             auto secondVal = rightPart.dropOne.front.to!int;
>             npushed.put(tuple(firstVal, secondVal, thirdVal));
>             continue;
>         }
>     }
>     // Doing more stuff with these arrays later. For now, just printing
> lengths
>     writeln(npushed.data.length);
>     writeln(pushed.data.length);
> }
>

Hi Daniel,

Thanks a lot for the code. I tested it on our production log file and it takes 0.2 sec (50% improvement over old time)
I tried using just the appender and it did not make a significant difference. On the other hand, just changing the inner loop to use std.algorithm and range seems to make a big difference.Is there a good resource to read about the good stuff in std.algorithm and range? I tried going through the library docs but they are too exhaustive!

Thanks :)
June 09, 2017
On Fri, Jun 09, 2017 at 04:28:08PM +0000, uncorroded via Digitalmars-d-learn wrote: [...]
> Is there a good resource to read about the good stuff in std.algorithm and range? I tried going through the library docs but they are too exhaustive!
[...]

Try these:

	http://ddili.org/ders/d.en/ranges.html
	http://www.informit.com/articles/article.aspx?p=1407357


T

-- 
Gone Chopin. Bach in a minuet.
June 09, 2017
btw it is important to use right compiler with right flags I am using ldc
with this flags
ldmd2 -O -release -boundscheck=off

On Fri, Jun 9, 2017 at 6:28 PM, uncorroded via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote:

> On Friday, 9 June 2017 at 14:19:48 UTC, Daniel Kozak wrote:
>
>> import std.stdio;
>> import std.array: appender, array;
>> import std.algorithm : findSplit, splitter, joiner, canFind, map;
>> import std.typecons : tuple, Tuple;
>> import std.conv : to;
>> import std.range : dropOne, dropExactly, takeExactly, chain;
>>
>> alias push_type = Tuple!(int, char[], int, bool, bool);
>> alias npush_type = Tuple!(char[], int, char[]);
>>
>> void read_log(string filename) {
>>     File file = File(filename, "r");
>>     auto npushed = appender!(npush_type[])();
>>     auto pushed = appender!(push_type[])();
>>     foreach (line; file.byLine) {
>>         if (auto findResult = line.findSplit(" SYNC_PUSH: ")) {
>>             auto rel = findResult[2];
>>             auto att = rel.splitter(" ");
>>
>>             auto firstVal = att.front.to!int;
>>             auto secondVal = att.dropExactly(2).takeExactly(2).joiner("
>> ").to!(char[]).dup;
>>             auto thirdVal = att.dropExactly(5).front.to!int;
>>             auto fourthVal = findResult[2].canFind("PA-SOC_POP");
>>             auto fifthVal = findResult[2].canFind("CU-SOC_POP");
>>             pushed.put(tuple(firstVal, secondVal, thirdVal, fourthVal,
>> fifthVal));
>>             continue;
>>         }
>>         if (auto findResult = line.findSplit(" SOC_NOT_PUSHED: ")) {
>>             auto leftPart = findResult[0].splitter(" ").dropExactly(2)
>>                                                        .takeExactly(2);
>>             auto rightPart = findResult[2].splitter(" ").takeExactly(2);
>>             auto firstVal = chain(leftPart.front,
>> leftPart.dropOne.front).to!(char[]);
>>             auto thirdVal = rightPart.front.to!(char[]).dup;
>>             auto secondVal = rightPart.dropOne.front.to!int;
>>             npushed.put(tuple(firstVal, secondVal, thirdVal));
>>             continue;
>>         }
>>     }
>>     // Doing more stuff with these arrays later. For now, just printing
>> lengths
>>     writeln(npushed.data.length);
>>     writeln(pushed.data.length);
>> }
>>
>>
> Hi Daniel,
>
> Thanks a lot for the code. I tested it on our production log file and it
> takes 0.2 sec (50% improvement over old time)
> I tried using just the appender and it did not make a significant
> difference. On the other hand, just changing the inner loop to use
> std.algorithm and range seems to make a big difference.Is there a good
> resource to read about the good stuff in std.algorithm and range? I tried
> going through the library docs but they are too exhaustive!
>
> Thanks :)
>