For a very long time I have been using bash, grep, sed, awk, usual suspects on Unix, as they are super quick to type, incremental, etc. Once complexity is to big I usually switch to Python (decades ago it might have been Perl or PHP).
I often will embed small snippets of grep or awk in some other tools that just need to do something with some text files. For example do some pre-processing for plotting in Gnuplot.
I even wrote my custom line-column processing "language", called kolumny
over a decade ago). To help with similar tasks. And while it does work well, I rarely use it (once a year these days sadly), because it is not really a full language.
Yesterday I had a need to some simple processing before before doing plotting in gnuplot:
set ylabel "locking rate [M/s]"
plot "<grep ^mx1 foo.txt" using 3:($3*$4/$9/1e6) title "RWMutex", \
"<grep ^mx2 foo.txt" using 3:($3*$4/$9/1e6) title "drwMutex"
where a file foo.txt
has things like this:
mx1 32 1 10000000 0.0001 1 100 100 0.552091302 552.091302ms
mx1 32 1 10000000 0.0001 1 100 100 0.552518653 552.518653ms
mx1 32 1 10000000 0.0001 1 100 100 0.562133796 562.133796ms
...
mx2 32 1 10000000 0.0001 1 100 100 0.613519317 613.519317ms
mx2 32 1 10000000 0.0001 1 100 100 0.602255619 602.255619ms
...
mx1 32 2 10000000 0.0001 1 100 100 1.489152483 1.489152483s
mx1 32 2 10000000 0.0001 1 100 100 1.469110205 1.469110205s
...
...
mx2 32 64 10000000 0.0001 1 100 100 8.84282034 8.84282034s
Ok, so my gnuplot script works, but now I have a lot of points for each x.
I would like to take max throughput (lowest time, in column 9), and only use that. Or maybe median. (Definitively not average tho).
And I didn't feel like doing this in awk
So started exploring rdmd a bit:
First or second attempt:
rdmd --eval='float[][int] g; foreach (line; stdin.byLine.filter!(x=>x.matchFirst("^mx1"))) { auto a = line.split; auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; g[c] ~= rate; } foreach (c, values; g) { writeln(c, " ", values.reduce!max); }' < foo.txt
Removed redundant ()
to make code shorter.
45 2.29471e+07
26 2.25617e+07
52 2.26505e+07
43 2.30352e+07
17 2.32184e+07
34 2.33697e+07
60 2.26649e+07
61 2.25918e+07
...
Ok. "Works"
That is not good for few reasons.
- Still kind of long
- Cannot easily embed into gnuplot script, because of usage of both
'
, and"
- I do group by
c
, using map (associative array), but that means during print, it will be unordered. If I switch to plotting using line instead of default point, I want ascending order, otherwise plot will be a chaos of lines. This could be fixed by piping output tosort -n -k 1
, but a) is less efficient, b) makes things even longer. Obvious way would be to remember previousc
, and aggregate on a fly. Faster, ordered by design (because input is ordered), less memory usage.
Next attempt (not fully correct), trying to rectify few things incrementally, not shooting for the perfect solution yet, just exploring a bit more:
rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if (prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}' < foo.txt
0 0
1 1.81999e+07
2 1.3897e+07
3 1.68113e+07
4 1.77501e+07
5 1.77466e+07
6 2.00162e+07
7 2.00754e+07
8 2.24083e+07
9 2.43998e+07
...
63 2.24421e+07
Some progress, but not quite there (obviously). We do not output line for 64, because check for prev_c!=c
is only in a loop, but we should have a writeln
again after a loop.
Lets fix this then.
rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if (prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);} writeln(prev_c, `` ``, max_rate); max_rate=0;' < foo.txt
A bit hairy but does the job. (Still prints 0, but that is easy to fix with something like if (prev_c != c && prev_c)
Lets reimplement in awk, for an unfair comparison:
awk 'BEGIN{prev_c = 0; max_rate=0.0;} /^mx1/{ c=$3; rate=c*$4/$9; if (prev_c != c) { print prev_c, max_rate; max_rate=0;} prev_c=c;if(rate>max_rate)max_rate=rate;} END{print prev_c, max_rate;}' < foo.txt
Quite a bit shorter.
There things that would be hard to do in D, but still possible.
auto x = ...
, replace with x:=...
(like in Go). This could be done with a simple preprocessor (even just a sed -E -e 's/([a-zA-Z0-9_]+) *:=/auto \1=/g'
before passing to gdmd
.
/regexp/{} /regexp/{}
, and foreach (a......)
, replace with a an abstraction for doing this for us.
Should be possible to implement, probably with API like this:
each( // implicitly on stdin.byLine()
"^mx1",
(a, m) => { // a is just line split on whitespaces,
// m is regexp match groups (optional)
c := a[2].to!int;
...
},
..., // more matchers.
..., // All matching matchers are executed in order, not just the first one.
..., // delegate with no preceding matcher, is equivalent to ".*" matching.
...);
We can accept both void
delegates, or ones returning int
, i.e. if we want to do something like loop break
. But in scripting, instead of break
in main loop, you will usually just exit whole script. So not super useful. (continue
works by just returning from void delegate, so not a concern).
More advance each
could allow multiple predicates, multiple regexps, and possibly some conditions (&&
, ||
). Can invent a mini DSL for this, or use operator overloading for this (maybe, as not all operators are overloadable in D, i.e. overloading comparison operators is very problematic in D, it was possible in D1, but not in D2).
We can also add original full line (unsplitted) as a first element of the a
, so a[0]
is just like awk $0
(whole line), and a[1]
is just like $1
(columns, with first one being $1
).
Note: We do not want to put this each
implicitly into a runner script, because often we want to do things before it. This could be done with something like --begin
, and --end
, but is more verbose. Plus --begin
and --end
, would make it harder to port command line code to file based script.
On the other front of to!int
, we can do better too. Either provide helper functions to common type conversions like to!int, to!float:
So instead of:
c := a[2].to!int;
we do
c := a[2].INT;
rate:=c * a[3].F32 / a[8].F32;
Ok, how about each
is smarter, and not only just does input line split into column of strings (string[]
), but instead puts each value into a custom library type, that provides a dynamic typing. Something like DynamicTypeValue[]
, but operator overloading for arithmetic, comparison and toString functions.
c := a[2];
rate := c * a[3] / a[8];
Surely possible.
Lets also add a awk-like print (similar to Python print
), which puts space between each argument for us, and for a good measure, lets use old PHP, echo
construct, to save one extra character.
How this would look:
./dm 'prev_c := 0;max_rate:=0.0; each("^mx1", (DT[] a){ c:=a[2]; rate:=c * a[3] / a[8]; if (prev_c != c) { echo(prev_c, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}); echo(prev_c, max_rate);' ./foo.txt
That looks pretty nice. Not optimal, but not too bad. Only 14 more characters than awk (203 bytes, vs 189).
Note: I do not quite have a full solution to DynamicTypeValue
, (missing hashing support, so it can be used as a key in associative array), but prototype is kind of working.
Unfortunately it is not quite working, even with some tries:
$ ./dm ....
/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): Error: cannot implicitly convert expression `c` of type `DT` to `int`
/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): Error: cannot implicitly convert expression `rate` of type `DT` to `double`
Failed: ["/usr/bin/dmd", "-d", "-v", "-o-", "/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d", "-I/tmp/.rdmd-1000"]
...
This boils down to:
int prev_c = 0;
prev_c = DT("1");
not compiling. I defined opCast
, but this is only for explicit casts.
If I would be able to allow semi-implicit casts for my type, that would work perfectly.
There was also a small issue with max
, std.algorithm.comparison.max
complains a bit about comparing DT
and double
:
/tmp/.rdmd-1000/eval.EA89F8F1475E6A614DCFA85E8098FEFF.d(122): Error: none of the overloads of template `std.algorithm.comparison.max` are callable using argument types `!()(double, DT)`
/usr/include/dmd/phobos/std/algorithm/comparison.d(1644): Candidates are: `max(T...)(T args)`
with `T = (double, DT)`
whose parameters have the following constraints:
`~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
` T.length >= 2
> !is(CommonType!T == void)
` `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
/usr/include/dmd/phobos/std/algorithm/comparison.d(1681): `max(T, U)(T a, U b)`
with `T = double,
U = DT`
whose parameters have the following constraints:
`~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
` > is(T == U)
- is(typeof(a < b))
` `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Tip: not satisfied constraints are marked with `>`
Fair enough, I could provide my own max
and min
, and possible few more functions (i.e. functions like std.math.sqrt, abs, etc), to operate easily on DT. Hard to do it fully transparently for everything, but should be possible to cover at least everything that awk
has too.
Doing $1
-> a[0]
, translation is trivial using some regular expressions. It could save 2 characters, but that is not a lot.
In summary:
So, in pure form, D language and rdmd, are usable, but rather verbose (mostly due to auto
, long function names like writeln, and extra arguments they require for putting spaces between argument). But still usable. The script I wrote would probably be a close to the limit of what would be acceptable, which is not great, because the example script does very little.
With some hacks, preprocessing, and extra library type and functions, it is possible to make usage way easier, code way shorter, and very comparable to awk. (I didn't test other functions like open, and operating on files), but it should not be too dissimilar).
Some operator overloading facilities of D programming language are lacking to fully make it usable tho.
Inability to opt-in to allow implicit opCast casting are making it not possible to develop fully dynamic and easy to use solution.
What do you think?
For reference, dm
script
#!/usr/bin/env python3
import os
import re
import subprocess
import sys
code = sys.argv[1]
filenames = sys.argv[2:]
header = """
struct DT {
string x_;
this(string x) { x_ = x; }
this(float x) { x_ = to!string(x); }
this(int x) { x_ = to!string(x); }
// string toString() const { return to!string(x_); }
string toString() const { return x_; }
bool can(T)() const {
try { to!T(x_); } catch { return false; } return true;
}
bool numeric() const { return can!double(); }
double number() const { return to!double(x_); }
auto opBinary(string op)(const ref DT other) const {
if (numeric() && other.numeric()) {
const n = number();
const m = other.number();
return DT(to!string(mixin("n " ~ op ~ " m")));
}
throw new Exception("cannot perform " ~ op ~ " on string");
}
auto opBinary(string op, Other)(const ref Other other) const {
if (numeric()) {
// static assert(is(other : float, double, int, uint));
// TODO(baryluk): We could maybe support adding string too. Not super useful tho.
// I want dynamic typing, but still to be strong typing. Not weak like PHP or JavaScript.
return DT(to!string(mixin("number() " ~ op ~ " other")));
}
// We could possibly allow number + string, and string + string, and string * int
throw new Exception("cannot perform " ~ op ~ " on string");
}
// opUnary, -, ~
// negation, ! - i.e. !c, where c is string repreenting integer, then we for !c we if c == "0", it will be true.
// todo support some bool?
int opCmp(const ref const(DT) other) const {
if (numeric() && other.numeric()) {
const n = number();
const m = other.number();
return (n > m) - (n < m);
}
if (!numeric() && !other.numeric()) {
return x_ < other.x_;
}
throw new Exception("cannot compare string with other");
}
int opCmp(Other)(const ref Other other) const {
// static if (is(Other: int, float, ...));
if (numeric()) {
const n = number();
return (n > other) - (n < other); // Quick hack
}
static if (is(Other == string)) {
return x_ < other;
} else {
throw new Exception("cannot compare string with other");
}
}
bool opEquals(const ref DT other) const {
return this.opCmp(other) == 0;
}
bool opEquals(Other)(const ref Other other) const {
return this.opCmp(other) == 0;
}
// This also handled !value
bool opCast(T)() const if (is(T == bool)) {
if (numeric()) {
return !number();
}
return !x_;
}
auto opCast(T)() const if (is(T == string)) {
return x_;
}
auto opCast(T)() const { // if T is numeric, i.e. int, double
pragma(msg, "casting to", T);
return x_.number();
}
auto opAssign(const ref DT other) {
x_ = other.x_;
return this;
}
auto opAssign(Other)(const ref Other other) {
x_ = to!string(other);
return this;
}
}
void echo(T...)(T args) {
foreach (arg; args[0..$-1]) {
write(arg);
write(' ');
}
writeln(args[$-1]);
}
void each(D)(string re, D dg) { // just an initial prototype
foreach (line; stdin.byLine) {
if (line.matchFirst(re)) {
dg(line.map!split().map!(x=>new DT(x))());
}
}
}
"""
code = re.sub(r"([a-zA-Z_][a-zA-Z0-9_]*) *:=", r" auto \1=", code)
# print(header+code)
with subprocess.Popen(["rdmd", f"--eval={header}{code}"], stdin=subprocess.PIPE, text=True) as p:
for filename in filenames:
with open(filename) as f:
for line in f:
p.stdin.write(line)
print(p)