Thread overview
dmd command line scripting experiments and observations
Dec 25
Sergey
Dec 25
Sergey
December 25

For a very long time I have been using bash, grep, sed, awk, usual suspects on Unix, as they are super quick to type, incremental, etc. Once complexity is to big I usually switch to Python (decades ago it might have been Perl or PHP).

I often will embed small snippets of grep or awk in some other tools that just need to do something with some text files. For example do some pre-processing for plotting in Gnuplot.

I even wrote my custom line-column processing "language", called kolumny over a decade ago). To help with similar tasks. And while it does work well, I rarely use it (once a year these days sadly), because it is not really a full language.

Yesterday I had a need to some simple processing before before doing plotting in gnuplot:

set ylabel "locking rate [M/s]"
plot "<grep ^mx1 foo.txt" using 3:($3*$4/$9/1e6) title "RWMutex", \
     "<grep ^mx2 foo.txt" using 3:($3*$4/$9/1e6) title "drwMutex"

where a file foo.txt has things like this:

mx1 32 1 10000000 0.0001 1 100 100 0.552091302 552.091302ms
mx1 32 1 10000000 0.0001 1 100 100 0.552518653 552.518653ms
mx1 32 1 10000000 0.0001 1 100 100 0.562133796 562.133796ms
...
mx2 32 1 10000000 0.0001 1 100 100 0.613519317 613.519317ms
mx2 32 1 10000000 0.0001 1 100 100 0.602255619 602.255619ms
...
mx1 32 2 10000000 0.0001 1 100 100 1.489152483 1.489152483s
mx1 32 2 10000000 0.0001 1 100 100 1.469110205 1.469110205s
...

...
mx2 32 64 10000000 0.0001 1 100 100 8.84282034 8.84282034s

Ok, so my gnuplot script works, but now I have a lot of points for each x.

I would like to take max throughput (lowest time, in column 9), and only use that. Or maybe median. (Definitively not average tho).

And I didn't feel like doing this in awk

So started exploring rdmd a bit:

First or second attempt:

rdmd --eval='float[][int] g; foreach (line; stdin.byLine.filter!(x=>x.matchFirst("^mx1"))) { auto a = line.split; auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; g[c] ~= rate; } foreach (c, values; g) { writeln(c, " ", values.reduce!max); }' < foo.txt

Removed redundant () to make code shorter.

45 2.29471e+07
26 2.25617e+07
52 2.26505e+07
43 2.30352e+07
17 2.32184e+07
34 2.33697e+07
60 2.26649e+07
61 2.25918e+07
...

Ok. "Works"

That is not good for few reasons.

  1. Still kind of long
  2. Cannot easily embed into gnuplot script, because of usage of both ', and "
  3. I do group by c, using map (associative array), but that means during print, it will be unordered. If I switch to plotting using line instead of default point, I want ascending order, otherwise plot will be a chaos of lines. This could be fixed by piping output to sort -n -k 1, but a) is less efficient, b) makes things even longer. Obvious way would be to remember previous c, and aggregate on a fly. Faster, ordered by design (because input is ordered), less memory usage.

Next attempt (not fully correct), trying to rectify few things incrementally, not shooting for the perfect solution yet, just exploring a bit more:

rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if (prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}' < foo.txt

0 0
1 1.81999e+07
2 1.3897e+07
3 1.68113e+07
4 1.77501e+07
5 1.77466e+07
6 2.00162e+07
7 2.00754e+07
8 2.24083e+07
9 2.43998e+07
...
63 2.24421e+07

Some progress, but not quite there (obviously). We do not output line for 64, because check for prev_c!=c is only in a loop, but we should have a writeln again after a loop.

Lets fix this then.

rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if (prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);} writeln(prev_c, `` ``, max_rate); max_rate=0;' < foo.txt

A bit hairy but does the job. (Still prints 0, but that is easy to fix with something like if (prev_c != c && prev_c)

Lets reimplement in awk, for an unfair comparison:

awk 'BEGIN{prev_c = 0; max_rate=0.0;} /^mx1/{ c=$3; rate=c*$4/$9; if (prev_c != c) { print prev_c, max_rate; max_rate=0;} prev_c=c;if(rate>max_rate)max_rate=rate;} END{print prev_c, max_rate;}' < foo.txt

Quite a bit shorter.

There things that would be hard to do in D, but still possible.

auto x = ..., replace with x:=... (like in Go). This could be done with a simple preprocessor (even just a sed -E -e 's/([a-zA-Z0-9_]+) *:=/auto \1=/g' before passing to gdmd.

/regexp/{} /regexp/{}, and foreach (a......), replace with a an abstraction for doing this for us.

Should be possible to implement, probably with API like this:

each(   // implicitly on stdin.byLine()
    "^mx1",
    (a, m) => {    // a is just line split on whitespaces,
                   // m is regexp match groups (optional)
       c := a[2].to!int;
       ...
    },
    ...,  // more matchers.
    ...,  // All matching matchers are executed in order, not just the first one.
    ...,  // delegate with no preceding matcher, is equivalent to ".*" matching.
    ...);

We can accept both void delegates, or ones returning int, i.e. if we want to do something like loop break. But in scripting, instead of break in main loop, you will usually just exit whole script. So not super useful. (continue works by just returning from void delegate, so not a concern).

More advance each could allow multiple predicates, multiple regexps, and possibly some conditions (&&, ||). Can invent a mini DSL for this, or use operator overloading for this (maybe, as not all operators are overloadable in D, i.e. overloading comparison operators is very problematic in D, it was possible in D1, but not in D2).

We can also add original full line (unsplitted) as a first element of the a, so a[0] is just like awk $0 (whole line), and a[1] is just like $1 (columns, with first one being $1).

Note: We do not want to put this each implicitly into a runner script, because often we want to do things before it. This could be done with something like --begin, and --end, but is more verbose. Plus --begin and --end, would make it harder to port command line code to file based script.

On the other front of to!int, we can do better too. Either provide helper functions to common type conversions like to!int, to!float:

So instead of:

       c := a[2].to!int;

we do

       c := a[2].INT;
       rate:=c * a[3].F32 / a[8].F32;

Ok, how about each is smarter, and not only just does input line split into column of strings (string[]), but instead puts each value into a custom library type, that provides a dynamic typing. Something like DynamicTypeValue[], but operator overloading for arithmetic, comparison and toString functions.

       c := a[2];
       rate := c * a[3] / a[8];

Surely possible.

Lets also add a awk-like print (similar to Python print), which puts space between each argument for us, and for a good measure, lets use old PHP, echo construct, to save one extra character.

How this would look:

./dm 'prev_c := 0;max_rate:=0.0; each("^mx1", (DT[] a){ c:=a[2]; rate:=c * a[3] / a[8]; if (prev_c != c) { echo(prev_c, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}); echo(prev_c, max_rate);' ./foo.txt

That looks pretty nice. Not optimal, but not too bad. Only 14 more characters than awk (203 bytes, vs 189).

Note: I do not quite have a full solution to DynamicTypeValue, (missing hashing support, so it can be used as a key in associative array), but prototype is kind of working.

Unfortunately it is not quite working, even with some tries:

$ ./dm ....
/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): Error: cannot implicitly convert expression `c` of type `DT` to `int`
/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): Error: cannot implicitly convert expression `rate` of type `DT` to `double`
Failed: ["/usr/bin/dmd", "-d", "-v", "-o-", "/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d", "-I/tmp/.rdmd-1000"]
...

This boils down to:

int prev_c = 0;
prev_c = DT("1");

not compiling. I defined opCast, but this is only for explicit casts.

If I would be able to allow semi-implicit casts for my type, that would work perfectly.

There was also a small issue with max, std.algorithm.comparison.max complains a bit about comparing DT and double:

/tmp/.rdmd-1000/eval.EA89F8F1475E6A614DCFA85E8098FEFF.d(122): Error: none of the overloads of template `std.algorithm.comparison.max` are callable using argument types `!()(double, DT)`
/usr/include/dmd/phobos/std/algorithm/comparison.d(1644):        Candidates are: `max(T...)(T args)`
  with `T = (double, DT)`
  whose parameters have the following constraints:
  `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
`    T.length >= 2
  > !is(CommonType!T == void)
`  `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
/usr/include/dmd/phobos/std/algorithm/comparison.d(1681):                        `max(T, U)(T a, U b)`
  with `T = double,
       U = DT`
  whose parameters have the following constraints:
  `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
`  > is(T == U)
  - is(typeof(a < b))
`  `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
  Tip: not satisfied constraints are marked with `>`

Fair enough, I could provide my own max and min, and possible few more functions (i.e. functions like std.math.sqrt, abs, etc), to operate easily on DT. Hard to do it fully transparently for everything, but should be possible to cover at least everything that awk has too.

Doing $1 -> a[0], translation is trivial using some regular expressions. It could save 2 characters, but that is not a lot.

In summary:

So, in pure form, D language and rdmd, are usable, but rather verbose (mostly due to auto, long function names like writeln, and extra arguments they require for putting spaces between argument). But still usable. The script I wrote would probably be a close to the limit of what would be acceptable, which is not great, because the example script does very little.

With some hacks, preprocessing, and extra library type and functions, it is possible to make usage way easier, code way shorter, and very comparable to awk. (I didn't test other functions like open, and operating on files), but it should not be too dissimilar).

Some operator overloading facilities of D programming language are lacking to fully make it usable tho.

Inability to opt-in to allow implicit opCast casting are making it not possible to develop fully dynamic and easy to use solution.

What do you think?

For reference, dm script

#!/usr/bin/env python3

import os
import re
import subprocess
import sys


code = sys.argv[1]
filenames = sys.argv[2:]

header = """
struct DT {
  string x_;
  this(string x) { x_ = x; }
  this(float x) { x_ = to!string(x); }
  this(int x) { x_ = to!string(x); }
  // string toString() const { return to!string(x_); }
  string toString() const { return x_; }
  bool can(T)() const {
    try { to!T(x_); } catch { return false; } return true;
  }
  bool numeric() const { return can!double(); }
  double number() const { return to!double(x_); }
  auto opBinary(string op)(const ref DT other) const {
    if (numeric() && other.numeric()) {
      const n = number();
      const m = other.number();
      return DT(to!string(mixin("n " ~ op ~ " m")));
    }
    throw new Exception("cannot perform " ~ op ~ " on string");
  }
  auto opBinary(string op, Other)(const ref Other other) const {
    if (numeric()) {
      // static assert(is(other : float, double, int, uint));
      // TODO(baryluk): We could maybe support adding string too. Not super useful tho.
      // I want dynamic typing, but still to be strong typing. Not weak like PHP or JavaScript.
      return DT(to!string(mixin("number() " ~ op ~ " other")));
    }
    // We could possibly allow number + string, and string + string, and string * int
    throw new Exception("cannot perform " ~ op ~ " on string");
  }

  // opUnary, -, ~
  // negation, ! - i.e. !c,   where c is string repreenting integer, then we for !c we if c == "0", it will be true.
  // todo support some bool?

  int opCmp(const ref const(DT) other) const {
    if (numeric() && other.numeric()) {
      const n = number();
      const m = other.number();
      return (n > m) - (n < m);
    }
    if (!numeric() && !other.numeric()) {
      return x_ < other.x_;
    }
    throw new Exception("cannot compare string with other");
  }
  int opCmp(Other)(const ref Other other) const {
    // static if (is(Other: int, float, ...));
    if (numeric()) {
      const n = number();
      return (n > other) - (n < other);  // Quick hack
    }
    static if (is(Other == string)) {
      return x_ < other;
    } else {
      throw new Exception("cannot compare string with other");
    }
  }
  bool opEquals(const ref DT other) const {
    return this.opCmp(other) == 0;
  }
  bool opEquals(Other)(const ref Other other) const {
    return this.opCmp(other) == 0;
  }

  // This also handled !value
  bool opCast(T)() const if (is(T == bool)) {
    if (numeric()) {
      return !number();
    }
    return !x_;
  }
  auto opCast(T)() const if (is(T == string)) {
    return x_;
  }
  auto opCast(T)() const {  // if T is numeric, i.e. int, double
    pragma(msg, "casting to", T);
    return x_.number();
  }

  auto opAssign(const ref DT other) {
    x_ = other.x_;
    return this;
  }
  auto opAssign(Other)(const ref Other other) {
    x_ = to!string(other);
    return this;
  }
}
void echo(T...)(T args) {
  foreach (arg; args[0..$-1]) {
      write(arg);
      write(' ');
  }
  writeln(args[$-1]);
}
void each(D)(string re, D dg) {  // just an initial prototype
  foreach (line; stdin.byLine) {
    if (line.matchFirst(re)) {
      dg(line.map!split().map!(x=>new DT(x))());
    }
  }
}
"""

code = re.sub(r"([a-zA-Z_][a-zA-Z0-9_]*) *:=", r" auto \1=", code)

# print(header+code)

with subprocess.Popen(["rdmd", f"--eval={header}{code}"], stdin=subprocess.PIPE, text=True) as p:
    for filename in filenames:
        with open(filename) as f:
            for line in f:
                p.stdin.write(line)

print(p)
December 25

On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:
...

>

Inability to opt-in to allow implicit opCast casting are making it not possible to develop fully dynamic and easy to use solution.

Was thinking a bit after posting, and maybe there is some hope:

So, one of the possibly limited hacks would be to force double return type on opBinary when used with arithmetic operators

Instead of

  auto opBinary(string op)(const ref DT other) const {
    if (numeric() && other.numeric()) {
      const n = number();
      const m = other.number();
      return DT(to!string(mixin("n " ~ op ~ " m")));
    }
    throw new Exception("cannot perform " ~ op ~ " on string");
  }

we do

  double opBinary(string op)(const ref DT other) const if (op == "+" || op == "-" || op == "*" | op == "/" || op == "^^" || op == "|" || op == "&" || op == "^"){
    if (numeric() && other.numeric()) {
      const n = number();
      const m = other.number();
      return mixin("n " ~ op ~ " m");
    }
    throw new Exception("cannot perform " ~ op ~ " on string");
  }
  string opBinary(string op)(const ref DT other) const if (op == "~") {
    if (!numeric() && !other.numeric()) {
      return mixin("x_ " ~ op ~ " other.x_");
    }
    throw new Exception("cannot perform " ~ op ~ " on non-string");
  }
  ...
  // more overloads
  // ...

This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).

December 25

On Monday, 25 December 2023 at 12:16:46 UTC, Witold Baryluk wrote:

>

On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:
This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).

Maybe if you are able to change spaces with tabs (create tsv file) this tool will help you https://github.com/eBay/tsv-utils

December 25

On Monday, 25 December 2023 at 12:29:19 UTC, Sergey wrote:

>

On Monday, 25 December 2023 at 12:16:46 UTC, Witold Baryluk wrote:

>

On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:
This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).

Maybe if you are able to change spaces with tabs (create tsv file) this tool will help you https://github.com/eBay/tsv-utils

I am able to change space to tabs. But I do not want. I strongly prefer spaces.

As I mentioned before, I have a custom tool called kolumny, that does what tsv-utils do, and way more. I want something more generic, also the reason for my post is not to solve my problem in particular (In case you missed the point of the post), but about language design, and make D useful in wider area of applications.

December 25

On Monday, 25 December 2023 at 12:45:50 UTC, Witold Baryluk wrote:

>

post), but about language design, and make D useful in wider area of applications.

I didn’t take the post yeah.
Why not use templates type to support not only double but any T?

December 25

On Monday, 25 December 2023 at 15:50:52 UTC, Sergey wrote:

>

On Monday, 25 December 2023 at 12:45:50 UTC, Witold Baryluk wrote:

>

post), but about language design, and make D useful in wider area of applications.

I didn’t take the post yeah.
Why not use templates type to support not only double but any T?

It just doesn't work.

struct DT {
// ...
}

auto c = 0;   // or 0.0, doesn't matter

c = DT("1");

c will infer to be int, or double. There is no way to convince D compiler to make it call some operator to do a conversion. There is no opAssignRight.

I do not see how templates help.

As a hack, I can do:

c := 0;
c = DT("1")

And instead of converting varname := to auto varname = , do DT varname = . Then I could probably do something about it.

But I think sometimes you want to force a type, or have an empty and default initialized variable like "string c;". Otherwise it looks not like D, and very hacky in general.

December 25

Actually I do not think DT varname = ... will work. It will only work for a limited number of types, with value semantic. It will not work for reference types, and other non-trivial types (i.e. from libraries, phobos, arrays, etc). But for command line scripting I would still want to be able to use auto (via := to auto translation) for them instead.

January 28

On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:

>

For a very long time I have been using bash, grep, sed, awk, usual suspects on Unix, as they are super quick to type, incremental, etc. Once complexity is to big I usually switch to Python (decades ago it might have been Perl or PHP).

I'm actually using Ruby instead of sed, awk and friends for this kind of tasks. Python is whitespace sensitive and that's the reason why I don't like it in general. But Ruby is essentially a modernized Perl with very expressive syntax. Your example with spaces removed and single character variables:

awk 'BEGIN{p=0;m=0.0;}/^mx1/{c=$3;r=c*$4/$9;if(p!=c){print p,m;m=0;}p=c;if(r>m)m=r;}END{print p,m;}' < foo.txt

ruby -e'g={0=>0};while l=gets;l.scan(/^mx1/){a=l.split;c=a[2].to_i;r=c*a[3].to_f/a[8].to_f;g[c]=[g[c]||0,r].max}end;g.each{puts"%d %g"%_1}' < foo.txt

The following variant works with both Ryby and Crystal:

crystal eval 'g={0=>0.0};while l=gets;l.scan(/^mx1/){a=l.split;c=a[2].to_i;rate=c*a[3].to_f/a[8].to_f;g[c]||=0.0;g[c]=[g[c],rate].max}end;g.each{|v|puts "%d %g"%v}' < foo.txt

D is not the best language for very terse singleliner codegolfing. And it doesn't look like many people are interested in adding special syntax sugar tailored for this.