Thread overview
Efficiently streaming data to associative array
Aug 08, 2017
Guillaume Chatelet
Aug 08, 2017
Guillaume Chatelet
Aug 09, 2017
kerdemdemir
Aug 09, 2017
Guillaume Chatelet
Aug 08, 2017
Anonymouse
Aug 10, 2017
Jon Degenhardt
August 08, 2017
Let's say I'm processing MB of data, I'm lazily iterating over the incoming lines storing data in an associative array. I don't want to copy unless I have to.

Contrived example follows:

input file
----------
a,b,15
c,d,12
...

Efficient ingestion
-------------------
void main() {

  size_t[string][string] indexed_map;

  foreach(char[] line ; stdin.byLine) {
    char[] a;
    char[] b;
    size_t value;
    line.formattedRead!"%s,%s,%d"(a,b,value);

    auto pA = a in indexed_map;
    if(pA is null) {
      pA = &(indexed_map[a.idup] = (size_t[string]).init);
    }

    auto pB = b in (*pA);
    if(pB is null) {
      pB = &((*pA)[b.idup] = size_t.init);
    }

    // Technically unneeded but let's say we have more than 2 dimensions.
    (*pB) = value;
  }

  indexed_map.writeln;
}


I qualify this code as ugly but fast. Any idea on how to make this less ugly? Is there something in Phobos to help?
August 08, 2017
On 8/8/17 11:28 AM, Guillaume Chatelet wrote:
> Let's say I'm processing MB of data, I'm lazily iterating over the incoming lines storing data in an associative array. I don't want to copy unless I have to.
> 
> Contrived example follows:
> 
> input file
> ----------
> a,b,15
> c,d,12
> ....
> 
> Efficient ingestion
> -------------------
> void main() {
> 
>    size_t[string][string] indexed_map;
> 
>    foreach(char[] line ; stdin.byLine) {
>      char[] a;
>      char[] b;
>      size_t value;
>      line.formattedRead!"%s,%s,%d"(a,b,value);
> 
>      auto pA = a in indexed_map;
>      if(pA is null) {
>        pA = &(indexed_map[a.idup] = (size_t[string]).init);
>      }
> 
>      auto pB = b in (*pA);
>      if(pB is null) {
>        pB = &((*pA)[b.idup] = size_t.init
>      }
> 
>      // Technically unneeded but let's say we have more than 2 dimensions.
>      (*pB) = value;
>    }
> 
>    indexed_map.writeln;
> }
> 
> 
> I qualify this code as ugly but fast. Any idea on how to make this less ugly? Is there something in Phobos to help?

I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.

Note, this is very close to Jon Degenhardt's blog post in May: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

-Steve
August 08, 2017
On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:
> On 8/8/17 11:28 AM, Guillaume Chatelet wrote:
>> Let's say I'm processing MB of data, I'm lazily iterating over the incoming lines storing data in an associative array. I don't want to copy unless I have to.
>> 
>> Contrived example follows:
>> 
>> input file
>> ----------
>> a,b,15
>> c,d,12
>> ....
>> 
>> Efficient ingestion
>> -------------------
>> void main() {
>> 
>>    size_t[string][string] indexed_map;
>> 
>>    foreach(char[] line ; stdin.byLine) {
>>      char[] a;
>>      char[] b;
>>      size_t value;
>>      line.formattedRead!"%s,%s,%d"(a,b,value);
>> 
>>      auto pA = a in indexed_map;
>>      if(pA is null) {
>>        pA = &(indexed_map[a.idup] = (size_t[string]).init);
>>      }
>> 
>>      auto pB = b in (*pA);
>>      if(pB is null) {
>>        pB = &((*pA)[b.idup] = size_t.init
>>      }
>> 
>>      // Technically unneeded but let's say we have more than 2 dimensions.
>>      (*pB) = value;
>>    }
>> 
>>    indexed_map.writeln;
>> }
>> 
>> 
>> I qualify this code as ugly but fast. Any idea on how to make this less ugly? Is there something in Phobos to help?
>
> I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.
>
> Note, this is very close to Jon Degenhardt's blog post in May: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/
>
> -Steve

I haven't yet dug into formattedRead but thx for letting me know : )
I was mostly speaking about the pattern with the AA. I guess the best I can do is a templated function to hide the ugliness.


ref Value GetWithDefault(Value)(ref Value[string] map, const (char[]) key) {
  auto pValue = key in map;
  if(pValue) return *pValue;
  return map[key.idup] = Value.init;
}

void main() {

  size_t[string][string] indexed_map;

  foreach(char[] line ; stdin.byLine) {
    char[] a;
    char[] b;
    size_t value;
    line.formattedRead!"%s,%s,%d"(a,b,value);

    indexed_map.GetWithDefault(a).GetWithDefault(b) = value;
  }

  indexed_map.writeln;
}


Not too bad actually !
August 08, 2017
On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:
> I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.

What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.

import std.stdio;
import std.format;

void main()
{
    string abc, def;
    int ghi, jkl;

    string s = "abc,123,def,456";
    s.formattedRead!"%s,%d,%s,%d"(abc, ghi, def, jkl);

    writeln(abc);
    writeln(def);
    writeln(ghi);
    writeln(jkl);
}
August 09, 2017
> I haven't yet dug into formattedRead but thx for letting me know : )
> I was mostly speaking about the pattern with the AA. I guess the best I can do is a templated function to hide the ugliness.
>
>
> ref Value GetWithDefault(Value)(ref Value[string] map, const (char[]) key) {
>   auto pValue = key in map;
>   if(pValue) return *pValue;
>   return map[key.idup] = Value.init;
> }
>
> void main() {
>
>   size_t[string][string] indexed_map;
>
>   foreach(char[] line ; stdin.byLine) {
>     char[] a;
>     char[] b;
>     size_t value;
>     line.formattedRead!"%s,%s,%d"(a,b,value);
>
>     indexed_map.GetWithDefault(a).GetWithDefault(b) = value;
>   }
>
>   indexed_map.writeln;
> }
>
>
> Not too bad actually !

As a total beginner I am feeling a bit not comfortable with basic operations in AA.

First even I am very happy we have pointers but using pointers in a common operation like this IMHO makes the language a bit not safe.

Second "in" keyword always seemed so specific to me.

I think I will use your solution "ref Value GetWithDefault(Value)" very often since it hides the two things above.



August 09, 2017
On Wednesday, 9 August 2017 at 10:00:14 UTC, kerdemdemir wrote:
> As a total beginner I am feeling a bit not comfortable with basic operations in AA.
>
> First even I am very happy we have pointers but using pointers in a common operation like this IMHO makes the language a bit not safe.
>
> Second "in" keyword always seemed so specific to me.
>
> I think I will use your solution "ref Value GetWithDefault(Value)" very often since it hides the two things above.

You don't need this most of the time, if you already have the correct type it's easy:

size_t[string][string] indexed_map;

string a, b; // a and b are strings not char[]
indexed_map[a][b] = value; // this will create the AA slots if needed

In my specific case the data is streamed from stdin and is not kept in memory.
byLine returns a view of the stdin buffer which may be replaced at the next for-loop iteration so I can't use the index operator directly, I need a string that does not change over time.

I could have used this code:

void main() {
  size_t[string][string] indexed_map;
  foreach(char[] line ; stdin.byLine) {
    char[] a;
    char[] b;
    size_t value;
    line.formattedRead!"%s,%s,%d"(a,b,value);
    indexed_map[a.idup][b.idup] = value;
  }
  indexed_map.writeln;
}

It's perfectly ok if data is small. In my case data is huge and creating a copy of the strings at each iteration is costly.
August 09, 2017
On 8/8/17 3:43 PM, Anonymouse wrote:
> On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:
>> I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.
> 
> What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.

using splitter(","), and then parsing each field using appropriate function (e.g. to!)

For example, the OP's code, I would do:

auto r = line.splitter(",");
a = r.front;
r.popFront;
b = r.front;
r.popFront;
c = r.front.to!int;

It would be nice if formattedRead didn't use appender, and instead sliced, but I'm not sure it can be fixed.

Note, one could make a template that does this automatically in one line.

-Steve
August 10, 2017
On Wednesday, 9 August 2017 at 13:36:46 UTC, Steven Schveighoffer wrote:
> On 8/8/17 3:43 PM, Anonymouse wrote:
>> On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:
>>> I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.
>> 
>> What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.
>
> using splitter(","), and then parsing each field using appropriate function (e.g. to!)
>
> For example, the OP's code, I would do:
>
> auto r = line.splitter(",");
> a = r.front;
> r.popFront;
> b = r.front;
> r.popFront;
> c = r.front.to!int;
>
> It would be nice if formattedRead didn't use appender, and instead sliced, but I'm not sure it can be fixed.
>
> Note, one could make a template that does this automatically in one line.
>
> -Steve

The blog post Steve referred to has examples of this type processing while iterating over lines in a file. A couple different ways to access the elements are shown. AA access is addressed also: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

--Jon