Thread overview
Associative array issue
Jan 23, 2013
Igor Kolesnik
Jan 23, 2013
H. S. Teoh
Jan 23, 2013
Igor Kolesnik
January 23, 2013
Hi;

I'm trying to run an example from the tutorial on http://www.informit.com/articles/article.aspx?p=1381876&seqNum=4
Here is the code

import std.stdio, std.string;

void main() {
  uint[string] dic;
  foreach (line; stdin.byLine) {
    string[] words = cast(string[])split(strip(line));
    foreach (word; words) {
      if (word in dic)
	continue;
      uint id = dic.length;
      dic[word] = id;
      writeln(id, '\t', word);
    }
  }
  //foreach (k,v; dic)
  //  writeln(k, '|', v);
}

When run it behaves somehow strange. Here is an example of the input/output I get

the type of array
0       the
1       type
2       of
3       array
in d the type of array
4       in
5       d
6       the
7       type
8       of
9       array

It seems like the 'word in dic' doesn't find the item in the array.
If I print the contents of 'dic' array on exit, I get the following

d|5
in |0
e of |3
in|4
the|6
array|9
 the|1
type|7
ty|2
of|8

Can someone help me understand what is going wrong? Am I missing something here?

ps: mdm32 v2.061 on Win7 x64

Sincerely,
Igor
January 23, 2013
On Wed, Jan 23, 2013 at 09:07:24PM +0100, Igor Kolesnik wrote: [...]
> import std.stdio, std.string;
> 
> void main() {
>   uint[string] dic;
>   foreach (line; stdin.byLine) {
>     string[] words = cast(string[])split(strip(line));
>     foreach (word; words) {
>       if (word in dic)
> 	continue;
>       uint id = dic.length;
>       dic[word] = id;
>       writeln(id, '\t', word);
>     }
>   }
>   //foreach (k,v; dic)
>   //  writeln(k, '|', v);
> }
> 
> When run it behaves somehow strange. Here is an example of the input/output I get
[...]

This is a known issue with stdin.byLine: it is a transient range (that means it reuses the same buffer for each line read from the input). The problem with this is that split returns slices of the line, that ultimately refer back to the data in the buffer. But by the time byLine is called again, that data has been overwritten. That's why the associative array is messed up.

There's a slight hint of this problem in your code that starts with "string[] words = cast(string[])..." -- in normal D code, you should not need to perform this kind of casting. In this case, this is an unsafe operation, because string is immutable(char)[], but the reused buffer returned by byLine is *not* immutable, so by casting away immutable, you've inadvertently introduced yourself to the buffer reuse issue in byLine. :)

The correct way to write that line is:

	string[] words = split(strip(line.idup));

which will copy the buffer, thereby ensuring it's safe to keep slices of it in your associative array, and also return the correct type so that no cast is necessary.


T

-- 
Notwithstanding the eloquent discontent that you have just respectfully expressed at length against my verbal capabilities, I am afraid that I must unfortunately bring it to your attention that I am, in fact, NOT verbose.
January 23, 2013
> The correct way to write that line is:
>
> 	string[] words = split(strip(line.idup));
>
> which will copy the buffer, thereby ensuring it's safe to keep slices of
> it in your associative array, and also return the correct type so that
> no cast is necessary.
>
>
> T

This makes sense. Thanks a lot!

Igor