| |
| Posted by H. S. Teoh in reply to Igor Kolesnik | PermalinkReply |
|
H. S. Teoh
Posted in reply to Igor Kolesnik
| On Wed, Jan 23, 2013 at 09:07:24PM +0100, Igor Kolesnik wrote: [...]
> import std.stdio, std.string;
>
> void main() {
> uint[string] dic;
> foreach (line; stdin.byLine) {
> string[] words = cast(string[])split(strip(line));
> foreach (word; words) {
> if (word in dic)
> continue;
> uint id = dic.length;
> dic[word] = id;
> writeln(id, '\t', word);
> }
> }
> //foreach (k,v; dic)
> // writeln(k, '|', v);
> }
>
> When run it behaves somehow strange. Here is an example of the input/output I get
[...]
This is a known issue with stdin.byLine: it is a transient range (that means it reuses the same buffer for each line read from the input). The problem with this is that split returns slices of the line, that ultimately refer back to the data in the buffer. But by the time byLine is called again, that data has been overwritten. That's why the associative array is messed up.
There's a slight hint of this problem in your code that starts with "string[] words = cast(string[])..." -- in normal D code, you should not need to perform this kind of casting. In this case, this is an unsafe operation, because string is immutable(char)[], but the reused buffer returned by byLine is *not* immutable, so by casting away immutable, you've inadvertently introduced yourself to the buffer reuse issue in byLine. :)
The correct way to write that line is:
string[] words = split(strip(line.idup));
which will copy the buffer, thereby ensuring it's safe to keep slices of it in your associative array, and also return the correct type so that no cast is necessary.
T
--
Notwithstanding the eloquent discontent that you have just respectfully expressed at length against my verbal capabilities, I am afraid that I must unfortunately bring it to your attention that I am, in fact, NOT verbose.
|