Thread overview
Reading a file of words line by line
Jan 14, 2020
mark
Jan 14, 2020
mark
Jan 14, 2020
mipri
Jan 15, 2020
mark
Jan 15, 2020
dwdv
Jan 15, 2020
mark
Jan 15, 2020
H. S. Teoh
Jan 16, 2020
Jesse Phillips
Jan 16, 2020
dwdv
Jan 16, 2020
mark
January 14, 2020
As part of learning D I want to read a file that contains one word per line (plus optional junk after the word) and creates a set of all the unique words of a particular length (uppercased).

D doesn't appear to have a set type so I'm faking using an associative array whose values are always 0.

I can't help feeling that the foreach loop's block is rather more verbose than it could be?

----
#!/usr/bin/env rdmd
import std.stdio;

immutable WORDFILE = "/usr/share/hunspell/en_GB.dic";
immutable WORDSIZE = 4; // Should be even

alias WordSet = int[string]; // key = word; value = 0

void main() {
    import core.time;

    auto start = MonoTime.currTime;
    auto words = getWords(WORDFILE, WORDSIZE);
    // TODO
    writeln(words.length, " words");
    writeln(MonoTime.currTime - start);
}

WordSet getWords(string filename, int wordsize) {
    import std.conv;
    import std.regex;
    import std.uni;

    WordSet words;
    auto rx = ctRegex!(r"^[a-z]+", "i");
    auto file = File(filename);
    foreach (line; file.byLine) {
	auto match = matchFirst(line, rx);
	if (!match.empty()) {
	    auto word = match.hit().to!string; // I hope this assumes UTF-8?
	    if (word.length == wordsize) {
		words[word.toUpper] = 0;
	    }
	}
    }
    return words;
}
----

PS I'm using ldc on Linux and think that rdmd is excellent. For lots of small Python programs I have I'm wondering how many would be faster using D and rdmd (which I think caches binaries). Also I've now got Mike Parker's "Learning D" on order.
January 14, 2020
Should I have closed the file, i.e.,:

    auto file = File(filename);
    scope(exit) file.close(); // Add this?

January 14, 2020
On Tuesday, 14 January 2020 at 16:39:16 UTC, mark wrote:
> I can't help feeling that the foreach loop's block is rather more verbose than it could be?
>

>     WordSet words;
>     auto rx = ctRegex!(r"^[a-z]+", "i");
>     auto file = File(filename);
>     foreach (line; file.byLine) {
> 	auto match = matchFirst(line, rx);
> 	if (!match.empty()) {
> 	    auto word = match.hit().to!string; // I hope this assumes UTF-8?
> 	    if (word.length == wordsize) {
> 		words[word.toUpper] = 0;
> 	    }
> 	}
>     }
>     return words;
> }
> ----

One thing I picked up during Advent of Code last year was
std.file.slurp, which was great for reading 90% of the input
files from that contest. With that, I'd do this more like

  int[string] words;
  slurp!string("input.txt", "%s").each!(w => words[w] = 0);

Where "%s" is what slurp() expects to find on each line, and
'string' is the type it returns from that. With just a list of
words this isn't very interesting. Some of my uses from the
contest are:

  auto input = slurp!(int, int, int)(args[1], "<x=%d, y=%d, z=%d>")
      .map!(p => Moon([p[0], p[1], p[2]])).array;

  Tuple!(string, string)[] input =
      slurp!(string, string)("input.txt", "%s)%s");

Of course if you want to validate the input as you're reading
it, you still have to do extra work, but it could be in a
.filter!

January 15, 2020
Thanks for the ideas, I've now reduced the size of the getWords() function (even allowing for moving the imports to the top of the file) to this:

WordSet getWords(string filename, int wordsize) {
    string bareWord(string line) {
	auto rx = ctRegex!(r"^([a-z]+)", "i");
	auto match = matchFirst(line, rx);
	return match.empty ? "" : match.hit.to!string;
    }
    WordSet words;
    slurp!string(filename, "%s")
	.map!(line => bareWord(line))
	.filter!(word => word.length == wordsize)
	.each!(word => words[word.toUpper] = 0);
    return words;
}

Is this as compact as it _reasonably_ can be?
January 15, 2020
On 2020-01-15 16:34, mark via Digitalmars-d-learn wrote:
> Is this as compact as it _reasonably_ can be?

How about this?

auto uniqueWords(string filename, uint wordsize) {
    import std.algorithm, std.array, std.conv, std.functional, std.uni;

    return File(filename).byLine
        .map!(line => line.until!(not!isAlpha))
        .filter!(word => word.count == wordsize)
        .map!(word => word.to!string.toUpper)
        .array
        .sort
        .uniq;
}
January 15, 2020
I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this:

WordSet getWords(string filename, int wordsize) {
    WordSet words;
    File(filename).byLine
	.map!(line => line.until!(not!isAlpha))
	.filter!(word => word.count == wordsize)
	.each!(word => words[word.to!string.toUpper] = 0);
    return words;
}

This is also 4x faster than my version that used a regex -- thanks!

Why did you use string.count rather than string.length?

January 15, 2020
On Wed, Jan 15, 2020 at 07:50:31PM +0000, mark via Digitalmars-d-learn wrote: [...]
> Why did you use string.count rather than string.length?

The .length of a `string` type is the number of bytes that it occupies, which is not necessarily the same thing as the number of characters in the string. E.g., if you receive a Unicode string, there may be multi-byte characters in it.


T

-- 
A computer doesn't mind if its programs are put to purposes that don't match their names. -- D. Knuth
January 16, 2020
On Wednesday, 15 January 2020 at 19:50:31 UTC, mark wrote:
> I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this:
>
> WordSet getWords(string filename, int wordsize) {
>     WordSet words;
>     File(filename).byLine
> 	.map!(line => line.until!(not!isAlpha))
> 	.filter!(word => word.count == wordsize)
> 	.each!(word => words[word.to!string.toUpper] = 0);
>     return words;
> }
>
> This is also 4x faster than my version that used a regex -- thanks!
>
> Why did you use string.count rather than string.length?

Your solution is fine, but also



void main () {

auto file = ["word one", "my word", "word"] ;
writeln (uniqueWords(file, 4));
}

auto uniqueWords(string[] file, uint wordsize) {
    import std.algorithm, std.array, std.conv, std.functional, std.uni;

    return file
        .map!(line => line.until!(not!isAlpha))
        .filter!(word => word.count == wordsize)
        .map!(word => word.to!string.toUpper)
        .array
        .sort
        .uniq
.map!(x => tuple (x, 0))
.assocArray ;
}


January 16, 2020
On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:
> [...]
> .map!(word => word.to!string.toUpper)
> .array
> .sort
> .uniq
> .map!(x => tuple (x, 0))
> .assocArray ;
> 

.each!(word => words[word.to!string.toUpper] = 0);

isn't far off, but could also be (sans imports):

return File(filename).byLine
    .map!(line => line.until!(not!isAlpha))
    .filter!(word => word.count == wordsize)
    .map!(word => word.to!string.toUpper)
    .assocArray(0.repeat);
January 16, 2020
On Thursday, 16 January 2020 at 10:10:02 UTC, dwdv wrote:
> On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:
>> [...]
[...]
> isn't far off, but could also be (sans imports):
>
> return File(filename).byLine
>     .map!(line => line.until!(not!isAlpha))
>     .filter!(word => word.count == wordsize)
>     .map!(word => word.to!string.toUpper)
>     .assocArray(0.repeat);

That's what I'm now using -- thanks!
(Now I can try the next bit.)