Reading a file of words line by line

Jan 14, 2020

mark

Jan 14, 2020

mark

Jan 14, 2020

Jan 15, 2020

Jan 15, 2020

Jan 15, 2020

Jan 15, 2020

Jan 16, 2020

Jan 16, 2020

Jan 16, 2020

January 14, 2020

Reading a file of words line by line

Posted by mark

Permalink

mark

Permalink

As part of learning D I want to read a file that contains one word per line (plus optional junk after the word) and creates a set of all the unique words of a particular length (uppercased).

D doesn't appear to have a set type so I'm faking using an associative array whose values are always 0.

I can't help feeling that the foreach loop's block is rather more verbose than it could be?

----
#!/usr/bin/env rdmd
import std.stdio;

immutable WORDFILE = "/usr/share/hunspell/en_GB.dic";
immutable WORDSIZE = 4; // Should be even

alias WordSet = int[string]; // key = word; value = 0

void main() {
    import core.time;

    auto start = MonoTime.currTime;
    auto words = getWords(WORDFILE, WORDSIZE);
    // TODO
    writeln(words.length, " words");
    writeln(MonoTime.currTime - start);
}

WordSet getWords(string filename, int wordsize) {
    import std.conv;
    import std.regex;
    import std.uni;

    WordSet words;
    auto rx = ctRegex!(r"^[a-z]+", "i");
    auto file = File(filename);
    foreach (line; file.byLine) {
	auto match = matchFirst(line, rx);
	if (!match.empty()) {
	    auto word = match.hit().to!string; // I hope this assumes UTF-8?
	    if (word.length == wordsize) {
		words[word.toUpper] = 0;
	    }
	}
    }
    return words;
}
----

PS I'm using ldc on Linux and think that rdmd is excellent. For lots of small Python programs I have I'm wondering how many would be faster using D and rdmd (which I think caches binaries). Also I've now got Mike Parker's "Learning D" on order.

On Tuesday, 14 January 2020 at 16:39:16 UTC, mark wrote: > I can't help feeling that the foreach loop's block is rather more verbose than it could be? > > WordSet words; > auto rx = ctRegex!(r"^[a-z]+", "i"); > auto file = File(filename); > foreach (line; file.byLine) { > auto match = matchFirst(line, rx); > if (!match.empty()) { > auto word = match.hit().to!string; // I hope this assumes UTF-8? > if (word.length == wordsize) { > words[word.toUpper] = 0; > } > } > } > return words; > } > ---- One thing I picked up during Advent of Code last year was std.file.slurp, which was great for reading 90% of the input files from that contest. With that, I'd do this more like int[string] words; slurp!string("input.txt", "%s").each!(w => words[w] = 0); Where "%s" is what slurp() expects to find on each line, and 'string' is the type it returns from that. With just a list of words this isn't very interesting. Some of my uses from the contest are: auto input = slurp!(int, int, int)(args[1], "<x=%d, y=%d, z=%d>") .map!(p => Moon([p[0], p[1], p[2]])).array; Tuple!(string, string)[] input = slurp!(string, string)("input.txt", "%s)%s"); Of course if you want to validate the input as you're reading it, you still have to do extra work, but it could be in a .filter!

Thanks for the ideas, I've now reduced the size of the getWords() function (even allowing for moving the imports to the top of the file) to this: WordSet getWords(string filename, int wordsize) { string bareWord(string line) { auto rx = ctRegex!(r"^([a-z]+)", "i"); auto match = matchFirst(line, rx); return match.empty ? "" : match.hit.to!string; } WordSet words; slurp!string(filename, "%s") .map!(line => bareWord(line)) .filter!(word => word.length == wordsize) .each!(word => words[word.toUpper] = 0); return words; } Is this as compact as it _reasonably_ can be?

On 2020-01-15 16:34, mark via Digitalmars-d-learn wrote: > Is this as compact as it _reasonably_ can be? How about this? auto uniqueWords(string filename, uint wordsize) { import std.algorithm, std.array, std.conv, std.functional, std.uni; return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .array .sort .uniq; }

I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this: WordSet getWords(string filename, int wordsize) { WordSet words; File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .each!(word => words[word.to!string.toUpper] = 0); return words; } This is also 4x faster than my version that used a regex -- thanks! Why did you use string.count rather than string.length?

On Wed, Jan 15, 2020 at 07:50:31PM +0000, mark via Digitalmars-d-learn wrote: [...] > Why did you use string.count rather than string.length? The .length of a `string` type is the number of bytes that it occupies, which is not necessarily the same thing as the number of characters in the string. E.g., if you receive a Unicode string, there may be multi-byte characters in it. T -- A computer doesn't mind if its programs are put to purposes that don't match their names. -- D. Knuth

On Wednesday, 15 January 2020 at 19:50:31 UTC, mark wrote: > I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this: > > WordSet getWords(string filename, int wordsize) { > WordSet words; > File(filename).byLine > .map!(line => line.until!(not!isAlpha)) > .filter!(word => word.count == wordsize) > .each!(word => words[word.to!string.toUpper] = 0); > return words; > } > > This is also 4x faster than my version that used a regex -- thanks! > > Why did you use string.count rather than string.length? Your solution is fine, but also void main () { auto file = ["word one", "my word", "word"] ; writeln (uniqueWords(file, 4)); } auto uniqueWords(string[] file, uint wordsize) { import std.algorithm, std.array, std.conv, std.functional, std.uni; return file .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .array .sort .uniq .map!(x => tuple (x, 0)) .assocArray ; }

On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote: > [...] > .map!(word => word.to!string.toUpper) > .array > .sort > .uniq > .map!(x => tuple (x, 0)) > .assocArray ; > .each!(word => words[word.to!string.toUpper] = 0); isn't far off, but could also be (sans imports): return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .assocArray(0.repeat);

On Thursday, 16 January 2020 at 10:10:02 UTC, dwdv wrote: > On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote: >> [...] [...] > isn't far off, but could also be (sans imports): > > return File(filename).byLine > .map!(line => line.until!(not!isAlpha)) > .filter!(word => word.count == wordsize) > .map!(word => word.to!string.toUpper) > .assocArray(0.repeat); That's what I'm now using -- thanks! (Now I can try the next bit.)

Forums