Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 14, 2020 Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
As part of learning D I want to read a file that contains one word per line (plus optional junk after the word) and creates a set of all the unique words of a particular length (uppercased). D doesn't appear to have a set type so I'm faking using an associative array whose values are always 0. I can't help feeling that the foreach loop's block is rather more verbose than it could be? ---- #!/usr/bin/env rdmd import std.stdio; immutable WORDFILE = "/usr/share/hunspell/en_GB.dic"; immutable WORDSIZE = 4; // Should be even alias WordSet = int[string]; // key = word; value = 0 void main() { import core.time; auto start = MonoTime.currTime; auto words = getWords(WORDFILE, WORDSIZE); // TODO writeln(words.length, " words"); writeln(MonoTime.currTime - start); } WordSet getWords(string filename, int wordsize) { import std.conv; import std.regex; import std.uni; WordSet words; auto rx = ctRegex!(r"^[a-z]+", "i"); auto file = File(filename); foreach (line; file.byLine) { auto match = matchFirst(line, rx); if (!match.empty()) { auto word = match.hit().to!string; // I hope this assumes UTF-8? if (word.length == wordsize) { words[word.toUpper] = 0; } } } return words; } ---- PS I'm using ldc on Linux and think that rdmd is excellent. For lots of small Python programs I have I'm wondering how many would be faster using D and rdmd (which I think caches binaries). Also I've now got Mike Parker's "Learning D" on order. |
January 14, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to mark | Should I have closed the file, i.e.,: auto file = File(filename); scope(exit) file.close(); // Add this? |
January 14, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to mark | On Tuesday, 14 January 2020 at 16:39:16 UTC, mark wrote: > I can't help feeling that the foreach loop's block is rather more verbose than it could be? > > WordSet words; > auto rx = ctRegex!(r"^[a-z]+", "i"); > auto file = File(filename); > foreach (line; file.byLine) { > auto match = matchFirst(line, rx); > if (!match.empty()) { > auto word = match.hit().to!string; // I hope this assumes UTF-8? > if (word.length == wordsize) { > words[word.toUpper] = 0; > } > } > } > return words; > } > ---- One thing I picked up during Advent of Code last year was std.file.slurp, which was great for reading 90% of the input files from that contest. With that, I'd do this more like int[string] words; slurp!string("input.txt", "%s").each!(w => words[w] = 0); Where "%s" is what slurp() expects to find on each line, and 'string' is the type it returns from that. With just a list of words this isn't very interesting. Some of my uses from the contest are: auto input = slurp!(int, int, int)(args[1], "<x=%d, y=%d, z=%d>") .map!(p => Moon([p[0], p[1], p[2]])).array; Tuple!(string, string)[] input = slurp!(string, string)("input.txt", "%s)%s"); Of course if you want to validate the input as you're reading it, you still have to do extra work, but it could be in a .filter! |
January 15, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to mipri | Thanks for the ideas, I've now reduced the size of the getWords() function (even allowing for moving the imports to the top of the file) to this: WordSet getWords(string filename, int wordsize) { string bareWord(string line) { auto rx = ctRegex!(r"^([a-z]+)", "i"); auto match = matchFirst(line, rx); return match.empty ? "" : match.hit.to!string; } WordSet words; slurp!string(filename, "%s") .map!(line => bareWord(line)) .filter!(word => word.length == wordsize) .each!(word => words[word.toUpper] = 0); return words; } Is this as compact as it _reasonably_ can be? |
January 15, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to mark | On 2020-01-15 16:34, mark via Digitalmars-d-learn wrote:
> Is this as compact as it _reasonably_ can be?
How about this?
auto uniqueWords(string filename, uint wordsize) {
import std.algorithm, std.array, std.conv, std.functional, std.uni;
return File(filename).byLine
.map!(line => line.until!(not!isAlpha))
.filter!(word => word.count == wordsize)
.map!(word => word.to!string.toUpper)
.array
.sort
.uniq;
}
|
January 15, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to dwdv | I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this: WordSet getWords(string filename, int wordsize) { WordSet words; File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .each!(word => words[word.to!string.toUpper] = 0); return words; } This is also 4x faster than my version that used a regex -- thanks! Why did you use string.count rather than string.length? |
January 15, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to mark | On Wed, Jan 15, 2020 at 07:50:31PM +0000, mark via Digitalmars-d-learn wrote: [...] > Why did you use string.count rather than string.length? The .length of a `string` type is the number of bytes that it occupies, which is not necessarily the same thing as the number of characters in the string. E.g., if you receive a Unicode string, there may be multi-byte characters in it. T -- A computer doesn't mind if its programs are put to purposes that don't match their names. -- D. Knuth |
January 16, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to mark | On Wednesday, 15 January 2020 at 19:50:31 UTC, mark wrote:
> I really do need a set for the next part of the program, but taking your code and ideas I have now reduced the function to this:
>
> WordSet getWords(string filename, int wordsize) {
> WordSet words;
> File(filename).byLine
> .map!(line => line.until!(not!isAlpha))
> .filter!(word => word.count == wordsize)
> .each!(word => words[word.to!string.toUpper] = 0);
> return words;
> }
>
> This is also 4x faster than my version that used a regex -- thanks!
>
> Why did you use string.count rather than string.length?
Your solution is fine, but also
void main () {
auto file = ["word one", "my word", "word"] ;
writeln (uniqueWords(file, 4));
}
auto uniqueWords(string[] file, uint wordsize) {
import std.algorithm, std.array, std.conv, std.functional, std.uni;
return file
.map!(line => line.until!(not!isAlpha))
.filter!(word => word.count == wordsize)
.map!(word => word.to!string.toUpper)
.array
.sort
.uniq
.map!(x => tuple (x, 0))
.assocArray ;
}
|
January 16, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jesse Phillips | On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:
> [...]
> .map!(word => word.to!string.toUpper)
> .array
> .sort
> .uniq
> .map!(x => tuple (x, 0))
> .assocArray ;
>
.each!(word => words[word.to!string.toUpper] = 0);
isn't far off, but could also be (sans imports):
return File(filename).byLine
.map!(line => line.until!(not!isAlpha))
.filter!(word => word.count == wordsize)
.map!(word => word.to!string.toUpper)
.assocArray(0.repeat);
|
January 16, 2020 Re: Reading a file of words line by line | ||||
---|---|---|---|---|
| ||||
Posted in reply to dwdv | On Thursday, 16 January 2020 at 10:10:02 UTC, dwdv wrote: > On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote: >> [...] [...] > isn't far off, but could also be (sans imports): > > return File(filename).byLine > .map!(line => line.until!(not!isAlpha)) > .filter!(word => word.count == wordsize) > .map!(word => word.to!string.toUpper) > .assocArray(0.repeat); That's what I'm now using -- thanks! (Now I can try the next bit.) |
Copyright © 1999-2021 by the D Language Foundation