Thread overview
Associative Array potential performance pitfall?
Mar 13, 2020
Adnan
Mar 13, 2020
H. S. Teoh
Mar 13, 2020
H. S. Teoh
Mar 13, 2020
dayllenger
March 13, 2020
In my machine the following D code compiled with release flag and LDC performs over 230ms while the similar Go code performs under 120ms.

string smallestRepr(const string arg) {
	import std.format : format;

	const repeated = format!"%s%s"(arg, arg);
	string result;
	result.reserve(arg.length);
	result = arg;
	foreach (i; 1 .. arg.length) {
		const slice = repeated[i .. i + arg.length];
		if (slice < result)
			result = slice;
	}
	return result;
}

unittest {
	assert("cba".smallestRepr() == "acb");
}

void main(const string[] args) {
	string[][const string] wordTable;
	import std.stdio : writeln, lines, File;

	foreach (string word; File(args[1]).lines()) {
		word = word[0 .. $ - 1]; // strip the newline character
		const string key = word.smallestRepr();
		if (key in wordTable)
			wordTable[key] ~= word;
		else
			wordTable[key] = [word];
	}

	foreach (key; wordTable.keys) {
		if (wordTable[key].length == 4) {
			writeln(wordTable[key]);
			break;
		}
	}
}

---

package main

import (
        "bufio"
        "fmt"
        "os"
)

func canonicalize(s string) string {
        c := s + s[:len(s)-1]
        best := s
        for i := 1; i < len(s); i++ {
                if c[i:i+len(s)] < best {
                        best = c[i : i+len(s)]
                }
        }
        return best
}

func main() {
        seen := make(map[string][]string)

        inputFile, err := os.Open(os.Args[1])
        defer inputFile.Close()

        if err != nil {
                os.Exit(1)
        }

        scanner := bufio.NewScanner(inputFile)

        for scanner.Scan() {
                word := scanner.Text()
                n := canonicalize(word)
                seen[n] = append(seen[n], word)

        }

        for _, words := range seen {
                if len(words) == 4 {
                        fmt.Printf("%v\n", words)
                        break
                }

        }
}

The input file : https://raw.githubusercontent.com/dolph/dictionary/master/enable1.txthttps://raw.githubusercontent.com/dolph/dictionary/master/enable1.txt

Problem description: https://www.reddit.com/r/dailyprogrammer/comments/ffxabb/20200309_challenge_383_easy_necklace_matching

Where am I losing performance?
March 13, 2020
On Fri, Mar 13, 2020 at 03:40:11AM +0000, Adnan via Digitalmars-d-learn wrote:
> In my machine the following D code compiled with release flag and LDC performs over 230ms while the similar Go code performs under 120ms.
> 
> string smallestRepr(const string arg) {
> 	import std.format : format;
> 
> 	const repeated = format!"%s%s"(arg, arg);

.format is dog slow, and not exactly something you want to be calling every iteration.  Try instead:

	auto repeated = arg ~ arg;


> 	string result;
> 	result.reserve(arg.length);
> 	result = arg;

This does not do what you think it does.  Assigning to a string is a shallow copy; the call to .reserve is superfluous. In fact, it's triggering another allocation, only to throw it away immediately. Just write:

	auto result = arg;


> 	foreach (i; 1 .. arg.length) {
> 		const slice = repeated[i .. i + arg.length];
> 		if (slice < result)
> 			result = slice;
> 	}
> 	return result;
> }
[...]
> Where am I losing performance?

1) Calling .format -- it's dog slow. Just use ~.

2) Calling .reserve only to throw away the results immediately. Just skip it.

I doubt your AA is the performance bottleneck here.


Try the following implementation of .smallestRepr that avoids (1) and
(2):

	string smallestRepr(const string arg) {
		auto repeated = arg ~ arg;
		string result = arg;
		foreach (i; 1 .. arg.length) {
			const slice = repeated[i .. i + arg.length];
			if (slice < result)
				result = slice;
		}
		return result;
	}

Note that `arg ~ arg` may allocate, but it also may not if the current buffer for `arg` is big enough to accomodate both. So you might actually save on some allocations here, that you lose out from calling .format.

On my machine, a quick benchmark shows a 2x performance boost.


T

-- 
That's not a bug; that's a feature!
March 13, 2020
On Friday, 13 March 2020 at 03:40:11 UTC, Adnan wrote:
> Where am I losing performance?

It is nonsensical to say without measuring. Humans are notoriously bad at predicting performance issues. Wrap all your hardworking code into a loop with like 100 iterations, compile and run:

$ perf record -g ./app file.txt && perf report

Measure dmd debug build first, then ldc2 -release -O.

Besides what H. S. Teoh said, these things help:

- disable GC before doing the job and enable it afterwards
- use .byLineCopy instead of lines(), it's faster
- use .byKeyValue when printing the results, it's lazy and returns both key and value

90ms after vs 215ms before

March 13, 2020
On 3/13/20 5:24 AM, H. S. Teoh wrote:
> Note that `arg ~ arg` may allocate, but it also may not if the current
> buffer for `arg` is big enough to accomodate both.

That always allocates. Only appending may avoid allocation:

arg ~= arg;

But, I would instead use ranges if possible to avoid all allocations.

-Steve
March 13, 2020
On Fri, Mar 13, 2020 at 09:30:16AM -0400, Steven Schveighoffer via Digitalmars-d-learn wrote:
> On 3/13/20 5:24 AM, H. S. Teoh wrote:
> > Note that `arg ~ arg` may allocate, but it also may not if the current buffer for `arg` is big enough to accomodate both.
> 
> That always allocates. Only appending may avoid allocation:
> 
> arg ~= arg;

Ah I see.  Mea culpa.


> But, I would instead use ranges if possible to avoid all allocations.
[...]

Initially I thought about avoiding all allocations, but I realized that no matter what, you need to allocate the key for the AA anyway, so I didn't go that route.  That said, though, we could probably reduce the number of allocations by various means, such as having the function return char[] instead of string, and calling .idup on-demand as opposed to allocating up-front. This would allow us to reuse a static buffer for `repeated` instead of incurring an allocation each time.  Given what the algorithm is doing, this should save some number of allocations if done correctly.


T

-- 
Creativity is not an excuse for sloppiness.