Jump to page: 1 2
Thread overview
Andrei Alexandrescu needs to read this
Oct 23, 2019
welkam
Oct 23, 2019
Jonathan Marler
Oct 23, 2019
welkam
Oct 24, 2019
H. S. Teoh
Oct 24, 2019
welkam
Oct 24, 2019
Jon Degenhardt
Oct 23, 2019
Walter Bright
Oct 23, 2019
H. S. Teoh
Oct 24, 2019
Walter Bright
Oct 27, 2019
Mark
Oct 27, 2019
drug
Oct 27, 2019
rikki cattermole
Oct 31, 2019
drug
Oct 27, 2019
rikki cattermole
Oct 24, 2019
welkam
Oct 24, 2019
Daniel Kozak
October 23, 2019
I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this.

Producing Wrong Data Without Doing Anything Obviously Wrong!
https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
October 23, 2019
On Wednesday, 23 October 2019 at 21:37:26 UTC, welkam wrote:
> I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this.
>
> Producing Wrong Data Without Doing Anything Obviously Wrong!
> https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf

That's why Andrei says "always measure". He understands how complex modern CPUs are and that it's basically pointless to try to predict performance.  He has some good talks where he shows that one of the biggest causes of performance problems is not understanding how the processor cache works, but his point was that you'll never be able to theoretically predict the performance of hardware in today's world.  Always measure.

What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take.  So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e . more predictable) in order to get the best performance.  When hardware was "dumber", it was better to make your code smarter.  An odd switch in paradigms.
October 23, 2019
On Wednesday, 23 October 2019 at 22:03:29 UTC, Jonathan Marler wrote:

> That's why Andrei says "always measure".
I see you didnt read the paper. The performance measurement that you talk about and Andrei does measures two things: 1. change due to code change and 2. change due to code layout changes.
When performance change is in order of magnitude then you can safely assume it was because of code change you made but when the difference is less than 10% it becomes unclear what actually is responsible for that difference. If you had read the paper you would find out that gcc's -O3 changes performance over -O2 from -8% to +12% on the same application.

Simple measurement is not sufficient to conclude that your change in algorithm is what is responsible for measured performance increase.
October 23, 2019
On 10/23/2019 3:03 PM, Jonathan Marler wrote:
> What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take.  So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e . more predictable) in order to get the best performance.  When hardware was "dumber", it was better to make your code smarter.  An odd switch in paradigms.

Keep in mind that starting in the late 70's, CPUs started being designed around the way compilers generate code. (Before then, instruction sets were a wacky collection of seemingly unrelated instructions. Compilers like orthogonality, and specialized instructions to do things like stack frame setup / teardown.)

This means that unusual instruction sequences tend to perform less well than the ordinary stuff a compiler generates.

It's also true that code optimizers are tuned to what the local C/C++ compiler generates, even if the optimizer is designed to work with multiple diverse languages.
October 23, 2019
On Wed, Oct 23, 2019 at 04:22:08PM -0700, Walter Bright via Digitalmars-d wrote:
> On 10/23/2019 3:03 PM, Jonathan Marler wrote:
> > What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take.  So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e .  more predictable) in order to get the best performance.  When hardware was "dumber", it was better to make your code smarter.  An odd switch in paradigms.

Indeed!  In the old days it was all about minimizing instructions. But nowadays, minimizing instructions may make your code perform worse if you increased the number of branches, thereby causing more branch hazards.

On the flip side, some good optimizers can eliminate branch hazards in certain cases, e.g.:

	bool cond;
	x = cond ? y+1 : y;

can be rewritten by the optimizer as:

	x = y + cond;

which allows for a branchless translation into machine code.

Generally, though, it's a bad idea to write this sort of optimizations in the source code: it runs the risk of confusing the optimizer, which may cause it to be disabled for that piece of code, resulting in poor generated code.  It's usually better to just trust the optimizer to do its job.

Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently.  Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs.  (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)


> Keep in mind that starting in the late 70's, CPUs started being designed around the way compilers generate code. (Before then, instruction sets were a wacky collection of seemingly unrelated instructions. Compilers like orthogonality, and specialized instructions to do things like stack frame setup / teardown.)
> 
> This means that unusual instruction sequences tend to perform less well than the ordinary stuff a compiler generates.

Yeah, nowadays with microcode, you can't trust the surface appearance of the assembly instructions anymore. What looks like the same number of instructions can have very different performance characteristics depending on how it's actually implemented in the microcode.


> It's also true that code optimizers are tuned to what the local C/C++ compiler generates, even if the optimizer is designed to work with multiple diverse languages.

Interesting, I didn't know this.


T

-- 
Guns don't kill people. Bullets do.
October 23, 2019
On Wed, Oct 23, 2019 at 11:20:07PM +0000, welkam via Digitalmars-d wrote: [...]
> When performance change is in order of magnitude then you can safely assume it was because of code change you made but when the difference is less than 10% it becomes unclear what actually is responsible for that difference. If you had read the paper you would find out that gcc's -O3 changes performance over -O2 from -8% to +12% on the same application.
> 
> Simple measurement is not sufficient to conclude that your change in algorithm is what is responsible for measured performance increase.

Yeah, I tend to get skeptical when people start micro-optimizing for small performance increases. When it's a 30%-40% or higher increase, then I'd say it's reasonably safe to conclude that the algorithm change was responsible. But if it's 2% or 5% then it's harder to be confident you aren't being misled by other factors.  Also, I tend to ignore differences of less than 1s or 0.5s in benchmarks, because it's hard to tell whether the 0.002s increase is caused by the code, or by other factors. When people start optimizing over sub-second performance increases I start getting suspicious.

As a general principle I'd say that if a set of benchmarks are being compared, they need to be run in the *exact* same environment and then compared. If a set of measurements were made 5 days ago and we compare them with measurements made today, there's no accounting for what subtle system differences may have crept in in the meantime, that may throw off the results.

But this paper reveals even more interesting points, though, about performance differences across systems or CPU models. For this, I'd propose that any benchmarks we're basing algorithm decisions on ought to be verified on at least two (preferably more) divergent systems. E.g., if running benchmark B on a Debian Linux server leads to the conclusion that algorithm A1 is better than A2, then I'd say we should check whether running B on a Windows machine leads to the same conclusion. Or if the code change is Linux-specific, I'd say test it also on a Slackware desktop system to see if we get different conclusions from a Debian server system. The more variety incorporated into the sample set the better.

However, I felt the point the paper makes about address randomization should actually be a *beneficial* point: rather than turn it off, run the exact same benchmark, say, 500 times with ASLR *enabled*, which should balance out any biases by incorporating into the results both beneficial and detrimental address alignments that may have resulted from ASLR.  If you make conclusions based on benchmarks taken with ASLR disabled, then you run the risk that the algorithm performs better without ASLR but worse in typical user environments which *would* have ASLR enabled.

Another problem with benchmark-based optimizations, that the paper didn't mention, is that you run into the risk of optimizing for the benchmark at the expense of actual, real-world software. Typical software, for example, doesn't run memcpy 50 million times in a tight loop; typically memcpy calls are sprinkled among a whole bunch of other stuff. If you micro-optimize memcpy to beat the benchmark, you could be misled by CPU cache / branch predictor effects, which would *not* trigger in actual application software because the usage pattern is completely different. You could potentially be making memcpy run *slower* in Excel even though it runs faster in a benchmark, for example.

Anecdote.  Once I wrote several D analogues of the Unix wc utility and ran a comparison with my Debian distro's stock wc utility (which is the GNU version). It basically amounted to calling memchr on newline characters, for GNU wc. In the D versions I used various different algorithms, including using std.mmap, std.parallelism, reading the file the traditional way by blocks, etc..  Then I ran the various versions of wc on various sets of data with different characteristics.

I discovered something very interesting: GNU wc was generally on par with, or outperformed the D versions of the code for files that contained long lines, but performed more poorly when given files that contained short lines.

Glancing at the glibc source code revealed why: glibc's memchr used an elaborate bit hack based algorithm that scanned the target string 8 bytes at a time. This required the data to be aligned, however, so when the string was not aligned, it had to manually process up to 7 bytes at either end of the string with a different algorithm.  So when the lines were long, the overall performance was dominated by the 8-byte at a time scanning code, which was very fast for large buffers.  However, when given a large number of short strings, the overhead of setting up for the 8-byte scan became more costly than the savings, so it performed more poorly than a naïve byte-by-byte scan.

I confirmed this conclusion by artificially constructing an input file with extremely long lines, and another input file with extremely short lines.  Then I compared GNU wc's performance with a naïve D function that basically did a byte-by-byte scan.  As expected, wc lost to the naïve scanner on the input file with extremely short lines, but won by a large margin on the file with extremely long lines.  Subsequently I was able to duplicate this result in D by writing the same 8-byte at a time scanner in D.

Taking a step back, I realized that this was a classic case of optimizing for the benchmark: it seemed likely that glibc's memchr was optimized for scanning large buffers, given the kind of algorithm used to implement it. Since benchmarks tend to be written for large test cases, my suspicion was that this algorithm was favored because the author(s) were testing the code on large buffers. But this optimization came at the expense of performance in the small buffer case.  Which means the actual benefit of this optimization depended on what your application uses memchr for.  If you regularly scan large buffers, then glibc's implementation will give you better performance.  However, if your application deals with a lot of small strings, you might be better off writing your own naïve byte-by-byte scanning code, because it will actually outperform glibc.

My gut feeling is that a lot of software actually frequently deals with short strings: think about your typical Facebook post or text message, or database of customer names. Your customers aren't going to have names that are several KB long, so if the software uses glibc's memchr on names, then it's performing rather poorly. A live chat system sends short messages at a time, and a lot of network protocols also center around short(ish) messages.  Large data like video or music generally don't use memchr() because they aren't textual. And even then, you tend to process them in blocks typically 4KB each or so.  So it's questionable whether glibc's choice of memchr implementation is really optimal in the sense of benefitting the most common applications, rather than just excelling at an artificial benchmark that doesn't accurately represent typical real-world usage.

Another corollary from all this, is that sometimes, there is no unique optimization that will work best for everybody.  There is no "optimal" code that's detached from its surrounding context; you optimize for one use case at the detriment of another.  And unless you have specific use cases in mind, it's hard, or even impossible, to make the "right" decisions -- and this is especially the case for standard libraries that must be as generic as possible. The more generic you get, the harder it is to choose the best algorithm. At a certain point it becomes outright impossible because "best" becomes ill-defined (best for who?).

And this comes back to my point that I get suspicious when people start trying to squeeze that last 5-10% performance out of their code. Beware, because you could be optimizing for your benchmark rather than the user's actual environment.


T

-- 
Старый друг лучше новых двух.
October 23, 2019
On 10/23/2019 4:51 PM, H. S. Teoh wrote:
>> It's also true that code optimizers are tuned to what the local C/C++
>> compiler generates, even if the optimizer is designed to work with
>> multiple diverse languages.
> 
> Interesting, I didn't know this.

It's pretty straightforward why - the optimizer developers tend to be C/C++ developers who look at the generated code from their C/C++ compiler.

October 24, 2019
On Wed, Oct 23, 2019 at 11:40 PM welkam via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>
> I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this.
>
> Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf

https://www.youtube.com/watch?v=r-TLSBdHe1A&t=20s
October 24, 2019
On Wednesday, 23 October 2019 at 23:22:08 UTC, Walter Bright wrote:
> instruction sets were a wacky collection of seemingly unrelated instructions.

Yeah at some point compiler writers and chip designers were in competition on who can produce better code. You can look at those instructions as an attempt at peephole optimization.
October 24, 2019
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
> you could be optimizing for your benchmark rather than the user's actual environment.

Thats the feeling I get when reading blog post on Rust compiler speed improvements.

The other thing to keep in mind is that you need to be mindful about what CPU resources are limiting your performance because if you change your algorithm to use more D-cache and it shows speed improvements in micro benchmarks if your application is already starving for D-cache you would reduce performance when you add that change to whole application.
« First   ‹ Prev
1 2