Thread overview
"BOLT" post-link optimizer gives 15% speed boost to Clang
Oct 23, 2018
Johan Engelen
Oct 23, 2018
Nicholas Wilson
Oct 24, 2018
Walter Bright
Oct 24, 2018
Johan
Oct 24, 2018
Walter Bright
Oct 24, 2018
welkam
October 23, 2018
Hi all,
  "Post-link optimization", what? Yes, indeed, optimization _after_ the _linker_ has generated a _binary_.
Read about this interesting work and the discussion here: https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html

When applied to clang, the performance gain is 15% faster execution (note, that's the result of applying BOLT on a clang binary that was already built with PGO and LTO !)

Cheers,
  Johan

October 23, 2018
On Tuesday, 23 October 2018 at 22:07:21 UTC, Johan Engelen wrote:
> Hi all,
>   "Post-link optimization", what? Yes, indeed, optimization _after_ the _linker_ has generated a _binary_.
> Read about this interesting work and the discussion here: https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
>
> When applied to clang, the performance gain is 15% faster execution (note, that's the result of applying BOLT on a clang binary that was already built with PGO and LTO !)
>
> Cheers,
>   Johan

Ahh, another thing to add to the list of cool stuff to integrate into LDC.
Its getting rather long.
October 23, 2018
On 10/23/2018 3:07 PM, Johan Engelen wrote:
> Hi all,
>    "Post-link optimization", what? Yes, indeed, optimization _after_ the _linker_ has generated a _binary_.
> Read about this interesting work and the discussion here: https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
> 
> When applied to clang, the performance gain is 15% faster execution (note, that's the result of applying BOLT on a clang binary that was already built with PGO and LTO !)
> 
> Cheers,
>    Johan
> 

Digital Mars C++ had this in the 1990s.

https://www.digitalmars.com/ctg/trace.html

How it works is -gt switch is thrown to generate a runtime profile of the graph of how functions call each other. Then, a module definition file https://www.digitalmars.com/ctg/ctgDefFiles.html is generated that directs the linker to order the layout of functions in the executable to minimize cache misses.

A 15% speedup is about right for this sort of optimization.

Note that Digital Mars didn't invent this, it was invented by (I forgot who) who productized it as "The Segmentor".

From https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md :

"BOLT (Binary Optimization and Layout Tool) is designed to improve the application performance by laying out code in a manner that helps CPU better utilize its caching and branch predicting resources.

"Before we can run BOLT optimizations, we need to collect the profile for Clang,

Yup, it's reinvention of the same thing.
October 24, 2018
On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright wrote:
> On 10/23/2018 3:07 PM, Johan Engelen wrote:
>> Hi all,
>>    "Post-link optimization", what? Yes, indeed, optimization _after_ the _linker_ has generated a _binary_.
>> Read about this interesting work and the discussion here: https://lists.llvm.org/pipermail/llvm-dev/2018-October/126985.html
>> 
>> When applied to clang, the performance gain is 15% faster execution (note, that's the result of applying BOLT on a clang binary that was already built with PGO and LTO !)
>> 
>> Cheers,
>>    Johan
>> 
>
> Digital Mars C++ had this in the 1990s.
>
> https://www.digitalmars.com/ctg/trace.html
>
> How it works is -gt switch is thrown to generate a runtime profile of the graph of how functions call each other. Then, a module definition file https://www.digitalmars.com/ctg/ctgDefFiles.html is generated that directs the linker to order the layout of functions in the executable to minimize cache misses.
>
> A 15% speedup is about right for this sort of optimization.
>
> Note that Digital Mars didn't invent this, it was invented by (I forgot who) who productized it as "The Segmentor".
>
> From https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md :
>
> "BOLT (Binary Optimization and Layout Tool) is designed to improve the application performance by laying out code in a manner that helps CPU better utilize its caching and branch predicting resources.
>
> "Before we can run BOLT optimizations, we need to collect the profile for Clang,
>
> Yup, it's reinvention of the same thing.

Sorry, but such statements show a large lack of insight in what is going on here. This isn't about who invented something first, the basic idea is of course very old. DMC++ most certainly didn't and doesn't do what is being discussed.
I suggest you reread about it without your preconceptions.

BOLT does not use compiler-inserted instrumentation for execution tracing (which btw is more than just function call tracing), that's what (usually) PGO and LTO use. Instead, the profile that BOLT uses is a sampling-based profile obtained e.g. using `perf` and hardware counters. The code layout optimization performed is not just putting functions in a better order, but moreso the reordering of basic blocks inside functions. PGO and LTO already do that too (and can also work with sampling based profiling but have trouble using all profile data information inside functions). Because the sampling profile data is more accurate than compiler-inserted instrumentation added before many optimization transformations, and because the optimization happens on machinecode level (meaning that more of the sampling profile data can be used well) substantial additional gains are obtained by post-link optimization on top of PGO+LTO.

- Johan

October 24, 2018
On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright wrote:
> Yup, it's reinvention of the same thing.

The authors never claimed that they invented anything. What they claim is that doing reordering after linking has benefits that compiler cant easily do. Bolt improves performance on top of PGO+LTO by:

1. Using more accurate profile data from intel`s last branch record(LBR)
2. Inlining function some times changes hotness of code blocks based on where it was inlined so doing profiling after inlining gives better decision making for Bolt.

The negative of Bolt`s approach is that is limited on what allowed transformations it can do so people should use PGO+LTO+Bolt for maximum performance. Second negative is that it works with ELF format so windows users wont benefit from it.
October 24, 2018
On 10/24/2018 3:26 AM, Johan wrote:
> On Wednesday, 24 October 2018 at 01:57:38 UTC, Walter Bright wrote:
>> On 10/23/2018 3:07 PM, Johan Engelen wrote:
>>>    "Post-link optimization", what? Yes, indeed, optimization _after_ the _linker_ has generated a _binary_.
>> Yup, it's reinvention of the same thing.
> Sorry, but such statements show a large lack of insight in what is going on here. This isn't about who invented something first, the basic idea is of course very old. DMC++ most certainly didn't and doesn't do what is being discussed.


> "Post-link optimization", what? Yes, indeed, optimization _after_ the _linker_ has generated a _binary_.

That was the summary, implying it is some new idea. It's an incremental improvement. DMC++ does indeed do post-link optimization. If it was written as "BOLT improves on profile guided optimization after the linker has generated a binary by reordering basic blocks" I wouldn't have had any issue with that.

clang is an excellent implementation of a lot of ideas.