Thread overview
TSV Utilities release with LTO and PGO enabled
Jan 14, 2018
Jon Degenhardt
Jan 16, 2018
Martin Nowak
Jan 16, 2018
Jon Degenhardt
Jan 16, 2018
Johan Engelen
Jan 17, 2018
Jon Degenhardt
Jan 17, 2018
Johan Engelen
Jan 18, 2018
Jon Degenhardt
January 14, 2018
I just released a new version of eBay's TSV Utilities. The cool thing about the release is not about changes in toolkit, but that it was possible to build everything using LDC's support for Link Time Optimization (LTO) and Profile Guided Optimization (PGO). This includes running the optimizations on both the application code and the D standard libraries (druntime and phobos). Further, it was all doable on Travis-CI (Linux and MacOS), including building release binaries available from the GitHub release page.

Combined, LTO and PGO resulted in performance improvements greater than 25% on three of my standard six benchmarks, and five of the six improved at least 8%.

Release info: https://github.com/eBay/tsv-utils-dlang/releases/tag/v1.1.16

January 16, 2018
On Sunday, 14 January 2018 at 23:18:42 UTC, Jon Degenhardt wrote:
> Combined, LTO and PGO resulted in performance improvements greater than 25% on three of my standard six benchmarks, and five of the six improved at least 8%.

Yay, I'm usually seeing double digit improvements for PGO alone, and single digit improvements for LTO. Meaning PGO has more effect even though LTO seems to be the more hyped one.
Have you bothered benchmarking them separately?

January 16, 2018
On Tuesday, 16 January 2018 at 00:19:24 UTC, Martin Nowak wrote:
> On Sunday, 14 January 2018 at 23:18:42 UTC, Jon Degenhardt wrote:
>> Combined, LTO and PGO resulted in performance improvements greater than 25% on three of my standard six benchmarks, and five of the six improved at least 8%.
>
> Yay, I'm usually seeing double digit improvements for PGO alone, and single digit improvements for LTO. Meaning PGO has more effect even though LTO seems to be the more hyped one.
> Have you bothered benchmarking them separately?

Last spring I made a few quick tests of both separately. That was just against the app code, without druntime/phobos. Saw some benefit from LTO, mainly one of the tools, and not much from PGO.

More recently I tried LTO standalone and LTO plus PGO, both against app code and druntime/phobos, but not PGO standalone. The LTO benchmarks are here: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/dlang-meetup-14dec2017.pdf. I've haven't published the LTO + PGO benchmarks.

The takeaway from my tests is that LTO and PGO will benefit different apps differently, perhaps in ways not easily predicted. One of my tools benefited primarily from PGO, two primarily from LTO, and one materially from both. So, it is worth trying both.

For both, the big win was from optimizing across app code and libs (druntime/phobos in my case). It'd be interesting to see if other apps see similar behavior, either with phobos/druntime or other libraries, perhaps libraries from dub dependencies.
January 16, 2018
On Tuesday, 16 January 2018 at 02:45:39 UTC, Jon Degenhardt wrote:
> On Tuesday, 16 January 2018 at 00:19:24 UTC, Martin Nowak wrote:
>> On Sunday, 14 January 2018 at 23:18:42 UTC, Jon Degenhardt wrote:
>>> Combined, LTO and PGO resulted in performance improvements greater than 25% on three of my standard six benchmarks, and five of the six improved at least 8%.
>>
>> Yay, I'm usually seeing double digit improvements for PGO alone, and single digit improvements for LTO. Meaning PGO has more effect even though LTO seems to be the more hyped one.
>> Have you bothered benchmarking them separately?
>
> Last spring I made a few quick tests of both separately. That was just against the app code, without druntime/phobos. Saw some benefit from LTO, mainly one of the tools, and not much from PGO.

Because PGO optimizes for the given profile, it would help a lot if you clarified how you do your PGO benchmarking. What kind of test load profile you used for optimization and what test load you use for the time measurement.

Regardless, it's fun to hear your test results :-)
  Johan
January 17, 2018
On Tuesday, 16 January 2018 at 22:04:52 UTC, Johan Engelen wrote:
> Because PGO optimizes for the given profile, it would help a lot if you clarified how you do your PGO benchmarking. What kind of test load profile you used for optimization and what test load you use for the time measurement.

The profiling used is checked into the repo and run as part of a PGO build, it is available for inspection. The benchmarks used for deltas are also documented, they the ones used in the benchmark comparison to similar tools done in March 2017. This report is in the repo (https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md).

However, it's hard to imagine anyone perusing the repo for this stuff, so I'll try to summarize what I did below.

Benchmarks - Six different tests of rather different but common operations run on large data files. The six tests were chosen because for each I was able to find at least three other tools, written in native compiled languages, with similar functionality. There are other valuable benchmarks, but I haven't published them.

Profiling - Profiling was developed separately for each tool. For each I generated several data files with data representative of typical uses cases. Generally numeric or text data in several forms and distributions. The data was unrelated to the data used in benchmarks, which is from publicly available machine learning data sets. However, personal judgement was used in the generation of the data sets, so it's not free from bias.

After generating the data, I generated a set of run options specific to each tool. As an example, tsv-filter selects data file lines based on various numeric and text criteria (e.g. less-than). There are a bit over 50 comparison operations, plus a few meta operations. The profiling runs ensure all the operations are run at least once, but that the most important overweighted. The ldc.profile.resetAll call was used to exclude all the initial setup code (command line argument processing). This was nice because it meant the data files could be small relative to real-world sets, and it runs fast enough to do at part of the build step (ie. on Travis-CI).

Look at https://github.com/eBay/tsv-utils-dlang/tree/master/tsv-filter/profile_data to see a concrete example (tsv-filter). In that directory are five data files and a shell script that runs the commands and collects the data.

This was done for four of the tools covering five of the benchmarks. I skipped one of the tools (tsv-join), as it's harder to come up with a concise set of profile operations for it.

I then ran the standard benchmarks I usually report on in various D venues.

Clearly personal judgment played a role. However, the tools are reasonably task focused, and I did take basic steps to ensure the benchmark data and tests were separate from the training data/tests. For these reasons, my confidence is good that the results are reasonable and well founded.

--Jon
January 17, 2018
On Wednesday, 17 January 2018 at 04:37:04 UTC, Jon Degenhardt wrote:
>
> Clearly personal judgment played a role. However, the tools are reasonably task focused, and I did take basic steps to ensure the benchmark data and tests were separate from the training data/tests. For these reasons, my confidence is good that the results are reasonable and well founded.

Great, thanks for the details, I agree.
Hope it's useful for others to see these details.

(btw, did you also check the performance gains when using the profile of the benchmark itself, to learn about the upper-bound of PGO for your program?)

I'll merge the IR PGO addition into LDC master soon. Don't know what difference it'll make.

-Johan

January 18, 2018
On Wednesday, 17 January 2018 at 21:49:52 UTC, Johan Engelen wrote:
> On Wednesday, 17 January 2018 at 04:37:04 UTC, Jon Degenhardt wrote:
>>
>> Clearly personal judgment played a role. However, the tools are reasonably task focused, and I did take basic steps to ensure the benchmark data and tests were separate from the training data/tests. For these reasons, my confidence is good that the results are reasonable and well founded.
>
> Great, thanks for the details, I agree.
> Hope it's useful for others to see these details.

Thanks Johan, much appreciated. :)

> (btw, did you also check the performance gains when using the profile of the benchmark itself, to learn about the upper-bound of PGO for your program?)
>
> I'll merge the IR PGO addition into LDC master soon. Don't know what difference it'll make.

No, I didn't do an upper-bounds check, that's a good idea. I plan to test the IR based PGO when it's available, I'll run an upper-bounds check as part of it.