May 15, 2020
On 5/14/20 9:26 AM, Mike Parker wrote:
> After reading a paper that grabbed his curiosity and wouldn't let go, Andrei set out to determine if Lomuto partitioning should still be considered inferior to Hoare for quicksort on modern hardware. This blog post details his results.
> 
> Blog:
> https://dlang.org/blog/2020/05/14/lomutos-comeback/
> 
> Reddit:
> https://www.reddit.com/r/programming/comments/gjm6yp/lomutos_comeback_quicksort_partitioning/ 
> 
> 
> HN:
> https://news.ycombinator.com/item?id=23179160

Fantastic article!

-Steve
May 17, 2020
On 5/14/20 11:57 AM, jmh530 wrote:
> On Thursday, 14 May 2020 at 13:40:24 UTC, Andrei Alexandrescu wrote:
>> [snip]
> 
> Really interesting. Thanks for sharing.
> 
> I have recently been spending some spare time learning more about D's topN and pivotPartition implementation, which led me to your paper on fast deterministic selection.
> 
> Would you consider changing the pivotPartition implementation based on this?

Yes, and I encourage you to look into putting together a PR.

> Would the insights gleamed from this paper mean that a branchless version of topN could be faster?

Yes. topN also uses partitioning.
May 17, 2020
On Thursday, 14 May 2020 at 14:11:57 UTC, SashaGreat wrote:
>
> If possible could you please next time share link with "old" instead of "www"? Like:
>
> https://old.reddit.com/r/programming/comments/gjm6yp/lomutos_comeback_quicksort_partitioning/
>

There is a Chrome extension that automatically redirects to the old version of Reddit

May 18, 2020
On Thursday, 14 May 2020 at 13:26:23 UTC, Mike Parker wrote:
> After reading a paper that grabbed his curiosity and wouldn't let go, Andrei set out to determine if Lomuto partitioning should still be considered inferior to Hoare for quicksort on modern hardware. This blog post details his results.
>
> Blog:
> https://dlang.org/blog/2020/05/14/lomutos-comeback/
>
> Reddit:
> https://www.reddit.com/r/programming/comments/gjm6yp/lomutos_comeback_quicksort_partitioning/
>
> HN:
> https://news.ycombinator.com/item?id=23179160

Got posted again to Hacker News earlier today. Currently at position 5.
May 17, 2020
On 5/14/20 9:26 AM, Mike Parker wrote:
> After reading a paper that grabbed his curiosity and wouldn't let go, Andrei set out to determine if Lomuto partitioning should still be considered inferior to Hoare for quicksort on modern hardware. This blog post details his results.
> 
> Blog:
> https://dlang.org/blog/2020/05/14/lomutos-comeback/
> 
> Reddit:
> https://www.reddit.com/r/programming/comments/gjm6yp/lomutos_comeback_quicksort_partitioning/ 
> 
> 
> HN:
> https://news.ycombinator.com/item?id=23179160

Looks like the blog post is enjoying a second wind after being posted by soneone else on hackernews. It's in top 10 right now.
May 31, 2020
On Thursday, 14 May 2020 at 13:26:23 UTC, Mike Parker wrote:
> After reading a paper that grabbed his curiosity and wouldn't let go, Andrei set out to determine if Lomuto partitioning should still be considered inferior to Hoare for quicksort on modern hardware. This blog post details his results.
>
> Blog:
> https://dlang.org/blog/2020/05/14/lomutos-comeback/
>
> Reddit:
> https://www.reddit.com/r/programming/comments/gjm6yp/lomutos_comeback_quicksort_partitioning/
>
> HN:
> https://news.ycombinator.com/item?id=23179160

A follow up article on this:

https://news.ycombinator.com/item?id=23363165
https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html
August 03, 2020

On 15/05/2020 12:28, Joseph Rushton Wakeling via Digitalmars-d-announce wrote:
> On Thursday, 14 May 2020 at 13:26:23 UTC, Mike Parker wrote:
>> After reading a paper that grabbed his curiosity and wouldn't let go, Andrei set out to determine if Lomuto partitioning should still be considered inferior to Hoare for quicksort on modern hardware. This blog post details his results.
>>
>> Blog:
>> https://dlang.org/blog/2020/05/14/lomutos-comeback/
> 
> Nice stuff!
> 
> One curious question -- unless I've misread things horribly, it looks like the D benchmarks for Lomuto branch-free are consistently slower than for C++.  Any idea why that is?  I would expect gcc/gdc and clang/ldc to produce effectively identical results for code like this.

Sorry for the belated response, as far as I can see, gdc and g++ only differ on one line.

    auto delta = smaller & (read - first);

This is lowered as:

    delta = smaller & (read - first) / 8;

However, g++ uses an exact divide operator (semantically that ignores rounding), whereas gdc uses a truncating divide operator (semantically rounds the quotient towards zero).

I'm willing to bet a beer on tweaking pointer subtraction will get gdc in lockstep with g++.
August 03, 2020
On 03/08/2020 13:08, Iain Buclaw via Digitalmars-d-announce wrote:
> 
> 
> On 15/05/2020 12:28, Joseph Rushton Wakeling via Digitalmars-d-announce wrote:
>> On Thursday, 14 May 2020 at 13:26:23 UTC, Mike Parker wrote:
>>> After reading a paper that grabbed his curiosity and wouldn't let go, Andrei set out to determine if Lomuto partitioning should still be considered inferior to Hoare for quicksort on modern hardware. This blog post details his results.
>>>
>>> Blog:
>>> https://dlang.org/blog/2020/05/14/lomutos-comeback/
>>
>> Nice stuff!
>>
>> One curious question -- unless I've misread things horribly, it looks like the D benchmarks for Lomuto branch-free are consistently slower than for C++.  Any idea why that is?  I would expect gcc/gdc and clang/ldc to produce effectively identical results for code like this.
> 
> Sorry for the belated response, as far as I can see, gdc and g++ only differ on one line.
> 
>     auto delta = smaller & (read - first);
> 
> This is lowered as:
> 
>     delta = smaller & (read - first) / 8;
> 
> However, g++ uses an exact divide operator (semantically that ignores rounding), whereas gdc uses a truncating divide operator (semantically rounds the quotient towards zero).
> 
> I'm willing to bet a beer on tweaking pointer subtraction will get gdc in lockstep with g++.
> 

I doubt Andrei will re-run the benchmarks now, but here's the PR (problem reference) with patch attached.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96429

August 04, 2020
On 04/08/2020 03:14, Andrei Alexandrescu wrote:
> Interesting, thanks!
> 

Did a quick benchmark for n in `seq 1 10` ./lomuto.exe ${n}000000...

gdc-baseline:
min_milliseconds=53.2922
median_milliseconds=56.8761
min_milliseconds=111.2512
median_milliseconds=115.5812
min_milliseconds=168.8659
median_milliseconds=174.9421
min_milliseconds=228.9230
median_milliseconds=235.9950
min_milliseconds=291.4758
median_milliseconds=302.2652
min_milliseconds=354.6598
median_milliseconds=360.3230
min_milliseconds=420.6221
median_milliseconds=424.0275
min_milliseconds=490.9294
median_milliseconds=505.0082
min_milliseconds=535.9912
median_milliseconds=556.0680
min_milliseconds=619.8160
median_milliseconds=629.6446

gdc-pr96429:
min_milliseconds=44.1008
median_milliseconds=46.2956
min_milliseconds=96.0864
median_milliseconds=99.4665
min_milliseconds=146.2498
median_milliseconds=151.5661
min_milliseconds=201.9479
median_milliseconds=207.0937
min_milliseconds=253.2555
median_milliseconds=265.6178
min_milliseconds=309.5941
median_milliseconds=317.8178
min_milliseconds=364.9312
median_milliseconds=381.9105
min_milliseconds=427.2581
median_milliseconds=437.9506
min_milliseconds=464.2838
median_milliseconds=482.9127
min_milliseconds=537.3167
median_milliseconds=550.9250

g++:
min_milliseconds=44.0164
median_milliseconds=46.5057
min_milliseconds=91.2730
median_milliseconds=97.8507
min_milliseconds=142.8459
median_milliseconds=153.4782
min_milliseconds=196.4765
median_milliseconds=202.1877
min_milliseconds=251.5876
median_milliseconds=261.6350
min_milliseconds=299.8530
median_milliseconds=311.0719
min_milliseconds=351.9959
median_milliseconds=370.0437
min_milliseconds=412.4231
median_milliseconds=420.1336
min_milliseconds=462.4810
median_milliseconds=474.2444
min_milliseconds=539.3359
median_milliseconds=541.5321

Looks good, so committing patch. :-)
August 04, 2020
On 8/4/20 4:19 AM, Iain Buclaw wrote:
> On 04/08/2020 03:14, Andrei Alexandrescu wrote:
>> Interesting, thanks!
>>
> 
> Did a quick benchmark for n in `seq 1 10` ./lomuto.exe ${n}000000...
[snip]
> Looks good, so committing patch. :-)

Awesome, thanks! That does solve a puzzler I had while benchmarking.

I'm thinking the story of discovering and fixing this would be a great follow-up in the blog. It doesn't quite mesh with Mike's current introductory series, but it could be done as an intermezzo a la "Now For Something Completely Different (And Much Lower Level)".

Sketch of an intro:

Upon reading "Lomuto's Comeback" in the D blog, I noticed the performance were consistently juuust a bit worse for the D version than for the C++ version for the same source code. My own measurements confirmed the same. That bothered me at two levels. First, people unfamiliar with the D language would form the opinion that D cannot reach the efficiency of C++. Second, as the gdc creator and maintainer, I knew for a fact the produced code must be literally identical. Any difference would pin point a bug somewhere in the code generation pipeline. So I set out to find it and fix it. This is the story of that investigation, which will take us through looking through disassembly, finding the culprit, devising a fix, confirming with measurements, and patching the open-source gdc compiler.

...