Jump to page: 1 2
Thread overview
M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles
May 13, 2021
kookman
May 13, 2021
Guillaume Piolat
May 13, 2021
Max Haughton
May 13, 2021
Nicholas Wilson
May 13, 2021
Iain Buclaw
May 13, 2021
Witold Baryluk
May 13, 2021
Witold Baryluk
May 13, 2021
claptrap
May 13, 2021
claptrap
May 27, 2021
Manu
May 27, 2021
deadalnix
May 27, 2021
Max Haughton
May 12, 2021
https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's really Star Trek stuff.

This will seriously challenge other CPU producers.

What perspectives do we have to run the compiler on M1 and produce M1 code?
May 13, 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:
> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>
> Integral division is the strongest arithmetic operation.
>
> I have a friend who knows some M1 internals. He said it's really Star Trek stuff.
>
> This will seriously challenge other CPU producers.
>
> What perspectives do we have to run the compiler on M1 and produce M1 code?

LDC v1.26 is already doing a very good job IMHO. I'm running with native ARM (brew.sh) LDC on Mac mini M1 and compiles are not noticeably slower than DMD on my Intel MacBook. Impressive!
May 13, 2021

On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:

>

https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/

Integral division is the strongest arithmetic operation.

I have a friend who knows some M1 internals. He said it's really Star Trek stuff.

This will seriously challenge other CPU producers.

What perspectives do we have to run the compiler on M1 and produce M1 code?

It's already winning let alone challenging, although consider just how fucking enormous the transistor budget is on the M1 on a per-core basis (i.e. from what is known in public, the M1 doesn't really have that much magic to it but is rather an extremely wide - where it really matters - iteration of what already works elsewhere in the industry, combined with no X86 tax on desktop for the first time.). Intel's process engineers completely dropped the ball, so the M1 is on a process something like 4-5 x denser than Intel 14nm.

Someone mentioned on hackernews that Intel improved the ThisXeon + 1 integer division capabilities also, would be worth benchmarking - although expecting monster SPECint numbers from a 28 core Xeon is probably missing the point.

Someone on the discord has an M1, D already works fine apparently, I'm aiming to get a blog post out of it.

The GCC project has M1 hardware and should apparently be getting support soon-ish. Apple don't like upstreaming their backends from what I can tell, so it could be a while before they get tuned much.

Apple also haven't published anything along the lines of an optimization manual for M1 so I guess we'll find out via osmosis what it's really capable of as times goes on - I think it's more likely Apple get the Microsoft hidden-api treatment than actually go public on some of the extensions they have made to the ARM ISA - both in new instructions and in the form of an old trick SPARC had which basically turns TSO on underneath a program to aid X86 emulation.

May 13, 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:
> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>
> Integral division is the strongest arithmetic operation.
>
> I have a friend who knows some M1 internals. He said it's really Star Trek stuff.
>
> This will seriously challenge other CPU producers.
>
> What perspectives do we have to run the compiler on M1 and produce M1 code?

LDC works out of the box.
GDC may support it with GDC11 (no idea on time frame).
DMD has no support for ARM whatsoever.
non-dmdfe based compilers, no idea. Probably not.
May 13, 2021
On Thursday, 13 May 2021 at 04:40:25 UTC, kookman wrote:
>
> LDC v1.26 is already doing a very good job IMHO. I'm running with native ARM (brew.sh) LDC on Mac mini M1 and compiles are not noticeably slower than DMD on my Intel MacBook. Impressive!

Even LDC 1.24 x86_64 cross-compiling to arm64 has very nice compilation time on the M1 here. Which is nice because you can then build the combined binary with one compiler.
May 13, 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:
> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>
> Integral division is the strongest arithmetic operation.
>
> I have a friend who knows some M1 internals. He said it's really Star Trek stuff.
>
> This will seriously challenge other CPU producers.
>

No. It means nothing.

1) M1 is built on smaller manufacturing node, allowing to "waste" more silicon area for such niche stuff.

2) He measured throughput in highly-div dense code. He didn't measure actual speed (latency) of a divide. Anybody can make integer division faster (higher throughput) by throwing more execution units or fully pipeline the integer divisions. It costs a lot of silicon, for zero gain, because real world code doesn't have a div after div ever next or second instruction.

He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. Copper Lake)? Zen 2? Did he isolate memory effects?

I see he used very high `count` in the loop: `count=524k`. That is 2MiB / 4MiB of data. Granted, this data access pattern will be easy to predict and prefetching will work really well, but still this will be touching multiple levels of cache.

There are data dependencies in the loop on the `sum`, making it really hard to do speculatation.

We don't see the assembly code, and don't know how much the loops are unrolled / how much potential is there for parallelism or hardware pipelineing.

3) Apple can get away with that, because they run on a leading edge manufacturing node, clock lower to reduce power, and waste silicon.


My guess is: If you do a single divide without a loop, it most likely will be the same on both platforms.

There is nothing magical about M1 "fast" division.
May 13, 2021
On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
> On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:
>> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>>
>> Integral division is the strongest arithmetic operation.
>>
>> I have a friend who knows some M1 internals. He said it's really Star Trek stuff.
>>
>> This will seriously challenge other CPU producers.
>>
>
> No. It means nothing.
>
> 1) M1 is built on smaller manufacturing node, allowing to "waste" more silicon area for such niche stuff.
>
> 2) He measured throughput in highly-div dense code. He didn't measure actual speed (latency) of a divide. Anybody can make integer division faster (higher throughput) by throwing more execution units or fully pipeline the integer divisions. It costs a lot of silicon, for zero gain, because real world code doesn't have a div after div ever next or second instruction.
>
> He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. Copper Lake)? Zen 2? Did he isolate memory effects?
>
> I see he used very high `count` in the loop: `count=524k`. That is 2MiB / 4MiB of data. Granted, this data access pattern will be easy to predict and prefetching will work really well, but still this will be touching multiple levels of cache.
>
> There are data dependencies in the loop on the `sum`, making it really hard to do speculatation.
>
> We don't see the assembly code, and don't know how much the loops are unrolled / how much potential is there for parallelism or hardware pipelineing.
>
> 3) Apple can get away with that, because they run on a leading edge manufacturing node, clock lower to reduce power, and waste silicon.
>
>
> My guess is: If you do a single divide without a loop, it most likely will be the same on both platforms.
>
> There is nothing magical about M1 "fast" division.

I just tested, using his benchmark code, on my a bit older AMD Zen+ CPU, that is clocked 2.8GHz (so actually slower than either M1 or the tested Xeon):

I got 1.156ns per u32 divide using hardware divide. If I normalize this to 3.2GHz, it becomes 1.01ns.

0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. So exactly same speed as M1 (0.351ms).

So, no, M1 is not 10 times faster than "x86".

Next time, exercise more critical thinking when reading "benchmark" claims.

May 13, 2021
On Thursday, 13 May 2021 at 06:02:31 UTC, Nicholas Wilson wrote:
> LDC works out of the box.
> GDC may support it with GDC11 (no idea on time frame).
> DMD has no support for ARM whatsoever.
> non-dmdfe based compilers, no idea. Probably not.

Nope, it doesn't work (sorry it took 5 months to raise an issue)

https://issues.dlang.org/show_bug.cgi?id=21919

Fortunately, OSX is broken on both X86_64 and ARM64 though.  So the sooner someone fixes it...
May 13, 2021
On Thursday, 13 May 2021 at 12:06:01 UTC, Witold Baryluk wrote:
> Next time, exercise more critical thinking when reading "benchmark" claims.

Indeed, proper benchmarks use application suites, not shoehorned synthetic garble... Besides, most performance sensitive code does not use division much if the programmers know what they are doing. And in this "benchmark" the division could've been moved out of the inner loop by a less-than-braindead compiler.

Looks like Intel is releasing a Clang based C++ compiler with OpenMP offload to Intel GPUs... Wonder if anyone knows anything about it?



May 13, 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu wrote:
> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>
> Integral division is the strongest arithmetic operation.
>
> I have a friend who knows some M1 internals. He said it's really Star Trek stuff.
>
> This will seriously challenge other CPU producers.

Integer division on Intel has always been excruciatingly slow, 64 bit idiv can be up to 100 cycles in some cases, but DIVSD is like 20 or something. Its much faster to convert to double do the division and convert back. (If you are ok with slightly lower precision.)

Just for reference I looked up timings for Zen3 and 64 bit idiv is

9-17 latency, 7-12 throughput

For skylake which is what it looks like the Xeon 8275CL is based on its..

42-95 latency, 24-90 throughput

So on paper a Zen3 is maybe 5 to 8 times faster at idiv than the Xeon he's using.



« First   ‹ Prev
1 2