intel-intrinsics v1.0.0 (page 2)

February 14, 2019

Re: intel-intrinsics v1.0.0

Posted by Guillaume Piolat
in reply to Crayo List

Permalink

Guillaume Piolat

Posted in reply to Crayo List

Permalink

On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote:
> On Wednesday, 13 February 2019 at 19:55:05 UTC, Guillaume Piolat wrote:
>> On Wednesday, 13 February 2019 at 04:57:29 UTC, Crayo List wrote:
>>> On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume Piolat wrote:
>>>> "intel-intrinsics" is a DUB package for people interested in x86 performance that want neither to write assembly, nor a LDC-specific snippet... and still have fastest possible code.
>>>>
>>> This is really cool and I appreciate your efforts!
>>>
>>> However (for those who are unaware) there is an alternative way that is (arguably) better;
>>> https://ispc.github.io/index.html
>>>
>>> You can write portable vectorized code that can be trivially invoked from D.
>>
>> ispc is another compiler in your build, and you'd write in another language, so it's not really the same thing.
>
> That's mostly what I said, except that I did not say it's the same thing.
> It's an alternative way to produce vectorized code in a deterministic and portable way.
> This is NOT an auto-vectorizing compiler!
>
>> I haven't used it (nor do I know anyone who do) so don't really know why it would be any better
> And that's precisely why I posted here; for those people that have interest in vectorizing their code in a portable way to be aware that there is another (arguably) better way.
> I highly recommend browsing through the walkthrough example;
> https://ispc.github.io/example.html
>
> For example, I have code that I can run on my Xeon Phi 7250 Knights Landing CPU by compiling with --target=avx512knl-i32x16, then I can run the exact same code with no change at all on my i7-5820k by compiling with --target=avx2-i32x8. Each time I get optimal code. This is not something you can easily do with intrinsics!

I don't disagree but ispc sounds more like a host-only OpenCL to me, rather than a replacement/competition for intel-intrinsics.

Intrinsics are easy: if calling another compiler with another source language might be trivial, then importing a DUB package and start using it within the same source code is even more trivial!

I take issue with the claim that Single Program Multiple Data yields much more performance than well written intrinsics code: when your compiler auto-vectorize (or you vectorized using SIMD semantics) you _also_ have one instruction for multiple data. The only gain I can see for SPMD would be use of non-temporal writes, since they are so hard to use effectively in practice.

I also take some issue with "portability": SIMD intrinsics optimize quite deterministically (some instructions get generated since LDC 1.0.0 -O0), also LLVM IR is portable to ARM, whereas ispc will likely never as admitted by its author: https://pharr.org/matt/blog/2018/04/29/ispc-retrospective.html

My interests on AVX-512 are subnormal: it can _slow down_ things on some x86 CPUs: https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 In general the latest instructions sets are increasingly hard to apply, and have lower yield.

The newer Intel instruction sets are basically a scam for the performance-minded. Sponsored work on x265 yields really abnormally low results, rewriting things with AVX-512: https://software.intel.com/en-us/articles/accelerating-x265-with-intel-advanced-vector-extensions-512-intel-avx-512

As to compiling precisely for the host target: we are building B2C software here so don't control the host machine. Thankfully the ancient SIMD instructions sets yield most of the value! Since a lot of the time memory throughput is the bottleneck.

I can see ispc being more useful when you know the precise model of your target Intel CPU. I would also like to see it compare to Intel's own software OpenCL: it seems it started its life as internal competition.

On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote: > And that's precisely why I posted here; for those people that have interest in vectorizing their code in a portable way to be aware that there is another (arguably) better way. All power to the people that have code that simple. But auto-vectorising in any capacity is the wrong way to do things in my field. An intrinsics library is vital to write highly specialised code. The tl;dr here is that we *FINALLY* have a minimum-spec for x64 CPUs represented with SSE intrinsics. Instead of whatever core.simd is. That's really important, and talks about auto-vectorisation are really best saved for another thread.

On Thursday, 14 February 2019 at 16:13:21 UTC, Ethan wrote: > On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote: >> And that's precisely why I posted here; for those people that have interest in vectorizing their code in a portable way to be aware that there is another (arguably) better way. > > All power to the people that have code that simple. But auto-vectorising in any capacity is the wrong way to do things in my field. An intrinsics library is vital to write highly specialised code. > > The tl;dr here is that we *FINALLY* have a minimum-spec for x64 CPUs represented with SSE intrinsics. Instead of whatever core.simd is. That's really important, and talks about auto-vectorisation are really best saved for another thread. Please re-read my post carefully!

On Thursday, 14 February 2019 at 21:45:57 UTC, Crayo List wrote: > On Thursday, 14 February 2019 at 16:13:21 UTC, Ethan wrote: >> On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote: >>> And that's precisely why I posted here; for those people that have interest in vectorizing their code in a portable way to be aware that there is another (arguably) better way. >> >> All power to the people that have code that simple. But auto-vectorising in any capacity is the wrong way to do things in my field. An intrinsics library is vital to write highly specialised code. >> >> The tl;dr here is that we *FINALLY* have a minimum-spec for x64 CPUs represented with SSE intrinsics. Instead of whatever core.simd is. That's really important, and talks about auto-vectorisation are really best saved for another thread. > > Please re-read my post carefully! I think ispc is interesting, and a very D-ish thing to have would be an ispc-like compiler at CTFE that outputs LLVM IR (or assembly or intel-intrinsics). That would break the language boundary and allows inlining. Though probably we need newCTFE for this, as everything interesting seems to need newCTFE :) And it's a gigantic amount of work.

On Thu, Feb 14, 2019 at 10:15:19PM +0000, Guillaume Piolat via Digitalmars-d-announce wrote: [...] > I think ispc is interesting, and a very D-ish thing to have would be an ispc-like compiler at CTFE that outputs LLVM IR (or assembly or intel-intrinsics). That would break the language boundary and allows inlining. Though probably we need newCTFE for this, as everything interesting seems to need newCTFE :) And it's a gigantic amount of work. Much as I love the idea of generating D code at compile-time and look forward to newCTFE, there comes a point when I'd really rather just run the DSL through some kind of preprocessing (i.e., compile with ispc) as part of the build, then link the result to the D code, rather than trying to shoehorn everything into (new)CTFE. T -- You have to expect the unexpected. -- RL

On Thursday, 14 February 2019 at 21:45:57 UTC, Crayo List wrote: > Please re-read my post carefully! Or - even better - take the hint that not every use of SIMD can be expressed in a high level manner.

On Thursday, 14 February 2019 at 22:28:46 UTC, H. S. Teoh wrote: > trying to shoehorn everything into (new)CTFE. Couldn't help but find a similarity between http://www.dsource.org/projects/mathextra/browser/trunk/blade/BladeDemo.d and ispc

Forums