April 17, 2016
Am Sat, 16 Apr 2016 21:46:08 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 4/16/2016 2:40 PM, Marco Leise wrote:
> > Tell me again, what's more elgant !
> 
> If I wanted to write in assembler, I wouldn't write in a high level language, especially a weird one like GNU version.

I hate the many pitfalls of extended asm: Forget to mention a side effect in the "clobbers" list and the compiler assumes that register or memory location still holds the value from before the asm. Have an _input_ reg clobbered? Must NOT name it in the clobber list but use it as a dummy output with a dummy variable assignment. The learning curve is steep and as you said, usually unintelligible without prior knowledge.

But what I really miss from the last generation of inline assemblers are these points:

1. In most cases you can make the asm transparent to the
   optimizer leading to:
   1.a Inlining of asm
   1.b Dead-code removal of asm blocks

2. Asm Template arguments (e.g. input variables) are bound via
   constraints:
   2.a Can use output constraint `"=a" var` to mean an of "AL",
       "AX", "EAX" or "RAX" depending on size of 'var'
   2.b `"r" ptr` can bind 32-bit and 64-bit pointers often
       eliminating the need for duplicate asm blocks that only
       differ in one mention of e.g. RSI vs. ESI.
   2.c Compiler seamlessly integrates host code variables
       with asm with host code. No need to manually pick tmp
       registers to move parameters and output. `"r" myUint`
       is all it takes for 'myUint' to end up in any of EAX,
       EDX, ... (whatever the register allocator deems
       efficient at that point)
   3.d As a net result, asm templates often reduce to a single
       mnemonic and work with X86, X32 and AMD64.

3. In DMD I often see "naked" used to get rid of function
   prolog and epilog in an attempt to get an intrinsic-like,
   fast function. This requires extra care to get the calling
   convention right and may require more code duplication for
   e.g. Win32. Asm templates in GCC and LLVM benefit from this
   speedup automatically, because the backend will remove
   unneeded prolog/epilog code and even inline small functions.

GCC's historically grown template syntax based on multiple _external_ assembler backends ain't that great and it is a PITA that it cannot understand the mnemonics and figure out side effects itself like DMD. But I hope I could highlight a few points where classic assemblers as found in Delphi or DMD fall behind in modern convenience and native efficiency.

When C was invented it matched the CPUs quite well, but today we have dozens of instructions that C and D syntax has no expression for. All modern compilers spend considerable amount of backend code to the task of pattern matching code constructs like a layman's POPCNT and replace them with optimal CPU instructions. More and more we turn to browsing the list of readily available compiler built-ins first and the next step is to acknowledge the need and make inline assemblers powerful enough for programmers to efficiently implement non-existing intrinsics in library code.

-- 
Marco

April 18, 2016
On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
> Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.)

There are many organizations in the world that are building software in-house, where such software is targeted to modern CPU SIMD types, most typically AVX/AVX2 and crypto instructions.

In these settings -- many of them scientific compute or big data center operators -- they know what servers they have, what CPU platforms they have. They don't care about portability to the past, older computers and so forth. A runtime check would make no sense for them, not for their baseline, and it would probably be a waste of time for them to design code to run on pre-AVX silicon. (AVX is not new anymore -- it's been around for a few years.)

Good examples can be found on Cloudflare's blog, especially Vlad Krasnov's posts. Here's one where he accelerates Golang's crypto libraries: https://blog.cloudflare.com/go-crypto-bridging-the-performance-gap/

Companies like CF probably spend millions of dollars on electricity, and there are some workloads where AVX-optimized code can yield tangible monetary savings.

Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host.

On the broader SIMD-as-first-class-citizen issue, I think it would be worth thinking about how to bake SIMD into the language instead of bolting it on. If I were designing a new language in 2016, I would take a fresh look at how SIMD could be baked into a language's core constructs. I'd think about new loop abstractions that could make SIMD easier to exploit, and how to nudge programmers away from serial monotonic mindsets and into more of a SIMD/FMA way of reasoning.
April 18, 2016
On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
> On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
>> Besides, I think it's a poor design to customize the app for only one SIMD type. A better idea (I've repeated this ad nauseum over the years) is to have n modules, one for each supported SIMD type. Compile and link all of them in, then detect the SIMD type at runtime and call the corresponding module. (This is how the D array ops are currently implemented.)
>
> There are many organizations in the world that are building software in-house, where such software is targeted to modern CPU SIMD types, most typically AVX/AVX2 and crypto instructions.
>

In addition it's COMPILER work, not programmer!
Compiler SHOULD be able to vectorize the code using SSE/AVX depending on command line switch. Why i should write all these merde ? Let compiler do its work.

Also compiler CAN generate multiple versions of one function using different SIMD instructions : Intel C++ Compiler works this way : it generates a few versions of a function and checks at run-time CPU capabilities and executes the fastest one.
April 23, 2016
On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
> 
> Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host.

Thanks, I've seen similar comments in LLVM code.

I have a question perhaps you can comment on?
With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled?
Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?
April 24, 2016
Am Sat, 23 Apr 2016 10:40:12 +0000
schrieb Johan Engelen <j@j.nl>:

> I have a question perhaps you can comment on?
> With LLVM, it is possible to specify something like "+sse3,-sse2"
> (I did not test whether this actually results in SSE3
> instructions being used, but no SSE2 instructions). What should
> be returned when querying whether "sse3" feature is enabled?
> Should __traits(targetHasFeature, "sse3") == true mean that
> implied features (such as sse and sse2) are also available?

Please do test it. Activating sse3 and disabling sse2 likely causes the compiler to silently re-enable sse2 as a dependency or error out.

-- 
Marco

May 02, 2016
On Saturday, 23 April 2016 at 10:40:12 UTC, Johan Engelen wrote:
> On Monday, 18 April 2016 at 00:27:06 UTC, Joe Duarte wrote:
>> 
>> Someone else said talked about marking "Broadwell" and other generation names. As others have said, it's better to specify features. I wanted to chime in with a couple of additional examples. Intel's transactional memory accelerating instructions (TSX) are only available on some Broadwell parts because there was a bug in the original implementation (Haswell and early Broadwell) and it's disabled on most. But the new Broadwell server chips have it, and it's a big deal for some DB workloads. Similarly, only some Skylake chips have the Secure Guard instructions (SGX), which are very powerful for creating secure enclaves on an untrusted host.
>
> Thanks, I've seen similar comments in LLVM code.
>
> I have a question perhaps you can comment on?
> With LLVM, it is possible to specify something like "+sse3,-sse2" (I did not test whether this actually results in SSE3 instructions being used, but no SSE2 instructions). What should be returned when querying whether "sse3" feature is enabled?
> Should __traits(targetHasFeature, "sse3") == true mean that implied features (such as sse and sse2) are also available?

If you specify SSE3, you should definitely get SSE2 and plain old SSE with it. SSE3 is a superset of SSE2 and includes all the SSE2 instructions (more than 100 I think.)

I'm not sure about your syntax – I thought the hyphen meant to include the option, not remove it, and I haven't seen the addition sign used for those settings. But I haven't done much with those optimization flags.

You wouldn't want to exclude SSE2 support because it's becoming the bare minimum baseline for modern systems, the de facto FP unit. Windows 10 requires a CPU with SSE2, as do more and more applications on the archaic Unix-like platforms.
August 23, 2016
On Thursday, 31 March 2016 at 08:23:45 UTC, Martin Nowak wrote:
> I'm currently working on a templated arrayop implementation (using RPN
> to encode ASTs).
> So far things worked out great, but now I got stuck b/c apparently none
> of the D compilers has a working SIMD implementation (maybe GDC has but
> it's very difficult to work w/ the 2.066 frontend).
>
> https://github.com/MartinNowak/druntime/blob/arrayOps/src/core/internal/arrayop.d https://github.com/MartinNowak/dmd/blob/arrayOps/src/arrayop.d
>
> I don't want to do anything fancy, just unaligned loads, stores, and integral mul/div. Is this really the current state of SIMD or am I missing sth.?
>
> -Martin

ndslice.algorithm [1], [2] compiled with recent LDC beta will do all work for you. Vectorized flag should be turned on and the last (row) dimension should have stride==1.

Generic matrix-matrix multiplication [3] is available in Mir version 0.16.0-beta2
It should be compiled with recent LDC beta, and -mcpu=native flag.

[1] http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html
[2] https://github.com/dlang/phobos/pull/4652
[3] http://docs.mir.dlang.io/latest/mir_glas_gemm.html
2 3 4 5 6 7 8 9 10 11 12
Next ›   Last »