September 19, 2015 Re: LDC 0.16.0 alpha3 is out! Get it, test it, give feedback! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kai Nacke | What can I say, it works nicely for me. I just see issues in arcane fields of SIMD, but could work around it with inline assembly that compiles to a lot shorter instructions than the corresponding sequence of intrinsics:
m_sse = __asm!(const(ubyte16)*)("
sub $2, $1
1:
add $2, $1
vpcmpistri $3, ($1), $4
cmp $2, %ecx
je 1b
add %rcx, $1
", "=r,0,I,K,x,~{ecx}", m_sse, 16, mode, SIMDFromString!cs);
It is just not very intelligible nor portable. Is there a way to turn the %rcx into a wild card or %rcx/%ecx depending on pointer width?
--
Marco
|
September 20, 2015 Re: LDC 0.16.0 alpha3 is out! Get it, test it, give feedback! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco Leise | On 19 Sep 2015, at 20:09, Marco Leise via digitalmars-d-ldc wrote:
> It is just not very intelligible nor portable. Is there a way
> to turn the %rcx into a wild card or %rcx/%ecx depending on
> pointer width?
Not that I know of. My first advice would be to use the intrinsic corresponding to vpcmpistri, but you mentioned the generated asm would be longer?
– David
|
September 21, 2015 Re: LDC 0.16.0 alpha3 is out! Get it, test it, give feedback! | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | Am Sun, 20 Sep 2015 16:54:19 +0200 schrieb David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc@puremagic.com>: > On 19 Sep 2015, at 20:09, Marco Leise via digitalmars-d-ldc wrote: > > It is just not very intelligible nor portable. Is there a way to turn the %rcx into a wild card or %rcx/%ecx depending on pointer width? > > Not that I know of. My first advice would be to use the intrinsic corresponding to vpcmpistri, but you mentioned the generated asm would be longer? > > – David Hell yeah, the GCC intrinsics don't differentiate between a SIMD register argument and a memory reference. Basically they take a SIMD vector, but understand that `*simdptr` can be encoded as a memory reference. It comes down to compiler flags and other circumstances how an argument is passed to a SIMD instruction. Now the problem arises when the compilers blindly assume that all SIMD memory is aligned although SSE4.2 introduces a few instructions that work on unaligned octet streams (this `vpcmp(i/e)str(i/m)` and at least a crc32 function IIRC) to make life easier. When these memory references get preloaded into SIMD registers with an aligned load you get a SEGFAULT - they require an unaligned load if anything. Long story short, GCC knew about this 3 years ago and it was decided that using the intrinsic without manually putting an unaligned load in front is incorrect. (Unless you use it on aligned data.) But it kind of defeats the purpose of using a specialized instruction to speed up string scanning when you have to add bloat around it. Maybe I'll post what LLVM or GCC would have generated if used with only intrinsics. -- Marco |
September 21, 2015 Re: LDC 0.16.0 alpha3 is out! Get it, test it, give feedback! | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | Here is the comparison. I omitted the part of the function that would be the same in both versions: prolog, epilog and loading the current position into %rax. 1st: hand-optimized ASM m_sse = __asm!(const(ubyte16)*)(" 1: vpcmpistri $3, ($1), $4 add $2, $1 cmp $2, %ecx je 1b sub $2, $1 add %rcx, $1 ", "=r,0,I,K,x,~{ecx}", m_sse, 16, mode, SIMDFromString!cs); c5 f8 28 05 01 74 04 00 vmovaps xmm0,XMMWORD PTR [rip+0x47401] c4 e3 79 63 00 08 L: vpcmpistri xmm0,XMMWORD PTR [rax],0x8 48 83 c0 10 add rax,0x10 83 f9 10 cmp ecx,0x10 74 f1 je L 48 83 e8 10 sub rax,0x10 48 01 c8 add rax,rcx 48 89 07 mov QWORD PTR [rdi],rax (33 bytes) Destructuring a ~200 MiB JSON file takes 348 ms with that. 2nd: "naive" approach using intrinsics int pos; do { ubyte16 sse = __builtin_ia32_lddqu(m_json); pos = __builtin_ia32_pcmpistri128(SIMDFromString!cs, sse, mode); m_json += pos; } while (pos == 16); 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi 48 89 45 f0 mov QWORD PTR [rbp-0x10],rax 48 8b 45 f0 L: mov rax,QWORD PTR [rbp-0x10] c5 fb f0 00 vlddqu xmm0,[rax] c5 f8 28 0d 81 74 04 00 vmovaps xmm1,XMMWORD PTR [rip+0x47481] c4 e3 79 63 c8 08 vpcmpistri xmm1,xmm0,0x8 48 63 d1 movsxd rdx,ecx 48 01 d0 add rax,rdx 48 8b 55 f8 mov rdx,QWORD PTR [rbp-0x8] 48 89 02 mov QWORD PTR [rdx],rax 81 f9 10 00 00 00 cmp ecx,0x10 48 89 45 f0 mov QWORD PTR [rbp-0x10],rax 74 d1 je L (55 bytes) Time: 502 ms Compiled with ldc-0.16-alpha3, LLVM 3.5.2 -O4 -mcpu=native -release -boundscheck=off -singleobj -disable-inlining -- Marco |
September 23, 2015 Re: LDC 0.16.0 alpha3 is out! Get it, test it, give feedback! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marco Leise | On 21 Sep 2015, at 11:16, Marco Leise via digitalmars-d-ldc wrote:
> Here is the comparison. I omitted the part of the function that would be the same in both versions: prolog, epilog and loading the current position into %rax. […]
Wow, thanks, that is indeed rather sobering.
— David
|
Copyright © 1999-2021 by the D Language Foundation