September 19, 2015
What can I say, it works nicely for me. I just see issues in arcane fields of SIMD, but could work around it with inline assembly that compiles to a lot shorter instructions than the corresponding sequence of intrinsics:

m_sse = __asm!(const(ubyte16)*)("
    sub        $2, $1
    1:
    add        $2, $1
    vpcmpistri $3, ($1), $4
    cmp        $2, %ecx
    je         1b
    add        %rcx, $1
    ", "=r,0,I,K,x,~{ecx}", m_sse, 16, mode, SIMDFromString!cs);

It is just not very intelligible nor portable. Is there a way to turn the %rcx into a wild card or %rcx/%ecx depending on pointer width?

-- 
Marco

September 20, 2015
On 19 Sep 2015, at 20:09, Marco Leise via digitalmars-d-ldc wrote:
> It is just not very intelligible nor portable. Is there a way
> to turn the %rcx into a wild card or %rcx/%ecx depending on
> pointer width?

Not that I know of. My first advice would be to use the intrinsic corresponding to vpcmpistri, but you mentioned the generated asm would be longer?

 – David
September 21, 2015
Am Sun, 20 Sep 2015 16:54:19 +0200
schrieb David Nadlinger via digitalmars-d-ldc
<digitalmars-d-ldc@puremagic.com>:

> On 19 Sep 2015, at 20:09, Marco Leise via digitalmars-d-ldc wrote:
> > It is just not very intelligible nor portable. Is there a way to turn the %rcx into a wild card or %rcx/%ecx depending on pointer width?
> 
> Not that I know of. My first advice would be to use the intrinsic corresponding to vpcmpistri, but you mentioned the generated asm would be longer?
>
>   – David

Hell yeah, the GCC intrinsics don't differentiate between a
SIMD register argument and a memory reference. Basically they
take a SIMD vector, but understand that `*simdptr` can be
encoded as a memory reference. It comes down to compiler flags
and other circumstances how an argument is passed to a SIMD
instruction. Now the problem arises when the compilers blindly
assume that all SIMD memory is aligned although SSE4.2
introduces a few instructions that work on unaligned octet
streams (this `vpcmp(i/e)str(i/m)` and at least a crc32
function IIRC) to make life easier.
When these memory references get preloaded into SIMD registers
with an aligned load you get a SEGFAULT - they require an
unaligned load if anything. Long story short, GCC knew about
this 3 years ago and it was decided that using the intrinsic
without manually putting an unaligned load in front is
incorrect. (Unless you use it on aligned data.) But it kind of
defeats the purpose of using a specialized instruction to
speed up string scanning when you have to add bloat around it.
Maybe I'll post what LLVM or GCC would have generated if used
with only intrinsics.

-- 
Marco

September 21, 2015
Here is the comparison. I omitted the part of the function that would be the same in both versions: prolog, epilog and loading the current position into %rax.

 1st: hand-optimized ASM

  m_sse = __asm!(const(ubyte16)*)("
    1:
    vpcmpistri $3, ($1), $4
    add        $2, $1
    cmp        $2, %ecx
    je         1b
    sub        $2, $1
    add        %rcx, $1
    ", "=r,0,I,K,x,~{ecx}", m_sse, 16, mode, SIMDFromString!cs);

    c5 f8 28 05 01 74 04 00	vmovaps xmm0,XMMWORD PTR [rip+0x47401]
    c4 e3 79 63 00 08	     L: vpcmpistri xmm0,XMMWORD PTR [rax],0x8
    48 83 c0 10			add    rax,0x10
    83 f9 10			cmp    ecx,0x10
    74 f1			je     L
    48 83 e8 10			sub    rax,0x10
    48 01 c8			add    rax,rcx
    48 89 07			mov    QWORD PTR [rdi],rax
    (33 bytes)

  Destructuring a ~200 MiB JSON file takes 348 ms with that.

 2nd: "naive" approach using intrinsics

  int pos;
  do
  {
    ubyte16 sse = __builtin_ia32_lddqu(m_json);
    pos = __builtin_ia32_pcmpistri128(SIMDFromString!cs, sse, mode);
    m_json += pos;
  }
  while (pos == 16);

    48 89 7d f8			mov    QWORD PTR [rbp-0x8],rdi
    48 89 45 f0			mov    QWORD PTR [rbp-0x10],rax
    48 8b 45 f0		     L: mov    rax,QWORD PTR [rbp-0x10]
    c5 fb f0 00			vlddqu xmm0,[rax]
    c5 f8 28 0d 81 74 04 00	vmovaps xmm1,XMMWORD PTR [rip+0x47481]
    c4 e3 79 63 c8 08		vpcmpistri xmm1,xmm0,0x8
    48 63 d1			movsxd rdx,ecx
    48 01 d0			add    rax,rdx
    48 8b 55 f8			mov    rdx,QWORD PTR [rbp-0x8]
    48 89 02			mov    QWORD PTR [rdx],rax
    81 f9 10 00 00 00		cmp    ecx,0x10
    48 89 45 f0			mov    QWORD PTR [rbp-0x10],rax
    74 d1			je     L
    (55 bytes)

  Time: 502 ms

Compiled with ldc-0.16-alpha3, LLVM 3.5.2
 -O4 -mcpu=native -release -boundscheck=off -singleobj
 -disable-inlining

-- 
Marco

September 23, 2015
On 21 Sep 2015, at 11:16, Marco Leise via digitalmars-d-ldc wrote:
> Here is the comparison. I omitted the part of the function that would be the same in both versions: prolog, epilog and loading the current position into %rax. […]

Wow, thanks, that is indeed rather sobering.

 — David
1 2
Next ›   Last »