October 12, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On 10/12/2016 10:39 AM, Stefan Koch wrote:
> On Wednesday, 12 October 2016 at 14:12:30 UTC, Andrei Alexandrescu wrote:
>> On 10/12/2016 09:39 AM, Stefan Koch wrote:
>
>>
>> Thanks! I'd say make sure there is exactly 0% loss on performance
>> compared to the popFront in the ASCII case, and if so make a PR with
>> the table version. -- Andrei
>
> I measured again.
> The table version has a DECREASES the performance for dmd by 1%.
> I think we should keep performance for dmd in mind.
> I could add the table version in a version (LDC) block.
No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei
|
October 12, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 14:46:32 UTC, Andrei Alexandrescu wrote: > > No need. 1% for dmd is negligible. 25% would raise an eyebrow. -- Andrei Alright then PR: https://github.com/dlang/phobos/pull/4849 |
October 12, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | My current favorites: void popFront(ref char[] s) @trusted pure nothrow { immutable byte c = s[0]; if (c >= -2) { s = s.ptr[1 .. s.length]; } else { import core.bitop; size_t i = 7u - bsr(~c); import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } } I also experimented with explicit speculation: void popFront(ref char[] s) @trusted pure nothrow { immutable byte c = s[0]; s = s.ptr[1 .. s.length]; if (c < -2) { import core.bitop; size_t i = 6u - bsr(~c); import std.algorithm; s = s.ptr[min(i, s.length) .. s.length]; } } LDC and GDC both compile these to 23 instructions. DMD does worse than with my other code. You can influence GDC's block layout with __builtin_expect. I notice that many other snippets posted use uint instead of size_t in the multi-byte branch. This generates extra instructions for me. |
October 12, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
> [Snip]
Didn't see the LUT implementation, nvm!
|
October 12, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to safety0ff | On 10/12/2016 01:05 PM, safety0ff wrote:
> On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
>> [Snip]
>
> Didn't see the LUT implementation, nvm!
Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
|
October 14, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei Alexandrescu wrote:
> On 10/12/2016 01:05 PM, safety0ff wrote:
>> On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
>>> [Snip]
>>
>> Didn't see the LUT implementation, nvm!
>
> Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
The first results from stoke are in.
It turns out stoke likes to produce garbage :(
It's smallest result so far has around 100 instructions.
However it might get better if I give it a few more hours to explore.
|
October 14, 2016 Re: Can you shrink it further? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stefan Koch | On Friday, 14 October 2016 at 04:21:28 UTC, Stefan Koch wrote:
> On Wednesday, 12 October 2016 at 17:59:51 UTC, Andrei Alexandrescu wrote:
>> On 10/12/2016 01:05 PM, safety0ff wrote:
>>> On Wednesday, 12 October 2016 at 16:48:36 UTC, safety0ff wrote:
>>>> [Snip]
>>>
>>> Didn't see the LUT implementation, nvm!
>>
>> Yah, that's pretty clever. Better yet, I suspect we can reuse the look-up table for front() as well. -- Andrei
>
> The first results from stoke are in.
> It turns out stoke likes to produce garbage :(
> It's smallest result so far has around 100 instructions.
> However it might get better if I give it a few more hours to explore.
Also I doubt that it is correct :(
testb $0x8, 0x200aa9(%rip)
movl $0x6, %eax
prefetchnta 0x200a9d(%rip)
je .L_400650
mulb -0x4(%rsp)
movb $0xfa, -0x5(%rsp)
vmovd (%rax), %xmm6
pmovzxbd -0x5(%rsp), %xmm11 1
psrad $0xf9, %xmm6
movl $0xef, %esp
pextrd $0xfe, %xmm6, (%rax)
.L_4005b0:
vrsqrtps 0x200a69(%rip), %ymm13
vzeroall
incl %edi
cmpb %ah, %dl
cmpq %rdi, %rdi
jbe .L_400640
ja .L_4005f0
pcmpeqq -0x4(%rsp), %xmm10
sbbb %ah, 0x200a4d(%rip)
jmpq .L_400643
.L_4005f0:
ja .L_40060c
jmpq .L_400643
.L_40060c:
ja .L_400628
minsd 0x200a3c(%rip), %xmm10
jmpq .L_400643
.L_400628:
vmovsldup %ymm3, %ymm3
vrcpps %ymm12, %ymm7
vrsqrtps -0x4(%rsp), %xmm0
fldl2t
vmovmskpd %xmm8, %r10
vrcpps %xmm6, %xmm13
rcrw $0xf7, %ax
jbe .L_400643
sbbq $0x40, %rax
xorb $0xfe, 0x200a0d(%rip)
adcw $0xf0, %r10w
.L_400640:
vmaskmovpd %xmm4, %xmm10, 0x2009ff(%rip)
pabsb %xmm12, %xmm15
.L_400643:
jne .L_4005b0
.L_400650:
retq
I am not quite sure what this does.
But I am certain it has nothing to do with UTF-8 decoding :)
Oh btw using an end pointer instead of a length reduces the table version to 12 instructions.
|
Copyright © 1999-2021 by the D Language Foundation