January 11, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris M | On Wednesday, 11 January 2017 at 00:11:50 UTC, Chris M wrote:
> On Tuesday, 10 January 2017 at 13:13:17 UTC, Basile B. wrote:
>> On Tuesday, 10 January 2017 at 11:38:43 UTC, Guillaume Piolat wrote:
>>> On Tuesday, 10 January 2017 at 10:41:54 UTC, Basile B. wrote:
>>>>
>>>> don't forget to flag
>>>>
>>>> asm pure nothrow {}
>>>>
>>>> otherwise it's slow.
>>>
>>> Why?
>>
>> It's an empirical observation. In september I tried to get why an inline asm function was slow. What happened was that I didn't mark the asm block as nothrow
>>
>> https://forum.dlang.org/post/xznocpxtalpayvkrwxey@forum.dlang.org
>>
>> I opened an issue asking the specifications to explain that clearly.
>
> Huh, that's really interesting, thanks for posting. I guess my other question would be how do I determine if a block of assembly is pure?
>
The game changer for the performances is just "nothrow".
|
January 11, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Basile B. | On Tuesday, 10 January 2017 at 10:41:54 UTC, Basile B. wrote:
> don't forget to flag
>
> asm pure nothrow {}
>
> otherwise it's slow.
Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.
|
January 11, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On Wednesday, 11 January 2017 at 06:14:35 UTC, Era Scarecrow wrote:
>
> Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.
I'm a taker if you have some algorithm to reuse 32-bit divide in wideint division instead of scanning bits :)
|
January 11, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Guillaume Piolat | On Wednesday, 11 January 2017 at 15:39:49 UTC, Guillaume Piolat wrote:
> On Wednesday, 11 January 2017 at 06:14:35 UTC, Era Scarecrow wrote:
>>
>> Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.
>
> I'm a taker if you have some algorithm to reuse 32-bit divide in wideint division instead of scanning bits :)
I remember the divide was giving me some trouble. The idea was to try and use the built in registers and limits of the assembly to take advantage of full 128bit division, unfortunately if the result is too large to fit in a 64bit result it breaks, rather than giving me half the result and letting me work with it.
Still I think I'll impliment my own version and then if it's faster I'll submit it.
|
May 23, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On Wednesday, 11 January 2017 at 17:32:35 UTC, Era Scarecrow wrote:
> Still I think I'll impliment my own version and then if it's faster I'll submit it.
Decided I'd give my hand at writing a 'ScaledInt' which is intended to basically allow any larger unsigned type. Coming across some assembly confusion.
Using mixin with assembly here's the 'result' of the mixin (as a final result)
alias UCent = ScaledInt!(uint, 4);
struct ScaledInt(I, int Size)
if (isUnsigned!(I) && Size > 1) {
I[Size] val;
ScaledInt opBinary(string op)(const ScaledInt rhs) const
if (op == "+") {
ScaledInt t;
asm pure nothrow { //mixin generated from another function, for simplicity
mov EBX, this;
clc;
mov EAX, rhs[EBP+0];
adc EAX, val[EBX+0];
mov t[EBP+0], EAX;
mov EAX, rhs[EBP+4];
adc EAX, val[EBX+4];
mov t[EBP+4], EAX;
mov EAX, rhs[EBP+8];
adc EAX, val[EBX+8];
mov t[EBP+8], EAX;
mov EAX, rhs[EBP+12];
adc EAX, val[EBX+12];
mov t[EBP+12], EAX;
}
return t;
}
}
Raw disassembly for my asm code shows this:
mov EBX,-4[EBP]
clc
mov EAX,0Ch[EBP]
adc EAX,[EBX]
mov -014h[EBP],EAX
mov EAX,010h[EBP]
adc EAX,4[EBX]
mov -010h[EBP],EAX
mov EAX,014h[EBP]
adc EAX,8[EBX]
mov -0Ch[EBP],EAX
mov EAX,018h[EBP]
adc EAX,0Ch[EBX]
mov -8[EBP],EAX
From what I'm seeing, it should be 8, 0ch, 10h, then 14h, all positive. I'm really scratching my head why I'm having this issue... Doing an add of t[0] = val[0] + rhs[0]; i get this disassembly:
mov EDX,-4[EBP] //mov EDX, this;
mov EBX,[EDX] //val[0]
add EBX,0Ch[EBP]//+ rhs.val[0]
mov ECX,8[EBP] //mov ECX, ???[???]
mov [ECX],EBX //t.val[0] =
If i do "mov ECX,t[EBP]", i get "mov ECX,-014h[EBP]". If i try to reference the exact variable val within t, it complains it doesn't know it at compiler-time (although it's a fixed location).
What am i missing here?
|
June 01, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On Tuesday, 23 May 2017 at 03:33:38 UTC, Era Scarecrow wrote:
> From what I'm seeing, it should be 8, 0ch, 10h, then 14h, all positive. I'm really scratching my head why I'm having this issue...
>
> What am i missing here?
More experiments and i think it comes down to static arrays.
The following function code
int[4] fun2() {
int[4] x = void;
asm {
mov dword ptr x, 100;
}
x[0] = 200; //get example of real offset
return x;
}
Produces the following (from obj2asm)
int[4] x.fun2() comdat
assume CS:int[4] x.fun2()
enter 014h,0
mov -4[EBP],EAX
mov dword ptr -014h[EBP],064h
mov EAX,-4[EBP]
mov dword ptr [EAX],0C8h // x[0]=200, offset +0
mov EAX,-4[EBP]
leave
ret
int[4] x.fun2() ends
So why is the offset off by 14h (20 bytes)? It's not like we need a to set a ptr first.
Go figure i probably found a bug...
|
June 02, 2017 Re: Mixin in Inline Assembly | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On Thursday, 1 June 2017 at 12:00:45 UTC, Era Scarecrow wrote:
> So why is the offset off by 14h (20 bytes)? It's not like we need a to set a ptr first.
>
> Go figure i probably found a bug...
Well as a side note a simple yet not happy workaround is making a new array slice of the memory and then using that pointer directly. Looking at the intel opcode and memory call conventions, I could have used a very compact intel set and scaling. Instead I'm forced to ignore scaling, and I'm also forced to push/pop the flags to save the carry when advancing the two pointers in parallel. Plus there's 3 instructions that don't need to be there.
Yeah this is probably nitpicking... I can't help wanting to be as optimized and small as possible.
|
Copyright © 1999-2021 by the D Language Foundation