January 11, 2017
On Wednesday, 11 January 2017 at 00:11:50 UTC, Chris M wrote:
> On Tuesday, 10 January 2017 at 13:13:17 UTC, Basile B. wrote:
>> On Tuesday, 10 January 2017 at 11:38:43 UTC, Guillaume Piolat wrote:
>>> On Tuesday, 10 January 2017 at 10:41:54 UTC, Basile B. wrote:
>>>>
>>>> don't forget to flag
>>>>
>>>> asm pure nothrow {}
>>>>
>>>> otherwise it's slow.
>>>
>>> Why?
>>
>> It's an empirical observation. In september I tried to get why an inline asm function was slow. What happened was that I didn't mark the asm block as nothrow
>>
>> https://forum.dlang.org/post/xznocpxtalpayvkrwxey@forum.dlang.org
>>
>> I opened an issue asking the specifications to explain that clearly.
>
> Huh, that's really interesting, thanks for posting. I guess my other question would be how do I determine if a block of assembly is pure?
>

The game changer for the performances is just "nothrow".
January 11, 2017
On Tuesday, 10 January 2017 at 10:41:54 UTC, Basile B. wrote:
> don't forget to flag
>
> asm pure nothrow {}
>
> otherwise it's slow.

Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.
January 11, 2017
On Wednesday, 11 January 2017 at 06:14:35 UTC, Era Scarecrow wrote:
>
> Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.

I'm a taker if you have some algorithm to reuse 32-bit divide in wideint division instead of scanning bits :)
January 11, 2017
On Wednesday, 11 January 2017 at 15:39:49 UTC, Guillaume Piolat wrote:
> On Wednesday, 11 January 2017 at 06:14:35 UTC, Era Scarecrow wrote:
>>
>> Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.
>
> I'm a taker if you have some algorithm to reuse 32-bit divide in wideint division instead of scanning bits :)

 I remember the divide was giving me some trouble. The idea was to try and use the built in registers and limits of the assembly to take advantage of full 128bit division, unfortunately if the result is too large to fit in a 64bit result it breaks, rather than giving me half the result and letting me work with it.

 Still I think I'll impliment my own version and then if it's faster I'll submit it.
May 23, 2017
On Wednesday, 11 January 2017 at 17:32:35 UTC, Era Scarecrow wrote:
>  Still I think I'll impliment my own version and then if it's faster I'll submit it.


Decided I'd give my hand at writing a 'ScaledInt' which is intended to basically allow any larger unsigned type. Coming across some assembly confusion.

Using mixin with assembly here's the 'result' of the mixin (as a final result)

alias UCent = ScaledInt!(uint, 4);

struct ScaledInt(I, int Size)
if (isUnsigned!(I) && Size > 1) {
    I[Size] val;

    ScaledInt opBinary(string op)(const ScaledInt rhs) const
    if (op == "+") {
        ScaledInt t;
        asm pure nothrow { //mixin generated from another function, for simplicity
            mov EBX, this;
            clc;
            mov EAX, rhs[EBP+0];
            adc EAX, val[EBX+0];
            mov t[EBP+0], EAX;
            mov EAX, rhs[EBP+4];
            adc EAX, val[EBX+4];
            mov t[EBP+4], EAX;
            mov EAX, rhs[EBP+8];
            adc EAX, val[EBX+8];
            mov t[EBP+8], EAX;
            mov EAX, rhs[EBP+12];
            adc EAX, val[EBX+12];
            mov t[EBP+12], EAX;
        }

        return t;
    }
}



Raw disassembly for my asm code shows this:
    mov     EBX,-4[EBP]
    clc
    mov     EAX,0Ch[EBP]
    adc     EAX,[EBX]
    mov     -014h[EBP],EAX
    mov     EAX,010h[EBP]
    adc     EAX,4[EBX]
    mov     -010h[EBP],EAX
    mov     EAX,014h[EBP]
    adc     EAX,8[EBX]
    mov     -0Ch[EBP],EAX
    mov     EAX,018h[EBP]
    adc     EAX,0Ch[EBX]
    mov     -8[EBP],EAX


From what I'm seeing, it should be 8, 0ch, 10h, then 14h, all positive. I'm really scratching my head why I'm having this issue... Doing an add of t[0] = val[0] + rhs[0]; i get this disassembly:

    mov     EDX,-4[EBP] //mov EDX, this;
    mov     EBX,[EDX]   //val[0]
    add     EBX,0Ch[EBP]//+ rhs.val[0]
    mov     ECX,8[EBP]  //mov ECX, ???[???]
    mov     [ECX],EBX   //t.val[0] =

If i do "mov ECX,t[EBP]", i get "mov ECX,-014h[EBP]". If i try to reference the exact variable val within t, it complains it doesn't know it at compiler-time (although it's a fixed location).

What am i missing here?
June 01, 2017
On Tuesday, 23 May 2017 at 03:33:38 UTC, Era Scarecrow wrote:
> From what I'm seeing, it should be 8, 0ch, 10h, then 14h, all positive. I'm really scratching my head why I'm having this issue...
>
> What am i missing here?

More experiments and i think it comes down to static arrays.

The following function code

int[4] fun2() {
    int[4] x = void;
    asm {
        mov dword ptr x, 100;
    }
    x[0] = 200; //get example of real offset
    return x;
}

Produces the following (from obj2asm)

int[4] x.fun2() comdat
        assume  CS:int[4] x.fun2()
                enter   014h,0
                mov     -4[EBP],EAX
                mov     dword ptr -014h[EBP],064h
                mov     EAX,-4[EBP]
                mov     dword ptr [EAX],0C8h        // x[0]=200, offset +0
                mov     EAX,-4[EBP]
                leave
                ret
int[4] x.fun2() ends


 So why is the offset off by 14h (20 bytes)? It's not like we need a to set a ptr first.

 Go figure i probably found a bug...
June 02, 2017
On Thursday, 1 June 2017 at 12:00:45 UTC, Era Scarecrow wrote:
>  So why is the offset off by 14h (20 bytes)? It's not like we need a to set a ptr first.
>
>  Go figure i probably found a bug...

 Well as a side note a simple yet not happy workaround is making a new array slice of the memory and then using that pointer directly. Looking at the intel opcode and memory call conventions, I could have used a very compact intel set and scaling. Instead I'm forced to ignore scaling, and I'm also forced to push/pop the flags to save the carry when advancing the two pointers in parallel. Plus there's 3 instructions that don't need to be there.

 Yeah this is probably nitpicking... I can't help wanting to be as optimized and small as possible.
1 2
Next ›   Last »