Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
June 01, 2016 Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Here's the assembly code for my alpha-blending routine: ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c); ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + offsetY); asm{ //moving the values to their destinations movd MM0, p; movd MM1, src; movq MM5, alpha; movq MM7, alphaMMXmul_const1; movq MM6, alphaMMXmul_const2; punpcklbw MM2, MM0; punpcklbw MM3, MM1; paddw MM6, MM5; //1 + alpha psubw MM7, MM5; //256 - alpha pmulhuw MM2, MM6; //src * (1 + alpha) pmulhuw MM3, MM7; //dest * (256 - alpha) paddw MM3, MM2; //(src * (1 + alpha)) + (dest * (256 - alpha)) psrlw MM3, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) / 256 //moving the result to its place; packuswb MM4, MM3; movd p, MM4; emms; } The two constants being referred here: static immutable ushort[4] alphaMMXmul_const1 = [256,256,256,256]; static immutable ushort[4] alphaMMXmul_const2 = [1,1,1,1]; alpha is a ushort[4] containing the alpha value four times. After some debugging, I found out that the p pointer becomes null at the end instead of pointing to a value. I have no experience with using in-line assemblers (although I made a few Hello World programs for MS-Dos with a stand-alone assembler), so I don't know when and how the compiler will interpret the types from D. |
June 01, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote: > After some debugging, I found out that the p pointer becomes null at the end instead of pointing to a value. I have no experience with using in-line assemblers (although I made a few Hello World programs for MS-Dos with a stand-alone assembler), so I don't know when and how the compiler will interpret the types from D. In the assembler the variable names actually become just the offset to where they are in the stack in relation to BP. So if you want the full pointer you actually need to convert it into a register first and then just use that register instead. So.... This should be correct. //unless you are going to actually use ubyte[4] here, just making a pointer will work instead, so cast(uint*) probably > ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c); > ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + offsetY); > asm{ //moving the values to their destinations movd ESI, src[EBP]; //get source pointer movd EDI, p[EBP]; //get destination pointer movd MM0, [EDI]; //use directly movd MM1, [ESI]; > movq MM5, alpha; > movq MM7, alphaMMXmul_const1; > movq MM6, alphaMMXmul_const2; > <snip> movd [EDI], MM4; } |
June 02, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Era Scarecrow | On Wednesday, 1 June 2016 at 23:35:40 UTC, Era Scarecrow wrote:
> On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
>> After some debugging, I found out that the p pointer becomes null at the end instead of pointing to a value. I have no experience with using in-line assemblers (although I made a few Hello World programs for MS-Dos with a stand-alone assembler), so I don't know when and how the compiler will interpret the types from D.
>
> In the assembler the variable names actually become just the offset to where they are in the stack in relation to BP. So if you want the full pointer you actually need to convert it into a register first and then just use that register instead. So.... This should be correct.
>
> //unless you are going to actually use ubyte[4] here, just making a pointer will work instead, so cast(uint*) probably
>> ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
>> ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + offsetY);
>> asm{ //moving the values to their destinations
> movd ESI, src[EBP]; //get source pointer
> movd EDI, p[EBP]; //get destination pointer
> movd MM0, [EDI]; //use directly
> movd MM1, [ESI];
>> movq MM5, alpha;
>> movq MM7, alphaMMXmul_const1;
>> movq MM6, alphaMMXmul_const2;
>
>> <snip>
>
> movd [EDI], MM4;
> }
I could get the code working with a bug after replacing pmulhuw with pmullw, but due to integer overflow I get a glitched image. I try to get around the fact that pmulhuw stores the high bits of the result either with multiplication or with bit shifting.
|
June 02, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Thursday, 2 June 2016 at 00:51:15 UTC, ZILtoid1991 wrote: > On Wednesday, 1 June 2016 at 23:35:40 UTC, Era Scarecrow wrote: >> On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote: > I could get the code working with a bug after replacing pmulhuw with pmullw, but due to integer overflow I get a glitched image. I try to get around the fact that pmulhuw stores the high bits of the result either with multiplication or with bit shifting. I forgot to mention that I had to make pointers for the arrays I used in order to be able to load them. |
June 02, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Thursday, 2 June 2016 at 00:52:48 UTC, ZILtoid1991 wrote:
> On Thursday, 2 June 2016 at 00:51:15 UTC, ZILtoid1991 wrote:
>> I could get the code working with a bug after replacing pmulhuw with pmullw, but due to integer overflow I get a glitched image. I try to get around the fact that pmulhuw stores the high bits of the result either with multiplication or with bit shifting.
>
> I forgot to mention that I had to make pointers for the arrays I used in order to be able to load them.
I'm not familiar with the MMX instruction set, however glancing at the source again I notice the const registers are (of course) arrays. Those two where you're loading the constants you should probably create/convert the pointers as appropriate, or if they are an Enum value I think they will be dropped in and work fine. (the value being 0x0100_0100_0100_0100 and 0x0001_0001_0001_0001 I believe, based on the short layout).
Maybe you already had that solved or the compiler does something I don't know...
|
June 02, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
> Here's the assembly code for my alpha-blending routine:
Could you also paste the D version of your code? Perhaps the compiler (LDC, GDC) will generate similarly vectorized code that is inlinable, etc.
-Johan
|
June 02, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan Engelen | On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote: > On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote: >> Here's the assembly code for my alpha-blending routine: > > Could you also paste the D version of your code? Perhaps the compiler (LDC, GDC) will generate similarly vectorized code that is inlinable, etc. > > -Johan ubyte[4] dest2 = *p; dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - src[0]))>>8); dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - src[0]))>>8); dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - src[0]))>>8); *p = dest2; The main problem with this is that it's much slower, even if I would calculate the alpha blending values once. The assembly code does not seem to have higher impact than the "replace if alpha = 255" algorithm: if(src[0] == 255){ *p = src; } It also seems I have a quite few problems with the assembly code, mostly with the pmulhuw command (it returns the higher 16 bit of the result, I need the lower 16 bit as unsigned), also with the pointers, as the read outs and write backs doesn't land to their correct places, sometimes resulting in a flickering screen or wrong colors affecting neighboring pixels. Current assembly code: //ushort[4] alpha = [src[0],src[0],src[0],src[0]]; //replace it if there's a faster method for this ushort[4] alpha = [100,100,100,100]; //src[3] = 255; ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr; ushort[4] *p3 = cast(ushort[4]*)alpha.ptr; ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr; ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr; asm{ //moving the values to their destinations mov ESI, p2[EBP]; mov EDI, p[EBP]; movd MM0, [ESI]; movd MM1, [EDI]; mov ESI, p3[EBP]; movq MM5, [ESI]; mov ESI, pc_256[EBP]; movq MM7, [ESI]; mov ESI, pc_1[EBP]; movq MM6, [ESI]; punpcklbw MM2, MM0; punpcklbw MM3, MM1; paddw MM6, MM5; //1 + alpha psubw MM7, MM5; //256 - alpha //psllw MM2, 2; //psllw MM3, 2; psrlw MM6, 1; psrlw MM7, 1; pmullw MM2, MM6; //src * (1 + alpha) pmullw MM3, MM7; //dest * (256 - alpha) paddw MM3, MM2; //(src * (1 + alpha)) + (dest * (256 - alpha)) psrlw MM3, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) / 256 //moving the result to its place; packuswb MM4, MM3; movd [EDI-3], MM4; emms; } Tried to get the correct result with trial and error, but there's no real improvement. |
June 02, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Thursday, 2 June 2016 at 13:32:51 UTC, ZILtoid1991 wrote:
> On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
>> Could you also paste the D version of your code? Perhaps the compiler (LDC, GDC) will generate similarly vectorized code that is inlinable, etc.
>
> ubyte[4] dest2 = *p;
> dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - src[0]))>>8);
> dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - src[0]))>>8);
> dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - src[0]))>>8);
> *p = dest2;
>
> The main problem with this is that it's much slower, even if I would calculate the alpha blending values once. The assembly code does not seem to have higher impact than the "replace if alpha = 255" algorithm:
>
> if(src[0] == 255){
> *p = src;
> }
>
> It also seems I have a quite few problems with the assembly code, mostly with the pmulhuw command (it returns the higher 16 bit of the result, I need the lower 16 bit as unsigned), also with the pointers, as the read outs and write backs doesn't land to their correct places, sometimes resulting in a flickering screen or wrong colors affecting neighboring pixels. Current assembly code:
I'd say the major portion of your speedup happens to be because you're trying to do 3 things at once. Rather specifically, because you're working with 3 8bit colors, you have 24bits of data to work with, and by adding 8bits for fixed floating point you can do a multiply and do 4 small multiplies in a single command.
You'd probably get a similar effect from bit shifting before and after the results. Since you're working with 3 colors and the alpha/multiplier... This assumes you do it without MMX. (reduces 6 multiplies to a mere 2)
ulong tmp1 = (src[1] << 32) | (src[2] << 16) | src[3];
ulong tmp2 = (dest2[1] << 32) | (dest2[2] << 16) | dest2[3];
tmp1 *= src[0]+1;
tmp1 += tmp2*(256 - src[0]);
src[3] = (tmp1 >> 8) & 0xff;
src[2] = (tmp1 >> 24) & 0xff;
src[1] = (tmp1 >> 40) & 0xff;
You could also increase the bit precision up so if you decided to do further adds or some other calculations it would have more room to fudge with, but not much. Say if you gave yourself 20 bits per variable rather than 16, the values can then hold 16x higher for getting say the average of x values at no cost (if divisible by ^2) other than a little difference in how you write it :)
Although you might still get a better result from MMX instructions if you have them in the right order. Don't forget though MMX uses the same register space as floating point, so mixing the two is a big no-no.
|
June 04, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
> Here's the assembly code for my alpha-blending routine:
> ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
> ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + offsetY);
> asm{ //moving the values to their destinations
> movd MM0, p;
> movd MM1, src;
> movq MM5, alpha;
> movq MM7, alphaMMXmul_const1;
> movq MM6, alphaMMXmul_const2;
> punpcklbw MM2, MM0;
> punpcklbw MM3, MM1;
>
> paddw MM6, MM5; //1 + alpha
> psubw MM7, MM5; //256 - alpha
>
> pmulhuw MM2, MM6; //src * (1 + alpha)
> pmulhuw MM3, MM7; //dest * (256 - alpha)
> paddw MM3, MM2; //(src * (1 + alpha)) + (dest * (256 - alpha))
> psrlw MM3, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) / 256
> //moving the result to its place;
> packuswb MM4, MM3;
> movd p, MM4;
> emms;
> }
>
> The two constants being referred here:
> static immutable ushort[4] alphaMMXmul_const1 = [256,256,256,256];
> static immutable ushort[4] alphaMMXmul_const2 = [1,1,1,1];
>
> alpha is a ushort[4] containing the alpha value four times.
>
> After some debugging, I found out that the p pointer becomes null at the end instead of pointing to a value. I have no experience with using in-line assemblers (although I made a few Hello World programs for MS-Dos with a stand-alone assembler), so I don't know when and how the compiler will interpret the types from D.
Problem solved. Current assembly code:
asm{
//moving the values to their destinations
mov EBX, p[EBP];
movd MM0, src;
movd MM1, [EBX];
movq MM5, alpha;
movq MM7, alphaMMXmul_const256;
movq MM6, alphaMMXmul_const1;
pxor MM2, MM2;
punpcklbw MM0, MM2;
punpcklbw MM1, MM2;
paddusw MM6, MM5; //1 + alpha
psubusw MM7, MM5; //256 - alpha
pmullw MM0, MM6; //src * (1 + alpha)
pmullw MM1, MM7; //dest * (256 - alpha)
paddusw MM0, MM1; //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw MM0, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) / 256
//moving the result to its place;
//pxor MM2, MM2;
packuswb MM0, MM2;
movd [EBX], MM0;
emms;
}
The actual problem was the poor documentation of MMX instructions as it never really caught on, and the disappearance of assembly programming from the mainstream. The end result was a quick alpha-blending algorithm that barely has any extra performance penalty compared to just copying the pixels. I currently have no plans on translating the whole sprite displaying algorithm to assembly, instead I'll work on the editor for the game engine.
|
June 04, 2016 Re: Can I get a more in-depth guide about the inline assembler? | ||||
---|---|---|---|---|
| ||||
Posted in reply to ZILtoid1991 | On Saturday, 4 June 2016 at 01:44:38 UTC, ZILtoid1991 wrote:
> Problem solved. Current assembly code:
>
> //moving the values to their destinations
> mov EBX, p[EBP];
> movd MM0, src;
> movd MM1, [EBX];
>
> <snip>
>
> The actual problem was the poor documentation of MMX instructions as it never really caught on, and the disappearance of assembly programming from the mainstream. The end result was a quick alpha-blending algorithm that barely has any extra performance penalty compared to just copying the pixels. I currently have no plans on translating the whole sprite displaying algorithm to assembly, instead I'll work on the editor for the game engine.
So... Why did you need to dereference the pointer for p and move it to EBX, but didn't need to do it for src (no [])?
Maybe you should explain your experiences with the MMX instruction set, follies and what you succeeded on? Where does the documentation fail? And are we talking about the Intel manuals and instruction sets or another source?
|
Copyright © 1999-2021 by the D Language Foundation