Thread overview | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
November 22, 2013 Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
I'm trying to learn some software rasterization stuff. Here's what I'm doing: 32-bit DMD on 64-bit Windows Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU) Framebuffer is thrown as is to OpenGL, rendered as textured quad. Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this: class Framebuffer { int[] data; int width; int height; } void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color) { foreach (i; y .. y + height) { int start = x + i * framebuffer.width; foreach(j; 0 .. width) { byte* bg = cast(byte*)&framebuffer.data[start + j]; byte* fg = cast(byte*)&color; int alpha = (fg[3] & 0xff) + 1; int inverseAlpha = 257 - alpha; bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8); bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8); bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8); bg[3] = cast(byte)0xff; } } } I would like to make this as fast as possible as it is done for almost every pixel every frame. Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :) Is this kind of algorith + data even a candidate for SIMD usage? Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why? |
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mikko Ronkainen | On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
wrote:
> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>
> 32-bit DMD on 64-bit Windows
> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>
> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>
> class Framebuffer
> {
> int[] data;
> int width;
> int height;
> }
>
> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
> {
> foreach (i; y .. y + height)
> {
> int start = x + i * framebuffer.width;
>
> foreach(j; 0 .. width)
> {
> byte* bg = cast(byte*)&framebuffer.data[start + j];
> byte* fg = cast(byte*)&color;
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
> bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
> bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
> bg[3] = cast(byte)0xff;
> }
> }
> }
>
> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>
> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
> Is this kind of algorith + data even a candidate for SIMD usage?
> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
Do you want to use a ubyte instead of a byte here?
Also, for your alpha channel:
int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;
If fg[3] = 0 then inverseAlpha = 256, which is out of the range
that can be stored in a ubyte.
Craig
|
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Dillabaugh | On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:
> On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
> wrote:
>> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>>
>> 32-bit DMD on 64-bit Windows
>> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
>> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>>
>> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>>
>> class Framebuffer
>> {
>> int[] data;
>> int width;
>> int height;
>> }
>>
>> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
>> {
>> foreach (i; y .. y + height)
>> {
>> int start = x + i * framebuffer.width;
>>
>> foreach(j; 0 .. width)
>> {
>> byte* bg = cast(byte*)&framebuffer.data[start + j];
>> byte* fg = cast(byte*)&color;
>>
>> int alpha = (fg[3] & 0xff) + 1;
>> int inverseAlpha = 257 - alpha;
>>
>> bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>> bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>> bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>> bg[3] = cast(byte)0xff;
>> }
>> }
>> }
>>
>> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>>
>> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
>> Is this kind of algorith + data even a candidate for SIMD usage?
>> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
>
> Do you want to use a ubyte instead of a byte here?
>
> Also, for your alpha channel:
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
> that can be stored in a ubyte.
>
> Craig
If I'm right all of these lines:
byte* fg = cast(byte*)&color;
int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;
are constant, and you put it outside the both foreach using an enum;
you can also pre-calculate this:
(alpha * (fg[0] & 0xff)
before foreach.
|
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrea Fontana | On Friday, 22 November 2013 at 08:44:06 UTC, Andrea Fontana wrote:
> On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:
>> On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
>> wrote:
>>> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>>>
>>> 32-bit DMD on 64-bit Windows
>>> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
>>> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>>>
>>> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>>>
>>> class Framebuffer
>>> {
>>> int[] data;
>>> int width;
>>> int height;
>>> }
>>>
>>> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
>>> {
>>> foreach (i; y .. y + height)
>>> {
>>> int start = x + i * framebuffer.width;
>>>
>>> foreach(j; 0 .. width)
>>> {
>>> byte* bg = cast(byte*)&framebuffer.data[start + j];
>>> byte* fg = cast(byte*)&color;
>>>
>>> int alpha = (fg[3] & 0xff) + 1;
>>> int inverseAlpha = 257 - alpha;
>>>
>>> bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>>> bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>>> bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>>> bg[3] = cast(byte)0xff;
>>> }
>>> }
>>> }
>>>
>>> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>>>
>>> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
>>> Is this kind of algorith + data even a candidate for SIMD usage?
>>> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
>>
>> Do you want to use a ubyte instead of a byte here?
>>
>> Also, for your alpha channel:
>>
>> int alpha = (fg[3] & 0xff) + 1;
>> int inverseAlpha = 257 - alpha;
>>
>> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
>> that can be stored in a ubyte.
>>
>> Craig
>
> If I'm right all of these lines:
>
> byte* fg = cast(byte*)&color;
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> are constant, and you put it outside the both foreach using an enum;
>
> you can also pre-calculate this:
> (alpha * (fg[0] & 0xff)
>
> before foreach.
Of course I mean immutable, not enum :)
|
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Dillabaugh | Craig Dillabaugh: > Do you want to use a ubyte instead of a byte here? See: http://d.puremagic.com/issues/show_bug.cgi?id=3850 Bye, bearophile |
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Friday, 22 November 2013 at 10:27:12 UTC, bearophile wrote:
> Craig Dillabaugh:
>
>> Do you want to use a ubyte instead of a byte here?
>
> See:
> http://d.puremagic.com/issues/show_bug.cgi?id=3850
>
> Bye,
> bearophile
Yes it is pretty easy to mix that up. A lot of my work is with
images with single byte pixels, so I am pretty used to using
ubyte now. I can't remember if I ever used byte, and if I did I
was likely a bug.
Craig
|
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Dillabaugh | Craig Dillabaugh:
> Yes it is pretty easy to mix that up. A lot of my work is with
> images with single byte pixels, so I am pretty used to using
> ubyte now. I can't remember if I ever used byte, and if I did I
> was likely a bug.
Vote for the bug if you like it :-)
Bye,
bearophile
|
November 22, 2013 Re: Optimization tips for alpha blending / rasterization loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Dillabaugh | > Do you want to use a ubyte instead of a byte here? Yes, that was a silly mistake. It seems that fixing that removed the need for all the masking operations, which had the biggest speedup. > Also, for your alpha channel: > > int alpha = (fg[3] & 0xff) + 1; > int inverseAlpha = 257 - alpha; > > If fg[3] = 0 then inverseAlpha = 256, which is out of the range > that can be stored in a ubyte. I think my logic should be correct. The calculations are done with ints, and the result is then just casted/clamped to the byte. The reason for the +1 is the >> 8, which divides by 256. class Framebuffer { uint[] framebufferData; uint framebufferWidth; uint framebufferHeight; } void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint width, uint height, uint color) { immutable ubyte* fg = cast(immutable ubyte*)&color; immutable uint alpha = fg[3] + 1; immutable uint invAlpha = 257 - alpha; immutable uint afg0 = alpha * fg[0]; immutable uint afg1 = alpha * fg[1]; immutable uint afg2 = alpha * fg[2]; foreach (i; y .. y + height) { uint start = x + i * framebuffer.width; foreach(j; 0 .. width) { ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]); bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8); bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8); bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8); bg[3] = 0xff; } } } Can this be made faster with SIMD? (I don't know much about it, maybe the data and algorithm doesn't fit it?) Can this be parallelized with any real gains? |
Copyright © 1999-2021 by the D Language Foundation