Thread overview
Optimization tips for alpha blending / rasterization loop
Nov 22, 2013
Mikko Ronkainen
Nov 22, 2013
Craig Dillabaugh
Nov 22, 2013
Andrea Fontana
Nov 22, 2013
Andrea Fontana
Nov 22, 2013
bearophile
Nov 22, 2013
Craig Dillabaugh
Nov 22, 2013
bearophile
Nov 22, 2013
Mikko Ronkainen
November 22, 2013
I'm trying to learn some software rasterization stuff. Here's what I'm doing:

32-bit DMD on 64-bit Windows
Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
Framebuffer is thrown as is to OpenGL, rendered as textured quad.

Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:

class Framebuffer
{
  int[] data;
  int width;
  int height;
}

void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
{
  foreach (i; y .. y + height)
  {
    int start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      byte* bg = cast(byte*)&framebuffer.data[start + j];
      byte* fg = cast(byte*)&color;

      int alpha = (fg[3] & 0xff) + 1;
      int inverseAlpha = 257 - alpha;

      bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
      bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
      bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
      bg[3] = cast(byte)0xff;
    }
  }
}

I would like to make this as fast as possible as it is done for almost every pixel every frame.

Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
Is this kind of algorith + data even a candidate for SIMD usage?
Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
November 22, 2013
On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
wrote:
> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>
> 32-bit DMD on 64-bit Windows
> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>
> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>
> class Framebuffer
> {
>   int[] data;
>   int width;
>   int height;
> }
>
> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
> {
>   foreach (i; y .. y + height)
>   {
>     int start = x + i * framebuffer.width;
>
>     foreach(j; 0 .. width)
>     {
>       byte* bg = cast(byte*)&framebuffer.data[start + j];
>       byte* fg = cast(byte*)&color;
>
>       int alpha = (fg[3] & 0xff) + 1;
>       int inverseAlpha = 257 - alpha;
>
>       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>       bg[3] = cast(byte)0xff;
>     }
>   }
> }
>
> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>
> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
> Is this kind of algorith + data even a candidate for SIMD usage?
> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?

Do you want to use a ubyte instead of a byte here?

Also, for your alpha channel:

int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

If fg[3] = 0 then inverseAlpha = 256, which is out of the range
that can be stored in a ubyte.

Craig
November 22, 2013
On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:
> On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
> wrote:
>> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>>
>> 32-bit DMD on 64-bit Windows
>> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
>> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>>
>> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>>
>> class Framebuffer
>> {
>>  int[] data;
>>  int width;
>>  int height;
>> }
>>
>> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
>> {
>>  foreach (i; y .. y + height)
>>  {
>>    int start = x + i * framebuffer.width;
>>
>>    foreach(j; 0 .. width)
>>    {
>>      byte* bg = cast(byte*)&framebuffer.data[start + j];
>>      byte* fg = cast(byte*)&color;
>>
>>      int alpha = (fg[3] & 0xff) + 1;
>>      int inverseAlpha = 257 - alpha;
>>
>>      bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>>      bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>>      bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>>      bg[3] = cast(byte)0xff;
>>    }
>>  }
>> }
>>
>> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>>
>> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
>> Is this kind of algorith + data even a candidate for SIMD usage?
>> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
>
> Do you want to use a ubyte instead of a byte here?
>
> Also, for your alpha channel:
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
> that can be stored in a ubyte.
>
> Craig

If I'm right all of these lines:

byte* fg = cast(byte*)&color;
int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

are constant, and you put it outside the both foreach using an enum;

you can also pre-calculate this:
(alpha * (fg[0] & 0xff)

before foreach.


November 22, 2013
On Friday, 22 November 2013 at 08:44:06 UTC, Andrea Fontana wrote:
> On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:
>> On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
>> wrote:
>>> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>>>
>>> 32-bit DMD on 64-bit Windows
>>> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
>>> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>>>
>>> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>>>
>>> class Framebuffer
>>> {
>>> int[] data;
>>> int width;
>>> int height;
>>> }
>>>
>>> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
>>> {
>>> foreach (i; y .. y + height)
>>> {
>>>   int start = x + i * framebuffer.width;
>>>
>>>   foreach(j; 0 .. width)
>>>   {
>>>     byte* bg = cast(byte*)&framebuffer.data[start + j];
>>>     byte* fg = cast(byte*)&color;
>>>
>>>     int alpha = (fg[3] & 0xff) + 1;
>>>     int inverseAlpha = 257 - alpha;
>>>
>>>     bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>>>     bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>>>     bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>>>     bg[3] = cast(byte)0xff;
>>>   }
>>> }
>>> }
>>>
>>> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>>>
>>> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
>>> Is this kind of algorith + data even a candidate for SIMD usage?
>>> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
>>
>> Do you want to use a ubyte instead of a byte here?
>>
>> Also, for your alpha channel:
>>
>> int alpha = (fg[3] & 0xff) + 1;
>> int inverseAlpha = 257 - alpha;
>>
>> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
>> that can be stored in a ubyte.
>>
>> Craig
>
> If I'm right all of these lines:
>
> byte* fg = cast(byte*)&color;
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> are constant, and you put it outside the both foreach using an enum;
>
> you can also pre-calculate this:
> (alpha * (fg[0] & 0xff)
>
> before foreach.

Of course I mean immutable, not enum :)
November 22, 2013
Craig Dillabaugh:

> Do you want to use a ubyte instead of a byte here?

See:
http://d.puremagic.com/issues/show_bug.cgi?id=3850

Bye,
bearophile
November 22, 2013
On Friday, 22 November 2013 at 10:27:12 UTC, bearophile wrote:
> Craig Dillabaugh:
>
>> Do you want to use a ubyte instead of a byte here?
>
> See:
> http://d.puremagic.com/issues/show_bug.cgi?id=3850
>
> Bye,
> bearophile

Yes it is pretty easy to mix that up.  A lot of my work is with
images with single byte pixels, so I am pretty used to using
ubyte now.  I can't remember if I ever used byte, and if I did I
was likely a bug.

Craig
November 22, 2013
Craig Dillabaugh:

> Yes it is pretty easy to mix that up.  A lot of my work is with
> images with single byte pixels, so I am pretty used to using
> ubyte now.  I can't remember if I ever used byte, and if I did I
> was likely a bug.

Vote for the bug if you like it :-)

Bye,
bearophile
November 22, 2013
> Do you want to use a ubyte instead of a byte here?

Yes, that was a silly mistake. It seems that fixing that removed the need for all the masking operations, which had the biggest speedup.

> Also, for your alpha channel:
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
> that can be stored in a ubyte.

I think my logic should be correct. The calculations are done with ints, and the result is then just casted/clamped to the byte. The reason for the +1 is the >> 8, which divides by 256.

class Framebuffer
{
  uint[] framebufferData;
  uint framebufferWidth;
  uint framebufferHeight;
}

void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint width, uint height, uint color)
{
  immutable ubyte* fg = cast(immutable ubyte*)&color;
  immutable uint alpha = fg[3] + 1;
  immutable uint invAlpha = 257 - alpha;
  immutable uint afg0 = alpha * fg[0];
  immutable uint afg1 = alpha * fg[1];
  immutable uint afg2 = alpha * fg[2];

  foreach (i; y .. y + height)
  {
    uint start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]);

      bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8);
      bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8);
      bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8);
      bg[3] = 0xff;
    }
  }
}

Can this be made faster with SIMD? (I don't know much about it, maybe the data and algorithm doesn't fit it?)

Can this be parallelized with any real gains?