Optimization tips for alpha blending / rasterization loop - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Optimization tips for alpha blending / rasterization loop

Thread overview

Optimization tips for alpha blending / rasterization loop
Nov 22, 2013 Mikko Ronkainen
Nov 22, 2013 Craig Dillabaugh
Nov 22, 2013 Andrea Fontana
Nov 22, 2013 Andrea Fontana
Nov 22, 2013 bearophile
Nov 22, 2013 Craig Dillabaugh
Nov 22, 2013 bearophile
Nov 22, 2013 Mikko Ronkainen

November 22, 2013

Optimization tips for alpha blending / rasterization loop

Posted by Mikko Ronkainen

Mikko Ronkainen

I'm trying to learn some software rasterization stuff. Here's what I'm doing:

32-bit DMD on 64-bit Windows
Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
Framebuffer is thrown as is to OpenGL, rendered as textured quad.

Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:

class Framebuffer
{
  int[] data;
  int width;
  int height;
}

void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
{
  foreach (i; y .. y + height)
  {
    int start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      byte* bg = cast(byte*)&framebuffer.data[start + j];
      byte* fg = cast(byte*)&color;

      int alpha = (fg[3] & 0xff) + 1;
      int inverseAlpha = 257 - alpha;

      bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
      bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
      bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
      bg[3] = cast(byte)0xff;
    }
  }
}

I would like to make this as fast as possible as it is done for almost every pixel every frame.

Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
Is this kind of algorith + data even a candidate for SIMD usage?
Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by Craig Dillabaugh
in reply to Mikko Ronkainen

Craig Dillabaugh

Posted in reply to Mikko Ronkainen

On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
wrote:
> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>
> 32-bit DMD on 64-bit Windows
> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>
> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>
> class Framebuffer
> {
>   int[] data;
>   int width;
>   int height;
> }
>
> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
> {
>   foreach (i; y .. y + height)
>   {
>     int start = x + i * framebuffer.width;
>
>     foreach(j; 0 .. width)
>     {
>       byte* bg = cast(byte*)&framebuffer.data[start + j];
>       byte* fg = cast(byte*)&color;
>
>       int alpha = (fg[3] & 0xff) + 1;
>       int inverseAlpha = 257 - alpha;
>
>       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>       bg[3] = cast(byte)0xff;
>     }
>   }
> }
>
> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>
> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
> Is this kind of algorith + data even a candidate for SIMD usage?
> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?

Do you want to use a ubyte instead of a byte here?

Also, for your alpha channel:

int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

If fg[3] = 0 then inverseAlpha = 256, which is out of the range
that can be stored in a ubyte.

Craig

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by Andrea Fontana
in reply to Craig Dillabaugh

Andrea Fontana

Posted in reply to Craig Dillabaugh

On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:
> On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
> wrote:
>> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>>
>> 32-bit DMD on 64-bit Windows
>> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
>> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>>
>> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>>
>> class Framebuffer
>> {
>>  int[] data;
>>  int width;
>>  int height;
>> }
>>
>> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
>> {
>>  foreach (i; y .. y + height)
>>  {
>>    int start = x + i * framebuffer.width;
>>
>>    foreach(j; 0 .. width)
>>    {
>>      byte* bg = cast(byte*)&framebuffer.data[start + j];
>>      byte* fg = cast(byte*)&color;
>>
>>      int alpha = (fg[3] & 0xff) + 1;
>>      int inverseAlpha = 257 - alpha;
>>
>>      bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>>      bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>>      bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>>      bg[3] = cast(byte)0xff;
>>    }
>>  }
>> }
>>
>> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>>
>> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
>> Is this kind of algorith + data even a candidate for SIMD usage?
>> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
>
> Do you want to use a ubyte instead of a byte here?
>
> Also, for your alpha channel:
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
> that can be stored in a ubyte.
>
> Craig

If I'm right all of these lines:

byte* fg = cast(byte*)&color;
int alpha = (fg[3] & 0xff) + 1;
int inverseAlpha = 257 - alpha;

are constant, and you put it outside the both foreach using an enum;

you can also pre-calculate this:
(alpha * (fg[0] & 0xff)

before foreach.

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by Andrea Fontana
in reply to Andrea Fontana

Andrea Fontana

Posted in reply to Andrea Fontana

On Friday, 22 November 2013 at 08:44:06 UTC, Andrea Fontana wrote:
> On Friday, 22 November 2013 at 03:36:38 UTC, Craig Dillabaugh wrote:
>> On Friday, 22 November 2013 at 02:24:56 UTC, Mikko Ronkainen
>> wrote:
>>> I'm trying to learn some software rasterization stuff. Here's what I'm doing:
>>>
>>> 32-bit DMD on 64-bit Windows
>>> Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR (this seems fastest to my CPU + GPU)
>>> Framebuffer is thrown as is to OpenGL, rendered as textured quad.
>>>
>>> Here's a simple rectangle drawing algorithm that also does alpha blending. I tried quite a many variations (for example without the byte casting, using ints and shifting instead), but none was as fast as this:
>>>
>>> class Framebuffer
>>> {
>>> int[] data;
>>> int width;
>>> int height;
>>> }
>>>
>>> void drawRectangle(Framebuffer framebuffer, int x, int y, int width, int height, int color)
>>> {
>>> foreach (i; y .. y + height)
>>> {
>>>   int start = x + i * framebuffer.width;
>>>
>>>   foreach(j; 0 .. width)
>>>   {
>>>     byte* bg = cast(byte*)&framebuffer.data[start + j];
>>>     byte* fg = cast(byte*)&color;
>>>
>>>     int alpha = (fg[3] & 0xff) + 1;
>>>     int inverseAlpha = 257 - alpha;
>>>
>>>     bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * (bg[0] & 0xff)) >> 8);
>>>     bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * (bg[1] & 0xff)) >> 8);
>>>     bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * (bg[2] & 0xff)) >> 8);
>>>     bg[3] = cast(byte)0xff;
>>>   }
>>> }
>>> }
>>>
>>> I would like to make this as fast as possible as it is done for almost every pixel every frame.
>>>
>>> Am I doing something stupid that is slowing things down? Cache trashing, or even branch prediction errors? :)
>>> Is this kind of algorith + data even a candidate for SIMD usage?
>>> Even if fg is of type byte, fg[0] would return greater value than 0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder why?
>>
>> Do you want to use a ubyte instead of a byte here?
>>
>> Also, for your alpha channel:
>>
>> int alpha = (fg[3] & 0xff) + 1;
>> int inverseAlpha = 257 - alpha;
>>
>> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
>> that can be stored in a ubyte.
>>
>> Craig
>
> If I'm right all of these lines:
>
> byte* fg = cast(byte*)&color;
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> are constant, and you put it outside the both foreach using an enum;
>
> you can also pre-calculate this:
> (alpha * (fg[0] & 0xff)
>
> before foreach.

Of course I mean immutable, not enum :)

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by bearophile
in reply to Craig Dillabaugh

bearophile

Posted in reply to Craig Dillabaugh

Craig Dillabaugh:

> Do you want to use a ubyte instead of a byte here?

See:
http://d.puremagic.com/issues/show_bug.cgi?id=3850

Bye,
bearophile

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by Craig Dillabaugh
in reply to bearophile

Craig Dillabaugh

Posted in reply to bearophile

On Friday, 22 November 2013 at 10:27:12 UTC, bearophile wrote:
> Craig Dillabaugh:
>
>> Do you want to use a ubyte instead of a byte here?
>
> See:
> http://d.puremagic.com/issues/show_bug.cgi?id=3850
>
> Bye,
> bearophile

Yes it is pretty easy to mix that up.  A lot of my work is with
images with single byte pixels, so I am pretty used to using
ubyte now.  I can't remember if I ever used byte, and if I did I
was likely a bug.

Craig

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by bearophile
in reply to Craig Dillabaugh

bearophile

Posted in reply to Craig Dillabaugh

Craig Dillabaugh:

> Yes it is pretty easy to mix that up.  A lot of my work is with
> images with single byte pixels, so I am pretty used to using
> ubyte now.  I can't remember if I ever used byte, and if I did I
> was likely a bug.

Vote for the bug if you like it :-)

Bye,
bearophile

November 22, 2013

Re: Optimization tips for alpha blending / rasterization loop

Posted by Mikko Ronkainen
in reply to Craig Dillabaugh

Mikko Ronkainen

Posted in reply to Craig Dillabaugh

> Do you want to use a ubyte instead of a byte here?

Yes, that was a silly mistake. It seems that fixing that removed the need for all the masking operations, which had the biggest speedup.

> Also, for your alpha channel:
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
> that can be stored in a ubyte.

I think my logic should be correct. The calculations are done with ints, and the result is then just casted/clamped to the byte. The reason for the +1 is the >> 8, which divides by 256.

class Framebuffer
{
  uint[] framebufferData;
  uint framebufferWidth;
  uint framebufferHeight;
}

void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint width, uint height, uint color)
{
  immutable ubyte* fg = cast(immutable ubyte*)&color;
  immutable uint alpha = fg[3] + 1;
  immutable uint invAlpha = 257 - alpha;
  immutable uint afg0 = alpha * fg[0];
  immutable uint afg1 = alpha * fg[1];
  immutable uint afg2 = alpha * fg[2];

  foreach (i; y .. y + height)
  {
    uint start = x + i * framebuffer.width;

    foreach(j; 0 .. width)
    {
      ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]);

      bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8);
      bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8);
      bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8);
      bg[3] = 0xff;
    }
  }
}

Can this be made faster with SIMD? (I don't know much about it, maybe the data and algorithm doesn't fit it?)

Can this be parallelized with any real gains?

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation