Jump to page: 1 2
Thread overview
Align a variable on the stack.
Nov 03, 2015
TheFlyingFiddle
Nov 04, 2015
Nicholas Wilson
Nov 05, 2015
TheFlyingFiddle
Nov 05, 2015
Marc Schütz
Nov 05, 2015
TheFlyingFiddle
Nov 05, 2015
TheFlyingFiddle
Nov 05, 2015
TheFlyingFiddle
Nov 06, 2015
rsw0x
Nov 06, 2015
TheFlyingFiddle
Nov 06, 2015
rsw0x
Nov 07, 2015
steven kladitis
Nov 07, 2015
BBaz
Nov 06, 2015
Marc Schütz
Nov 06, 2015
Marc Schütz
Nov 06, 2015
TheFlyingFiddle
Nov 05, 2015
BBasile
Nov 06, 2015
arGus
Nov 06, 2015
rsw0x
November 03, 2015
Is there a built in way to do this in dmd?

Basically I want to do this:

auto decode(T)(...)
{
   while(...)
   {
      T t = T.init; //I want this aligned to 64 bytes.
   }
}


Currently I am using:

align(64) struct Aligner(T)
{
   T value;
}

auto decode(T)(...)
{
   Aligner!T t = void;
   while(...)
   {
      t.value = T.init;
   }
}

But is there a less hacky way? From the documentation of align it seems i cannot use that for this kind of stuff. Also I don't want to have to use align(64) on my T struct type since for my usecase I am decoding arrays of T.

The reason that I want to do this in the first place is that if the variable is aligned i get about a 2.5x speedup (i don't really know why... found it by accident)






November 04, 2015
On Tuesday, 3 November 2015 at 23:29:45 UTC, TheFlyingFiddle wrote:
> Is there a built in way to do this in dmd?
>
> Basically I want to do this:
>
> auto decode(T)(...)
> {
>    while(...)
>    {
>       T t = T.init; //I want this aligned to 64 bytes.
>    }
> }
>
>
> Currently I am using:
>
> align(64) struct Aligner(T)
> {
>    T value;
> }
>
> auto decode(T)(...)
> {
>    Aligner!T t = void;
>    while(...)
>    {
>       t.value = T.init;
>    }
> }
>
> But is there a less hacky way? From the documentation of align it seems i cannot use that for this kind of stuff. Also I don't want to have to use align(64) on my T struct type since for my usecase I am decoding arrays of T.
>
> The reason that I want to do this in the first place is that if the variable is aligned i get about a 2.5x speedup (i don't really know why... found it by accident)

Note that there are two different alignments:
         to control padding between instances on the stack (arrays)
         to control padding between members of a struct

align(64) //arrays
struct foo
{
      align(16) short baz; //between members
      align (1) float quux;
}

your 2.5x speedup is due to aligned vs. unaligned loads and stores which for SIMD type stuff has a really big effect. Basically misaligned stuff is really slow. IIRC there was a (blog/paper?) of someone on a uC spending a vast amount of time in ONE misaligned integer assignment causing traps and getting the kernel involved. Not quite as bad on x86 but still with doing.

As to a less jacky solution I'm not sure there is one.


November 05, 2015
On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson wrote:
> Note that there are two different alignments:
>          to control padding between instances on the stack (arrays)
>          to control padding between members of a struct
>
> align(64) //arrays
> struct foo
> {
>       align(16) short baz; //between members
>       align (1) float quux;
> }
>
> your 2.5x speedup is due to aligned vs. unaligned loads and stores which for SIMD type stuff has a really big effect. Basically misaligned stuff is really slow. IIRC there was a (blog/paper?) of someone on a uC spending a vast amount of time in ONE misaligned integer assignment causing traps and getting the kernel involved. Not quite as bad on x86 but still with doing.
>
> As to a less jacky solution I'm not sure there is one.

Thanks for the reply. I did some more checking around and I found that it was not really an alignment problem but was caused by using the default init value of my type.

My starting type.
align(64) struct Phys
{
   float x, y, z, w;
   //More stuff.
} //Was 64 bytes in size at the time.

The above worked fine, it was fast and all. But after a while I wanted the data in a diffrent format. So I started decoding positions, and other variables in separate arrays.

Something like this:
align(16) struct Pos { float x, y, z, w; }

This counter to my limited knowledge of how cpu's work was much slower. Doing the same thing lot's of times, touching less memory with less branches should in theory at-least be faster right? So after I ruled out bottlenecks in the parser I assumed there was some alignment problems so I did my Aligner hack. This caused to code to run faster so I assumed this was the cause... Naive! (there was a typo in the code I submitted to begin with I used a = Align!(T).init and not a.value = T.init)

The performance was actually cased by the line : t = T.init no matter if it was aligned or not. I solved the problem by changing the struct to look like this.
align(16) struct Pos
{
    float x = float.nan;
    float y = float.nan;
    float z = float.nan;
    float w = float.nan;
}

Basically T.init get's explicit values. But... this should be the same Pos.init as the default Pos.init. So I really fail to understand how this could fix the problem. I guessed the compiler generates some slightly different code if I do it this way? And that this slightly different code fixes some bottleneck in the cpu. But when I took a look at the assembly of the function I could not find any difference in the generated code...

I don't really know where to go from here to figure out the underlying cause. Does anyone have any suggestions?






November 05, 2015
On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle wrote:
> I don't really know where to go from here to figure out the underlying cause. Does anyone have any suggestions?

Can you publish two compilable and runnable versions of the code that exhibit the difference? Then we can have a look at the generated assembly. If there's really different code being generated depending on whether the .init value is explicitly set to float.nan or not, then this suggests there is a bug in DMD.
November 05, 2015
On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle wrote:
> [...]
> I solved the problem by changing the struct to look like this.
> align(16) struct Pos
> {
>     float x = float.nan;
>     float y = float.nan;
>     float z = float.nan;
>     float w = float.nan;
> }
>

wow that's quite strange. FP members should be initialized without initializer ! Eg you should get the same with

align(16) struct Pos
{
     float x, y, ,z, w;
}



November 05, 2015
On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
> On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle wrote:
> Can you publish two compilable and runnable versions of the code that exhibit the difference? Then we can have a look at the generated assembly. If there's really different code being generated depending on whether the .init value is explicitly set to float.nan or not, then this suggests there is a bug in DMD.

I created a simple example here:

struct A { float x, y, z ,w; }
struct B
{
   float x=float.nan;
   float y=float.nan;
   float z=float.nan;
   float w=float.nan;
}

void initVal(T)(ref T t, ref float k)
{
    pragma(inline, false);
    t.x = k;
    t.y = k * 2;
    t.z = k / 2;
    t.w = k^^3;
}


__gshared A[] a;
void benchA()
{
    A val;
    foreach(float f; 0 .. 1000_000)
    {
	val = A.init;
	initVal(val, f);
	a ~= val;
    }
}

__gshared B[] b;
void benchB()
{
    B val;
    foreach(float f; 0 .. 1000_000)
    {
        val = B.init;
        initVal(val, f);
	b ~= val;
    }
}


int main(string[] argv)
{
    import std.datetime;
    import std.stdio;

    auto res = benchmark!(benchA, benchB)(1);
    writeln("Default: ", res[0]);
    writeln("Explicit: ", res[1]);
	
    return 0;
}

output:

Default:  TickDuration(1637842)
Explicit: TickDuration(167088)

~10x slowdown...


November 05, 2015
On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle wrote:
> On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
> ~10x slowdown...

I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 windows.


November 05, 2015
On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:
> On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle wrote:
>> On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
>> ~10x slowdown...
>
> I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 windows.

I reduced it further:

struct A { float x, y, z ,w; }
struct B
{
   float x=float.nan;
   float y=float.nan;
   float z=float.nan;
   float w=float.nan;
}

void initVal(T)(ref T t, ref float k) { pragma(inline, false); }

void benchA()
{
   foreach(float f; 0 .. 1000_000)
   {
      A val = A.init;
      initVal(val, f);
   }
}

void benchB()
{
   foreach(float f; 0 .. 1000_000)
   {
      B val = B.init;
      initVal(val, f);
   }
}

int main(string[] argv)
{
   import std.datetime;
   import std.stdio;

   auto res = benchmark!(benchA, benchB)(1);
   writeln("Default:  ", res[0]);
   writeln("Explicit: ", res[1]);

   readln;
   return 0;
}

also i am using dmd -release -boundcheck=off -inline

The pragma(inline, false) is there to prevent it from removing the assignment in the loop.
November 06, 2015
On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle wrote:
> On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:
>> [...]
>
> I reduced it further:
>
> [...]

these run at the exact same speed for me and produce identical assembly output from a quick glance
dmd 2.069, -O -release -inline
November 06, 2015
On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
> On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle wrote:
>> On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:
>>> [...]
>>
>> I reduced it further:
>>
>> [...]
>
> these run at the exact same speed for me and produce identical assembly output from a quick glance
> dmd 2.069, -O -release -inline

Are you running on windows?

I tested on windows x64 and there I also get the exact same speed for both functions.
« First   ‹ Prev
1 2