Thread overview
Re: DMD 1.034 and 2.018 releases
Aug 11, 2008
Pete
Aug 11, 2008
Walter Bright
Aug 13, 2008
Georg Lukas
Aug 13, 2008
Don
Aug 14, 2008
Dave
Aug 14, 2008
JAnderson
August 11, 2008
Walter Bright Wrote:

> This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?
> 
> http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip
> 
> http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zip

Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.

Regards,
August 11, 2008
Pete wrote:
> Not sure if someone else has already mentioned this but would it be
> possible for the compiler to align these arrays on 16 byte boundaries
> in order to maximise any possible vector efficiency. AFAIK you can't
> actually specify align anything higher than align 8 at the moment
> which is a bit of a problem.

Anything allocated with new will be aligned on 16 byte boundaries.
August 13, 2008
On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
> Walter Bright Wrote:
>> This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?
> 
> Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.

From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.:

a = 0xf00d0013 (3 mod 16)
b = 0xdeaffff3 (3 mod 16)

In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest.

This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.

Georg
-- 
|| http://op-co.de ++  GCS/CM d? s: a-- C+++ UL+++ !P L+++ E--- W++  ++
|| gpg: 0x962FD2DE ||  N++ o? K- w---() O M V? PS+ PE-- Y+ PGP++ t*  ||
|| Ge0rG: euIRCnet ||  5 X+ R tv b+(+++) DI+(+++) D+ G e* h! r* !y+  ||
++ IRCnet OFTC OPN ||________________________________________________||
August 13, 2008
Georg Lukas wrote:
> On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
>> Walter Bright Wrote:
>>> This one has (finally) got array operations implemented. For those who
>>> want to show off their leet assembler skills, the initial assembler
>>> implementation code is in phobos/internal/array*.d. Burton Radons wrote
>>> the assembler. Can you make it faster?
>> Not sure if someone else has already mentioned this but would it be
>> possible for the compiler to align these arrays on 16 byte boundaries in
>> order to maximise any possible vector efficiency. AFAIK you can't
>> actually specify align anything higher than align 8 at the moment which
>> is a bit of a problem.
> 
> From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.:
> 
> a = 0xf00d0013 (3 mod 16)
> b = 0xdeaffff3 (3 mod 16)
> 
> In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest.
> 
> This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.

Just begin with a check for minimal size. If less than that size, don't use SSE at all.

> 
> Georg
August 14, 2008
"Don" <nospam@nospam.com.au> wrote in message news:g7u36h$20j0$1@digitalmars.com...
> Georg Lukas wrote:
>> On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
>>> Walter Bright Wrote:
>>>> This one has (finally) got array operations implemented. For those who
>>>> want to show off their leet assembler skills, the initial assembler
>>>> implementation code is in phobos/internal/array*.d. Burton Radons wrote
>>>> the assembler. Can you make it faster?
>>> Not sure if someone else has already mentioned this but would it be
>>> possible for the compiler to align these arrays on 16 byte boundaries in
>>> order to maximise any possible vector efficiency. AFAIK you can't
>>> actually specify align anything higher than align 8 at the moment which
>>> is a bit of a problem.
>>
>> From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.:
>>
>> a = 0xf00d0013 (3 mod 16)
>> b = 0xdeaffff3 (3 mod 16)
>>
>> In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest.

Good idea. Right now in that code there is (usually) a case for both un/aligned.

It typically goes like this:

if(cpu_has_sse2 && a.length > min_size)
{
   if(((cast(size_t) aptr | cast(size_t)bptr | cast(size_t)cptr) & 15) != 0)
   {    // Unaligned case
   asm
   {
   ...
   movdqu  XMM0, [EAX]
   ...
   }
   }
   else
   {    // Aligned case
   asm
   {
   ...
   movdqa  XMM0, [EAX]
   ...
   }
   }
}

The two blocks of asm code is basically identical except for the un/aligned SSE opcodes.

With your idea, one could get rid of the test for alignment, probably some bloat and a whole lot of duplication. I guess the question would be if the overhead of your idea would be less than the current design.

- Dave

>>
>> This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.
>
> Just begin with a check for minimal size. If less than that size, don't use SSE at all.
>
>>
>> Georg 

August 14, 2008
Georg Lukas wrote:
> On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
>> Walter Bright Wrote:
>>> This one has (finally) got array operations implemented. For those who
>>> want to show off their leet assembler skills, the initial assembler
>>> implementation code is in phobos/internal/array*.d. Burton Radons wrote
>>> the assembler. Can you make it faster?
>> Not sure if someone else has already mentioned this but would it be
>> possible for the compiler to align these arrays on 16 byte boundaries in
>> order to maximise any possible vector efficiency. AFAIK you can't
>> actually specify align anything higher than align 8 at the moment which
>> is a bit of a problem.
> 
> From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.:
> 
> a = 0xf00d0013 (3 mod 16)
> b = 0xdeaffff3 (3 mod 16)
> 
> In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest.
> 
> This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.

There would be some overhead for small arrays however as I said in my previous email, if your using a small array then its likely that your not doing much.  If it is a performance issue you should switch to a larger array (by grouping all your smaller ones together).  Of course there's the edge case where some actually needs to do a g-billion operations on exactly the same small array.

> 
> Georg

-Joel