January 15, 2012
On 15 January 2012 19:01, bearophile <bearophileHUGS@lycos.com> wrote:
> Iain Buclaw:
>
>> Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up with -O2 and above.  My oh my...
>
> Please, show me the assembly code produced, with its relative D source :-)
>
> Bye,
> bearophile

For those who can't read AT&T:
----
.LC5:
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .long   1067030938
        .align 16

_D4test5test2FZNhG4f:
        .cfi_startproc
        mov     eax, 3
        cvtsi2ss        xmm0, eax
        mov     al, 7
        cvtsi2ss        xmm1, eax
        unpcklps        xmm0, xmm0
        unpcklps        xmm1, xmm1
        movlhps xmm0, xmm0
        movlhps xmm1, xmm1
        mulps   xmm0, XMMWORD PTR .LC5[rip]
        addps   xmm0, xmm1
        ret
        .cfi_endproc
----


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
January 16, 2012
I just built 32 & 64 bit DMD (latest commit on git tree is
f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)

Using the 32-bit version, I got this error:
Internal error: backend/cg87.c 1702

The 64-bit version went fine.

Previously, both 32 and 64 bit version had no problem.

On 01/15/2012 01:56 PM, Walter Bright wrote:
> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with.
> -----------------------
> import core.simd;
> 
> void test1a(float[4] a) { }
> 
> void test1()
> {
>     float[4] a = 1.2;
>     a[] = a[] * 3 + 7;
>     test1a(a);
> }
> 
> void test2a(float4 a) { }
> 
> void test2()
> {
>     float4 a = 1.2;
>     a = a * 3 + 7;
>     test2a(a);
> }
> 
> import std.stdio;
> import std.datetime;
> 
> int main()
> {
>     test1();
>     test2();
>     auto b = comparingBenchmark!(test1, test2, 100);
>     writeln(b.point);
>     return 0;
> }

January 16, 2012
On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
> I just built 32&  64 bit DMD (latest commit on git tree is
> f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)
>
> Using the 32-bit version, I got this error:
> Internal error: backend/cg87.c 1702
>
> The 64-bit version went fine.
>
> Previously, both 32 and 64 bit version had no problem.

Which machine?
January 16, 2012
Well I only have 1 machine, a laptop running 64 bit Arch Linux.
Yesterday I did a git pull, built both 32 & 64 bit DMD, and this code
compiled fine using those.
But now, the 32 bit version fails.

Walter Bright <newshound2@digitalmars.com> wrote:
> On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
>> I just built 32&  64 bit DMD (latest commit on git tree is
>> f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)
>> 
>> Using the 32-bit version, I got this error:
>> Internal error: backend/cg87.c 1702
>> 
>> The 64-bit version went fine.
>> 
>> Previously, both 32 and 64 bit version had no problem.
> 
> Which machine?
January 16, 2012
32 bit SIMD for Linux is not implemented.

It's all 64 bit platforms, and 32 bit OS X.

On 1/16/2012 2:35 AM, Andre Tampubolon wrote:
> Well I only have 1 machine, a laptop running 64 bit Arch Linux.
> Yesterday I did a git pull, built both 32&  64 bit DMD, and this code
> compiled fine using those.
> But now, the 32 bit version fails.
>
> Walter Bright<newshound2@digitalmars.com>  wrote:
>> On 1/16/2012 12:59 AM, Andre Tampubolon wrote:
>>> I just built 32&   64 bit DMD (latest commit on git tree is
>>> f800f6e342e2d9ab1ec9a6275b8239463aa1cee8)
>>>
>>> Using the 32-bit version, I got this error:
>>> Internal error: backend/cg87.c 1702
>>>
>>> The 64-bit version went fine.
>>>
>>> Previously, both 32 and 64 bit version had no problem.
>>
>> Which machine?

January 16, 2012
On 1/15/12 12:56 AM, Walter Bright wrote:
> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
> Anyhow, it's good enough now to play around with. Consider it alpha
> quality. Expect bugs - but make bug reports, as there's a serious lack
> of source code to test it with.
> -----------------------
> import core.simd;
>
> void test1a(float[4] a) { }
>
> void test1()
> {
> float[4] a = 1.2;
> a[] = a[] * 3 + 7;
> test1a(a);
> }
>
> void test2a(float4 a) { }
>
> void test2()
> {
> float4 a = 1.2;
> a = a * 3 + 7;
> test2a(a);
> }

These two functions should have the same speed. The function that ought to be slower is:

void test1()
{
    float[5] a = 1.2;
    float[] b = a[1 .. $];
    b[] = b[] * 3 + 7;
    test1a(a);
}


Andrei
January 16, 2012
On 16 January 2012 18:17, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org
> wrote:

> On 1/15/12 12:56 AM, Walter Bright wrote:
>
>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux. Anyhow, it's good enough now to play around with. Consider it alpha quality. Expect bugs - but make bug reports, as there's a serious lack of source code to test it with.
>> -----------------------
>> import core.simd;
>>
>> void test1a(float[4] a) { }
>>
>> void test1()
>> {
>> float[4] a = 1.2;
>> a[] = a[] * 3 + 7;
>> test1a(a);
>> }
>>
>> void test2a(float4 a) { }
>>
>> void test2()
>> {
>> float4 a = 1.2;
>> a = a * 3 + 7;
>> test2a(a);
>> }
>>
>
> These two functions should have the same speed.


A function using float arrays and a function using hardware vectors should certainly not be the same speed.


January 16, 2012
On 1/16/12 10:46 AM, Manu wrote:
> A function using float arrays and a function using hardware vectors
> should certainly not be the same speed.

My point was that the version using float arrays should opportunistically use hardware ops whenever possible.

Andrei
January 16, 2012
On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/15/12 12:56 AM, Walter Bright wrote:
>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
>> Anyhow, it's good enough now to play around with. Consider it alpha
>> quality. Expect bugs - but make bug reports, as there's a serious lack
>> of source code to test it with.
>> -----------------------
>> import core.simd;
>>
>> void test1a(float[4] a) { }
>>
>> void test1()
>> {
>> float[4] a = 1.2;
>> a[] = a[] * 3 + 7;
>> test1a(a);
>> }
>>
>> void test2a(float4 a) { }
>>
>> void test2()
>> {
>> float4 a = 1.2;
>> a = a * 3 + 7;
>> test2a(a);
>> }
>
> These two functions should have the same speed. The function that ought to be slower is:
>
> void test1()
> {
>      float[5] a = 1.2;
>      float[] b = a[1 .. $];
>      b[] = b[] * 3 + 7;
>      test1a(a);
> }
>
>
> Andrei

Unfortunately druntime's array ops are a mess and fail
to speed up anything below 16 floats.
Additionally there is overhead for a function call and
they have to check alignment at runtime.

martin
January 16, 2012
On 16 January 2012 18:48, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org
> wrote:

> On 1/16/12 10:46 AM, Manu wrote:
>
>> A function using float arrays and a function using hardware vectors should certainly not be the same speed.
>>
>
> My point was that the version using float arrays should opportunistically use hardware ops whenever possible.


I think this is a mistake, because such a piece of code never exists
outside of some context. If the context it exists within is all FPU code
(and it is, it's a float array), then swapping between FPU and SIMD
execution units will probably result in the function being slower than the
original (also the float array is unaligned). The SIMD version however must
exist within a SIMD context, since the API can't implicitly interact with
floats, this guarantees that the context of each function matches that
within which it lives.
This is fundamental to fast vector performance. Using SIMD is an all or
nothing decision, you can't just mix it in here and there.
You don't go casting back and fourth between floats and ints on every other
line... obviously it's imprecise, but it's also a major performance hazard.
There is no difference here, except the performance hazard is much worse.