August 09, 2008
JAnderson wrote:
> Sweet!  I love the way you put this forth as a challenge.  Maybe D will have the worlds fastest array operations :)

I thought a little competition might bring out the best in people!
August 09, 2008
Jarrett Billingsley wrote:

> "Lars Ivar Igesund" <larsivar@igesund.net> wrote in message news:g7ias2$2kbo$1@digitalmars.com...
>> Walter Bright wrote:
>>
>>> This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?
>>>
>>> http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip
>>
>> The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
> 
> Too bad Tango doesn't support them yet.  :C

Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango
August 09, 2008
On Fri, 08 Aug 2008 22:43:08 -0700, Walter Bright wrote:

> bearophile wrote:
>> Probably I am missing something important, I have tried this code with the 1.034 (that compiles my large d libs fine), but I have found many problems:
>> 
>> import std.stdio: putr = writefln;
>> 
>> void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6];
>> 
>> //putr(a1[] + a2[]); // test.d(6): Error: Array operations not
>> implemented
> 
> It only works if the top level is an assignment operation.
> 
> 
[..]
> 
>> int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error:
>> Array operations not implemented
> 
> Doesn't work for initializers.
> 
> 
[..]

Looks like there is room for improvement.
It does put work on the programmers nerves when things doesn't work as
expected. :)

Anyway - Good work!
August 09, 2008
bearophile wrote:
>> import std.stdio: putr = writefln;


Walter Bright:
>It only works if the top level is an assignment operation.<

Then I have not seen such comment in the docs, if this is absent from the docs, then this deserves to be added. And the error message too given DMD can be improved.


>Doesn't work for initializers.<

Both the docs (if not already present) and the error message have to explain this.

This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0]


>> int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []
>I don't know what putr is.

It's just a shorter alias of the writefln.


>>auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented<<

>Have to use slice [] operator.<

I'd like a less wrong error message then.


>>Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<

>Yes.<

This is very positive :-)


>D already distinguishes operations on the array handle, a, from operations on the contents of a, a[]. I think this is a good distinction.<

I understand and I agree, but the [] make the code a little less natural to write.

----------------------------

For reference this is the shortened code, it compiles and runs but the results and error messages are bogus:

import std.stdio: writefln;

void main() {
    int[] a1 = [1, 2, 3];
    int[] a2 = [2, 4, 6];

    auto a3 = a1[] + 4;
    writefln(a3); // prints: [1,2,3,0,0,0,0]

    int[] a7;
    a7[] = a1[] + a2[];
    writefln(a7); // prints: []

    // a7 = a1 + a2; // test2.d(14): Error: Array operations not implemented
}

The last line gives a wrong message error (well, the message errors in the precedent code were all wrong).

-------------------

The following code works, yay! :-)

import std.stdio: writefln;

void main() {
    int[] a1 = [1, 2, 3];
    int[] a2 = [2, 4, 6];
    auto a3 = new int[2];

    a3[] = a1[] + a2[];
    writefln(a3); // prints: [3,6]
}

Later,
bearophile
August 09, 2008
First benchmark, just D against itself, not used GCC yet, the results show that vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet:


import std.stdio: put = writef, putr = writefln;
import std.conv: toInt;

version (Win32) {
    import std.c.windows.windows: QueryPerformanceCounter, QueryPerformanceFrequency;

    double clock() {
        long t;
        QueryPerformanceCounter(&t);

        return cast(double)t / queryPerformanceFrequency;
    }

    long queryPerformanceFrequency;

    static this() {
        QueryPerformanceFrequency(&queryPerformanceFrequency);
    }
}

version (linux) {
    import std.c.linux.linux: time;

    double clock() {
        return cast(double)time(null);
    }
}


void main(string[] args) {
    int n = args.length >= 2 ? toInt(args[1]) : 10;
    n *= 8; // to avoid problems with SSE2
    int nloops = args.length >= 3 ? toInt(args[2]) : 1;
    bool use_vec = args.length == 4 ? cast(bool)toInt(args[3]) : true;

    putr("array len= ", n, "  nloops= ", nloops, "  Use vec ops: ", use_vec);

    auto a1 = new int[n]; // void?
    auto a2 = new int[n]; // void?
    auto a3 = new int[n];

    foreach (i, ref el; a1)
        el = i * 7 + 1;
    foreach (i, ref el; a2)
        el = i + 1;

    auto t = clock();
    if (use_vec)
        for (int j = 0; j < nloops; j++)
            a3[] = a1[] / a2[];
    else
        for (int j = 0; j < nloops; j++)
            for (int i; i < a3.length; i++)
                a3[i] = a1[i] / a2[i];
    putr("time= ", clock() - t, " s");

    if (a3.length < 300)
        putr("\nResult:\n", a3);
}

/*
D code with /:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10563 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.222 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.654696 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.655401 s


D code with *:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10615 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.21904 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.65515 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.65566 s
    (Note that 0.65566 > 0.65515 isn't due to noise)


D code with +:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10848 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.22527 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.654797 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.654991 s

*/


Bye,
bearophile
August 09, 2008
Second version, just a bit cleaner code, less bug-prone, etc: http://codepad.org/BlwSIBKl

Timings on linux on DMD 2.0 with * as operation seems much better.

Bater,
bearophile
August 09, 2008
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> First benchmark, just D against itself, not used GCC yet, the results show that
vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet:

I see at least part of the problem.  When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops.  Three arrays of 80000 ints comes to a total of about 960k.  This is not going to fit in any L1 cache for a long time.  Heck, my CPU only has 512k L2 cache per core.  Here are my results using smaller arrays designed to fit in my 64k L1 data cache, and the same code as Bearophile.

+ operator:
D:\code>array_benchmark.exe 500 1000000 0
array len= 4000  nloops= 1000000  Use vec ops: false
time= 4.82841 s

D:\code>array_benchmark.exe 500 1000000 1
array len= 4000  nloops= 1000000  Use vec ops: true
time= 2.32902 s

* operator :

D:\code>array_benchmark.exe 500 1000000 0
array len= 4000  nloops= 1000000  Use vec ops: false
time= 6.1556 s

D:\code>array_benchmark.exe 500 1000000 1
array len= 4000  nloops= 1000000  Use vec ops: true
time= 6.16539 s

/ operator:

D:\code>array_benchmark.exe 500 100000 0
array len= 4000  nloops= 100000  Use vec ops: false
time= 7.02435 s

D:\code>array_benchmark.exe 500 100000 1
array len= 4000  nloops= 100000  Use vec ops: true
time= 6.84251 s

BTW, for the sake of comparison, here's my CPU specs from CPU-Z. Also note that I'm running in 32-bit mode.

Number of processors	1
Number of cores	2 per processor
Number of threads	2 (max 2) per processor
Name	AMD Athlon 64 X2 3600+
Code Name	Brisbane
Specification	AMD Athlon(tm) 64 X2 Dual Core Processor 3600+
Package	Socket AM2 (940)
Family/Model/Stepping	F.B.1
Extended Family/Model	F.6B
Brand ID	4
Core Stepping	BH-G1
Technology	65 nm
Core Speed	2698.1 MHz
Multiplier x Bus speed	9.5 x 284.0 MHz
HT Link speed	852.0 MHz
Stock frequency	1900 MHz
Instruction sets	MMX (+), 3DNow! (+), SSE, SSE2, SSE3, x86-64
L1 Data cache (per processor)	2 x 64 KBytes, 2-way set associative, 64-byte line size
L1 Instruction cache (per processor)	2 x 64 KBytes, 2-way set associative, 64-byte
line size
L2 cache (per processor)	2 x 512 KBytes, 16-way set associative, 64-byte line size
August 09, 2008
C version too:

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define MYOP *
typedef int T;
#define TFORM "%d "

void error(char *string) {
    fprintf(stderr, "ERROR: %s\n", string);
    exit(EXIT_FAILURE);
}

double myclock() {
    clock_t t = clock();
    if (t == -1)
        return 0.0;
    else
        return t / (double)CLOCKS_PER_SEC;
}

int main(int argc, char** argv) {
    int n = argc >= 2 ? atoi(argv[1]) : 10;

    n *= 8; // to avoid problems with SSE2
    int nloops = argc >= 3 ? atoi(argv[2]) : 1;

    printf("array len= %d  nloops= %d\n", n, nloops);

    //__attribute__((aligned(16)))
    T* __restrict a1 = (T*)malloc(sizeof(T) * n + 16);
    T* __restrict a2 = (T*)malloc(sizeof(T) * n + 16);
    T* __restrict a3 = (T*)malloc(sizeof(T) * n + 16);
    if (a1 == NULL || a2 == NULL || a3 == NULL)
        error("memory overflow");

    int i, j;
    for (i = 0; i < n; i++) {
        a1[i] = i * 7 + 1;
        a2[i] = i + 1;
    }

    double t = myclock();
    for (j = 0; j < nloops; j++)
        for (i = 0; i < n; i++) // Alignment of access forced using peeling.
            a3[i] = a1[i] MYOP a2[i];
    printf("time= %f s\n", myclock() - t);

    if (n < 300) {
        printf("\nResult:\n");
        for (i = 0; i < n; i++)
            printf(TFORM, a3[i]);
        putchar('\n');
    }

    return 0;
}

/*

MYOP = *, compiled with:
gcc -Wall -O3 -s benchmark.c -o benchmark
    C:\>benchmark 100 3000000
    array len= 800  nloops= 3000000
    time= 3.656000 s

    C:\>benchmark 10000 10000
    array len= 80000  nloops= 10000
    time= 1.374000 s

    C:\>benchmark 12000000 1
    array len= 96000000  nloops= 1
    time= 0.547000 s


MYOP = *, compiled with:
gcc -Wall -O3 -s -ftree-vectorize -msse3 -ftree-vectorizer-verbose=5 benchmark.c -o benchmark
    C:\>benchmark 100 3000000
    array len= 800  nloops= 3000000
    time= 3.468000 s

    C:\>benchmark 10000 10000
    array len= 80000  nloops= 10000
    time= 1.156000 s

    C:\>benchmark 12000000 1
    array len= 96000000  nloops= 1
    time= 0.531000 s

In the larger array the cache effects may dominate over computing time.

*/

August 09, 2008
dsimcha:
> I see at least part of the problem.  When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops.

Right. Finding good benchmarks is not easy, and I have shown the code here for people to spot problems in it. I have added a C version too now.

Bye,
bearophile
August 09, 2008
Lars Ivar Igesund wrote:
> Jarrett Billingsley wrote:
> 
>> "Lars Ivar Igesund" <larsivar@igesund.net> wrote in message
>> news:g7ias2$2kbo$1@digitalmars.com...
>>> Walter Bright wrote:
>>>
>>>> This one has (finally) got array operations implemented. For those who
>>>> want to show off their leet assembler skills, the initial assembler
>>>> implementation code is in phobos/internal/array*.d. Burton Radons wrote
>>>> the assembler. Can you make it faster?
>>>>
>>>> http://www.digitalmars.com/d/1.0/changelog.html
>>>> http://ftp.digitalmars.com/dmd.1.034.zip
>>> The array op docs aren't actually on the 1.0 array page. But great! I
>>> remember trying to use these 3 years ago :D
>> Too bad Tango doesn't support them yet.  :C
> 
> Are you suggesting that Walter should have told us that he was implementing
> this feature ahead of releasing 1.034?

All Tango needs to do is copy the internal\array*.d files over and add them to the makefile.