August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to JAnderson | JAnderson wrote:
> Sweet! I love the way you put this forth as a challenge. Maybe D will have the worlds fastest array operations :)
I thought a little competition might bring out the best in people!
|
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jarrett Billingsley | Jarrett Billingsley wrote: > "Lars Ivar Igesund" <larsivar@igesund.net> wrote in message news:g7ias2$2kbo$1@digitalmars.com... >> Walter Bright wrote: >> >>> This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? >>> >>> http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip >> >> The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D > > Too bad Tango doesn't support them yet. :C Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034? -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango |
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Fri, 08 Aug 2008 22:43:08 -0700, Walter Bright wrote: > bearophile wrote: >> Probably I am missing something important, I have tried this code with the 1.034 (that compiles my large d libs fine), but I have found many problems: >> >> import std.stdio: putr = writefln; >> >> void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; >> >> //putr(a1[] + a2[]); // test.d(6): Error: Array operations not >> implemented > > It only works if the top level is an assignment operation. > > [..] > >> int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error: >> Array operations not implemented > > Doesn't work for initializers. > > [..] Looks like there is room for improvement. It does put work on the programmers nerves when things doesn't work as expected. :) Anyway - Good work! |
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | bearophile wrote: >> import std.stdio: putr = writefln; Walter Bright: >It only works if the top level is an assignment operation.< Then I have not seen such comment in the docs, if this is absent from the docs, then this deserves to be added. And the error message too given DMD can be improved. >Doesn't work for initializers.< Both the docs (if not already present) and the error message have to explain this. This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0] >> int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: [] >I don't know what putr is. It's just a shorter alias of the writefln. >>auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented<< >Have to use slice [] operator.< I'd like a less wrong error message then. >>Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<< >Yes.< This is very positive :-) >D already distinguishes operations on the array handle, a, from operations on the contents of a, a[]. I think this is a good distinction.< I understand and I agree, but the [] make the code a little less natural to write. ---------------------------- For reference this is the shortened code, it compiles and runs but the results and error messages are bogus: import std.stdio: writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; auto a3 = a1[] + 4; writefln(a3); // prints: [1,2,3,0,0,0,0] int[] a7; a7[] = a1[] + a2[]; writefln(a7); // prints: [] // a7 = a1 + a2; // test2.d(14): Error: Array operations not implemented } The last line gives a wrong message error (well, the message errors in the precedent code were all wrong). ------------------- The following code works, yay! :-) import std.stdio: writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; auto a3 = new int[2]; a3[] = a1[] + a2[]; writefln(a3); // prints: [3,6] } Later, bearophile |
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | First benchmark, just D against itself, not used GCC yet, the results show that vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet: import std.stdio: put = writef, putr = writefln; import std.conv: toInt; version (Win32) { import std.c.windows.windows: QueryPerformanceCounter, QueryPerformanceFrequency; double clock() { long t; QueryPerformanceCounter(&t); return cast(double)t / queryPerformanceFrequency; } long queryPerformanceFrequency; static this() { QueryPerformanceFrequency(&queryPerformanceFrequency); } } version (linux) { import std.c.linux.linux: time; double clock() { return cast(double)time(null); } } void main(string[] args) { int n = args.length >= 2 ? toInt(args[1]) : 10; n *= 8; // to avoid problems with SSE2 int nloops = args.length >= 3 ? toInt(args[2]) : 1; bool use_vec = args.length == 4 ? cast(bool)toInt(args[3]) : true; putr("array len= ", n, " nloops= ", nloops, " Use vec ops: ", use_vec); auto a1 = new int[n]; // void? auto a2 = new int[n]; // void? auto a3 = new int[n]; foreach (i, ref el; a1) el = i * 7 + 1; foreach (i, ref el; a2) el = i + 1; auto t = clock(); if (use_vec) for (int j = 0; j < nloops; j++) a3[] = a1[] / a2[]; else for (int j = 0; j < nloops; j++) for (int i; i < a3.length; i++) a3[i] = a1[i] / a2[i]; putr("time= ", clock() - t, " s"); if (a3.length < 300) putr("\nResult:\n", a3); } /* D code with /: C:\>array_benchmark.exe 10000 10000 0 array len= 80000 nloops= 10000 Use vec ops: false time= 7.10563 s C:\>array_benchmark.exe 10000 10000 1 array len= 80000 nloops= 10000 Use vec ops: true time= 7.222 s C:\>array_benchmark.exe 12000000 1 0 array len= 96000000 nloops= 1 Use vec ops: false time= 0.654696 s C:\>array_benchmark.exe 12000000 1 1 array len= 96000000 nloops= 1 Use vec ops: true time= 0.655401 s D code with *: C:\>array_benchmark.exe 10000 10000 0 array len= 80000 nloops= 10000 Use vec ops: false time= 7.10615 s C:\>array_benchmark.exe 10000 10000 1 array len= 80000 nloops= 10000 Use vec ops: true time= 7.21904 s C:\>array_benchmark.exe 12000000 1 0 array len= 96000000 nloops= 1 Use vec ops: false time= 0.65515 s C:\>array_benchmark.exe 12000000 1 1 array len= 96000000 nloops= 1 Use vec ops: true time= 0.65566 s (Note that 0.65566 > 0.65515 isn't due to noise) D code with +: C:\>array_benchmark.exe 10000 10000 0 array len= 80000 nloops= 10000 Use vec ops: false time= 7.10848 s C:\>array_benchmark.exe 10000 10000 1 array len= 80000 nloops= 10000 Use vec ops: true time= 7.22527 s C:\>array_benchmark.exe 12000000 1 0 array len= 96000000 nloops= 1 Use vec ops: false time= 0.654797 s C:\>array_benchmark.exe 12000000 1 1 array len= 96000000 nloops= 1 Use vec ops: true time= 0.654991 s */ Bye, bearophile |
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | Second version, just a bit cleaner code, less bug-prone, etc: http://codepad.org/BlwSIBKl Timings on linux on DMD 2.0 with * as operation seems much better. Bater, bearophile |
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | == Quote from bearophile (bearophileHUGS@lycos.com)'s article
> First benchmark, just D against itself, not used GCC yet, the results show that
vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet:
I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops. Three arrays of 80000 ints comes to a total of about 960k. This is not going to fit in any L1 cache for a long time. Heck, my CPU only has 512k L2 cache per core. Here are my results using smaller arrays designed to fit in my 64k L1 data cache, and the same code as Bearophile.
+ operator:
D:\code>array_benchmark.exe 500 1000000 0
array len= 4000 nloops= 1000000 Use vec ops: false
time= 4.82841 s
D:\code>array_benchmark.exe 500 1000000 1
array len= 4000 nloops= 1000000 Use vec ops: true
time= 2.32902 s
* operator :
D:\code>array_benchmark.exe 500 1000000 0
array len= 4000 nloops= 1000000 Use vec ops: false
time= 6.1556 s
D:\code>array_benchmark.exe 500 1000000 1
array len= 4000 nloops= 1000000 Use vec ops: true
time= 6.16539 s
/ operator:
D:\code>array_benchmark.exe 500 100000 0
array len= 4000 nloops= 100000 Use vec ops: false
time= 7.02435 s
D:\code>array_benchmark.exe 500 100000 1
array len= 4000 nloops= 100000 Use vec ops: true
time= 6.84251 s
BTW, for the sake of comparison, here's my CPU specs from CPU-Z. Also note that I'm running in 32-bit mode.
Number of processors 1
Number of cores 2 per processor
Number of threads 2 (max 2) per processor
Name AMD Athlon 64 X2 3600+
Code Name Brisbane
Specification AMD Athlon(tm) 64 X2 Dual Core Processor 3600+
Package Socket AM2 (940)
Family/Model/Stepping F.B.1
Extended Family/Model F.6B
Brand ID 4
Core Stepping BH-G1
Technology 65 nm
Core Speed 2698.1 MHz
Multiplier x Bus speed 9.5 x 284.0 MHz
HT Link speed 852.0 MHz
Stock frequency 1900 MHz
Instruction sets MMX (+), 3DNow! (+), SSE, SSE2, SSE3, x86-64
L1 Data cache (per processor) 2 x 64 KBytes, 2-way set associative, 64-byte line size
L1 Instruction cache (per processor) 2 x 64 KBytes, 2-way set associative, 64-byte
line size
L2 cache (per processor) 2 x 512 KBytes, 16-way set associative, 64-byte line size
|
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | C version too: #include "stdlib.h" #include "stdio.h" #include "time.h" #define MYOP * typedef int T; #define TFORM "%d " void error(char *string) { fprintf(stderr, "ERROR: %s\n", string); exit(EXIT_FAILURE); } double myclock() { clock_t t = clock(); if (t == -1) return 0.0; else return t / (double)CLOCKS_PER_SEC; } int main(int argc, char** argv) { int n = argc >= 2 ? atoi(argv[1]) : 10; n *= 8; // to avoid problems with SSE2 int nloops = argc >= 3 ? atoi(argv[2]) : 1; printf("array len= %d nloops= %d\n", n, nloops); //__attribute__((aligned(16))) T* __restrict a1 = (T*)malloc(sizeof(T) * n + 16); T* __restrict a2 = (T*)malloc(sizeof(T) * n + 16); T* __restrict a3 = (T*)malloc(sizeof(T) * n + 16); if (a1 == NULL || a2 == NULL || a3 == NULL) error("memory overflow"); int i, j; for (i = 0; i < n; i++) { a1[i] = i * 7 + 1; a2[i] = i + 1; } double t = myclock(); for (j = 0; j < nloops; j++) for (i = 0; i < n; i++) // Alignment of access forced using peeling. a3[i] = a1[i] MYOP a2[i]; printf("time= %f s\n", myclock() - t); if (n < 300) { printf("\nResult:\n"); for (i = 0; i < n; i++) printf(TFORM, a3[i]); putchar('\n'); } return 0; } /* MYOP = *, compiled with: gcc -Wall -O3 -s benchmark.c -o benchmark C:\>benchmark 100 3000000 array len= 800 nloops= 3000000 time= 3.656000 s C:\>benchmark 10000 10000 array len= 80000 nloops= 10000 time= 1.374000 s C:\>benchmark 12000000 1 array len= 96000000 nloops= 1 time= 0.547000 s MYOP = *, compiled with: gcc -Wall -O3 -s -ftree-vectorize -msse3 -ftree-vectorizer-verbose=5 benchmark.c -o benchmark C:\>benchmark 100 3000000 array len= 800 nloops= 3000000 time= 3.468000 s C:\>benchmark 10000 10000 array len= 80000 nloops= 10000 time= 1.156000 s C:\>benchmark 12000000 1 array len= 96000000 nloops= 1 time= 0.531000 s In the larger array the cache effects may dominate over computing time. */ |
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to dsimcha | dsimcha:
> I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops.
Right. Finding good benchmarks is not easy, and I have shown the code here for people to spot problems in it. I have added a C version too now.
Bye,
bearophile
|
August 09, 2008 Re: DMD 1.034 and 2.018 releases | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lars Ivar Igesund | Lars Ivar Igesund wrote:
> Jarrett Billingsley wrote:
>
>> "Lars Ivar Igesund" <larsivar@igesund.net> wrote in message
>> news:g7ias2$2kbo$1@digitalmars.com...
>>> Walter Bright wrote:
>>>
>>>> This one has (finally) got array operations implemented. For those who
>>>> want to show off their leet assembler skills, the initial assembler
>>>> implementation code is in phobos/internal/array*.d. Burton Radons wrote
>>>> the assembler. Can you make it faster?
>>>>
>>>> http://www.digitalmars.com/d/1.0/changelog.html
>>>> http://ftp.digitalmars.com/dmd.1.034.zip
>>> The array op docs aren't actually on the 1.0 array page. But great! I
>>> remember trying to use these 3 years ago :D
>> Too bad Tango doesn't support them yet. :C
>
> Are you suggesting that Walter should have told us that he was implementing
> this feature ahead of releasing 1.034?
All Tango needs to do is copy the internal\array*.d files over and add them to the makefile.
|
Copyright © 1999-2021 by the D Language Foundation