Wayne Scott
|
I was doing some tests using the gdc compiler and comparing it to gcc.
First I created C version of the example wc program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
char *
readfile(char *file)
{
char *ret;
int fd;
struct stat sb;
stat(file, &sb);
ret = malloc(sb.st_size + 1);
fd = open(file, O_RDONLY);
read(fd, ret, sb.st_size);
ret[sb.st_size] = 0;
close(fd);
return (ret);
}
int
main (int ac, char **av)
{
int w_total = 0;
int l_total = 0;
int c_total = 0;
int i;
printf (" lines words bytes file\n");
for (i = 1; i < ac; i++) {
char *input;
int w_cnt = 0, l_cnt = 0, c_cnt = 0;
int inword = 0;
char *p;
input = readfile(av[i]);
p = input;
while (*p) {
if (*p == '\n') ++l_cnt;
if (*p != ' ') {
if (!inword) {
inword = 1;
++w_cnt;
}
} else {
inword = 0;
}
++c_cnt;
++p;
}
free(input);
printf ("%8u%8u%8u %s\n", l_cnt, w_cnt, c_cnt, av[i]);
l_total += l_cnt;
w_total += w_cnt;
c_total += c_cnt;
}
if (ac > 2) {
printf ("--------------------------------------\n"
"%8u%8u%8u total\n",
l_total, w_total, c_total);
}
return (0);
}
Then I compiled both versions with -O2 and used cachegrind to find out exactly how many instruction each one needed to run. Here are the results with the C version first.
(This is over 2 megs of C source)
$ valgrind --tool=cachegrind ./wc_c ~/bk/bk-3.3.x/src/*.c > /dev/null
==3349== I refs: 22,529,481
==3349== I1 misses: 784
==3349== L2i misses: 778
==3349== I1 miss rate: 0.0%
==3349== L2i miss rate: 0.0%
==3349==
==3349== D refs: 2,393,366 (2,262,770 rd + 130,596 wr)
==3349== D1 misses: 10,159 ( 9,671 rd + 488 wr)
==3349== L2d misses: 9,680 ( 9,315 rd + 365 wr)
==3349== D1 miss rate: 0.4% ( 0.4% + 0.3% )
==3349== L2d miss rate: 0.4% ( 0.4% + 0.2% )
==3349==
==3349== L2 refs: 10,943 ( 10,455 rd + 488 wr)
==3349== L2 misses: 10,458 ( 10,093 rd + 365 wr)
==3349== L2 miss rate: 0.0% ( 0.0% + 0.2% )
farm Dlang $ valgrind --tool=cachegrind ./wc_d ~/bk/bk-3.3.x/src/*.c > /dev/null
==3351== Cachegrind, an I1/D1/L2 cache profiler for x86-linux.
==3351== I refs: 29,081,497
==3351== I1 misses: 1,216
==3351== L2i misses: 1,199
==3351== I1 miss rate: 0.0%
==3351== L2i miss rate: 0.0%
==3351==
==3351== D refs: 4,891,118 (3,663,754 rd + 1,227,364 wr)
==3351== D1 misses: 61,871 ( 24,677 rd + 37,194 wr)
==3351== L2d misses: 60,880 ( 23,757 rd + 37,123 wr)
==3351== D1 miss rate: 1.2% ( 0.6% + 3.0% )
==3351== L2d miss rate: 1.2% ( 0.6% + 3.0% )
==3351==
==3351== L2 refs: 63,087 ( 25,893 rd + 37,194 wr)
==3351== L2 misses: 62,079 ( 24,956 rd + 37,123 wr)
==3351== L2 miss rate: 0.1% ( 0.0% + 3.0% )
As you can see the D version of the code used 30% more instructions and
100% more data accesses.
(BTW the system wc program was a lot slower than both of these...)
That is not too bad for the benefits, but I was hoping they would
be closer. Originally I was seeing MUCH different results, but I was
using smaller input sets. D has a much higer startup overhead.
Next I tried making the D code look like my C version without the dynamic arrays and just using pointers. It didn't really change the numbers at all. Also adding -fno-bounds-check didn't help. That is a good sign because it means that the array code generates the same code you would write using pointer.
Anyway I thought the result was interesting...
-Wayne
|