September 29, 2010
I have a AMD Phenom(tm) II X4 955 Processor and this script works only on one core of my quad
(800Mhz)
September 29, 2010
bioinfornatics:

> with ldc and tango (up to date)
> $ ldc -O5 -release -enable-inlining test.d
> $ time ./test
> 1500000 1300000 497200000
> 
> real	0m4.376s
> user	0m4.373s
> sys	0m0.001s

LDC has llvm back-end, that doesn't share that dmd problem.
But to give us meaningful data, I suggest you to time the C program too on your machine, so we can compare :-)

Bye,
bearophile
September 29, 2010
== Quote from bearophile (bearophileHUGS@lycos.com)'s article
> Iain Buclaw:
> > Oh, it was my *original* intention to test runtime speed. However, the time to compile just stood out little more like a sore thumb than what I anticipated.
> If your purpose is to test runtime speed, use a more natural number of cases
like 10 or 20 or even 50 :-)
> So I have done a better test, see below for the code.
> Timings, NLOOPS = 100_000, best of 6, seconds:
>   DMD: 7.70
>   GCC: 2.42
> gcc  4.5.1, -Wall -O3 -s
> dmd 2.049, -O -release -inline
--snip--
> Bye,
> bearophile

Well, this should give you a good comparison between case jump tables on and off in GCC.

gdc-4.4 -O0
real	0m9.994s
user	0m9.957s
sys	0m0.028s

gdc-4.4 -O0 -fno-jump-tables
real	0m10.222s
user	0m10.197s
sys	0m0.012s

gdc-4.4 -O0 -funroll-loops
real	0m10.004s
user	0m9.965s
sys	0m0.032s

gdc-4.4 -O0 -funroll-loops -fno-jump-tables
real	0m10.011s
user	0m9.981s
sys	0m0.020s

gdc-4.4 -O3
real	0m7.107s
user	0m7.092s
sys	0m0.008s

gdc-4.4 -O3 -fno-jump-tables
real	0m7.136s
user	0m7.112s
sys	0m0.008s

gdc-4.4 -O3 -funroll-loops
real	0m7.127s
user	0m7.104s
sys	0m0.008s

gdc-4.4 -O3 -funroll-loops -fno-jump-tables
real	0m7.237s
user	0m7.184s
sys	0m0.044s

Differences are pretty minimal...


In comparison to DMD:

dmd -O
real	0m15.049s
user	0m14.989s
sys	0m0.044s


Iain
September 29, 2010
On Tue, 28 Sep 2010 20:45:27 -0400, bearophile <bearophileHUGS@lycos.com> wrote:

> retard:
>
>> Instead of O(n) linear search or O(ln n) binary search, why not use O(1)
>> jump tables in this case?
>
> I don't exactly know. But you must take into account the constants too, it's not just a matter of worst-case computational complexity. Probably when the density of a large jump table becomes too much low, its experimental performance on modern CPUs gets worse than a binary search among few entries. But I am not sure, I have not written&run benchmarks on this.
>
> Bye,
> bearophile

Well there are 28 labeled cases and ~16kb of jump table address space. (32kb on 64-bit platforms)
September 29, 2010
Robert Jacques:
> Well there are 28 labeled cases and ~16kb of jump table address space. (32kb on 64-bit platforms)

32 kb are enough to fill the code part of the L1 cache on most CPUs.

Bye,
bearophile
September 29, 2010
Iain Buclaw:

> gdc-4.4 -O3 -fno-jump-tables
> real	0m7.136s
> user	0m7.112s
> sys	0m0.008s
...
> gdc-4.4 -O3 -funroll-loops -fno-jump-tables
> real	0m7.237s
> user	0m7.184s
> sys	0m0.044s
> 
> Differences are pretty minimal...

Probably because gcc/gdc is not using a jump table here, but a binary search tree :-)

Bye,
bearophile
September 29, 2010
Tue, 28 Sep 2010 22:00:06 -0400, bearophile wrote:

> Robert Jacques:
>> Well there are 28 labeled cases and ~16kb of jump table address space.
>> (32kb on 64-bit platforms)
> 
> 32 kb are enough to fill the code part of the L1 cache on most CPUs.

28 cases and 32 kB of space seems like a waste. I'd use some sort of hashing before the jump to eliminate unnecessary blank slots. However, even if this kind of solution was cache friendly, I have no idea how this would affect the operation of modern branch predictors.
1 2
Next ›   Last »