Thread overview
Optimizing a raytracer
Oct 16, 2013
Jacob Carlborg
Oct 16, 2013
finalpatch
Oct 16, 2013
ponce
Oct 17, 2013
bearophile
Mar 26, 2014
Bienlein
October 16, 2013
Hello!

I am writing an unbiased raytrace renderer in D. I have good progress, but I want to make it as fast as possible where I can do it without compromises.

I use a struct with three doubles for vector and color calculations and I have operator overloading for them. Many vectors and colors are created during the tracing calculations.

I thought, using classes may require too much memory, because they are not destructed on scope end, and maybe speed reduction when GC kicks in.

Is my assumptions that in this case struct are more wise?

To avoid the constructing many vectors and colors, I thought to use ref arguments, but I also heard that ref functions are not inlined. What would generate the fastest code for a cross-product for example?

What compiler and compilations flags should I use to generate the fastest code? My main target is sixty-four bit machines, cross-platform. What optimizations can I assume for various compilers? Are only once used local variables inlined? So it secure to extract local variables only to make the code more easy to understand?

Thanks is Advance!
Róbert László Páli
October 16, 2013
On 2013-10-16 14:02, "Róbert László Páli" wrote:
> Hello!
>
> I am writing an unbiased raytrace renderer in D. I have good progress,
> but I want to make it as fast as possible where I can do it without
> compromises.
>
> I use a struct with three doubles for vector and color calculations and
> I have operator overloading for them. Many vectors and colors are
> created during the tracing calculations.
>
> I thought, using classes may require too much memory, because they are
> not destructed on scope end, and maybe speed reduction when GC kicks in.
>
> Is my assumptions that in this case struct are more wise?
>
> To avoid the constructing many vectors and colors, I thought to use ref
> arguments, but I also heard that ref functions are not inlined. What
> would generate the fastest code for a cross-product for example?
>
> What compiler and compilations flags should I use to generate the
> fastest code? My main target is sixty-four bit machines, cross-platform.
> What optimizations can I assume for various compilers? Are only once
> used local variables inlined? So it secure to extract local variables
> only to make the code more easy to understand?

I would say use structs. For compiler I would go with LDC or GDC. Both of these are faster for floating point calculations than DMD. You can always benchmark.

-- 
/Jacob Carlborg
October 16, 2013
I find it critical to ensure all loops are unrolled in basic vector ops (copy/arithmathc/dot etc.)

On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli wrote:
> Hello!
>
> I am writing an unbiased raytrace renderer in D. I have good progress, but I want to make it as fast as possible where I can do it without compromises.

October 16, 2013
On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli wrote:
> I thought, using classes may require too much memory, because they are not destructed on scope end, and maybe speed reduction when GC kicks in.
>
> Is my assumptions that in this case struct are more wise?

Yes, by all means use struct.


> What would generate the fastest code for a cross-product for example?

If you are on x86, SSE 4.1 introduced an instruction called DPPS which performs a dot product. Maybe you can force it into doing a cross-product with clever swizzles and masks.
October 17, 2013
@Jacob Carlborg
> I would say use structs. For compiler I would go with LDC or GDC. Both of these are faster for floating point calculations than DMD. You can always benchmark.

Thank you for the advice!
I installed ldc and used ldmd2.
Te benchmarks are amazing! :O

DMD > compile = 2503 > run = 26210
LDMD > compile = 3953 > run = 8935

These are in milliseconds,
benchmarked with time command.
Both were compiled with smae Flags:
-O -inline -release -noboundscheck

@finalpatch
> I find it critical to ensure all loops are unrolled in basic vector ops (copy/arithmathc/dot etc.)

In these crucial parts I don't use loops,
made these operations by hand. There
are simple 3 named doubles.
But thanks for the advice.

@ponce
> If you are on x86, SSE 4.1 introduced an instruction called DPPS which performs a dot product. Maybe you can force it into doing a cross-product with clever swizzles and masks.

Could you give me a hint, how it could
be implemented in D to use that dot product?
I am not expirienced with such low-level programming.

And would you suggest to try to use
SIMD double4 for 3D vectors? It would
take some time to change code.
October 17, 2013
Róbert László Páli:

> And would you suggest to try to use
> SIMD double4 for 3D vectors? It would
> take some time to change code.

Using a double4 could improve the performance of your code, but it must be used wisely. (One general tip is to avoid mixing SIMD and serial code. if you want to use SIMD code, then it's often better to keep using SIMD registers even if you have one value).

Bye,
bearophile
March 26, 2014
> Using a double4 could improve the performance of your code, but it must be used wisely. (One general tip is to avoid mixing  SIMD
> and serial code. if you want to use SIMD code, then it's  often
> better to keep using SIMD registers even if you have one  value).

I sadly could not get it to work properly, but the performance
seems good so far. Teoretichally I only would need to adjust the
Vector struct and operations (a small layer of the code, the rest
uses only the Vector type and the operations, not the inside of it).

In case you are interested:
http://palaes.rudanium.org/SubSpace/render.php
March 26, 2014
Oh, thanks for all of your help. Nice
to see, that D guys do really help. :)
March 26, 2014
You can also achieve significant speed-ups by doing things in parallel, f.ex. see https://groups.google.com/forum/?hl=de#!searchin/golang-nuts/ray$20tracer/golang-nuts/mxYzHQSV3rw/dOA78aeVLgEJ
March 26, 2014
Thanks! I already do tracing the samples parallel.
Strangly I have a core 2 duo and it seems that using
3 threads is the best (slightly better than 2). Aldough
this might be accidetal. Maybe the more-complex
samples are more equally in separate threds.