Optimizing a raytracer

Hello! I am writing an unbiased raytrace renderer in D. I have good progress, but I want to make it as fast as possible where I can do it without compromises. I use a struct with three doubles for vector and color calculations and I have operator overloading for them. Many vectors and colors are created during the tracing calculations. I thought, using classes may require too much memory, because they are not destructed on scope end, and maybe speed reduction when GC kicks in. Is my assumptions that in this case struct are more wise? To avoid the constructing many vectors and colors, I thought to use ref arguments, but I also heard that ref functions are not inlined. What would generate the fastest code for a cross-product for example? What compiler and compilations flags should I use to generate the fastest code? My main target is sixty-four bit machines, cross-platform. What optimizations can I assume for various compilers? Are only once used local variables inlined? So it secure to extract local variables only to make the code more easy to understand? Thanks is Advance! Róbert László Páli

October 16, 2013

Re: Optimizing a raytracer

Posted by Jacob Carlborg
in reply to Róbert László Páli

Permalink

Jacob Carlborg

Posted in reply to Róbert László Páli

Permalink

On 2013-10-16 14:02, "Róbert László Páli" wrote:
> Hello!
>
> I am writing an unbiased raytrace renderer in D. I have good progress,
> but I want to make it as fast as possible where I can do it without
> compromises.
>
> I use a struct with three doubles for vector and color calculations and
> I have operator overloading for them. Many vectors and colors are
> created during the tracing calculations.
>
> I thought, using classes may require too much memory, because they are
> not destructed on scope end, and maybe speed reduction when GC kicks in.
>
> Is my assumptions that in this case struct are more wise?
>
> To avoid the constructing many vectors and colors, I thought to use ref
> arguments, but I also heard that ref functions are not inlined. What
> would generate the fastest code for a cross-product for example?
>
> What compiler and compilations flags should I use to generate the
> fastest code? My main target is sixty-four bit machines, cross-platform.
> What optimizations can I assume for various compilers? Are only once
> used local variables inlined? So it secure to extract local variables
> only to make the code more easy to understand?

I would say use structs. For compiler I would go with LDC or GDC. Both of these are faster for floating point calculations than DMD. You can always benchmark.

-- 
/Jacob Carlborg

I find it critical to ensure all loops are unrolled in basic vector ops (copy/arithmathc/dot etc.) On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli wrote: > Hello! > > I am writing an unbiased raytrace renderer in D. I have good progress, but I want to make it as fast as possible where I can do it without compromises.

On Wednesday, 16 October 2013 at 12:02:15 UTC, Róbert László Páli wrote: > I thought, using classes may require too much memory, because they are not destructed on scope end, and maybe speed reduction when GC kicks in. > > Is my assumptions that in this case struct are more wise? Yes, by all means use struct. > What would generate the fastest code for a cross-product for example? If you are on x86, SSE 4.1 introduced an instruction called DPPS which performs a dot product. Maybe you can force it into doing a cross-product with clever swizzles and masks.

@Jacob Carlborg > I would say use structs. For compiler I would go with LDC or GDC. Both of these are faster for floating point calculations than DMD. You can always benchmark. Thank you for the advice! I installed ldc and used ldmd2. Te benchmarks are amazing! :O DMD > compile = 2503 > run = 26210 LDMD > compile = 3953 > run = 8935 These are in milliseconds, benchmarked with time command. Both were compiled with smae Flags: -O -inline -release -noboundscheck @finalpatch > I find it critical to ensure all loops are unrolled in basic vector ops (copy/arithmathc/dot etc.) In these crucial parts I don't use loops, made these operations by hand. There are simple 3 named doubles. But thanks for the advice. @ponce > If you are on x86, SSE 4.1 introduced an instruction called DPPS which performs a dot product. Maybe you can force it into doing a cross-product with clever swizzles and masks. Could you give me a hint, how it could be implemented in D to use that dot product? I am not expirienced with such low-level programming. And would you suggest to try to use SIMD double4 for 3D vectors? It would take some time to change code.

Róbert László Páli: > And would you suggest to try to use > SIMD double4 for 3D vectors? It would > take some time to change code. Using a double4 could improve the performance of your code, but it must be used wisely. (One general tip is to avoid mixing SIMD and serial code. if you want to use SIMD code, then it's often better to keep using SIMD registers even if you have one value). Bye, bearophile

> Using a double4 could improve the performance of your code, but it must be used wisely. (One general tip is to avoid mixing SIMD > and serial code. if you want to use SIMD code, then it's often > better to keep using SIMD registers even if you have one value). I sadly could not get it to work properly, but the performance seems good so far. Teoretichally I only would need to adjust the Vector struct and operations (a small layer of the code, the rest uses only the Vector type and the operations, not the inside of it). In case you are interested: http://palaes.rudanium.org/SubSpace/render.php

Thanks! I already do tracing the samples parallel. Strangly I have a core 2 duo and it seems that using 3 threads is the best (slightly better than 2). Aldough this might be accidetal. Maybe the more-complex samples are more equally in separate threds.

Forums