Jump to page: 1 2
Thread overview
Requesting Help with Optimizing Code
Apr 08, 2021
Kyle Ingraham
Apr 08, 2021
tsbockman
Apr 08, 2021
Kyle Ingraham
Apr 08, 2021
Bastiaan Veelo
Apr 08, 2021
tsbockman
Apr 08, 2021
Max Haughton
Apr 08, 2021
tsbockman
Apr 08, 2021
Max Haughton
Apr 08, 2021
Iain Buclaw
Apr 08, 2021
Kyle Ingraham
Apr 08, 2021
Max Haughton
Apr 08, 2021
tsbockman
Apr 08, 2021
Guillaume Piolat
Apr 08, 2021
Guillaume Piolat
Apr 08, 2021
Kyle Ingraham
April 08, 2021

Hello all. I have been working on learning computational photography and have been using D to do that. I recently got some code running that performs chromatic adaptation (white balancing). The output is still not ideal (image is overexposed) but it does correct color casts. The issue I have is with performance. With a few optimizations found with profiling I have been able to drop processing time from ~10.8 to ~6.2 seconds for a 16 megapixel test image. That still feels like too long however. Image editing programs are usually much faster.

The optimizations that I've implemented:

Is there anything else I can do to improve performance?

I tested the code under the following conditions:

  • Compiled with dub build --build=release --compiler=ldmd2
  • dub v1.23.0, ldc v1.24.0
  • Intel Xeon W-2170B 2.5GHz (4.3GHz turbo)
  • Test image
  • Test code:
#!/usr/bin/env dub
/+ dub.sdl:
    name "photog-test"
    dependency "photog" version="~>0.1.1-alpha"
    dependency "jpeg-turbod" version="~>0.2.0"
+/

import std.datetime.stopwatch : AutoStart, StopWatch;
import std.file : read, write;
import std.stdio : writeln, writefln;

import jpeg_turbod;
import mir.ndslice : reshape, sliced;

import photog.color : chromAdapt, Illuminant, rgb2Xyz;
import photog.utils : imageMean, toFloating, toUnsigned;

void main()
{
    const auto jpegFile = "image-in.jpg";
    auto jpegInput = cast(ubyte[]) jpegFile.read;

    auto dc = new Decompressor();
    ubyte[] pixels;
    int width, height;
    bool decompressed = dc.decompress(jpegInput, pixels, width, height);

    if (!decompressed)
    {
        dc.errorInfo.writeln;
        return;
    }

    auto image = pixels.sliced(height, width, 3).toFloating;

    int err;
    double[] srcIlluminant = image
        .imageMean
        .reshape([1, 1, 3], err)
        .rgb2Xyz
        .field;
    assert(err == 0);

    auto sw = StopWatch(AutoStart.no);

    sw.start;
    auto ca = chromAdapt(image, srcIlluminant, Illuminant.d65).toUnsigned;
    sw.stop;

    auto timeTaken = sw.peek.split!("seconds", "msecs");
    writefln("%d.%d seconds", timeTaken.seconds, timeTaken.msecs);

    auto c = new Compressor();
    ubyte[] jpegOutput;
    bool compressed = c.compress(ca.field, jpegOutput, width, height, 90);

    if (!compressed)
    {
        c.errorInfo.writeln;
        return;
    }

    "image-out.jpg".write(jpegOutput);
}

Functions found through profiling to be taking most time:

A profile for the test code is here. The trace.log.dot file can be viewed with xdot. The PDF version is here. The profile was generated using:

  • Compiled with dub build --build=profile --compiler=ldmd2
  • Visualized with profdump - dub run profdump -- -f -d -t 0.1 trace.log trace.log.dot

The branch containing the optimized code is here: https://github.com/kyleingraham/photog/tree/up-chromadapt-perf
The corresponding release is here: https://github.com/kyleingraham/photog/releases/tag/v0.1.1-alpha

If you've gotten this far thank you so much for reading. I hope there's enough information here to ease thinking about optimizations.

April 08, 2021

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:

>

The issue I have is with performance.

This belongs in the "Learn" forum, I think.

>

With a few optimizations found with profiling I have been able
to drop processing time from ~10.8 to ~6.2 seconds for a 16
megapixel test image. That still feels like too long however.
Image editing programs are usually much faster.

I agree, intuitively that sounds like there is a lot of room for further optimization.

>

Functions found through profiling to be taking most time:

This is very high level code, of a sort which depends heavily on inlining, loop unrolling and/or auto-vectorization, and a few other key optimizations to perform well. So, I really can't tell which parts are likely to compile down to good ASM just from skimming it.

Personally, I try to stick to static foreach and normal runtime loops for the inner loops of really CPU-intensive stuff like this, so that it's easier to see from the source code what the ASM will look like. With inlining and value range propagation it's definitely possible to get good performance out of range-based or functional-style code, but it's harder because the D source code is that much further away from what the CPU will actually do.

All that said, I can offer a few tips which may or may not help here:

The only concrete floating-point type I see mentioned in those functions is double, which is almost always extreme overkill for color calculations. You should probably be using float (32 bits per channel), instead. Even half-precision (16 bits per channel) is more than the human eye can distinguish. The extra 13 bits of precision offered by float should be more than sufficient to absorb any rounding errors.

Structure your loops (or functional equivalents) such that the "for each pixel" loop is the outermost, or as far out as you can get it. Do as much as you can with each pixel while it is still in registers or L1 cache. Otherwise, you may end up bottle-necked by memory bandwidth.

Make sure you have optimizations enabled, especially cross module inlining, -O3, the most recent SIMD instruction set you are comfortable requiring (almost everyone in the developed world now has Intel SSE4 or ARM Neon), and fused multiply add/subtract.

There will always be three primary colors throughout the entire maintenance life of your code, so just go ahead and specialize for that. You don't need a generic matrix multiplication algorithm, for instance. A specialized 3x3 or 4x4 version could be branchless and SIMD accelerated.

April 08, 2021

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:

>

Hello all. I have been working on learning computational photography and have been using D to do that. I recently got some code running that performs chromatic adaptation (white balancing). The output is still not ideal (image is overexposed) but it does correct color casts. The issue I have is with performance. With a few optimizations found with profiling I have been able to drop processing time from ~10.8 to ~6.2 seconds for a 16 megapixel test image. That still feels like too long however. Image editing programs are usually much faster.

The optimizations that I've implemented:

Is there anything else I can do to improve performance?

I tested the code under the following conditions:

  • Compiled with dub build --build=release --compiler=ldmd2
  • dub v1.23.0, ldc v1.24.0
  • Intel Xeon W-2170B 2.5GHz (4.3GHz turbo)
  • Test image
  • Test code:
#!/usr/bin/env dub
/+ dub.sdl:
    name "photog-test"
    dependency "photog" version="~>0.1.1-alpha"
    dependency "jpeg-turbod" version="~>0.2.0"
+/

import std.datetime.stopwatch : AutoStart, StopWatch;
import std.file : read, write;
import std.stdio : writeln, writefln;

import jpeg_turbod;
import mir.ndslice : reshape, sliced;

import photog.color : chromAdapt, Illuminant, rgb2Xyz;
import photog.utils : imageMean, toFloating, toUnsigned;

void main()
{
    const auto jpegFile = "image-in.jpg";
    auto jpegInput = cast(ubyte[]) jpegFile.read;

    auto dc = new Decompressor();
    ubyte[] pixels;
    int width, height;
    bool decompressed = dc.decompress(jpegInput, pixels, width, height);

    if (!decompressed)
    {
        dc.errorInfo.writeln;
        return;
    }

    auto image = pixels.sliced(height, width, 3).toFloating;

    int err;
    double[] srcIlluminant = image
        .imageMean
        .reshape([1, 1, 3], err)
        .rgb2Xyz
        .field;
    assert(err == 0);

    auto sw = StopWatch(AutoStart.no);

    sw.start;
    auto ca = chromAdapt(image, srcIlluminant, Illuminant.d65).toUnsigned;
    sw.stop;

    auto timeTaken = sw.peek.split!("seconds", "msecs");
    writefln("%d.%d seconds", timeTaken.seconds, timeTaken.msecs);

    auto c = new Compressor();
    ubyte[] jpegOutput;
    bool compressed = c.compress(ca.field, jpegOutput, width, height, 90);

    if (!compressed)
    {
        c.errorInfo.writeln;
        return;
    }

    "image-out.jpg".write(jpegOutput);
}

Functions found through profiling to be taking most time:

A profile for the test code is here. The trace.log.dot file can be viewed with xdot. The PDF version is here. The profile was generated using:

  • Compiled with dub build --build=profile --compiler=ldmd2
  • Visualized with profdump - dub run profdump -- -f -d -t 0.1 trace.log trace.log.dot

The branch containing the optimized code is here: https://github.com/kyleingraham/photog/tree/up-chromadapt-perf
The corresponding release is here: https://github.com/kyleingraham/photog/releases/tag/v0.1.1-alpha

If you've gotten this far thank you so much for reading. I hope there's enough information here to ease thinking about optimizations.

I am away from a proper dev machine at the moment so I can't really delve into this in much detail, but some thoughts:

Are you making the compiler aware of your machine? Although the obvious point here is vector width (you have AVX-512 from what I can see, however I'm not sure if this is actually a win or not on Skylake W), the compiler is also forced to use a completely generic scheduling model which may or may not yield good code for your procesor. For LDC, you'll want -mcpu=native. Also use cross module inlining if you aren't already.

I also notice in your hot code, you are using function contracts: When contracts are being checked, LDC actually does use them to optimize the code, however in release builds when they are turned off this is not the case. If you are happy to assume that they are true in release builds (I believe you should at least), you can give the compiler this additional information via the use of an intrinsic or similar (with LDC I just use inline IR, on GDC you have __builtin_unreachable()). LDC will assume asserts in release builds as far as I have tested, however this still invokes a branch (just not to a proper assert handler).

Via pragma(inline, true) you can wrap this idea into a little "function" like this

pragma(LDC_inline_ir)
    R inlineIR(string s, R, P...)(P);

pragma(inline, true)
void assume()(bool condition)
{
    //flesh this out if you want actually do something in debug builds.
    if(!condition)
    {
        inlineIR!("unreachable;", void)();
    }
}

So you call this with information you know about your code, and the optimizer will assume that the condition is true - in your case it looks like the arrays are supposed to be of length 3, so if you let the optimizer assume this it can yield a much much much better function because of it (Compilers these days will usually generate a big unrolled loop using the full vector width of the machine, and some little ones that - if you put your modular arithmetic hat on - make up the difference for the whole array, but in your case you could just want 3 movs maybe). Using this method can also yield much more aggressive autovectorization for some loops.

Helping the optimizer in the aforementioned manner will often not yield dramatic performance increases, at least in throughput, however they will yield code that is smaller and has a simpler control flow graph - these two facts are kind to latency and your instruction cache (The I$ is quite big in 2021, however processors now utilise a so-called L0 cache for the decoded micro-ops which is not very big and can also yield counterintuitive results on - admittedly contrived - loop examples due to tricks that the CPU does to make typical code faster ceasing to apply). I think AMD Zen has some interesting loop alignment requirements but I could be misremembering.

As for the usual performance traps you'll need to look at the program running in perf, or vTune if you can be bothered (vTune is like a rifle to perf's air pistol, but is enormous on your disk, it does however understand D code pretty well so I am finding it quite interesting to play with).

April 08, 2021

On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:

>

Although the obvious point here is vector width (you have
AVX-512 from what I can see, however I'm not sure if this is
actually a win or not on Skylake W)

From what I've seen, LLVM's code generation and optimization for AVX-512 auto-vectorization is still quite bad and immature compared to AVX2 and earlier, and the wider the SIMD register the more that data structures and algorithms have to be specifically tailored to really benefit from them. Also, using AVX-512 instructions forces the CPU to downclock.

So, I wouldn't expect much benefit from AVX-512 for the time being, unless you're going to hand optimize for it.

>

For LDC, you'll want -mcpu=native`.

Only do this if you don't care about the binary working on any CPU but your own. Otherwise, you need to look at something like the Steam Hardware survey and decide what percentage of the market you want to capture (open the "Other Settings" section): https://store.steampowered.com/hwsurvey

April 08, 2021

On Thursday, 8 April 2021 at 03:45:06 UTC, tsbockman wrote:

>

On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:

>

Although the obvious point here is vector width (you have
AVX-512 from what I can see, however I'm not sure if this is
actually a win or not on Skylake W)

From what I've seen, LLVM's code generation and optimization for AVX-512 auto-vectorization is still quite bad and immature compared to AVX2 and earlier, and the wider the SIMD register the more that data structures and algorithms have to be specifically tailored to really benefit from them. Also, using AVX-512 instructions forces the CPU to downclock.

So, I wouldn't expect much benefit from AVX-512 for the time being, unless you're going to hand optimize for it.

>

For LDC, you'll want -mcpu=native`.

Only do this if you don't care about the binary working on any CPU but your own. Otherwise, you need to look at something like the Steam Hardware survey and decide what percentage of the market you want to capture (open the "Other Settings" section): https://store.steampowered.com/hwsurvey

You can do multiversioning fairly easily these days.

And AVX-512 downclocking can be quite complicated, I have seen benchmarks where one still can achieve a decent speedup even with downclocking. At very least it's worth profiling - the reason why I brought up Skylake W specifically is that some of the earlier ones actually emulated the 512 bit vector instructions rather than having proper support in the function units IIRC.

D needs finer grained control of the optimizer inside loops - e.g. I don't care about inlining writeln, but if something doesn't get inlined inside a hot loop you're fucked.

April 08, 2021

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:

>

Is there anything else I can do to improve performance?

  • profiling?
  • you can use _mm_pow_ps in intel-intrinsics package to have 4x pow at once for the price of one.
April 08, 2021

On Thursday, 8 April 2021 at 14:17:09 UTC, Guillaume Piolat wrote:

>

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:

>

Is there anything else I can do to improve performance?

  • profiling?
  • you can use _mm_pow_ps in intel-intrinsics package to have 4x pow at once for the price of one.

Also if you don't need the precision: always use powf instead of pow for double, use expf instead of exp for double, etc. Else you pay extra.
In particular, llvm_pow with a float argument is a lot faster.

April 08, 2021

On Thursday, 8 April 2021 at 03:58:36 UTC, Max Haughton wrote:

>

On Thursday, 8 April 2021 at 03:45:06 UTC, tsbockman wrote:

>

[...]

You can do multiversioning fairly easily these days.

Perhaps a @target_clones UDA in {gcc/ldc}.attributes. ;-)

April 08, 2021

On Thursday, 8 April 2021 at 03:25:39 UTC, tsbockman wrote:

>

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:

>

The issue I have is with performance.

This belongs in the "Learn" forum, I think.

In hindsight you are right here. I'll divert this sort of post there in the future.

>

Personally, I try to stick to static foreach and normal runtime loops for the inner loops of really CPU-intensive stuff like this, so that it's easier to see from the source code what the ASM will look like.

Are compilers able to take loops and parallelize them?

>

The only concrete floating-point type I see mentioned in those functions is double, which is almost always extreme overkill for color calculations.

Great tip here. I'll think about the needs for the data I'm working on before choosing the first type on the shelf.

>

Structure your loops (or functional equivalents) such that the "for each pixel" loop is the outermost, or as far out as you can get it. Do as much as you can with each pixel while it is still in registers or L1 cache. Otherwise, you may end up bottle-necked by memory bandwidth.

Make sure you have optimizations enabled, especially cross module inlining, -O3, the most recent SIMD instruction set you are comfortable requiring (almost everyone in the developed world now has Intel SSE4 or ARM Neon), and fused multiply add/subtract.

Are there any sources you would recommend for learning more about these techniques e.g. code bases, articles etc.? I'll see how far I can get with creative googling.

>

There will always be three primary colors throughout the entire maintenance life of your code, so just go ahead and specialize for that. You don't need a generic matrix multiplication algorithm, for instance. A specialized 3x3 or 4x4 version could be branchless and SIMD accelerated.

I had thought about this for one of my functions but didn't think to extend it further to a function I can use everywhere. I'll do that.

Thank you for taking the time to write this up. I'm sure these tips will go a long way for improving performance.

April 08, 2021

On Thursday, 8 April 2021 at 16:37:57 UTC, Kyle Ingraham wrote:

>

Are compilers able to take loops and parallelize them?

No, but you can quite easily:

// Find the logarithm of every number from
// 1 to 1_000_000 in parallel, using the
// default TaskPool instance.
auto logs = new double[1_000_000];

foreach (i, ref elem; parallel(logs))
{
    elem = log(i + 1.0);
}

https://dlang.org/phobos/std_parallelism.html#.parallel

-- Bastiaan.

« First   ‹ Prev
1 2