Thread overview
Compiler optimization breaks multi-threaded code
Nov 14, 2010
Michal Minich
Nov 14, 2010
Sean Kelly
Nov 15, 2010
Kagamin
Nov 16, 2010
Sean Kelly
Nov 16, 2010
stephan
November 14, 2010
There is one question on SO which seems like a serious problem for atomic ops.

http://stackoverflow.com/questions/4165149/compiler-optimization-breaks- multi-threaded-code

in short:

shared uint cnt;
void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o = cnt; }

is compiled with dmd -O to something like:

shared uint cnt;
void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

see the web page for details.
November 14, 2010
Michal Minich Wrote:

> There is one question on SO which seems like a serious problem for atomic ops.
> 
> http://stackoverflow.com/questions/4165149/compiler-optimization-breaks- multi-threaded-code
> 
> in short:
> 
> shared uint cnt;
> void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o = cnt; }
> 
> is compiled with dmd -O to something like:
> 
> shared uint cnt;
> void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }
> 
> see the web page for details.

What a mess.  DMD isn't supposed to optimize across asm blocks.
November 15, 2010
Sean Kelly Wrote:

> > shared uint cnt;
> > void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o = cnt; }
> > 
> > is compiled with dmd -O to something like:
> > 
> > shared uint cnt;
> > void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }
> What a mess.  DMD isn't supposed to optimize across asm blocks.

There're no asm blocks in the code. It's a violated contract of shared data access.
November 16, 2010
Kagamin <spam@here.lot> wrote:
> Sean Kelly Wrote:
> 
>>> shared uint cnt;
>>> void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o =
> > > cnt; }
>>> 
>>> is compiled with dmd -O to something like:
>>> 
>>> shared uint cnt;
>>> void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }
>> What a mess.  DMD isn't supposed to optimize across asm blocks.
> 
> There're no asm blocks in the code. It's a violated contract of shared data access.

cas() contains an asm block. Though I guess in this case the compiler isn't actually optimizing across it. Does atomic!"+="(&cnt, 1) work correctly?  I know the issue with shared would still have to be fixed, but that code uses asm for the load as well, so it probably won't be optimized the same way.
November 16, 2010
Am 16.11.2010 18:09, schrieb Sean Kelly:
> cas() contains an asm block. Though I guess in this case the compiler
> isn't actually optimizing across it. Does atomic!"+="(&cnt, 1) work
> correctly?  I know the issue with shared would still have to be fixed,
> but that code uses asm for the load as well, so it probably won't be
> optimized the same way.

Thanks for looking into the issue around here. Just three comments from my side, Sean.

Disclaimer: based on a couple of hours chasing a bug and not much D experience (but some optimizing C++ compiler experience - so the issue looked familiar :-) )

1) atomicOp is not concerned. You only read memory once in the function call. Whether from a local variable that was loaded from something global or directly from a global, doesn't really matter (except for timing, maybe).

2) You are right, the compiler seems to not optimize across asm statements. So, the example can be fixed with the following hack:
    void atomicInc  ( ) {
        uint o;
        while ( !cas( &cnt, o, o + 1 ) ) {
            asm { nop; } o = cnt;
        }
    }
This is however more brittle than it looks, because it is not always clear what "optimizing across an asm block". This version has the issue again:
    void atomicInc  ( ) {
        uint o = cnt;
        do {
            asm { nop; } o = cnt;
        } while ( !cas( &cnt, o, o + 1 ) )
    }
While this case might look somewhat obvious, I encountered some problems in more complex code, and finally went for the all-inline-assembler solution to be on the safe side.

3) During my debugging, I believe that I saw the optimizer not only re-ordering reads of shared variables, but also writes to shared variables. IIRC, my Dekker example on SO (which fails for the missing s/l/mfence instructions), also sports a re-ordering of the lines
        cnt++;
        turn2 = true; flag1 = false;
into
        turn2 = true;
        cnt++;
        flag1 = false;
which in this case is not really important, but might introduce another bug if I was prepared to live with the risk of starvation (and remove turn2). If the compiler would still re-order (haven't tested), cnt++ would be outside of the critical section.

Hope this helps & cheers,
Stephan