Thread overview
Memory corruption with -O3, but not -O2 (and not with DMD)
Aug 17, 2019
James Blachly
Aug 20, 2019
David Nadlinger
Aug 22, 2019
James Blachly
Aug 21, 2019
Kagamin
Aug 22, 2019
James Blachly
Aug 23, 2019
Kagamin
August 17, 2019
Hi all,

First , as always thanks for LDC2 without which we couldn't write high performance D software for our lab.

I've run in to an problem wherein after ~60,000 iterations of a loop we get memory corruption, but only when building with LDC2 and -O3; there are no problems AFAICT with -O2, or when building optimized versions with DMD. -enable-inlining does not make a difference. All of that being said, it does not rule out me making a pointer or memory error, but all seems well except with LDC2 -O3.

Debugging has been difficult because -O3 optimizes away a lot and thus lldb is not able to show me the tracking debugging variables I need to isolate the problematic code. Disassembling, I've found the register storing pointer to the corrupt string; interestingly, the correct string appears just slightly lower on the heap (maybe 32 bytes IIRC).

Manifestation slightly nondeterministic -- adding tracking variables and code makes the problem intermittent.

Indeed, I placed some simple guards (e.g.: if (pre_string != post_string) throw new Exception() ) near the place where the corrupt memory is manifest and sometimes the guard is triggered, while other times it is _not_, but the bad string shows up just a few statements later (inside a function).

Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful.

If this should move to github, let me know.

Kind regards
August 20, 2019
Dear James,

As mentioned by kinke elsewhere, this is pretty much impossible to track down from our end without more information.

Is the corruption deterministic across multiple runs of one particular executable? Putting a memory watchpoint to the address that gets corrupted might provide some extra clues as to where it comes from.

Having a look at the LLVM IR (-output-ll) might also be illuminating; I personally find it easier to read than assembly. In particular, you could use the LLVM `opt` tool to apply the -O3 passes one by one, compiling to object code, linking and testing every step of the way, and compare the IR before/after the pass that first introduces the crash. (The `bugpoint` tool has some support for this, but you might be quicker doing this manually.)

Best regards,
David


On 17 Aug 2019, at 21:33, James Blachly via digitalmars-d-ldc wrote:

> Hi all,
>
> First , as always thanks for LDC2 without which we couldn't write high performance D software for our lab.
>
> I've run in to an problem wherein after ~60,000 iterations of a loop we get memory corruption, but only when building with LDC2 and -O3; there are no problems AFAICT with -O2, or when building optimized versions with DMD. -enable-inlining does not make a difference. All of that being said, it does not rule out me making a pointer or memory error, but all seems well except with LDC2 -O3.
>
> Debugging has been difficult because -O3 optimizes away a lot and thus lldb is not able to show me the tracking debugging variables I need to isolate the problematic code. Disassembling, I've found the register storing pointer to the corrupt string; interestingly, the correct string appears just slightly lower on the heap (maybe 32 bytes IIRC).
>
> Manifestation slightly nondeterministic -- adding tracking variables and code makes the problem intermittent.
>
> Indeed, I placed some simple guards (e.g.: if (pre_string != post_string) throw new Exception() ) near the place where the corrupt memory is manifest and sometimes the guard is triggered, while other times it is _not_, but the bad string shows up just a few statements later (inside a function).
>
> Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful.
>
> If this should move to github, let me know.
>
> Kind regards
August 21, 2019
On Saturday, 17 August 2019 at 20:33:08 UTC, James Blachly wrote:
> Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful.

Did you check that only this block is responsible for it? Compile all code with -O2 and only this particular block with -O3 and see if it still happens.
August 21, 2019
On 8/21/19 2:29 AM, Kagamin wrote:
> On Saturday, 17 August 2019 at 20:33:08 UTC, James Blachly wrote:
>> Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful.
> 
> Did you check that only this block is responsible for it? Compile all code with -O2 and only this particular block with -O3 and see if it still happens.

I had never considered the possibility of optimizing different parts at different levels :-O  I guess you just break it out into a separate .d/.o file and link?
August 21, 2019
On 8/20/19 4:18 AM, David Nadlinger wrote:
> Dear James,
> 
> As mentioned by kinke elsewhere, this is pretty much impossible to track down from our end without more information.
> 
> Is the corruption deterministic across multiple runs of one particular executable? Putting a memory watchpoint to the address that gets corrupted might provide some extra clues as to where it comes from.
> 
> Having a look at the LLVM IR (-output-ll) might also be illuminating; I personally find it easier to read than assembly. In particular, you could use the LLVM `opt` tool to apply the -O3 passes one by one, compiling to object code, linking and testing every step of the way, and compare the IR before/after the pass that first introduces the crash. (The `bugpoint` tool has some support for this, but you might be quicker doing this manually.)
> 
> Best regards,
> David

Thank you David. We will open up the repo soon whether we get the bug nailed or not. The corruption has been nondeterministic which of course makes it all the more frustrating. I have not used the opt tool, nor had I considered examining the IR instead of the assembly -- thank you for that.
August 23, 2019
On Thursday, 22 August 2019 at 00:34:12 UTC, James Blachly wrote:
> I had never considered the possibility of optimizing different parts at different levels :-O  I guess you just break it out into a separate .d/.o file and link?

You can start at file granularity.