Memory corruption with -O3, but not -O2 (and not with DMD)

Aug 17, 2019

James Blachly

Aug 20, 2019

David Nadlinger

Aug 22, 2019

Aug 21, 2019

Aug 22, 2019

Aug 23, 2019

Hi all, First , as always thanks for LDC2 without which we couldn't write high performance D software for our lab. I've run in to an problem wherein after ~60,000 iterations of a loop we get memory corruption, but only when building with LDC2 and -O3; there are no problems AFAICT with -O2, or when building optimized versions with DMD. -enable-inlining does not make a difference. All of that being said, it does not rule out me making a pointer or memory error, but all seems well except with LDC2 -O3. Debugging has been difficult because -O3 optimizes away a lot and thus lldb is not able to show me the tracking debugging variables I need to isolate the problematic code. Disassembling, I've found the register storing pointer to the corrupt string; interestingly, the correct string appears just slightly lower on the heap (maybe 32 bytes IIRC). Manifestation slightly nondeterministic -- adding tracking variables and code makes the problem intermittent. Indeed, I placed some simple guards (e.g.: if (pre_string != post_string) throw new Exception() ) near the place where the corrupt memory is manifest and sometimes the guard is triggered, while other times it is _not_, but the bad string shows up just a few statements later (inside a function). Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful. If this should move to github, let me know. Kind regards

August 20, 2019

Re: Memory corruption with -O3, but not -O2 (and not with DMD)

Posted by David Nadlinger
in reply to James Blachly

Permalink

David Nadlinger

Posted in reply to James Blachly

Permalink

Dear James,

As mentioned by kinke elsewhere, this is pretty much impossible to track down from our end without more information.

Is the corruption deterministic across multiple runs of one particular executable? Putting a memory watchpoint to the address that gets corrupted might provide some extra clues as to where it comes from.

Having a look at the LLVM IR (-output-ll) might also be illuminating; I personally find it easier to read than assembly. In particular, you could use the LLVM `opt` tool to apply the -O3 passes one by one, compiling to object code, linking and testing every step of the way, and compare the IR before/after the pass that first introduces the crash. (The `bugpoint` tool has some support for this, but you might be quicker doing this manually.)

Best regards,
David

On 17 Aug 2019, at 21:33, James Blachly via digitalmars-d-ldc wrote:

> Hi all,
>
> First , as always thanks for LDC2 without which we couldn't write high performance D software for our lab.
>
> I've run in to an problem wherein after ~60,000 iterations of a loop we get memory corruption, but only when building with LDC2 and -O3; there are no problems AFAICT with -O2, or when building optimized versions with DMD. -enable-inlining does not make a difference. All of that being said, it does not rule out me making a pointer or memory error, but all seems well except with LDC2 -O3.
>
> Debugging has been difficult because -O3 optimizes away a lot and thus lldb is not able to show me the tracking debugging variables I need to isolate the problematic code. Disassembling, I've found the register storing pointer to the corrupt string; interestingly, the correct string appears just slightly lower on the heap (maybe 32 bytes IIRC).
>
> Manifestation slightly nondeterministic -- adding tracking variables and code makes the problem intermittent.
>
> Indeed, I placed some simple guards (e.g.: if (pre_string != post_string) throw new Exception() ) near the place where the corrupt memory is manifest and sometimes the guard is triggered, while other times it is _not_, but the bad string shows up just a few statements later (inside a function).
>
> Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful.
>
> If this should move to github, let me know.
>
> Kind regards

On Saturday, 17 August 2019 at 20:33:08 UTC, James Blachly wrote: > Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful. Did you check that only this block is responsible for it? Compile all code with -O2 and only this particular block with -O3 and see if it still happens.

On 8/21/19 2:29 AM, Kagamin wrote: > On Saturday, 17 August 2019 at 20:33:08 UTC, James Blachly wrote: >> Am at my wits' end, so help or next steps are greatly appreciated. I can provide disassembly of whatever combination of -O/-O2/-O3 and triggering/nontriggering code blocks if it would e helpful. > > Did you check that only this block is responsible for it? Compile all code with -O2 and only this particular block with -O3 and see if it still happens. I had never considered the possibility of optimizing different parts at different levels :-O I guess you just break it out into a separate .d/.o file and link?

On 8/20/19 4:18 AM, David Nadlinger wrote: > Dear James, > > As mentioned by kinke elsewhere, this is pretty much impossible to track down from our end without more information. > > Is the corruption deterministic across multiple runs of one particular executable? Putting a memory watchpoint to the address that gets corrupted might provide some extra clues as to where it comes from. > > Having a look at the LLVM IR (-output-ll) might also be illuminating; I personally find it easier to read than assembly. In particular, you could use the LLVM `opt` tool to apply the -O3 passes one by one, compiling to object code, linking and testing every step of the way, and compare the IR before/after the pass that first introduces the crash. (The `bugpoint` tool has some support for this, but you might be quicker doing this manually.) > > Best regards, > David Thank you David. We will open up the repo soon whether we get the bug nailed or not. The corruption has been nondeterministic which of course makes it all the more frustrating. I have not used the opt tool, nor had I considered examining the IR instead of the assembly -- thank you for that.

On Thursday, 22 August 2019 at 00:34:12 UTC, James Blachly wrote: > I had never considered the possibility of optimizing different parts at different levels :-O I guess you just break it out into a separate .d/.o file and link? You can start at file granularity.

Forums