Jump to page: 1 2 3
Thread overview
Tricky DMD bug, but I have no idea how to report
Dec 17
JN
Dec 18
JN
Feb 06
JN
Feb 06
JN
Dec 18
Aliak
Feb 07
JN
Feb 08
JN
Feb 08
JN
Feb 08
JN
Feb 08
JN
Feb 08
JN
December 17
Hey guys,

while working on my game engine project, I encountered a DMD codegen bug. It occurs only when compiling in release mode, debug works. Unfortunately I am unable to minimize the code, since it's quite a bit of code, and changing the code changes the bug occurrence. Basically my faulty piece of code looks like this

class Texture2D {}

auto a = new Texture2D();
auto b = new Texture2D();
auto c = new Texture2D();
Texture2D[int] TextureBindings;
writeln(a, b, c);
textureBindings[0] = a;
textureBindings[1] = b;
textureBindings[2] = c;
writeln(textureBindings);

and the output is:

Texture2DTexture2DTexture2D
[0:null, 2:null, 1:null]

I'd expect it to output:

Texture2DTexture2DTexture2D
[0:Texture2D, 2:Texture2D, 1:Texture2D]

depending on what I change around this code, for example changing it to

writeln(a, " ", b, " ", c);

results in output of:

Texture2D Texture2D Texture2D
[0:Texture2D, 2:null, 1:null]

It feels completely random. Removing, adding calls completely unrelated to these changes the result. My guess is that the compiler somehow reorders the calls incorrectly, changing the semantics. Trick is, LDC works correctly and produces the expected result, both when compiling in debug and release mode.

I tried to play around with assoc arrays on run.dlang.io but could never reproduce it. It has to do something with the way my code works and possibly interacts with other C libraries. Does anyone have an idea what could it be and how to reproduce it so that it can be reported and fixed? For now, I'll just switch to LDC, but I feel bad leaving a possible bug intact and unreported.

This is with DMD32 D Compiler v2.083.1, on Windows, x86_64 compilation target.
December 17
On Mon, Dec 17, 2018 at 09:59:59PM +0000, JN via Digitalmars-d-learn wrote: [...]
> class Texture2D {}
> 
> auto a = new Texture2D();
> auto b = new Texture2D();
> auto c = new Texture2D();
> Texture2D[int] TextureBindings;
> writeln(a, b, c);
> textureBindings[0] = a;
> textureBindings[1] = b;
> textureBindings[2] = c;
> writeln(textureBindings);
> 
> and the output is:
> 
> Texture2DTexture2DTexture2D
> [0:null, 2:null, 1:null]
> 
> I'd expect it to output:
> 
> Texture2DTexture2DTexture2D
> [0:Texture2D, 2:Texture2D, 1:Texture2D]
> 
> depending on what I change around this code, for example changing it to
> 
> writeln(a, " ", b, " ", c);
> 
> results in output of:
> 
> Texture2D Texture2D Texture2D
> [0:Texture2D, 2:null, 1:null]

Ah, a pointer bug.  Lovely. :-/

My first guess is that you have a bunch of references to local variables that have gone out of scope.


> It feels completely random. Removing, adding calls completely unrelated to these changes the result.

Typical symptoms of a pointer bug of some kind.  Could be an uninitialized pointer, if you have used `T* p = void;` anywhere.


> My guess is that the compiler somehow reorders the calls incorrectly, changing the semantics.

Possible, but unlikely.  My bet is that you have dangling pointers, most likely to local variables that have gone out of scope.  Perhaps somewhere in the code you ran into the evil implicit conversion of static arrays into slices, which results in dangling pointers if said slice persists beyond the lifetime of the static array.

Another likely candidate is that if you're calling C/C++ libraries somewhere in your code, you may have passed in a wrong size, perhaps a byte count where an array length ought to be used, or vice versa, and as a result you got a buffer overrun.  I ran into similar bugs when writing OpenGL code.


> Trick is, LDC works correctly and produces the expected result, both when compiling in debug and release mode.
[...]

I bet the bug is still there, just latent because of the slightly different memory layout when compiling with LDC.  You probably want to be absolutely sure it's a compiler bug before moving on, as it could very well be a bug in your code.

A less likely possibility might be an optimizer bug -- do you get different results if you add / remove '-O' (and/or '-inline') from your dmd command-line?  If some combination of -O and -inline (or their removal thereof) "fixes" the problem, it could be an optimizer bug. But those are rare, and usually only show up when you use an obscure D feature combined with another obscure corner case, in a way that people haven't thought of.  My bet is still on a pointer bug somewhere in your code.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
December 18
On Monday, 17 December 2018 at 21:59:59 UTC, JN wrote:
> Hey guys,
>
> while working on my game engine project, I encountered a DMD codegen bug. It occurs only when compiling in release mode, debug works. Unfortunately I am unable to minimize the code, since it's quite a bit of code, and changing the code changes the bug occurrence. Basically my faulty piece of code looks like this
>
> [...]

I remember a couple of months ago someone complaining about similar issues when switching to a newer dmd. I tried looking for the thread but can’t find it. Think it was on the general list.

Have you tried previous compiler versions yet?
December 18
On Monday, 17 December 2018 at 22:22:05 UTC, H. S. Teoh wrote:
> A less likely possibility might be an optimizer bug -- do you get different results if you add / remove '-O' (and/or '-inline') from your dmd command-line?  If some combination of -O and -inline (or their removal thereof) "fixes" the problem, it could be an optimizer bug. But those are rare, and usually only show up when you use an obscure D feature combined with another obscure corner case, in a way that people haven't thought of.  My bet is still on a pointer bug somewhere in your code.
>

I played around with dmd commandline. It works with -O. Works with -O -inline. As soon as I add -boundscheck=off it breaks.

As I understand it, out of bounds access is UB. Which would fit my problems because they look like UB. But if I run without boundscheck=off, shouldn't I get a RangeError somewhere?
December 18
On Tue, Dec 18, 2018 at 10:29:07PM +0000, JN via Digitalmars-d-learn wrote:
> On Monday, 17 December 2018 at 22:22:05 UTC, H. S. Teoh wrote:
> > A less likely possibility might be an optimizer bug -- do you get different results if you add / remove '-O' (and/or '-inline') from your dmd command-line?  If some combination of -O and -inline (or their removal thereof) "fixes" the problem, it could be an optimizer bug. But those are rare, and usually only show up when you use an obscure D feature combined with another obscure corner case, in a way that people haven't thought of.  My bet is still on a pointer bug somewhere in your code.
> > 
> 
> I played around with dmd commandline. It works with -O. Works with -O -inline. As soon as I add -boundscheck=off it breaks.
> 
> As I understand it, out of bounds access is UB. Which would fit my problems because they look like UB. But if I run without boundscheck=off, shouldn't I get a RangeError somewhere?

In theory, yes.  But I wonder if there's some corner case where some combination of -O or -inline may cause a bounds check to be elided, but still hit UB. Perhaps the optimizer skipped a bounds check even though it shouldn't have.  What about compiling with -boundscheck=off but without -O -inline?  Does that make a difference?

Barring that, it might be one of those really evil pointer bugs where the problem has already happened far away from the site where the symptoms first appear, usually an undetected memory corruption that only shows up as invalid data long after the actual corruption happened. Very hard to trace.

Are you sure you didn't accidentally do something like escape a pointer to a local variable, or a slice of a local static array that has since gone out of scope?  Because that's what your symptoms most closely resemble.  The last time I ran into this in my own D code, it was caused by D's really evil implicit conversion of static arrays to slices, where passing a local static array implicitly passes a slice instead, e.g.:

	SomeObject persistentStorage;

	auto someFunc(int[] data)
	{
		... // stuff
		persistentStorage.insert(data); // retains reference to data
		...
	}

	void buggyCode()
	{
		int[16] arr = ...;
		...
		someFunc(arr);	// <--- implicit conversion happens here
		...
		// uh oh, arr is going out of scope, but
		// persistentStorage holds a reference to it
	}

	void main()
	{
		...
		buggyCode(); // escaped reference to local variable
		...

		// Crash when it tries to access the slice to
		// out-of-scope data:
		doSomething(persistentStorage);
		...
	}

Since no explicit slicing was done, there was no compiler error / warning of any sort, and it wasn't obvious from the code what had happened. By the time doSomething() was called, it was already long past the source of the problem in buggyCode(), and it was almost impossible to trace the problem back to its source.

Theoretically, -dip25 and -dip1000 are supposed to prevent this sort of problem, but I don't know how fully-implemented they are, whether they would catch the specific instance in your code, or whether your code even compiles with these options.


T

-- 
There's light at the end of the tunnel. It's the oncoming train.
February 06
On Tuesday, 18 December 2018 at 22:56:19 UTC, H. S. Teoh wrote:
> Since no explicit slicing was done, there was no compiler error / warning of any sort, and it wasn't obvious from the code what had happened. By the time doSomething() was called, it was already long past the source of the problem in buggyCode(), and it was almost impossible to trace the problem back to its source.
>
> Theoretically, -dip25 and -dip1000 are supposed to prevent this sort of problem, but I don't know how fully-implemented they are, whether they would catch the specific instance in your code, or whether your code even compiles with these options.
>
>
> T

No luck. Actually, I avoid in my code pointers in general, I write my code very "Java-like" with objects everywhere etc. I gave up on the issue actually, perhaps I am encountering this bug https://issues.dlang.org/show_bug.cgi?id=16511 in my own code. Anyway, 32-bit and 64-bit debug work, so does LDC. That's good enough for me.
February 06
On Wed, Feb 06, 2019 at 09:50:44PM +0000, JN via Digitalmars-d-learn wrote:
> On Tuesday, 18 December 2018 at 22:56:19 UTC, H. S. Teoh wrote:
> > Since no explicit slicing was done, there was no compiler error / warning of any sort, and it wasn't obvious from the code what had happened. By the time doSomething() was called, it was already long past the source of the problem in buggyCode(), and it was almost impossible to trace the problem back to its source.
> > 
> > Theoretically, -dip25 and -dip1000 are supposed to prevent this sort of problem, but I don't know how fully-implemented they are, whether they would catch the specific instance in your code, or whether your code even compiles with these options.
[...]
> No luck. Actually, I avoid in my code pointers in general, I write my code very "Java-like" with objects everywhere etc.
[...]

The nasty thing about the implicit static array -> slice conversion is that your code can have no bare pointers in sight, yet you still end up with an invalid reference to an out-of-scope local variable.

Some of us have argued that this conversion ought to be be prohibited. But we haven't actually tried going in that direction yet, because it *will* break existing code (though IMO such code is suspect to begin with, and besides, all you have to do is to explicitly slice the static array to get around the newly-introduced compile error).

Of course, I've no clue whether this is the cause of your problems -- it's just one of many possibilities.  Pointer bugs are nasty things to debug, regardless of whether or not they've been abstracted away in nicer clothing.  I still remember pointer bugs that took literally months just to get a clue on, because it was nigh impossible to track down where they happened -- the symptoms are too far removed from the cause.  You pretty much have to take a wild guess and get lucky.

They are just as bad as race condition bugs. (Once, a race condition bug took me almost half a year to fix, because it only showed up in the customer's live environment and we could never reproduce it locally. We knew there was a race somewhere, but it was impossible to locate it. Eventually, by pure accident, an unrelated code change subtly altered the timings of certain things that made the bug more likely to manifest under certain conditions -- and only then were we finally able to reliably reproduce the problem and track down its root cause.)


T

-- 
"I suspect the best way to deal with procrastination is to put off the procrastination itself until later. I've been meaning to try this, but haven't gotten around to it yet. " -- swr
February 06
On Wednesday, 6 February 2019 at 22:22:26 UTC, H. S. Teoh wrote:
> Of course, I've no clue whether this is the cause of your problems -- it's just one of many possibilities.  Pointer bugs are nasty things to debug, regardless of whether or not they've been abstracted away in nicer clothing.  I still remember pointer bugs that took literally months just to get a clue on, because it was nigh impossible to track down where they happened -- the symptoms are too far removed from the cause.  You pretty much have to take a wild guess and get lucky.
>
> They are just as bad as race condition bugs. (Once, a race condition bug took me almost half a year to fix, because it only showed up in the customer's live environment and we could never reproduce it locally. We knew there was a race somewhere, but it was impossible to locate it. Eventually, by pure accident, an unrelated code change subtly altered the timings of certain things that made the bug more likely to manifest under certain conditions -- and only then were we finally able to reliably reproduce the problem and track down its root cause.)
>
>
> T

I am not sure if it's a pointer bug. What worries me is that it breaks at the start of the program, but uncommenting code at the end of the program influences it. Unless there's some crazy reordering going on, this shouldn't normally have an effect. I still believe the bug is on the compiler side, but it's a bit of code in my case, and if I try to minimize the case, the issue disappears. Oh well.
February 06
On Wed, Feb 06, 2019 at 10:37:27PM +0000, JN via Digitalmars-d-learn wrote: [...]
> I am not sure if it's a pointer bug. What worries me is that it breaks at the start of the program, but uncommenting code at the end of the program influences it. Unless there's some crazy reordering going on, this shouldn't normally have an effect.

As I've said before, this kind of "spooky" action-at-a-distance symptom is exactly the kind of behaviour you'd expect from a pointer bug.  Of course, it doesn't mean that it *must* be a pointer bug, but it does look awfully similar to one.


> I still believe the bug is on the compiler side, but it's a bit of code in my case, and if I try to minimize the case, the issue disappears. Oh well.

That's another typical symptom of a pointer bug.  It seems less likely to be a codegen bug, because I'd expect a codegen bug to exhibit more consistent symptoms: if a particular code is triggering a compiler codegen bug, then it shouldn't matter what other code is being compiled, the bug should show up in all cases.  This kind of sensitivity to minute, unrelated changes is closer to how pointer bugs tend to behave.

Of course, it's possible that there's a pointer bug in the *compiler*, so there's that.  It's hard to tell either way at this point.  Though given how much the compiler is used by so many people on a daily basis, it's also less likely though not impossible. Unless your code just happens to contain a particularly rare combination of language features that causes the compiler to go down a rarely-tested code path that contains the bug.

Anyway, given what you said about how moving (or minimizing) seemingly-unrelated code around seems to affect the symptoms, we could do a little educated guesswork to try to narrow it down a little more. You said commenting out code at the end of the program affects whether it crashes at the beginning.  Is this in the same function (presumably main()), or is it in different functions?

If it's in the same function, one possibility is that you have some local variables that are being overrun by a buffer overflow or some bad pointer.  Commenting out code at the end of the function changes the layout of variables on the stack, so it would change what gets overwritten.  Possibly, the bug gets hidden by the bad pointer being redirected to some innocuous variable whose value is no longer used, or some such, so the presence of the bug is masked.

If the commented-out code is in a different function from the location of the crash, and you're sure that the commented out code is not being run before the crash, then it would appear to be something related to the layout of global variables.  Perhaps there's some module static ctor that's being triggered / not triggered, that changes the global state in some way that affects the code at the beginning of the program?  If there's a bad pointer that points to some heap location, the action of module ctors running vs. not running could alter the heap state enough to mask the bug in some cases.

Another possibility is if you're interfacing with C code and have a non null-terminated D string that's being cast to char*, and the presence of more code in the executable may perturb the data/code segment layout just enough to push the string somewhere that happens to contain a null shortly afterwards.

Just some guesses based on my experience with pointer bugs.


T

-- 
Written on the window of a clothing store: No shirt, no shoes, no service.
February 07
On Monday, 17 December 2018 at 21:59:59 UTC, JN wrote:
> while working on my game engine project, I encountered a DMD codegen bug. It occurs only when compiling in release mode, debug works.

Old thread, but FWIW, such bugs can be easily and precisely reduced with DustMite. In your test script, just compile with and without the compiler option which causes the bug to manifest, and check that one works and the other doesn't.

I put together a short article on the DustMite wiki describing how to do this:
https://github.com/CyberShadow/DustMite/wiki/Reducing-a-bug-with-a-specific-compiler-option

« First   ‹ Prev
1 2 3