Unittests and extern(C)

June 21, 2017
Posted by H. S. Teoh
Permalink
H. S. Teoh
Permalink
Never thought to mention these two things in the same subject line? Haha, well today I finally have reason to. This post is about an obscure bug I encountered today in one of my projects, with a moral lesson on why you really, really, ought to be using unittest blocks everywhere.

First, a bit of a background.  The program in which said bug occurred consists of a module that takes user input, preprocesses it, and instantiates a code template that produces a D code snippet. This snippet is then saved into a temporary file, and compiled with the local D compiler to produce a shared object. Subsequently, it makes use of Posix's dlopen() family of functions to load the shared object, lookup the symbol of the generated function, and return a function pointer to it.  The main module then does its own processing in which it calls the generated code via this function pointer.

The actual code is, of course, rather involved, but here's a highly-simplified version of it that captures the essentials:

	// The code template.
	//
	// The q"..." syntax is D's built-in heredoc syntax, convenient
	// for multi-line string literals.
	//
	// Basically, this template is just a boilerplate function
	// declaration to wrap around the generated D code snippet. It's
	// written as a format string, with the "%s" specifying where
	// the code snippet should be inserted.
	static immutable string codeTemplate = q"ENDTEMPLATE
	module funcImpl;
	double funcImpl(double x, double y)
	{
		return %s;
	}
	ENDTEMPLATE";

	// A convenient alias for the function pointer type that will be
	// returned by the JIT compiler.
	alias FuncImpl = double function(double, double);

	// Compiles the given input into a shared object, load it, and
	// return a function pointer to its entry point.
	FuncImpl compile(string preprocessedInput)
	{
		// Instantiate code template and write it into a
		// temporary source file.
		import std.format : format;
		string code = format(codeTemplate, preprocessedInput);

		import std.file : write;
		enum srcFile = "/path/to/tmpfile.d";
		write(srcFile, code);

		// Compile it into a shared object with the D compiler.
		// Thanks to the wonderful API of std.process, this is a
		// cinch, no need to fiddle with fork(), execv(),
		// waitpid(), etc..
		import std.process;
		enum soFile = "/path/to/tmpfile.so";
		auto ret = execute([
			"/path/to/dmd",
			"-fPIC",
			"-shared",	// make it a shared object
			"-of" ~ soFile,
			srcFile
		]);
		if (ret.status != 0) ... // compile failed

		// Load the result as a shared library
		import core.sys.posix.dlfcn;
		import std.string : toStringz;

		void* lib = dlopen(soFile.toStringz, RTLD_LAZY | RTLD_LOCAL);
		if (lib is null) ... // handle error

		// Lookup the symbol of the generated function
		auto symbol = "_D8funcImpl8funcImplFddZd"; // mangled name of funcImpl()
		impl = cast(FuncImpl) dlsym(lib, symbol);
		if (impl is null) ... // handle error
		return impl;
	}

	void main(string[] args)
	{
		auto input = getUserInput(...);
		auto snippet = preprocessInput(input);
		auto impl = compile(snippet);

		... // do stuff
		auto result = impl(x, y); // call generated function
		... // more stuff
	}

The symbol "_D8funcImpl8funcImplFddZd" is basically the mangled version of funcImpl(). A mangled name is basically a way of encoding a function signature into a legal identifier for an object file symbol -- the system linker does not (and should not) understand D overloading rules, for example, so the compiler needs to generate a unique name for every function overload. Generally, the D compiler takes care of this for us, so we never have to worry about it in usual D code, and can simply use the human-readable name "funcImpl", or the fully-qualified name "funcImpl.funcImpl". However, since dlsym() doesn't understand the D mangling scheme, in this case we need to tell it the mangled name so that it can find the right symbol in the shared object.

This was the original version of the program. So far so good.

In this version of the program, I didn't write many unittests for compile(), because I felt it was ugly for unittests to have side-effects on the host system (creating / deleting files, running external programs, etc.). So, to my shame, this part of the code had rather poor unittest coverage. Which will eventually lead to trouble...

...

Then recently, I rewrote preprocessInput to do smarter things with the user input.  In the course of doing that, I decided to clean up the above bit of code in compile() that deals with mangled function names. Since the code template doesn't actually need to do any overloading, we don't actually have to deal with mangled names at all; we could just use D's handy "extern(C)" declaration to tell the compiler to use C-style function names (i.e., no mangling, no overloading) for the function instead of the usual D mangling scheme. Then we don't have to work with the ugly unreadable mangled name and can just ask dlsym() directly for the symbol "funcImpl".

So all we need is to add "extern(C)" to the code template:

	static immutable string codeTemplate = q"ENDTEMPLATE
	module funcImpl;
	extern(C) double funcImpl(double x, double y)
	{
		return %s;
	}
	ENDTEMPLATE";

Then change the declaration of `symbol` to just:

	auto symbol = "funcImpl"; // much more readable!

Should be a harmless change, right? Right...?

Well, initially, I didn't notice anything amiss.  No thanks to compile() not having sufficient unittest coverage, e.g., run the compilation code on a trivial code fragment like "y - x*x" then checking that calling impl() returns the expected answer, I didn't notice at first that the program output has now gone horribly wrong.  As luck would have it, at the time I just so happened to be testing inputs whose corresponding functions were symmetrical with respect to the arguments x and y, i.e., funcImpl(x,y) == funcImpl(y,x).  No errors were found in the output.

But later when I ran it on something that *wasn't* symmetrical, I suddenly noticed that the results were all wrong.  It seemed that x and y were getting swapped for some reason.  My first guess was that the new preprocessing code had some bugs in it.  So I ran git bisect to find the error... unfortunately, I had done the rewrite of the preprocessing code in a separate module, which wasn't integrated with the main program until the very end.  So when git bisect indicated that the bug first appeared in the commit that first connected the new code to the main program (the extern(C) change was made as part of this commit, since it was such a small change) , it still didn't answer the question of where exactly the bug was.  I checked and double-checked the code to make sure that I didn't screw up somewhere and exchanged x and y... but found nothing.

Long story short, eventually I found that the offending change was when I added "extern(C)" to codeTemplate. And then, it all became instantly clear:

The code template says:

	extern(C) double funcImpl(double x, double y)

But the function pointer type is declared as:

	alias FuncImpl = double function(double, double);

Notice the lack of `extern(C)` in the latter. The call to dlsym(), of
course, simply casts the return value to FuncImpl, because dlsym()
doesn't know anything about what the symbol's type might be, and just
returns void*:

	FuncImpl impl = cast(FuncImpl) dlsym(lib, symbol);

Here's a lesson I learned that I didn't think about before: since extern(C) is intended for C interoperability, it not only suppresses D symbol mangling, but also *changes the ABI* of the function to be compatible with whatever the host OS's C calling conventions are.  In particular, the C calling convention may or may not be the same as the D calling convention. I had assumed that they were more-or-less the same as long as you weren't passing D-specific things between the functions, but in this case, I was wrong: the *order* of parameters when calling a D function is the *opposite* of the order of parameters when calling a C function, and naturally this applied to all extern(C) functions, including those written in D but annotated with extern(C).

As a result, when the main program invoked impl(x,y), the reversed
parameter order of extern(C) caused the effective call to become
impl(y,x), even though nowhere in the main program are x and y ever
swapped!

After realizing this, the fix is trivial: add `extern(C)` to the
function pointer type:

	alias FuncImpl = extern(C) double function(double, double);

And everything worked as it should again.

Finally, the moral of the story is that had I written unittests for compile(), I would have caught this bug much earlier than I did. It's as simple as writing:

	unittest
	{
		auto impl = compile("y - x*x");
		assert(impl(2.0, 4.0) == 0.0);
	}

Then when I forgot to add extern(C) to the declaration of FuncImpl, the unittest failure would have pinpointed the problem immediately and saved me the trouble of git bisecting and inspecting the code to find the bug. Needless to say, this unittest is now in my codebase -- who cares if it's ugly that the unittest writes to the filesystem and invokes an external program; the benefit of catching obscure bugs like this one (and preventing such bugs from resurfacing later) far outweighs any ivory tower idealistic notions about unittests having no side-effects.

TL;DR: (1) extern(C) not only suppresses symbol mangling, it also
changes the ABI of the annotated function; (2) always, *always*, write
unittests.  Just do it. Even supposedly "ugly" or "dirty" ones. Compile
with -cov to shame yourself into writing more unittests. (My compile()
function had at one point 0% test coverage. That should have raised a
big red flag.)


P.S. @MikeParker: this time I wrote this post with the D blog in mind. I know it probably needs more polish, but hopefully this is something you can work with? ;-)


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Forums