Possible solution to template bloat problem? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Possible solution to template bloat problem?

Thread overview

Possible solution to template bloat problem?
Aug 19, 2013 H. S. Teoh
Aug 19, 2013 Dicebot
Aug 19, 2013 John Colvin
Aug 19, 2013 Dicebot
Aug 20, 2013 Ramon
Aug 20, 2013 Dicebot
Aug 20, 2013 H. S. Teoh
Aug 20, 2013 Daniel Murphy
Aug 20, 2013 Dicebot
Aug 20, 2013 Timon Gehr
Aug 20, 2013 Ramon
Aug 20, 2013 Dicebot
Aug 20, 2013 Ramon
Aug 20, 2013 H. S. Teoh
Aug 20, 2013 John Colvin
Aug 20, 2013 Ramon
Aug 20, 2013 H. S. Teoh
Aug 20, 2013 Ramon
Aug 20, 2013 Andrei Alexandrescu
Aug 20, 2013 Ramon
Aug 20, 2013 John Colvin
Aug 20, 2013 H. S. Teoh
Aug 21, 2013 Ramon
Aug 21, 2013 Dicebot
Aug 21, 2013 Ramon
Aug 21, 2013 BS
Aug 21, 2013 BS
Aug 21, 2013 Dicebot
Aug 21, 2013 H. S. Teoh
Aug 21, 2013 Ramon
Aug 21, 2013 Andrei Alexandrescu
Aug 21, 2013 Ramon
Aug 21, 2013 Tyler Jameson Little
Aug 22, 2013 Regan Heath
Aug 22, 2013 Ramon
Aug 22, 2013 eles
Aug 22, 2013 Regan Heath
Aug 22, 2013 Ramon
Aug 22, 2013 Dicebot
Aug 22, 2013 Ramon
Aug 22, 2013 1100110
Aug 22, 2013 Regan Heath
Aug 22, 2013 Ramon
Aug 22, 2013 Regan Heath
Aug 22, 2013 Ramon
Aug 23, 2013 WiseWords
Aug 23, 2013 bsd
Aug 23, 2013 Ramon
Aug 23, 2013 bsd
Aug 23, 2013 BS
Aug 23, 2013 Regan Heath
Aug 22, 2013 Dicebot
Aug 20, 2013 Jacob Carlborg

August 19, 2013

Possible solution to template bloat problem?

Posted by H. S. Teoh

H. S. Teoh

With D's honestly awesome metaprogramming features, templates are liable to be (and in fact are) used a LOT. This leads to the unfortunate situation of template bloat: every time you instantiate a template, it adds yet another copy of the templated code into your object file. This gets worse when you use templated structs/classes, each of which may define some number of methods, and each instantiation adds yet another copy of all those methods.

This is doubly bad if these templates are used only during compile-time, and never referenced during runtime. That's a lot of useless baggage in the final executable. Plus, it leads to issues like this one:

	http://d.puremagic.com/issues/show_bug.cgi?id=10833

While looking at this bug, I got an idea: what if, instead of emitting template instantiations into the same object file as non-templated code, the compiler were to emit each instantiation into a separate static *library*? For instance, if you have code in program.d, then the compiler would emit non-templated code like main() into program.o, but all template instantiations get put in, say, libprogram.a. Then during link time, the compiler runs `ld -oprogram program.o libprogram.a`, and then the linker will pull in symbols from libprogram.a that are referenced by program.o.

If we were to set things up so that libprogram.a contains a separate unit for each instantiated template function, then the linker would actually pull in only code that is actually referenced at runtime. For example, say our code looks like this:

	struct S(T) {
		T x;
		T method1(T t) { ... }
		T method2(T t) { ... }
		T method3(T t) { ... }
	}
	void main() {
		auto sbyte  = S!byte();
		auto sint   = S!int();
		auto sfloat = S!float();

		sbyte.method1(1);
		sint.method2(2);
		sfloat.method3(3.0);
	}

Then the compiler would put main() in program.o, and *nothing else*. In program.o, there would be undefined references to S!byte.method1, S!int.method2, and S!float.method3, but not the actual code. Instead, when the compiler sees S!byte, S!int, and S!float, it puts all of the instantiated methods inside libprogram.a as separate units:

	libprogram.a:
		struct_S_byte_method1.o:
			S!byte.method1
		struct_S_byte_method2.o:
			S!byte.method2
		struct_S_byte_method3.o:
			S!byte.method3
		struct_S_int_method1.o:
			S!int.method1
		struct_S_int_method2.o:
			S!int.method2
		struct_S_int_method3.o:
			S!int.method3
		struct_S_float_method1.o:
			S!float.method1
		struct_S_float_method2.o:
			S!float.method2
		struct_S_float_method3.o:
			S!float.method3

Since the compiler doesn't know at instantiation time which of these methods will actually be used, it simply emits all of them and puts them into the static library.

Then at link-time, the compiler tells the linker to include libprogram.a when linking program.o. So the linker goes through each undefined reference, and resolves them by linking in the module in libprogram.a that defines said reference. So it would link in the code for S!byte.method1, S!int.method2, and S!float.method3. The other 6 instantiations are not linked into the final executable, because they are never actually referenced by the runtime code.

So this way, we minimize template bloat to only the code that's actually used at runtime. If a particular template function instantiation is only used during CTFE, for example, it would be present in libprogram.a but won't get linked, because none of the runtime code references it. This would fix bug 10833.

Is this workable? Is it implementable in DMD?


T

-- 
Nearly all men can stand adversity, but if you want to test a man's character, give him power. -- Abraham Lincoln

August 19, 2013

Re: Possible solution to template bloat problem?

Posted by Dicebot
in reply to H. S. Teoh

Dicebot

Posted in reply to H. S. Teoh

Just two words: "separate compilation".

Any solution that is going to address template problem needs to improve current state with such compilation model, not make it even worse.

As an alternative, I have proposed one of two approaches (or both):

1) Stop dumping all symbols into root module supplied from the command line. Emit symbols to object files that match modules they were instantiated from. If symbol has no valid source point (== constraint or CTFE) than don't emit it at all.

2) Create object files in format that allows usage of `ld --gc-sections` (garbage collection of unused symbols upon linking). Don't know if similar thing exists for Windows.

Latter should be relatively easy to do but it is not cross-platform and it does not help build systems with tracking rebuild conditions.

Former feels like a proper approach and I have been working on it (also @eskimor) for some time. But it is rather hard as relevant code does not seem to track required information at all and probably no one but Walter knows all minor details about its design. To sum it up - I have not been able to produce a pull request that passes the test suite so far (though I have put it on hold for some time, going to return to this).

August 19, 2013

Re: Possible solution to template bloat problem?

Posted by John Colvin
in reply to H. S. Teoh

John Colvin

Posted in reply to H. S. Teoh

On Monday, 19 August 2013 at 20:23:46 UTC, H. S. Teoh wrote:
> With D's honestly awesome metaprogramming features, templates are liable
> to be (and in fact are) used a LOT. This leads to the unfortunate
> situation of template bloat: every time you instantiate a template, it
> adds yet another copy of the templated code into your object file. This
> gets worse when you use templated structs/classes, each of which may
> define some number of methods, and each instantiation adds yet another
> copy of all those methods.
>
> This is doubly bad if these templates are used only during compile-time,
> and never referenced during runtime. That's a lot of useless baggage in
> the final executable. Plus, it leads to issues like this one:
>
> 	http://d.puremagic.com/issues/show_bug.cgi?id=10833
>
> While looking at this bug, I got an idea: what if, instead of emitting
> template instantiations into the same object file as non-templated code,
> the compiler were to emit each instantiation into a separate static
> *library*? For instance, if you have code in program.d, then the
> compiler would emit non-templated code like main() into program.o, but
> all template instantiations get put in, say, libprogram.a. Then during
> link time, the compiler runs `ld -oprogram program.o libprogram.a`, and
> then the linker will pull in symbols from libprogram.a that are
> referenced by program.o.
>
> If we were to set things up so that libprogram.a contains a separate
> unit for each instantiated template function, then the linker would
> actually pull in only code that is actually referenced at runtime. For
> example, say our code looks like this:
>
> 	struct S(T) {
> 		T x;
> 		T method1(T t) { ... }
> 		T method2(T t) { ... }
> 		T method3(T t) { ... }
> 	}
> 	void main() {
> 		auto sbyte  = S!byte();
> 		auto sint   = S!int();
> 		auto sfloat = S!float();
>
> 		sbyte.method1(1);
> 		sint.method2(2);
> 		sfloat.method3(3.0);
> 	}
>
> Then the compiler would put main() in program.o, and *nothing else*. In
> program.o, there would be undefined references to S!byte.method1,
> S!int.method2, and S!float.method3, but not the actual code. Instead,
> when the compiler sees S!byte, S!int, and S!float, it puts all of the
> instantiated methods inside libprogram.a as separate units:
>
> 	libprogram.a:
> 		struct_S_byte_method1.o:
> 			S!byte.method1
> 		struct_S_byte_method2.o:
> 			S!byte.method2
> 		struct_S_byte_method3.o:
> 			S!byte.method3
> 		struct_S_int_method1.o:
> 			S!int.method1
> 		struct_S_int_method2.o:
> 			S!int.method2
> 		struct_S_int_method3.o:
> 			S!int.method3
> 		struct_S_float_method1.o:
> 			S!float.method1
> 		struct_S_float_method2.o:
> 			S!float.method2
> 		struct_S_float_method3.o:
> 			S!float.method3
>
> Since the compiler doesn't know at instantiation time which of these
> methods will actually be used, it simply emits all of them and puts them
> into the static library.
>
> Then at link-time, the compiler tells the linker to include libprogram.a
> when linking program.o. So the linker goes through each undefined
> reference, and resolves them by linking in the module in libprogram.a
> that defines said reference. So it would link in the code for
> S!byte.method1, S!int.method2, and S!float.method3. The other 6
> instantiations are not linked into the final executable, because they
> are never actually referenced by the runtime code.
>
> So this way, we minimize template bloat to only the code that's actually
> used at runtime. If a particular template function instantiation is only
> used during CTFE, for example, it would be present in libprogram.a but
> won't get linked, because none of the runtime code references it. This
> would fix bug 10833.
>
> Is this workable? Is it implementable in DMD?
>
>
> T

Without link-time optimisation, this prevents inlining doesn't it?

August 19, 2013

Re: Possible solution to template bloat problem?

Posted by Dicebot
in reply to John Colvin

Dicebot

Posted in reply to John Colvin

On Monday, 19 August 2013 at 22:11:39 UTC, John Colvin wrote:
> Without link-time optimisation, this prevents inlining doesn't it?

It does not _prevent_ inlining, but it breaks incremental builds in case inlining has ever happened.

August 20, 2013

Re: Possible solution to template bloat problem?

Posted by Ramon
in reply to Dicebot

Ramon

Posted in reply to Dicebot

Well, I'm afraid that's what templates are. One (or the compiler) fills them in and that's it.

In other words: Templates are compile time while (real) generics are run time. This basically comes down to have some way of designating classes as, for instance, comparable and then either running along the object chain comparing all built in objects (with built in compare functionality) or having a compare implemented (Of course, there is also arithmetic functions, etc.).

While this sounds great it actually carries some code weight ("bloat") with it, too because all that functionality must be somewhere. It gets (relatively) cheaper though when being heavily used.

August 20, 2013

Re: Possible solution to template bloat problem?

Posted by Dicebot
in reply to Ramon

Dicebot

Posted in reply to Ramon

On Tuesday, 20 August 2013 at 00:34:38 UTC, Ramon wrote:
> Well, I'm afraid that's what templates are. One (or the compiler) fills them in and that's it.
>
> In other words: Templates are compile time while (real) generics are run time. This basically comes down to have some way of designating classes as, for instance, comparable and then either running along the object chain comparing all built in objects (with built in compare functionality) or having a compare implemented (Of course, there is also arithmetic functions, etc.).
>
> While this sounds great it actually carries some code weight ("bloat") with it, too because all that functionality must be somewhere. It gets (relatively) cheaper though when being heavily used.

What you speak is true for languages with "generics" where amount of generated code is pretty much equal to one that would have been written by hand. But in D templates do much more and there is no practical reasons other one quality of implementation to keep it that way.

For example, template constraints and stuff used during CTFE are low-hanging fruits. Those don't need to be emitted in resulting executable at all being only used at compile time.

More theoretically complex problem is stuff like std.algorithm - simply using something like map will result in several hundreds (!) of trivial template instances most of which will be inlined and never actually used in resulting binary. That is something that link-stage gargabe collection can take care of with a totally awesome results. I doubt it can be done by compiler itself but maybe there are some options I have missed.

In a perfect world using templates implies generic algorithms not generic code generation. it same thing as manual assembly - with modern optimizers and inlining capabilities all this crap must be boiled down to same code as carefully crafted manual one. No reasons to not do it.

Reminds me: how hard is writing own linker is again? :)

August 20, 2013

Re: Possible solution to template bloat problem?

Posted by H. S. Teoh
in reply to Dicebot

H. S. Teoh

Posted in reply to Dicebot

On Tue, Aug 20, 2013 at 02:48:27AM +0200, Dicebot wrote:
> On Tuesday, 20 August 2013 at 00:34:38 UTC, Ramon wrote:
> >Well, I'm afraid that's what templates are. One (or the compiler)
> >fills them in and that's it.
> >
> >In other words: Templates are compile time while (real) generics are run time. This basically comes down to have some way of designating classes as, for instance, comparable and then either running along the object chain comparing all built in objects (with built in compare functionality) or having a compare implemented (Of course, there is also arithmetic functions, etc.).

In that case, the "generics" you're talking about amounts basically to OO polymorphism. You have the same machine code that can process diverse types by using indirection to abstract away the implementation details. This is no doubt useful, as OO itself proves, but it does come with a cost: using indirection incurs a (small, but nevertheless non-zero) performance hit. Inside inner loops, this can be a performance killer.

At the machine code level, it's actually not possible to use the same code for, e.g., comparing two ints vs. comparing two floats. You need different machine instructions for them, so there is no single piece of code that can be reused for both types. You have to somehow switch between them at runtime depending on what types are being passed in. (It gets even hairier if you're comparing, say, ints and floats, in which case additional instructions must be used for promoting one type to another so that they are comparable.) So, say you call your function "less". In order to actually run, you need one version for comparing ints, another for comparing floats, etc.. This is what templates do.

Alternatively, you use indirection: int and float can be associated with some static data structure that describes each type, say it has a function pointer that, for ints, point to the function that compares int, and for floats, point to the function that compares floats. Then the caller doesn't actually directly call the low-level int/float-specific functions, but they always look up the function pointer. This is, in essence, how OO polymorphism works: each class has a vtable with function pointers to the specific implementation of each overloaded method. Then at runtime, you call whatever function the vtable points to, thus achieving runtime genericity.

The problem with indirection is that it's expensive: given two objects to be compared, you need to dereference the pointer to those objects, then dereference the pointer to their respective vtables, then dereference the function pointer in the vtables in order to call the function.

Templates, OTOH, are compile-time bound: the compiler ascertains at compile-time that your particular piece of code is comparing two ints, so it bypasses all of the expensive runtime dereferencing, and directly calls the function for comparing ints. The resulting code is inflexible, in the sense that you can't change, at runtime, the arguments to floats -- it would fail horribly -- but it avoids 3 pointer dereferences. When inside an inner loop, this can mean the difference between smooth animation and unusably jerky animation.

The cost, of course, is that if you need the same piece of code for comparing both ints and floats, then the compiler has to generate two copies of the code, one to handle ints, and the other to handle floats.

The saving grace, as Dicebot points out, is that if these copies of code are small enough, they will be inlined, so you can even save on the cost of a function call. This somewhat reduces the resulting code size -- you save on function call instructions and stack push/pops, etc., but you're still paying for the duplicated code.

This is the current state of the art.

Now, my dream is that one day, perhaps compilers will get smart enough that you don't even need to worry about the distinction between templates and runtime polymorphism anymore -- you specify what you want to get done, and the compiler determines, based on characteristics of the target machine and how the program as a whole will use that particular piece of code, whether to use indirection or template expansion. Or maybe even a mix of both, depending on the situation.

[...]
> Reminds me: how hard is writing own linker is again? :)

Honestly, I think it's about time linker technology is rethought and developed further. Possible developements are automatic elision of unused code sections (already implemented in some linkers), automatic merging of identical sections (not sure if implemented yet -- may require language support), link-time inlining, reordering of symbols to increase code locality during execution (optimize for CPU caches), etc..

Or more ambitiously, better integration with compiler so that the linker has access to compile-time structures to help it make decisions about optimization. Present-day object file formats are too far along the process to their executable form to permit many optimizations that could in theory be performed by the linker, requiring instead hacks like weak symbols, unused section GC, etc..

On the more mundane side, we need better algorithms for improving linker performance. Current symbol resolution algorithms don't scale very well when your object files are large or have large numbers of symbols. Surely there are ways of improving the asymptotic complexity of these things!

T

-- 
It is impossible to make anything foolproof because fools are so ingenious. -- Sammy

August 20, 2013

Re: Possible solution to template bloat problem?

Posted by Jacob Carlborg
in reply to H. S. Teoh

Jacob Carlborg

Posted in reply to H. S. Teoh

On 2013-08-19 22:22, H. S. Teoh wrote:

> If we were to set things up so that libprogram.a contains a separate
> unit for each instantiated template function, then the linker would
> actually pull in only code that is actually referenced at runtime. For
> example, say our code looks like this:

Doesn't the compiler already do something like that with the -lib flag?

-- 
/Jacob Carlborg

August 20, 2013

Re: Possible solution to template bloat problem?

Posted by Daniel Murphy
in reply to H. S. Teoh

Daniel Murphy

Posted in reply to H. S. Teoh

"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message news:mailman.213.1376962388.1719.digitalmars-d@puremagic.com...
>
> [...]
>> Reminds me: how hard is writing own linker is again? :)
>
> Honestly, I think it's about time linker technology is rethought and developed further. Possible developements are automatic elision of unused code sections (already implemented in some linkers), automatic merging of identical sections (not sure if implemented yet -- may require language support), link-time inlining, reordering of symbols to increase code locality during execution (optimize for CPU caches), etc..
>
> Or more ambitiously, better integration with compiler so that the linker has access to compile-time structures to help it make decisions about optimization. Present-day object file formats are too far along the process to their executable form to permit many optimizations that could in theory be performed by the linker, requiring instead hacks like weak symbols, unused section GC, etc..
>
> On the more mundane side, we need better algorithms for improving linker performance. Current symbol resolution algorithms don't scale very well when your object files are large or have large numbers of symbols. Surely there are ways of improving the asymptotic complexity of these things!
>
>

Check out llvm's lld.  Their choice of language sucks, but they do appear to be trying to rethink the whole mess.

August 20, 2013

Re: Possible solution to template bloat problem?

Posted by Dicebot
in reply to H. S. Teoh

Dicebot

Posted in reply to H. S. Teoh

On Tuesday, 20 August 2013 at 01:33:08 UTC, H. S. Teoh wrote:
> [...]
>> Reminds me: how hard is writing own linker is again? :)
>
> Honestly, I think it's about time linker technology is rethought and
> developed further.

Actually, I was asking about this not because there are critical issues with existing linker technology. While it definitely has a lot of space for improvements, even good old `ld` has plenty of features that do matter. Collection of unused sections that was already mentioned can result in huge binary size improvements. I'd love to enable this by default - problem is we can't rely on platform-specific tools to define such an important feature tightly coupled with language itself (think about `export`).

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation