April 08, 2012
On 08.04.2012 18:18, H. S. Teoh wrote:
[snip]
>>
>> 1. Every time a function is generated (or pretty much any symbol)
>> not only a size calculated but also a checksum* of it's data.
>> (If we go for link-time optimization we should find a place to stick
>> it to in the object file)
>
> We'd have to make sure the checksum doesn't end up in the final
> executable though, otherwise the bloat may negate any gains we've made.

Easy the symbol size is in object file (obviously) but it surely not present in the executable. If I worth anything legacy formats have a plenty of slack space reserved for future. Nwo is the future :)
[snip]
>> 3. After any function was generated compiler checks an entry in the
>> duplicate table that matches size, followed by matching checksum and
>> only then (if required) doing a straight memcmp. If it happens that
>> there is a match compiler just throws generated code away and
>> _aliases_ it's symbol to that of a matched entry.
>> (so there has to be an alias table if there isn't one already)
>
> I think you don't even need an alias table; IIRC the OS dynamic linker
> can easily handle symbols that have the same value (i.e. that point to
> the same place). All you have to do is to change the value of the
> duplicated symbols so that they all point to the same address.
>

Nice to know.
>
> [...]
>> Applicability:
>> It's not only const-immutable bloat, it can be alleviated with
>> inout. Yet there are plenty of places the exact same code is being
>> generated: e.g. sealed containers of int vs uint, std.find on
>> dchar[] vs uint[]/int[] an so on.
>> In general, the coarse grained parametrization is the root of all
>> evil and it is inevitable since we are just humans after all.
>
> I'm not sure I understand the last sentence there, but duplicate code
> elimination is definitely a big plus. It will also mean that we can use
> templates more freely without having to worry about template bloat.
>

It's easy - define a template on type T. Code it up.
Now how many times you did consider that e.g. you can parametrize on the size of the type instead of the type itself? I'm making a point that it's almost always the case with sealed containers of PODs for instance.
(now multiply the argument for total number of parameters)

>
>> Notes:
>> 1. If we do checksum calculation on the fly during codegen it gets
>> at virtually no cost as the data is in CPU data cache. Preliminary
>> version can avoid hacking this part of backend though.
>>
>> 2. By _alias_ I mean the ability of compiler to emit references to a
>> given symbol as if it was some other symbol (should be really
>> straight forward).
>
> Like I said, I think this isn't even necessary, if the compiler can just
> generate the same value for the duplicated symbols.

OK, I'm not much into how *exactly* linker works these days. I know the basics though.


>
>> 4. Ironically the same exact thing works with any kind of immutable
>> data structures. It looks like string pooling is superseded by this
>> proposal.
> [...]
>
> Not really... string pooling can take advantage of overlapping
> (sub)strings, but I don't think you can do that with code. But I think
> your idea has a lot of merit. I'm for making linkers smarter than they
> currently are.
>

Sorry, it's just me running ahead of train somewhat. Basically once this initial version is in place I have one cool refinement for it in mind. For now we just need to keep the hash function transitive and associative for the gods sake. 128/64bit checksum please ;)

-- 
Dmitry Olshansky
April 08, 2012
On 08.04.2012 16:37, Marco Leise wrote:
[snip]
> Template bloat could be especially important to 'fix' on embedded systems.

I think I this idea largely formed years ago when I was working with c++ on 8bit micros. You won't believe the amount of code size one can save by using one separate generic save-it-all-then-load-it-all prolog/epilog for all functions (and esp ISRs).


>  Let's wait till the bugs and important features are implemented or hack the compiler ourselves.

Let's hack the compiler ;)
BTW I think it should be possible to apply the idea on the IR level not on the actual machine code.

-- 
Dmitry Olshansky
April 08, 2012
On Sun, Apr 08, 2012 at 08:45:19PM +0400, Dmitry Olshansky wrote:
> On 08.04.2012 18:18, H. S. Teoh wrote:
> [snip]
> >We'd have to make sure the checksum doesn't end up in the final executable though, otherwise the bloat may negate any gains we've made.
> 
> Easy the symbol size is in object file (obviously) but it surely not present in the executable. If I worth anything legacy formats have a plenty of slack space reserved for future. Nwo is the future :)

I agree!


[...]
> >>Applicability:
> >>It's not only const-immutable bloat, it can be alleviated with
> >>inout. Yet there are plenty of places the exact same code is being
> >>generated: e.g. sealed containers of int vs uint, std.find on
> >>dchar[] vs uint[]/int[] an so on.
> >>In general, the coarse grained parametrization is the root of all
> >>evil and it is inevitable since we are just humans after all.
> >
> >I'm not sure I understand the last sentence there, but duplicate code elimination is definitely a big plus. It will also mean that we can use templates more freely without having to worry about template bloat.
> 
> It's easy - define a template on type T. Code it up.
> Now how many times you did consider that e.g. you can parametrize on
> the size of the type instead of the type itself? I'm making a point
> that it's almost always the case with sealed containers of PODs for
> instance.
> (now multiply the argument for total number of parameters)

Yeah, that's what I was thinking of. This would be a very big gain for the new AA implementation, for example. I wouldn't have to worry so much about template bloat if most of the instantiations are going to get merged anyway. :-)


[...]
> >Not really... string pooling can take advantage of overlapping (sub)strings, but I don't think you can do that with code. But I think your idea has a lot of merit. I'm for making linkers smarter than they currently are.
> >
> 
> Sorry, it's just me running ahead of train somewhat. Basically once this initial version is in place I have one cool refinement for it in mind. For now we just need to keep the hash function transitive and associative for the gods sake. 128/64bit checksum please ;)
[...]

And what would the cool refinement be? :-)


T

-- 
It's bad luck to be superstitious. -- YHL
April 08, 2012
Am Sun, 08 Apr 2012 20:58:15 +0400
schrieb Dmitry Olshansky <dmitry.olsh@gmail.com>:

> On 08.04.2012 16:37, Marco Leise wrote:
> [snip]
> > Template bloat could be especially important to 'fix' on embedded systems.
> 
> I think I this idea largely formed years ago when I was working with c++ on 8bit micros. You won't believe the amount of code size one can save by using one separate generic save-it-all-then-load-it-all prolog/epilog for all functions (and esp ISRs).
> 
> 
> >  Let's wait till the bugs and important features are implemented or hack the compiler ourselves.
> 
> Let's hack the compiler ;)
> BTW I think it should be possible to apply the idea on the IR level not
> on the actual machine code.

I don't know compilers, but doesn't the IR level still distinguish between signed and unsigned for example, even though the generated machine code will be the same? Anyway that's what I had in mind with the caching of CTFE as well. To create a hash over the IR (or AST?) of a function (or anything else) and store the hash and the serialized IR into a file next to the object files.

-- 
Marco

April 08, 2012
On 08.04.2012 21:24, H. S. Teoh wrote:
>
> Yeah, that's what I was thinking of. This would be a very big gain for
> the new AA implementation, for example. I wouldn't have to worry so much
> about template bloat if most of the instantiations are going to get
> merged anyway. :-)
>

Right the advantage is that one doesn't have to use tricks. One can
simply assume the compiler is doing it's job. That's what happened with inlining some 10+ years ago.

>
> [...]
>>> Not really... string pooling can take advantage of overlapping
>>> (sub)strings, but I don't think you can do that with code. But I
>>> think your idea has a lot of merit. I'm for making linkers smarter
>>> than they currently are.
>>>
>>
>> Sorry, it's just me running ahead of train somewhat. Basically once
>> this initial version is in place I have one cool refinement for it in
>> mind. For now we just need to keep the hash function transitive and
>> associative for the gods sake. 128/64bit checksum please ;)
> [...]
>
> And what would the cool refinement be? :-)
>
>
OK, you sort of forced my hand.
I hope you have been warned about spoilers before ;)


The refinement is merging prefixes and suffixes of course.
And for that one needs to calculate hashes for all of prefixes and all of suffixes. I will define _all_ later on.

First observation is that if you calculated partial checksums for prefixes you have sums for all suffixes and vise-versa.
Namely:
//prefix ends on i, suffix start on i
sum_prefix[i] = total - sum_suffix[i];

that is derived from the fact that:
total = sum_prefix[i] + sum_suffix[i];
(taking into account the properties of +/- in the finite field)

Now take the original idea, substitute checksum with an array of partial sums and lengths (btw lengths follow the same rule of prefix<->suffix) and you know what this refinement all about.

In fact this all was easy, the trick is fitting this into the time frame of compilation.

To that end one should do full duplicate elimination first then mess with merging.

Then we see that generally we are about to do O((M1*...*Mn)^2) work where n - is number of functions, Mi - number of prefixes (= suffixes) we defined for each. The constant C in front of (M1...Mn)^2 here is not that big - it's comparing checksums & lengths but *sometimes* memcmp still keep in mind that the number of all functions is big.

So we have to get around this monstrous workload. At the moment I see the only way is doing this is use coarse grained prefixes and introduce some threshold on maximum number of prefixes we account for. That about defining above mentioned _all_ for prefixes.

Coarse grained means we do store partial checksums on a fixed block boundary (say every 16 or 32 bytes) if that aligns properly with instructions if not we just skip this prefix and move on.
(this also hopefully limits memory usage on partial sums array)

Another threshold is that we don't mess with partial sums for really BIG functions. (might be not needed)

I think there is some space for other heuristics to try out but they are mostly along the same lines.

P.S. Damn, I could have done a nice paper on that... too late :)

-- 
Dmitry Olshansky
April 08, 2012
On 4/8/12 1:49 PM, Dmitry Olshansky wrote:
> P.S. Damn, I could have done a nice paper on that... too late :)

You may always do.

Andrei

April 08, 2012
On 4/8/2012 4:01 AM, Dmitry Olshansky wrote:
> I think it's been ages since I meant to ask why nobody (as in compiler vendors)
> does what I think is rather simple optimization.

I worked out how to do it a while ago, but there's been no time to implement it.

(You can't do a memcmp because of all the relocations.)

The main difficulty is not being able to modify the linker. So you're pretty much limited to what the compiler is able to do before linking. D does allow the compiler to deal with all the modules at compile time, so this helps a lot.
April 08, 2012
On 08.04.2012 22:51, Walter Bright wrote:
> On 4/8/2012 4:01 AM, Dmitry Olshansky wrote:
>> I think it's been ages since I meant to ask why nobody (as in compiler
>> vendors)
>> does what I think is rather simple optimization.
>
> I worked out how to do it a while ago, but there's been no time to
> implement it.
>
> (You can't do a memcmp because of all the relocations.)

Yes, I thought there must have been some problem with it.

>
> The main difficulty is not being able to modify the linker. So you're
> pretty much limited to what the compiler is able to do before linking. D
> does allow the compiler to deal with all the modules at compile time, so
> this helps a lot.

I'm thinking what if we kind of do it as an extension so that your linker (e.g. optlink) knows about these checksums (or whatever you use instead) while normal linker just works the way it always did.

-- 
Dmitry Olshansky
April 08, 2012
On 08.04.2012 22:49, Dmitry Olshansky wrote:
>
> The refinement is merging prefixes and suffixes of course.
> And for that one needs to calculate hashes for all of prefixes and all
> of suffixes. I will define _all_ later on.
>
> First observation is that if you calculated partial checksums for
> prefixes you have sums for all suffixes and vise-versa.
> Namely:
> //prefix ends on i, suffix start on i
> sum_prefix[i] = total - sum_suffix[i];
>
> that is derived from the fact that:
> total = sum_prefix[i] + sum_suffix[i];
> (taking into account the properties of +/- in the finite field)
>

Mm btw there is one step from here to use it on arbitrary common slices (i, j) where i < j:
slice_sum(i, j) = sum_prefix[j] - sum_prefix[i]

P.S. (Goes for walk citing a dynamic programming maxim)

-- 
Dmitry Olshansky
April 08, 2012
On Sun, Apr 08, 2012 at 10:56:43PM +0400, Dmitry Olshansky wrote:
> On 08.04.2012 22:51, Walter Bright wrote:
[...]
> >The main difficulty is not being able to modify the linker. So you're pretty much limited to what the compiler is able to do before linking. D does allow the compiler to deal with all the modules at compile time, so this helps a lot.
> 
> I'm thinking what if we kind of do it as an extension so that your
> linker (e.g. optlink) knows about these checksums (or whatever you use
> instead) while normal linker just works the way it always did.
[...]

I agree. Either that, or have an external utility for doing it.

I must say that IMO current linker technology is quite dumb, and no harm could come from experimenting with smarter linkers.  Things like duplicate template instantiation elimination have been introduced due to the influence of C++; I don't see why D shouldn't introduce its own advancements to linkers too. :-)


T

-- 
Laissez-faire is a French term commonly interpreted by Conservatives to mean 'lazy fairy,' which is the belief that if governments are lazy enough, the Good Fairy will come down from heaven and do all their work for them.