reduce mangled name sizes via link-time symbol renaming

January 26, 2018
Re: reduce mangled name sizes via link-time symbol renaming
Posted by H. S. Teoh
in reply to Johannes Pfau
Permalink
H. S. Teoh
Posted in reply to Johannes Pfau
Permalink
On Fri, Jan 26, 2018 at 08:34:50AM +0100, Johannes Pfau via Digitalmars-d wrote: [...]
> What is the benefit of using link-time renaming (a linker specific feature) instead of directly renaming the symbol in the compiler? We could be quite radical and hash all symbols > a certain threshold. As long as we have a hash function with strong enough collision resistance there shouldn't be any problem.

I think this is something worthwhile to implement, or at least try out. Huge symbols have been an ongoing source of trouble in D code, esp. when there's heavy template usage.  Even after Rainer's symbol backref PR was merged, which largely alleviated the recursive symbol bloat problem, we still have cases like object.__switch that need to be addressed.


> AFAICS we only need the mapping hashed_name ==> full name for debugging. So maybe we can simply stuff the full, mangled name somehow into dwarf debug information? We can even keep dwarf debug information in external files and support for this is just being added to GCCs libbacktrace, so even stack traces could work fine.
[...]

I dunno, I'm skeptical that a 10,000-character symbol is of any use to anyone, even for debugging. I mean, what are you going to do with it? Visually scan 10,000 characters to see if it's the same symbol as another 10,000-character symbol in the program? If the only way to make practical use of it is to use a program to compare it, then substituting it with a hash is not any different.

It seems to me that the most useful parts of a long symbol are basically its initial segment, which is usually the module name, useful for narrowing down where the symbol came from, and the ending segment, usually the last symbol(s) of a UFCS chain, or some argument types, useful for determining the function name, or which overload is being called. Given a long enough symbol, the middle portion is pretty much never looked at; it might as well be random characters.  Which suggests the following scheme: if a symbol S exceeds N characters, for a suitably-chosen N (I'd say somewhere around 500 or 1000, as a rough initial stab), then replace it with:

	S[0 .. 80] ~ hashOf(s) ~ S[$-80 .. $]

This gives you 160 human-readable characters of the most useful parts of the symbol, with the largely-useless middle part replaced with a fixed-length hash, so in the worst case, the symbol will be around 2-3 lines long and no more.

I chose 80 arbitrarily, it can be longer or shorter, but it's approximately the length of 1 line of code, which presumably should be enough to uniquely identify the source module of the symbol as well as the last function name / parameter types.  Perhaps it can be increased to about 200 or so, give or take, so that compressed symbols are approximately N characters long. Or N can be reduced to match the 160 + the ASCII-encoded size of the hash.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi
Forums