March 31, 2016
On 3/31/16 9:38 AM, Adam D. Ruppe wrote:
> On Thursday, 31 March 2016 at 13:10:49 UTC, Steven Schveighoffer wrote:
>> Voldemort types are what cause the bloat, templates inside templates
>> aren't as much of a problem.
>
>
> So here's an idea of other things don't work out: voldemort types don't
> have a name that can be said... that could be true in the mangle too.
>
> We could potentially just hash it to some fixed length. Take the
> existing name, SHA1 it, and call the mangle
> function_containing_type$that_hash. Demangling the inside is useless
> anyway, so we lose nothing from that.

Ugh, let's try the huffman coding thing first :) Stack traces would be unusable.

However, this does seem promising if that doesn't work out. The hash would definitely solve the linking and binary size issue.

A possible thing the compiler *could* do is place inside the binary a hash-to-actual-symbol table that the exception printer can utilize to print a nicer stack trace...

-Steve
March 31, 2016
On 3/31/16 9:38 AM, Adam D. Ruppe wrote:

> Of course, a chain of voldemort types will still include that type as a
> template argument in the next call, so names as a whole can still be
> very long, but how long are your chains? I can see this becoming a 10 KB
> long name in extreme circumstances (still yikes) but not megabytes.

If you are hashing anyway, just hash the hash :) In other words, this function:

auto foo(T)(T t)
{
   static struct R {}
   return R;
}

would result in a return type of a hash, with no care about whether T was a hashed symbol or not. This puts a cap on the size of the name, it would never get that big.

-Steve
March 31, 2016
On Thursday, 31 March 2016 at 14:00:38 UTC, Steven Schveighoffer wrote:
> Ugh, let's try the huffman coding thing first :)

Do that after and along with!

> Stack traces would be unusable.

That's why I'd keep the function name outside the hash. The inner spam wouldn't be readable (not like a megabyte long name is readable anyway...), but the function name (which includes template arguments*) still is and that's probably the most useful part anyway.

* I just thought of another thing though.... string arguments to templates are included in the mangle, and with CTFE mixin stuff, they can become VERY long.

void foo(string s)() {}

pragma(msg, foo!"hi there, friend".mangleof);

_D1p48__T3fooVAyaa16_68692074686572652c20667269656e64Z3fooFNaNbNiNfZv

That "68692074686572652c20667269656e64" portion of it is the string represented as hexadecimal.

If you do a CTFE thing, the code string you pass in may be kept in ALL the names generated from it.

We should probably do something about these too. Simply gzipping before converting to hex is a possible option if we want it reversible. Or, of course, hashing too if we don't care about that.

And IMO a huge string in a stack trace is unreadable anyway... I'd be tempted to say if the string is longer than like 64 chars, just hash it. But this is debatable.

> A possible thing the compiler *could* do is place inside the binary a hash-to-actual-symbol table that the exception printer can utilize to print a nicer stack trace...

Indeed. I actually just emailed Liran with this suggestion as well. I don't think it is ideal for D in general, but for an internal project, it might solve some problems.
March 31, 2016
On Thursday, 31 March 2016 at 13:10:49 UTC, Steven Schveighoffer wrote:
>
> I too like Voldemort types, but I actually found moving the types outside the functions quite straightforward. It's just annoying to have to repeat the template parameters. If you make them private, then you can simply avoid all the constraints. It's a bad leak of implementation, since now anything in the file has access to that type directly, but it's better than the issues with voldemort types.
>

If you move anything with a Voldemort type to their own modules, then do what you say, then there is no longer an access issue. Leads to a proliferation of modules.

I can think of another alternative, but it is probably a needless complexity. Suppose there is a protection attribute with the property that things in the module can only access it if given permission explicitly. For instance, taking the D wiki Voldemort type example and modifying it to your approach would give

struct TheUnnameable
{
	int value;
	this(int x) {value = x;}
	int getValue() { return value; }
}

auto createVoldemortType(int value)
{
    return TheUnnameable(value);
}

The Unnameable would then be changed to

explicit struct TheUnnameable
{
	explicit(createVoldemortType);
	int value;
	this(int x) {value = x;}
	int getValue() { return value; }
}

where explicit used as a protection attribute would restrict TheUnnameable to only things where the explicit function gives explicit permission.
March 31, 2016
On 3/31/16 10:30 AM, Adam D. Ruppe wrote:
> On Thursday, 31 March 2016 at 14:00:38 UTC, Steven Schveighoffer wrote:
>> Ugh, let's try the huffman coding thing first :)
>
> Do that after and along with!
>
>> Stack traces would be unusable.
>
> That's why I'd keep the function name outside the hash. The inner spam
> wouldn't be readable (not like a megabyte long name is readable
> anyway...), but the function name (which includes template arguments*)
> still is and that's probably the most useful part anyway.

Moving types outside the function actually results in a pretty readable stack trace.

I noticed the stack trace printer just gives up after so many characters anyway.

-Steve
March 31, 2016
On Thursday, 31 March 2016 at 11:15:18 UTC, Johan Engelen wrote:
> Hi Anon,
>   I've started implementing your idea. But perhaps you already have a beginning of an implementation? If so, please contact me :)
> https://github.com/JohanEngelen
>
> Thanks,
>   Johan

No, I haven't started implemented things for that idea. The experiments I did with it were by manually altering mangled names in Vim.

I've been spending my D time thinking about potential changes to how template string value parameters are encoded. My code is a bit messy (and not integrated with the compiler at all), but I use a bootstring technique (similar to Punycode[1]) to encode Unicode text using only [a-zA-Z0-9_].

The results are always smaller than base16 and base64 encodings. For plain ASCII text, the encoding tends to grow by a small amount. For text containing larger UTF-8 code points, the encoding usually ends up smaller than the raw UTF-8 string.

A couple examples of my encoder at work:

---
some_identifier
some_identifier_

/usr/include/d/std/stdio.d
usrincludedstdstdiod_jqacdhbd

Hello, World!
HelloWorld_0far4i


こんにちは世界 (UTF-8: 21 bytes)
XtdCDr5mL02g3rv (15 bytes)
---

I still need to clean up the encoder/decoder and iron out some specifics on how this could fit into the mangling, but I should have time to work on this some more later today/tomorrow.

[1]: https://en.wikipedia.org/wiki/Punycode
March 31, 2016
On Thursday, 31 March 2016 at 16:38:59 UTC, Anon wrote:
> I've been spending my D time thinking about potential changes to how template string value parameters are encoded.


How does it compare to simply gzipping the string and writing it out with base62?

March 31, 2016
On Thursday, 31 March 2016 at 16:46:42 UTC, Adam D. Ruppe wrote:
> On Thursday, 31 March 2016 at 16:38:59 UTC, Anon wrote:
>> I've been spending my D time thinking about potential changes to how template string value parameters are encoded.
>
>
> How does it compare to simply gzipping the string and writing it out with base62?

My encoding is shorter in the typical use case, at least when using xz instead gzip. (xz was quicker/easier to get raw compressed data without a header.)

1= Raw UTF-8, 2= my encoder, 3= `echo -n "$1" | xz -Fraw | base64`

---
1. some_identifier
2. some_identifier_
3. AQA0c29tZV9pZGVudGlmaWVyAA==

1. /usr/include/d/std/stdio.d
2. usrincludedstdstdiod_jqacdhbd
3. AQAZL3Vzci9pbmNsdWRlL2Qvc3RkL3N0ZGlvLmQa

1. Hello, World!
2. HelloWorld_0far4i
3. AQAMSGVsbG8sIFdvcmxkIQA=

1. こんにちは世界
2. XtdCDr5mL02g3rv
3. AQAU44GT44KT44Gr44Gh44Gv5LiW55WMAA==
---

The problem is that compression isn't magical, and a string needs to be long enough and have enough repetition to compress well. If it isn't, compression causes the data to grow, and base64 compounds that. For the sake of fairness, let's also do a larger (compressible) string.

Input: 1000 lines, each with the text "Hello World"

1. 12000 bytes
2. 12008 bytes
3. 94 bytes

However, my encoding is still fairly compressible, so we *could* route it through the same compression if/when a symbol is determined to be compressible. That yields 114 bytes.

The other thing I really like about my encoder is that plain C identifiers are left verbatim visible in the result. That would be especially nice with, e.g., opDispatch.

Would a hybrid approach (my encoding, optionally using compression when it would be advantageous) make sense? My encoder already has to process the whole string, so it could do some sort of analysis to estimate how compressible the result would be. I don't know what that would look like, but it could work.

Alternately, we could do the compression on whole mangled names, not just the string values, but I don't know how desirable that is.
March 31, 2016
On Thursday, 31 March 2016 at 17:30:44 UTC, Anon wrote:
> My encoding is shorter in the typical use case


Yeah, but my thought is the typical use case isn't actually the problem - it is OK as it is. Longer strings are where it gets concerning to me.

> Would a hybrid approach (my encoding, optionally using compression when it would be advantageous) make sense?

Yeah, that might be cool too.

> Alternately, we could do the compression on whole mangled names, not just the string values, but I don't know how desirable that is.

There are often a lot of repeats in there... so maybe.
March 31, 2016
On Thursday, 31 March 2016 at 17:52:43 UTC, Adam D. Ruppe wrote:
> Yeah, but my thought is the typical use case isn't actually the problem - it is OK as it is. Longer strings are where it gets concerning to me.

Doubling the size of UTF-8 (the effect of the current base16 encoding) bothers me regardless of string length. Especially when use of __MODULE__ and/or __FILE__ as template arguments seems to be fairly common.

Having thought about it a bit more, I am now of the opinion that super-long strings have no business being in template args, so we shouldn't cater to them.

The main case with longer strings going into template arguments I'm aware of is for when strings will be processed, then fed to `mixin()`. However, that compile time string processing is much better served with a CTFE-able function using a Rope for manipulating the string until it is finalized into a normal string. If you are doing compile-time string manipulation with templates, the big symbol is the least of your worries. The repeated allocation and reallocation will quickly make your code uncompilable due to soaring RAM usage. The same is (was?) true of non-rope CTFE string manipulation. Adding a relatively memory-intensive operation like compression isn't going to help in that case.

Granted, the language shouldn't have a cut-off for string length in template arguments, but if you load a huge string as a template argument, I think something has gone wrong in your code. Catering to that seems to me to be encouraging it, despite the existence of much better approaches.

The only other case I can think of where you might want a large string as a template argument is something like:

```
struct Foo(string s)
{
    mixin(s);
}

Foo!q{...} foo;
```

But that is much better served as something like:

```
mixin template Q()
{
    mixin(q{...}); // String doesn't end up in mangled name
}

struct Foo(alias A)
{
    mixin A;
}

Foo!Q foo;
```

Or better yet (when possible):

```
mixin template Q()
{
    ... // No string mixin needed
}

struct Foo(alias A)
{
    mixin A;
}

Foo!Q foo;
```

The original mangling discussion started from the need to either fix a mangling problem or officially discourage Voldemort types. Most of the ideas we've been discussing and/or working on have been toward keeping Voldemort types, since many here want them. I'm not sure what use case would actually motivate compressing strings/symbols.

My motivations for bootstring encoding:

* Mostly care about opDispatch, and use of __FILE__/__MODULE__ as compile-time parameters. Symbol bloat from their use isn't severe, but it could be better.
* ~50% of the current mangling size for template string parameters
* Plain C identifier strings (so, most identifiers) will end up directly readable in the mangled name even without a demangler
* Retains current ability to access D symbols from C (in contrast to ideas that would use characters like '$' or '?')
* I already needed bootstring encoding for an unrelated project, and figured I could offer to share it with D, since it seems like it would fit here, too