July 28, 2017
I've already asked on the main newsgroup, but seems this didn't catch the attention of our GC experts:
http://forum.dlang.org/thread/oka1vo$4sr$1@digitalmars.com

Basically I want to get emulated TLS working in GDC and wonder whether we could somehow integrate with the GCC emutls code. We'd need to post some patches for the libgcc emutls code  so I'm interested in the best way to implement the GC scanning, particularly regarding performance.

The main problem is that GCC emutls allocates every single TLS variable in every thread using a malloc call. So we have lots of independent memory ranges. How does the GC perform in such situations, assuming I add an interface to libgcc to iterate all allocated memory ranges and use the scanDG delegate in rt.sections / rt.tlsgc?

An alternative could be to somehow implement support for custom allocators in GCC emutls and allocate all out D TLS variables using the GC. We'd still have to scan the per-thread TLS pointer array to avoid pinning all GC allocations, but this should work. Main drawback is a large bloat in the data segment to store a pointer to the allocation function for every variable.

(FYI, more details about the GCC emutls implementation are given in the linked forum thread)

So what do you think is best for GC performance? Option 1 would be a rather simple extension in libgcc, option 2 is more intrusive.

-- Johannes