December 05, 2008
== Quote from dsimcha (dsimcha@yahoo.com)'s article
>
> Thanks, though I'm way ahead of you in that I already did this.  Works great,
> except it's a little bit slow.
> I'm actually working on an implementation of the SuperStack proposed by Andrei
> about a month ago, which was why I needed good TLS.  It seems like with the
> current implementation (using the faster explicit key solution instead of the
> slower class-based solution), about 1/3 of my time is being spent on retrieving
> TLS.  I got this number by caching the stuff from TLS on the stack of the calling
> function and passing it in as a parameter.  This may become a semi-hidden feature
> for wringing out that last bit of performance from SuperStack.  Is TLS inherently
> slow, or is the druntime implementation relatively quick and dirty and likely to
> improve in the future?

The druntime implementation is about as fast as user-level TLS can get, I'm afraid.  If you look at the implementation:

class ThreadLocal
{
    T val()
    {
        Wrap* wrap = cast(Wrap*) Thread.getLocal( m_key );
        return wrap ? wrap.val : m_def;
    }
}

class Thread
{
    static void* getLocal( uint key )
    {
        return getThis().m_local[key];
    }

    static Thread getThis()
    {
        version( Posix )
            return cast(Thread) pthread_getspecific( sm_this );
    }

    void*[LOCAL_MAX] m_local;
}

The OS-level TLS call is typically implemented as an array indexing operation, so to get a TLS value you're looking at indexing into two arrays, a cast, and then an additional cast and conditional jump if you use ThreadLocal.  Error checking is even omitted for performance reasons.  If I knew of a way to make it faster then I would :-)


Sean
December 05, 2008
dsimcha wrote:
> Thanks, though I'm way ahead of you in that I already did this.  Works great,
> except it's a little bit slow.

TLS is always going to be slow. Beating the old drum about how freakin' useful a tool obj2asm is and why doesn't anyone use it, here's what it looks like:

--------------------

__thread int foo;
void main()
{
    foo = 3;
}

---------------------
__Dmain comdat
        assume  CS:__Dmain
                mov     EAX,__tls_index
                mov     ECX,FS:__tls_array
                mov     EDX,[EAX*4][ECX]
                mov     dword ptr _D5test63fooi[EDX],3
                xor     EAX,EAX
                ret
__Dmain ends
---------------------

So you see, it takes 4 instructions to reference a TLS global vs 1 instruction for regular static data. The lesson is to minimize directly referencing such globals. Instead, take a pointer to them, or cache the value into a local.

As for whether __thread is completely implemented, yes it is completely implemented in the compiler. I obviously forgot about the gc, though, and I'm glad you found the problem so I can fix it. In the meantime, you can call the gc directly to register your __thread variable as a 'root', then the gc will recognize it properly.

If you want to read about how TLS works under Windows, see:

http://www.nynaeve.net/?p=180

It works an equivalent, but completely differently under the hood, way in Linux:

http://people.redhat.com/drepper/tls.pdf
December 05, 2008
On Fri, Dec 5, 2008 at 5:38 PM, Walter Bright <newshound1@digitalmars.com> wrote:
> TLS is always going to be slow. Beating the old drum about how freakin' useful a tool obj2asm is and why doesn't anyone use it, here's what it looks like:

Er, Walter, you realize it's not free, right?  Meaning that even if the EUP is only $15 there's still going to be a lot of people who don't have it just because they don't feel bothered to buy it.
December 05, 2008
Thanks, guys.  I've found ways to speed things up a decent amount, and put an alpha of my SuperStack up on Scrapple, though I renamed it to TempAlloc because I don't like the name SuperStack.  See D.announce and http://dsource.org/projects/scrapple/browser/trunk/tempAlloc
December 06, 2008
Jarrett Billingsley wrote:
> On Fri, Dec 5, 2008 at 5:38 PM, Walter Bright
> <newshound1@digitalmars.com> wrote:
>> TLS is always going to be slow. Beating the old drum about how freakin'
>> useful a tool obj2asm is and why doesn't anyone use it, here's what it looks
>> like:
> 
> Er, Walter, you realize it's not free, right?  Meaning that even if
> the EUP is only $15 there's still going to be a lot of people who
> don't have it just because they don't feel bothered to buy it.

That's like saying one works as an auto mechanic but prefers to use a rock rather than a hammer because a hammer costs $15 !! It's just far too useful to not buy at such a reasonable price.

Even so, obj2asm is free on the linux version.
December 06, 2008
On Fri, Dec 5, 2008 at 7:29 PM, Walter Bright <newshound1@digitalmars.com> wrote:
> That's like saying one works as an auto mechanic but prefers to use a rock rather than a hammer because a hammer costs $15 !! It's just far too useful to not buy at such a reasonable price.
>
> Even so, obj2asm is free on the linux version.

That doesn't really help Windows DMD users who are stuck using an outdated object format that almost nothing else seems to understand. Or on Linux, for that matter, since there are - and always have been - free disassemblers for ELF.
December 06, 2008
Jarrett Billingsley wrote:
> On Fri, Dec 5, 2008 at 7:29 PM, Walter Bright
> <newshound1@digitalmars.com> wrote:
>> That's like saying one works as an auto mechanic but prefers to use a rock
>> rather than a hammer because a hammer costs $15 !! It's just far too useful
>> to not buy at such a reasonable price.
>>
>> Even so, obj2asm is free on the linux version.
> 
> That doesn't really help Windows DMD users who are stuck using an
> outdated object format that almost nothing else seems to understand.
> Or on Linux, for that matter, since there are - and always have been -
> free disassemblers for ELF.

I updated Agner Fog's objconv so that the -fasm option now works with DMD .obj's on Windows. He still hasn't released it yet on his site, but I can give it to anyone who's interested.
It disassembles all instructions, even the newly defined ones that don't exist on any current processors. But it still has a few problems, and it won't give you D source code interleaved with the asm output.
December 07, 2008
On Sat, 06 Dec 2008 02:57:20 +0200, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:

> On Fri, Dec 5, 2008 at 7:29 PM, Walter Bright
> <newshound1@digitalmars.com> wrote:
>> That's like saying one works as an auto mechanic but prefers to use a rock
>> rather than a hammer because a hammer costs $15 !! It's just far too useful
>> to not buy at such a reasonable price.
>>
>> Even so, obj2asm is free on the linux version.
>
> That doesn't really help Windows DMD users who are stuck using an
> outdated object format that almost nothing else seems to understand.
> Or on Linux, for that matter, since there are - and always have been -
> free disassemblers for ELF.

IDA 4.9 is now free for non-commercial purposes, and it understands DMD's .obj files.

http://www.hex-rays.com/idapro/idadownfreeware.htm

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
December 08, 2008
Walter Bright wrote:
> Steven Schveighoffer wrote:
>> I'd say most likely that the GC doesn't see anything declared as __thread, so when you use that pointer as the only reference to GC allocated data, it doesn't see that it's still in use, and will collect.
> 
> Looks like I need to do some research to see how the gc can discover the extent of tls data.

I've got this working now for Windows and Linux for the main program (not for dll's or shared libraries).
December 08, 2008
Walter Bright, el  7 de diciembre a las 16:04 me escribiste:
> Walter Bright wrote:
> >Steven Schveighoffer wrote:
> >>I'd say most likely that the GC doesn't see anything declared as __thread, so when you use that pointer as the only reference to GC allocated data, it doesn't see that it's still in use, and will collect.
> >Looks like I need to do some research to see how the gc can discover the extent of tls data.
> 
> I've got this working now for Windows and Linux for the main program (not for dll's or shared libraries).

I saw the change[1] and I wonder why there are mentions to the DMD implementation. Shouldn't that be implementation agnostic, being in the "common" part of the runtime? I guess _tlsstart and _tlsend should be added to the runtime specification[2] too, right?

BTW, the change broke the indentation style of druntime :S

Thank you.

[1] http://www.dsource.org/projects/druntime/changeset/57 [2] http://www.dsource.org/projects/druntime/wiki/RuntimeSpec

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
La máquina de la moneda, mirá como te queda!
	-- Sidharta Kiwi