February 19, 2015
On Tuesday, 17 February 2015 at 22:50:29 UTC, Kai Nacke wrote:
> On Windows, LLVM uses the segment registers for TLS storage (gs: for 32bit and fs: for 64bit). There is no other impact.
>

Hijacking this as I'm investigating how TLS plays with shared object.

So, my understanding is that the segment register is used as a base for TLS, and TLS globals are indexed using this register as a base.

This sounds like it can work until you have to cross shared object boundaries. What is done in this case ?
February 20, 2015
On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler wrote:
>
> If I turn on optimization they both take 7 milliseconds.

You cannot benchmark it like this. To make it more realistic you should use multiple compilation units, add fences and cache invalidation.
February 20, 2015
On 2015-02-19 23:27, deadalnix wrote:

> Hijacking this as I'm investigating how TLS plays with shared object.

I guess you can read the documentation at [1] and the source code in druntime for Linux and FreeBSD.

> So, my understanding is that the segment register is used as a base for
> TLS, and TLS globals are indexed using this register as a base.
>
> This sounds like it can work until you have to cross shared object
> boundaries. What is done in this case ?

I think that is when a runtime helper functions is used, i.e. __tls_get_addr. Multiple models of TLS exist and they very between platforms and depending of what features are needed, i.e. shared objects. There's some documentation [1].

[1] http://www.akkadia.org/drepper/tls.pdf

-- 
/Jacob Carlborg
February 22, 2015
"Ola Fosheim "Grøstad\"" <ola.fosheim.grostad+dlang@gmail.com> writes:

> On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler wrote:
>>
>> If I turn on optimization they both take 7 milliseconds.
>
> You cannot benchmark it like this. To make it more realistic you should use multiple compilation units, add fences and cache invalidation.

Hmm, you got me thinking.  A mfence should not be needed for TLS so in a MT program, expensive TLS lookup could still win.  If cache is blown, wouldn't time to reload cache begin to dominate?  I know all of this is very architecture dependent, but I have been wary of the number of instructions to do TLS lookup compared to shared.  Perhaps I should not. Am I thinking correctly?
--
Dan
February 22, 2015
On Sunday, 22 February 2015 at 04:33:58 UTC, Dan Olson wrote:
> Hmm, you got me thinking.  A mfence should not be needed for TLS so in a
> MT program, expensive TLS lookup could still win.  If cache is blown,
> wouldn't time to reload cache begin to dominate?  I know all of this is
> very architecture dependent, but I have been wary of the number of
> instructions to do TLS lookup compared to shared.  Perhaps I should not.
> Am I thinking correctly?

The problem is really in synthetic benchmarks that is comparing apples/oranges. The "problem" may disappear once TLS tables are loaded into the cache or if the compiler has moved the "problem" outside of the loop and retaining it in a register (which also has a hidden cost). A x86 cache miss is  perhaps 100-200 cycles and a 3rd level cache load/full barrier is 30-40 cycles, but a pure read or write barrier is only a few cycles... What is the hidden cost of D TLS versus the optimal codegen for a program? I guess you have to compare C vs D on a set of complex programs to figure it all out.
February 23, 2015
On Sunday, 22 February 2015 at 17:36:49 UTC, Ola Fosheim Grøstad wrote:
> The problem is really in synthetic benchmarks that is comparing apples/oranges. The "problem" may disappear once TLS tables are loaded into the cache or if the compiler has moved the "problem" outside of the loop and retaining it in a register (which also has a hidden cost). A x86 cache miss is  perhaps 100-200 cycles and a 3rd level cache load/full barrier is 30-40 cycles, but a pure read or write barrier is only a few cycles... What is the hidden cost of D TLS versus the optimal codegen for a program? I guess you have to compare C vs D on a set of complex programs to figure it all out.

Yes I agree that you can't determine the general performance of TLS from such a simple program.

Here's what happened: I was writing a program that could optionally use TLS memory.  When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler.  No matter how I used TLS, it was much much slower when using LLVM.  The simple program is just a simple way to demonstrate that TLS is very slow in one specific type of program.  It would be great to see another program that could demonstrate that TLS is actually faster in some use cases.  However, since it it sooo much slower, I think you'll have a hard time finding such an example.  The simple program demonstrates that TLS is almost 2 orders of magnitude slower...it may not be that much slower in other types of programs...but with numbers like that it seem obvious that something is wrong.
February 23, 2015
On Monday, 23 February 2015 at 04:10:29 UTC, Jonathan Marler wrote:
> Here's what happened: I was writing a program that could optionally use TLS memory.  When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler.
>  No matter how I used TLS, it was much much slower when using LLVM.  The simple program is just a simple way to demonstrate that TLS is very slow in one specific type of program.

Yeah, demonstrating that it slow is reasonable. I was more thinking about the other direction, that either globals or TLS is fast is hard to show without a multi-threaded best-of-breed baseline to compare against. (i.e. that TLS is faster than globals or the other way around does not say much since they both can be too slow if the code gen is lacking...)

> It would be great to see another program that could demonstrate that TLS is actually faster in some use cases.  However, since it it sooo much slower, I think you'll have a hard time finding such an example.  The simple program demonstrates that TLS is almost 2 orders of magnitude slower...it may not be that much slower in other types of programs...but with numbers like that it seem obvious that something is wrong.

Some other wrongs with naive TLS is that every thread gets the same dataset, that you pollute 3rd level cache compared to globals, and that globals can be fetched without a register (absolute addressing or relative to program counter). I'd be vary of using TLS for larger datstructures, but putting a pointer there instead gives you YET another indirection-> more cache misses...
February 23, 2015
"Jonathan Marler" <johnnymarler@gmail.com> writes:

> Here's what happened: I was writing a program that could optionally use TLS memory.  When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler.  No matter how I used TLS, it was much much slower when using LLVM.

Hi Jonathan.

The reason for slowness on OS X is here in the source code on Apple's website. A TLS has the extra cost of address lookup by a call to _tlv_get_addr:

_tlv_get_addr:
	movq	8(%rdi),%rax			// get key from descriptor
	movq	%gs:0x0(,%rax,8),%rax	// get thread value
	testq	%rax,%rax				// if NULL, lazily allocate
	je		LlazyAllocate
	addq	16(%rdi),%rax			// add offset from descriptor
	ret
LlazyAllocate:
        ...

http://www.opensource.apple.com/source/dyld/dyld-210.2.3/src/threadLocalHelpers.s

--
Dan
February 23, 2015
On 2015-02-23 17:18, Dan Olson wrote:

> Hi Jonathan.
>
> The reason for slowness on OS X is here in the source code on Apple's
> website. A TLS has the extra cost of address lookup by a call to
> _tlv_get_addr:

Other platforms also use an extra functoin for some models, i.e. when support for dynamic libraries are required.

-- 
/Jacob Carlborg
1 2
Next ›   Last »