Jump to page: 1 2
Thread overview
[dmd-concurrency] word tearing status in today's processors
Jan 27, 2010
Kevin Bealer
Jan 27, 2010
Michel Fortin
Jan 27, 2010
Kevin Bealer
Jan 27, 2010
Robert Jacques
Jan 27, 2010
Brad Roberts
Jan 27, 2010
Robert Jacques
Jan 28, 2010
Sean Kelly
Jan 27, 2010
Michel Fortin
Jan 27, 2010
Sean Kelly
Jan 27, 2010
Michel Fortin
Jan 28, 2010
Sean Kelly
Jan 27, 2010
Robert Jacques
Jan 27, 2010
Kevin Bealer
January 27, 2010
Hello,


I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.

In particular:

a) Can we assume that all or most of today's processors are able to write memory at byte level?

b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?

c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.


Thanks,

Andrei
January 27, 2010
On Wed, Jan 27, 2010 at 10:10 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> Hello,
>
>
> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>
> In particular:
>
> a) Can we assume that all or most of today's processors are able to write memory at byte level?
>
> b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?
>
> c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.
>
>
> Thanks,
>
> Andrei
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>

I've been thinking about this a little.

I think a basic hashed locking scheme like the one you mention is a good idea, especially if you can come up with one that can deal sensibly with different object sizes.  Using spin locks, you could do almost anything with a few CAS operations.

I like the idea of a tiered approach as a starting point:

A. Make a list of operations D would like to do with shared objects, fields,
etc.
B. Create a basic implementation of those ops using the global locking
scheme.
C. Anything the processor can help you do better, can be considered an
optimization
    for that platform.

I'm sure some people would see this as too complex, but this approach separates language capability from platform capability.  It also allows reasoning and performance comparison of layer B, which different vendors / D compilers could do differently.

Tier A, the shared operations, could include byte and int64 operations, but
also anything else you'd like to do with a shared object, such as modifying
an associative array's contents or swapping the contents of two adjacent
fields of the same object.  Maybe this should be limited to things
that might conceivably be possible using atomic CPU operations, such as
atomically modifying two adjacent fields together (I'm thinking of array
assignment but it's probably better for layer 1 to be conceived in terms of
what phyical operation is being done), but maybe not something like sorting
an array.

Tier B implements these ops using a simple two-layer locking scheme to lock
memory areas.  Two layers of locks should be enough to handle variable sized
objects if the spinlocks have both a shared and exclusive locking mode.
Use layer 1 to lock disk-page sized regions and layer 2 to lock 8 or 16
byte regions.  To lock anything larger than 8/16 bytes, you lock the layer 1
lock in exclusive mode.  For anything 8/16 bytes or smaller, lock the layer
1 lock in shared mode and the layer 2 lock in exclusive mode.

Tier C is a cook book of tricks for modifying things on particular platforms, e.g. if you know that byte wise writing or int64 assignment is possible without tearing you can define replacements for the generic ops for that platform.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100127/31545c35/attachment.htm>
January 27, 2010
On Wed, 27 Jan 2010 10:10:49 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:

> Hello,
>
>
> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>
> In particular:
>
> a) Can we assume that all or most of today's processors are able to write memory at byte level?

Not sure. Both x86 and ARM seem to have set byte instructions.

> b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?

Yes, in general, though on x86 xchg (not CAS) should be used instead.

> c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.

The instructions you're looking for is CMPXCHG8B for 32-bit x86 CPUs. It's
been around since the 486. For other CPUs, they generally use a
linked-load. From wikipedia:
All of Alpha, PowerPC, MIPS, and ARM have LL/SC instructions: ldl_l/stl_c
and ldq_l/stq_c (Alpha), lwarx/stwcx (PowerPC), ll/sc (MIPS), and
ldrex/strex (ARM version 6 and above).

Most platforms provide multiple sets of instructions for different data
sizes, e.g. ldarx/stdcx for doubleword on the PowerPC.
Some CPUs require the address being accessed exclusively to be configured
in write-through mode.
Some CPUs track the load-linked address at a cache-line or other
granularity, such that any modification to any portion of the cache line
(whether via another core's store-conditional or merely by an ordinary
store) is sufficient to cause the store-conditional to fail.
All of these platforms provide weak LL/SC. The PowerPC implementation is
the strongest, allowing an LL/SC pair to wrap loads and even stores to
other cache lines. This allows it to implement, for example, lock-free
reference counting in the face of changing object graphs with arbitrary
counter reuse (which otherwise requires DCAS).

And from an ARM website (STREXD is 64-bit):
ARM LDREX and STREX are available in ARMv6 and above.
ARM LDREXB, LDREXH, LDREXD, STREXB, STREXD, and STREXH are available in
ARMv6K and above.
All these 32-bit Thumb instructions are available in ARMv6T2 and above,
except that LDREXD and STREXD are not available in the ARMv7-M profile.

ARM also has had a swap-byte instruction since v4, which may/may not be equivalent to LDREXB/STREXB.

So I think it's safe to say that 64-bit writes will be efficient on most CPUs out there and making a language level guarantee is okay.

Warning: most of this came from some quick Google searches, so I don't know if there's other gotchas out there.

>
> Thanks,
>
> Andrei
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency

January 27, 2010
Le 2010-01-27 ? 11:40, Kevin Bealer a ?crit :

> Tier A, the shared operations, could include byte and int64 operations, but also anything else you'd like to do with a shared object, such as modifying an associative array's contents or swapping the contents of two adjacent fields of the same object.

Hum, nice idea, but we need to be careful with that. Locking into the lock pool for performing a simple data operation is fine. But since you never know in what order each thread will take its locks in the lock pool, no thread should be allowed to take a lock while it holds a lock in the lock pool, even a separate object lock.


> Maybe this should be limited to things that might conceivably be possible using atomic CPU operations, such as atomically modifying two adjacent fields together (I'm thinking of array assignment but it's probably better for layer 1 to be conceived in terms of what phyical operation is being done), but maybe not something like sorting
> an array.

As I explained, it needs to be limited to some specific operation to avoid deadlocks.

If you see atomic operations *as an optimization*, then for that optimization to be possible you need to restrict the lock pool to only what is possible using atomics. That's because you can't mix atomic and lock pool operations on the same variable. If only one of the allowed operations can't be done with atomics, you need to use the lock pool for all others too.

Note that I said "as an optimization". If using the lock pool is done explicitly, the programer can ensure correct usage:

	shared int x;
	shared long y;

	++atomic(x); // x always accessed as atomic
	++locked(y); // y always accessed as locked

Although in this case it might be safer if this was part of the type information, perhaps through a template:

	shared Atomic!int x;
	shared Locked!long y;

	++x; // x always accessed as atomic
	++y; // y always accessed as locked

Static if could be used to choose between Atomic!T and Locked!T when specific atomic capabilities are missing:

	// check for atomic increment
	static if (is(Atomic!long()++))
		Atomic!long y;
	else
		Locked!long y;


-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



January 27, 2010
Le 2010-01-27 ? 10:10, Andrei Alexandrescu a ?crit :

> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
> 
> In particular:
> 
> a) Can we assume that all or most of today's processors are able to write memory at byte level?

I'd like to have an answer to that. You're right that it's terribly difficult to get reliable information on this.

While I believe most have a write instruction at the byte level, I'm not sure if the memory coherency between processors can works at this granularity.

I think the best route would be to not make any assumption and offer an API where the user can detect atomic capabilities and seamlessly switch to the lock pool mechanism for variables where the needed atomic operations aren't available.

See my reply to Kevin.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



January 27, 2010
Michel Fortin wrote:
> Le 2010-01-27 ? 10:10, Andrei Alexandrescu a ?crit :
> 
>> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>> 
>> In particular:
>> 
>> a) Can we assume that all or most of today's processors are able to write memory at byte level?
> 
> I'd like to have an answer to that. You're right that it's terribly difficult to get reliable information on this.
> 
> While I believe most have a write instruction at the byte level, I'm not sure if the memory coherency between processors can works at this granularity.

Well that only adds to the much folklore that I've pored through already :o). We sorely miss a hardcore low-level threads expert on this list.

> I think the best route would be to not make any assumption and offer an API where the user can detect atomic capabilities and seamlessly switch to the lock pool mechanism for variables where the needed atomic operations aren't available.
> 
> See my reply to Kevin.

I don't think we should pursue that path.


Andrei
January 27, 2010
On Wed, Jan 27, 2010 at 12:36 PM, Michel Fortin <michel.fortin at michelf.com>wrote:

> Le 2010-01-27 ? 11:40, Kevin Bealer a ?crit :
>
> > Tier A, the shared operations, could include byte and int64 operations,
> but also anything else you'd like to do with a shared object, such as modifying an associative array's contents or swapping the contents of two adjacent fields of the same object.
>
> Hum, nice idea, but we need to be careful with that. Locking into the lock pool for performing a simple data operation is fine. But since you never know in what order each thread will take its locks in the lock pool, no thread should be allowed to take a lock while it holds a lock in the lock pool, even a separate object lock.


What I was thinking of is two non-overlapping lock pools, and the order would always be that you either take a (shared) lock in the first, then (an exclusive in) the second, or you only take the (exclusive) lock in the first.  So all lock ops should be ordered and safe from deadlock as long as the thread doesn't go belly up while holding the lock.  If you needed to take two locks in either pool, you'd need to take them together in each pool in order, and you could take both of the disk-page level locks in numerical order to avoid an ordering issue there -- I'm thinking of these locks as only being accessible to this shared-qualifier implementation layer, not taken from a general pool of locks the user can dip into.  The disk-page locks are not in the same pool as the 8-byte-range locks.

E.g. update a shared long on a platform that doesn't have 8 byte atomics, assuming a secondary protection size of 16:

address = 12345;

 get_shared_lock(table1[address / disk_page_size]);
 get_exclusive_lock(table2[address / 16]);

// modify the long in some way
*12345 = *12345 + 10;

clear_exclusive_lock(table2[address / 16]);
 clear_shared_lock(table1[address / disk_page_size]);

I was thinking of the disk page locks as a way to support operations that cover any operation affecting an entire shared struct no matter how big it is (up to the disk page size).  If we assume that all ops will be field-wise then you could just use one lock that is at the granularity you want to lock at.

This should be deadlock free as long as the set of locks needed for any op are known before getting any of them, which I think is doable if the operations each apply to one location.  This scheme wouldn't work if you wanted to combine arbitrary sets of operations under a single atomicness constraint -- it only works for simply defined operations, like reading a 16 byte value or storing a single byte into a word.

One nice thing about spinlocks is that they don't actually "block", so you can avoid deadlocks based on lock order anyway -- if you can't get the last lock you need after N operations, you could back out the ones you have and wait for a few milliseconds.  But I was trying to avoid that kind of complexity.


> > Maybe this should be limited to things that might conceivably be possible
> using atomic CPU operations, such as atomically modifying two adjacent fields together (I'm thinking of array assignment but it's probably better for layer 1 to be conceived in terms of what phyical operation is being done), but maybe not something like sorting
> > an array.
>
> As I explained, it needs to be limited to some specific operation to avoid deadlocks.
>
> If you see atomic operations *as an optimization*, then for that optimization to be possible you need to restrict the lock pool to only what is possible using atomics. That's because you can't mix atomic and lock pool operations on the same variable. If only one of the allowed operations can't be done with atomics, you need to use the lock pool for all others too.
>
Right.  I was thinking that each operation, defined in terms like "increment an int32" is either a lock-pool op or an atomic op for a given platform.  If your platform allows single-byte-write without shearing then writing shared 'byte' fields should be done using that mechanism.  If it doesn't support 8-byte atomic operations, then those fields have to be handled with the lock pool mechanism in all cases.

This implies that every piece of the data is part of exactly one protection domain -- e.g. if you have an 8 byte region of memory, you need to either treat it as a long or as two integers, you can't mix.  It also implies that if you want byte-level access and don't have it natively, that objects falling in the same word need to share their locking strategy -- so maybe every struct would need a map of which fields are locked and which are atomic.


> Note that I said "as an optimization". If using the lock pool is done explicitly, the programer can ensure correct usage:
>
>        shared int x;
>        shared long y;
>
>        ++atomic(x); // x always accessed as atomic
>        ++locked(y); // y always accessed as locked
>
> Although in this case it might be safer if this was part of the type information, perhaps through a template:
>
>        shared Atomic!int x;
>        shared Locked!long y;
>
>        ++x; // x always accessed as atomic
>        ++y; // y always accessed as locked
>
> Static if could be used to choose between Atomic!T and Locked!T when specific atomic capabilities are missing:
>
>        // check for atomic increment
>        static if (is(Atomic!long()++))
>                Atomic!long y;
>        else
>                Locked!long y;
>
>
> --
> Michel Fortin
> michel.fortin at michelf.com
> http://michelf.com/
>
>
>
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100127/159b2233/attachment-0001.htm>
January 27, 2010
On Wed, 27 Jan 2010 10:10:49 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
> Hello,
>
>
> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>
> In particular:
>
> a) Can we assume that all or most of today's processors are able to write memory at byte level?

 From an article on the Java Memory model:
http://javaboutique.internet.com/articles/java_memory/
Word Tearing
Word Tearing occurs when some processors do not allow for a single byte to
be updated individually. In old processors writes to memory are done
through a word (two bytes), so in these cases the processor would read the
whole word from memory and update the appropriate byte and then write the
word back into memory. This is called word tearing. In modern processors
this problem does not exist as they allow a single byte to be written to
memory.

Summarized from a table: http://en.wikipedia.org/wiki/Word_(computing)
8 bit addressing resolution has been standard since the late 70s (with the
exception of the cray)

 From Re: JavaMemoryModel: Word tearing
(http://www.cs.umd.edu/~pugh/java/memoryModel/archive/0974.html) Circa 2002
  3) Word tearing is not an issue on most machines. SPARC (as of my V9
manual)
specifies that loads and stores, even double-word, are atomic. PowerPC
specifies
that all loads and stores up to the native word size are atomic. And while
Intel
may not specify the semantics explicitly, mainstream manufacturers are
unlikely to
go against the standard implementation.

 From an Intel forum:
http://software.intel.com/en-us/forums/showthread.php?t=47453
"Recently I learned about word tearing in threaded programs and that it
can occur on alpha processors."

"If a variable crosses the boundary between memory units, which can happen
if the machine supports unaligned memory access, the computer may have to
send the data in two bus transactions. An unaligned 32-bit value, for
example, may be sent by writing the two adjacent 32-bit memory units. If
either memory unit involved in the transaction is simultaneously written
 from another processor, half of the value may be lost. This is called
"word tearing.""

"Here are some extracts from the IA-32 Intel Architecture Software
Developer's Manual Volume 3 : System Programming Guide
[start of quote]
7.1.1 Guaranteed Atomic Operations
The Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors
guarantee that the following basic memory operations will always be
carried out atomically:
. Reading or writing a byte
. Reading or writing a word aligned on a 16-bit boundary
. Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium 4, Intel Xeon, and P6 family, and Pentium processors guarantee
that the following additional memory operations will always be carried out
atomically:
. Reading or writing a quadword aligned on a 64-bit boundary.
. 16-bit accesses to uncached memory locations that fit within a 32-bit
data bus.
The P6 family processors guarantee that the following additional memory
operations will always be carried out atomically:
. Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within
a 32-byte cache line.
Accesses to cacheable memory that are split across bus widths, cache lines
and page boundaries are not guaranteed to be atomic by the Pentium 4,
Intel Xeon, P6 family, Pentium and Intel486 processors. The Pentium 4,
Intel Xeon and P6 family processors provide bus control signals that
permit external memory subsystems to make split accesses atomic; however,
nonaligned data accesses will seriously impact the performance of the
processor and should be avoided.
[end of quote]
"

January 27, 2010
On Jan 27, 2010, at 10:00 AM, Andrei Alexandrescu wrote:

> Michel Fortin wrote:
>> Le 2010-01-27 ? 10:10, Andrei Alexandrescu a ?crit :
>>> I'm looking _hard data_ on how today's processors address word
>>> tearing. As usual, googling for word tearing yields the usual mix
>>> of vague information, folklore, and opinionated newsgroup
>>> discussions.
>>> In particular:
>>> a) Can we assume that all or most of today's processors are able to
>>> write memory at byte level?
>> I'd like to have an answer to that. You're right that it's terribly
>> difficult to get reliable information on this.
>> While I believe most have a write instruction at the byte level, I'm
>> not sure if the memory coherency between processors can works at this
>> granularity.
> 
> Well that only adds to the much folklore that I've pored through already :o). We sorely miss a hardcore low-level threads expert on this list.

I'm nearly certain this works on x86 because x86 can do atomic unaligned writes, which is basically the same as byte-level atomic writes.  I really couldn't say about other CPUs though.  I'd like to believe it's possible, but don't have data to back it up.  Might be worth looking at Joe Seigh's atomics library and seeing if it offers atomic ops at byte-level granularity.
January 27, 2010
On Wed, Jan 27, 2010 at 10:10 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> Hello,
>
>
> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>
> In particular:
>
> a) Can we assume that all or most of today's processors are able to write memory at byte level?
>
> b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?
>
> c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.
>
>
> Thanks,
>
> Andrei
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency


In my previous response, I got side tracked into thinking about how to protect arbitrary ranges of memory, e.g. whole shared objects.  Rereading this, I realized you're only asking about word tearing.

If I was making this choice, I would probably just use direct atomic ops for fields that had atomicity guarantees and a global spinlock for operations on fields that didn't have atomicity guarantees on that platform.

>From Robert Jacques recent message, it looks like intel and others are
moving in the direction of making more things atomic that weren't, and that recent processors handle all the normal cases.  If so, no point in architecting a solution for the various cases that won't matter for long.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100127/74f185bc/attachment.htm>
« First   ‹ Prev
1 2