January 27, 2010
Le 2010-01-27 ? 14:33, Sean Kelly a ?crit :

> I'm nearly certain this works on x86 because x86 can do atomic unaligned writes, which is basically the same as byte-level atomic writes.

You're probably right. I guess I read some things wrong before.

About unaligned writes, what happens if you're crossing the boundary of a memory page?


-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



January 27, 2010
Thanks, Robert. This is very useful!

Andrei

Robert Jacques wrote:
> On Wed, 27 Jan 2010 10:10:49 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
> 
>> Hello,
>>
>>
>> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>>
>> In particular:
>>
>> a) Can we assume that all or most of today's processors are able to write memory at byte level?
> 
> Not sure. Both x86 and ARM seem to have set byte instructions.
> 
>> b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?
> 
> Yes, in general, though on x86 xchg (not CAS) should be used instead.
> 
>> c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.
> 
> The instructions you're looking for is CMPXCHG8B for 32-bit x86 CPUs.
> It's been around since the 486. For other CPUs, they generally use a
> linked-load. From wikipedia:
> All of Alpha, PowerPC, MIPS, and ARM have LL/SC instructions:
> ldl_l/stl_c and ldq_l/stq_c (Alpha), lwarx/stwcx (PowerPC), ll/sc
> (MIPS), and ldrex/strex (ARM version 6 and above).
> 
> Most platforms provide multiple sets of instructions for different data
> sizes, e.g. ldarx/stdcx for doubleword on the PowerPC.
> Some CPUs require the address being accessed exclusively to be
> configured in write-through mode.
> Some CPUs track the load-linked address at a cache-line or other
> granularity, such that any modification to any portion of the cache line
> (whether via another core's store-conditional or merely by an ordinary
> store) is sufficient to cause the store-conditional to fail.
> All of these platforms provide weak LL/SC. The PowerPC implementation is
> the strongest, allowing an LL/SC pair to wrap loads and even stores to
> other cache lines. This allows it to implement, for example, lock-free
> reference counting in the face of changing object graphs with arbitrary
> counter reuse (which otherwise requires DCAS).
> 
> And from an ARM website (STREXD is 64-bit):
> ARM LDREX and STREX are available in ARMv6 and above.
> ARM LDREXB, LDREXH, LDREXD, STREXB, STREXD, and STREXH are available in
> ARMv6K and above.
> All these 32-bit Thumb instructions are available in ARMv6T2 and above,
> except that LDREXD and STREXD are not available in the ARMv7-M profile.
> 
> ARM also has had a swap-byte instruction since v4, which may/may not be equivalent to LDREXB/STREXB.
> 
> So I think it's safe to say that 64-bit writes will be efficient on most CPUs out there and making a language level guarantee is okay.
> 
> Warning: most of this came from some quick Google searches, so I don't know if there's other gotchas out there.
> 
>>
>> Thanks,
>>
>> Andrei
>> _______________________________________________
>> dmd-concurrency mailing list
>> dmd-concurrency at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
> 
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
January 27, 2010
I can't provide contractory data, but careful with the assumption that because the cpu provides a byte assign instruction that it's atomic.

On Wed, 27 Jan 2010, Andrei Alexandrescu wrote:

> Thanks, Robert. This is very useful!
> 
> Andrei
> 
> Robert Jacques wrote:
> > On Wed, 27 Jan 2010 10:10:49 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
> > 
> > > Hello,
> > > 
> > > 
> > > I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
> > > 
> > > In particular:
> > > 
> > > a) Can we assume that all or most of today's processors are able to write memory at byte level?
> > 
> > Not sure. Both x86 and ARM seem to have set byte instructions.
> > 
> > > b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?
> > 
> > Yes, in general, though on x86 xchg (not CAS) should be used instead.
> > 
> > > c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.
> > 
> > The instructions you're looking for is CMPXCHG8B for 32-bit x86 CPUs. It's
> > been around since the 486. For other CPUs, they generally use a linked-load.
> > From wikipedia:
> > All of Alpha, PowerPC, MIPS, and ARM have LL/SC instructions: ldl_l/stl_c
> > and ldq_l/stq_c (Alpha), lwarx/stwcx (PowerPC), ll/sc (MIPS), and
> > ldrex/strex (ARM version 6 and above).
> > 
> > Most platforms provide multiple sets of instructions for different data
> > sizes, e.g. ldarx/stdcx for doubleword on the PowerPC.
> > Some CPUs require the address being accessed exclusively to be configured in
> > write-through mode.
> > Some CPUs track the load-linked address at a cache-line or other
> > granularity, such that any modification to any portion of the cache line
> > (whether via another core's store-conditional or merely by an ordinary
> > store) is sufficient to cause the store-conditional to fail.
> > All of these platforms provide weak LL/SC. The PowerPC implementation is the
> > strongest, allowing an LL/SC pair to wrap loads and even stores to other
> > cache lines. This allows it to implement, for example, lock-free reference
> > counting in the face of changing object graphs with arbitrary counter reuse
> > (which otherwise requires DCAS).
> > 
> > And from an ARM website (STREXD is 64-bit):
> > ARM LDREX and STREX are available in ARMv6 and above.
> > ARM LDREXB, LDREXH, LDREXD, STREXB, STREXD, and STREXH are available in
> > ARMv6K and above.
> > All these 32-bit Thumb instructions are available in ARMv6T2 and above,
> > except that LDREXD and STREXD are not available in the ARMv7-M profile.
> > 
> > ARM also has had a swap-byte instruction since v4, which may/may not be equivalent to LDREXB/STREXB.
> > 
> > So I think it's safe to say that 64-bit writes will be efficient on most CPUs out there and making a language level guarantee is okay.
> > 
> > Warning: most of this came from some quick Google searches, so I don't know if there's other gotchas out there.
> > 
> > > 
> > > Thanks,
> > > 
> > > Andrei
> > > _______________________________________________
> > > dmd-concurrency mailing list
> > > dmd-concurrency at puremagic.com
> > > http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
> > 
> > _______________________________________________
> > dmd-concurrency mailing list
> > dmd-concurrency at puremagic.com
> > http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
> 
January 27, 2010
On Wed, 27 Jan 2010 16:51:22 -0500, Brad Roberts <braddr at puremagic.com> wrote:
> I can't provide contractory data, but careful with the assumption that because the cpu provides a byte assign instruction that it's atomic.

Tearing is different from being atomic. A regular store will always suffer
 from publication safety issues on modern hardware.

January 27, 2010
On Jan 27, 2010, at 11:47 AM, Michel Fortin wrote:

> Le 2010-01-27 ? 14:33, Sean Kelly a ?crit :
> 
>> I'm nearly certain this works on x86 because x86 can do atomic unaligned writes, which is basically the same as byte-level atomic writes.
> 
> You're probably right. I guess I read some things wrong before.
> 
> About unaligned writes, what happens if you're crossing the boundary of a memory page?

It doesn't work.  But unaligned atomic writes are so slow anyway that you never actually want to use them, so it's not really a big deal.
January 27, 2010
Yeah, I know older (non-x86?) processors did a RMW operation at the word level to accomplish this.

On Jan 27, 2010, at 1:51 PM, Brad Roberts wrote:

> I can't provide contractory data, but careful with the assumption that because the cpu provides a byte assign instruction that it's atomic.
> 
> On Wed, 27 Jan 2010, Andrei Alexandrescu wrote:
> 
>> Thanks, Robert. This is very useful!
>> 
>> Andrei
>> 
>> Robert Jacques wrote:
>>> On Wed, 27 Jan 2010 10:10:49 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> 
>>>> I'm looking _hard data_ on how today's processors address word tearing. As usual, googling for word tearing yields the usual mix of vague information, folklore, and opinionated newsgroup discussions.
>>>> 
>>>> In particular:
>>>> 
>>>> a) Can we assume that all or most of today's processors are able to write memory at byte level?
>>> 
>>> Not sure. Both x86 and ARM seem to have set byte instructions.
>>> 
>>>> b) If not, is it reasonable to have the compiler insert for sub-word shared assignments a call to a function that avoids word tearing by means of a CAS loop?
>>> 
>>> Yes, in general, though on x86 xchg (not CAS) should be used instead.
>>> 
>>>> c) For 64-bit data (long and double), am I right in assuming that all non-ancient Intel32 processors do offer a means to atomically assign 64-bit data? (What are those asm instructions?) For processors that don't (Intel or not), can we/should we guarantee at the language level that 64-bit writes are atomic? We could effect that by using e.g. a federation of hashed locks, or even (gasp!) two global locks, one for long and one for double, and do something cleverer when public outrage puts our lives in danger. Java guarantees atomic assignment for volatile data, but I'm not sure what mechanisms implementations use.
>>> 
>>> The instructions you're looking for is CMPXCHG8B for 32-bit x86 CPUs. It's
>>> been around since the 486. For other CPUs, they generally use a linked-load.
>>> From wikipedia:
>>> All of Alpha, PowerPC, MIPS, and ARM have LL/SC instructions: ldl_l/stl_c
>>> and ldq_l/stq_c (Alpha), lwarx/stwcx (PowerPC), ll/sc (MIPS), and
>>> ldrex/strex (ARM version 6 and above).
>>> 
>>> Most platforms provide multiple sets of instructions for different data
>>> sizes, e.g. ldarx/stdcx for doubleword on the PowerPC.
>>> Some CPUs require the address being accessed exclusively to be configured in
>>> write-through mode.
>>> Some CPUs track the load-linked address at a cache-line or other
>>> granularity, such that any modification to any portion of the cache line
>>> (whether via another core's store-conditional or merely by an ordinary
>>> store) is sufficient to cause the store-conditional to fail.
>>> All of these platforms provide weak LL/SC. The PowerPC implementation is the
>>> strongest, allowing an LL/SC pair to wrap loads and even stores to other
>>> cache lines. This allows it to implement, for example, lock-free reference
>>> counting in the face of changing object graphs with arbitrary counter reuse
>>> (which otherwise requires DCAS).
>>> 
>>> And from an ARM website (STREXD is 64-bit):
>>> ARM LDREX and STREX are available in ARMv6 and above.
>>> ARM LDREXB, LDREXH, LDREXD, STREXB, STREXD, and STREXH are available in
>>> ARMv6K and above.
>>> All these 32-bit Thumb instructions are available in ARMv6T2 and above,
>>> except that LDREXD and STREXD are not available in the ARMv7-M profile.
>>> 
>>> ARM also has had a swap-byte instruction since v4, which may/may not be equivalent to LDREXB/STREXB.
>>> 
>>> So I think it's safe to say that 64-bit writes will be efficient on most CPUs out there and making a language level guarantee is okay.
>>> 
>>> Warning: most of this came from some quick Google searches, so I don't know if there's other gotchas out there.
>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Andrei
>>>> _______________________________________________
>>>> dmd-concurrency mailing list
>>>> dmd-concurrency at puremagic.com
>>>> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>>> 
>>> _______________________________________________
>>> dmd-concurrency mailing list
>>> dmd-concurrency at puremagic.com
>>> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>> _______________________________________________
>> dmd-concurrency mailing list
>> dmd-concurrency at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>> 
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency

1 2
Next ›   Last »