January 28, 2010
Walter Bright wrote:
> You say the 2 word atomics lead to an entirely different design than 1 word atomics. I agree and think that is pretty clear.

In fact it's pretty darn subtle. It took the academic community years to figure that out. For a while it was believed that e.g. CAS2 was strictly more powerful than CAS. It turned out it's just as powerful, just more convenient.

> My point was that in trying to make 16 <=> 32 bit portability, the problems with allocating 100,000 byte arrays led one inevitably to using entirely different designs between the two, even if the language tried to make such code portable. You couldn't hide those design changes behind a simple macro.

I agree that having a larger space also enables different designs. That doesn't make machine-dependent designs a good thing!

> But very few people advocated forcing 32 bit designs to be coded as if they were 16 bit ones. Similarly, I can't see disabling support for 128 bit atomics just because they don't work on some CPUs.

We could and should allow machine-specific support via intrinsics that may or may not be available, not by enabling or disabling assignments!

Let me clarify what the most dangerous situation is.

1. A developer develops on a certain machine with good capabilities.

2. As we all know software design is like a gas - it will fill everything it can, so sooner or later the programmer writes a shared array assignment. She may not remember it's a dicey feature, or she may use it without realizing it (e.g. by calling a function against a shared array).

3. Now the code will only compile on a subset of platforms. This has two problematic consequences.

3a. There was no way during development to turn on a flag saying "Portability issues are errors".

3b. Rewriting is much more costly now that the entire program has been already designed to take advantage of the shaky functionality.

This _is_ a problem, and please - parallels with 16 bit etc. do not hold. Let's talk about the durned CAS2!

> The user does have a choice

NO! That is a false choice! Nonportabilities that go undetected until you port are NOT a choice, they are a problem with the language.

> - he can program it using 32 bit constraints, and it will work successfully on 64 bit machines.

If you want this to be remotely palatable, you must define a flag that enforces portability. Can we negotiate something in that direction?


Andrei
January 28, 2010
On Thu, Jan 28, 2010 at 2:52 PM, Walter Bright <walter at digitalmars.com>wrote:

> I don't think the message D should send is: "Yes, I know you're using a machine that can do atomic 128 bits, but because some machines you don't have and don't care about don't, you have to go the long way around the block and use a clumsy, awkward, inefficient workaround."
>
> The choices are:
>
> 1. allow atomic access for basic types where the CPU supports it, issue error when it does not. The compiler makes it clear where the programmer needs to pay attention on machines that need it, and so the programmer can select whether to use a (slow) mutex or to redesign the algorithm.
>
> 2. allow atomic access only for basic types supported by the lowest common denominator CPU. This requires the user to use workarounds even on machines that support native operations. It's like the old days where programmers were forced to suffer with emulated floating point even if they'd spent the $$$ for a floating point coprocessor.
>
> 3. allow atomic access for all basic types, emit mutexes for those types where the CPU does not allow atomic access. Keep in mind that mutexes can make access 100 times slower or more. Bartosz suggested to me that silently inserting such mutexes is a bad idea because the programmer would likely prefer to redesign the code than accept such a tremendous slowdown, except that the compiler hides such and makes it hard for him to find.
>
>
> As I've said before, I prefer (1) because D is a systems programming language. It's mission is not the Java "compile once, run everywhere." D should cater to the people who want to get the most out of their machines, not the least. For example, someone writing a device driver *needs* to get all the performance possible. Having some operations be 100x slower just because of portability is not acceptable.
>

I like this analysis in principle but the #3 option has a factor - 100x slower - has this really been tested?  I'll grant that a full pthreads style mutexes, which are function calls with a lot of overhead and logic built into it, not to mention system calls in some cases, are pretty darn slow. But once we assume that atomics require a memory barrier of some kind on read, and also that a simple spinlock is good enough for a mutex, I wonder if it is that large.  Contrast these two designs to implement "shared real x; x = x + 1;"

No magic:

   <memory barrier>
   CAS loop to do x = x + 1
   <memory barrier>

Versus emulated:

   <memory barrier>
   register int sl_index = int(& x) & 0xFF;
   CAS loop to set _spinlock_[sl_index] from 0 to 1
   x = x + 1 // no CAS needed here this time
   _spinlock_[sl_index] = 0 // no CAS needed to unlock
   <memory barrier>

I assume some of these memory barriers are not needed, but is the second design really 100x slower?  I'd think the CAS is the slowest part followed by the memory barrier, and the rest is fairly minor, right?  The sl_index calculation should be cheap since &x must be in a register already.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100128/3d0fe6d3/attachment-0001.htm>
January 29, 2010
May I suggest that Walter you might not be understanding one important thing (and Andrei isn't helping by skirting the obvious):

Under your rule Walter, on a 32-bit machine you will be able to use shared arrays.  Arrays are 64 bits on 32 bit machines.  32-bit machines have 64-bit atomic operations.

But on a 64 bit machine you will not.  Arrays are 128 bit items on 64 bit machines.  There are not 128-bit atomic operations on a 64 bit machine.

So the transition is backwards -- 64 bit will be in turn *less* capable than 32 bit.  This does not analogize with your examples of trying to emulate 32 bits in 16.  It's more like refusing to emulate 32-bit behavior in 64 bits.  D will be saying "we have full support for shared arrays in 32-bit processors, but no support for them in 64 bit."  In essence it will be saying, don't use shared arrays if you want your code to survive the future, or at least the immediate future where 64-bit architectures do not have 128-bit atomic operations.

So you might as well disable shared array operations until you can get them to work on all platforms.

BUT, I think it still will be possible to use shared array operations in 32-bit land because you can always use casting :)

i.e.:

shared int[] arr;
void updateArr()
{
   // this only works for 32-bit code
   shared(int)[] localcopy = cast(shared(int)[]) *(cast(shared(long)*) &arr);

   localcopy[2] = 1;
}

-Steve



----- Original Message ----
> From: Walter Bright <walter at digitalmars.com>
> 
> You say the 2 word atomics lead to an entirely different design than 1 word atomics. I agree and think that is pretty clear. My point was that in trying to make 16 <=> 32 bit portability, the problems with allocating 100,000 byte arrays led one inevitably to using entirely different designs between the two, even if the language tried to make such code portable. You couldn't hide those design changes behind a simple macro.
> 
> But very few people advocated forcing 32 bit designs to be coded as if they were 16 bit ones. Similarly, I can't see disabling support for 128 bit atomics just because they don't work on some CPUs.
> 
> The user does have a choice - he can program it using 32 bit constraints, and it will work successfully on 64 bit machines. Or he can code to 64 bit CPU advantages and accept that he's going to have to redesign some of it for 32 bit systems. We shouldn't make that decision for him, as he quite possibly has no intention of trying to port it to 32 bits, and instead wants to exploit the machine he bought.
> 
> Andrei Alexandrescu wrote:
> > That's a very different topic than shared arrays, and you tend to conflate the
> two.
> > 
> > I find the availability or lack thereof of certain types on certain platforms
> quite reasonable. Assume "cent" were still in the language - I'd have no trouble saying that in D cent-based code may or may not work depending on the machine.
> > 
> > Also, I find it reasonable to leave the length of certain types to the
> implementation, as long as we don't fall in C's mistake of making a hash of everything. For example, it's entirely reasonable to say that real is 80 bit on Intel but 64 bit on other machines or 128 bit on other machines. It's also reasonable to say that the size of size_t, ptrdiff_t, and void* depends on the platform.
> > 
> > The issue with atomic assignment of shared arrays is a different beast
> entirely, and no amount of eloquent rhetoric, historical examples, comparisons, and metaphors will counter that. The problem is that using two-word atomics (assignment and CAS) leads to _entirely_ different designs than using one-word atomics. It turns out that the power of the two is the same, but two-word atomics simplify design a fair amount.
> > 
> > So we're at a crossroad. We could say: D offers two-word atomics, go ahead and
> design with it. Or we could say: D offers one-word atomics, go ahead and design with it. That is it.
> > 
> > What is entirely unreasonable is: whatever machine you're on, we're allowing
> maximum possible without warning you of any portability issue. It's not acceptable! Again: D is not assembly language with different instructions for different machines!!!
> > 
> > The maximum we could do is to offer certain intrinsics a la assignDword() and
> clarify that they are nonportable. Then people look up its documentation, grep for it, etc. But throwing such a basic feature as assignment into the fray is very wrong.
> > 
> > 
> > Andrei
> > 
> > Walter Bright wrote:
> >> I'm still ticked that many languages and compilers dropped support for 80 bit
> long doubles on x86 in the 90's because non-x86 CPUs didn't have it.
> >> 
> >> Don Clugston wrote:
> >>> 2010/1/28 Steve Schveighoffer :
> >>> 
> >>>> By that logic, shared arrays must be usable on 32-bit systems and not on 64
> bit.  I think this is a huge portability issue, and an unacceptable inconsistency.
> >>>> 
> >>>> Reals are not the same however, there the capabilities of the processor are
> specifically targeted.
> >>>> 
> >>> 
> >>> Agreed. The IEEE describes 80-bit extended real as a non-interchange
> >>> format. I think it's perfectly acceptable to make its sharing
> >>> behaviour processor-specific. (It could be defined as an ABI issue).
> >>> In general, 'real' does not support atomic operations -- certainly not
> >>> for SPARC 128-bit reals, which are implemented via hardware
> >>> interrupts.
> >>> 
> >> 
> >> ------------------------------------------------------------------------
> >> 
> >> _______________________________________________
> >> dmd-concurrency mailing list
> >> dmd-concurrency at puremagic.com
> >> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
> > _______________________________________________
> > dmd-concurrency mailing list
> > dmd-concurrency at puremagic.com
> > http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
> > 
> > 
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency




January 29, 2010
Steve Schveighoffer wrote:
> But on a 64 bit machine you will not.  Arrays are 128 bit items on 64 bit machines.  There are not 128-bit atomic operations on a 64 bit machine.

There are on some, which is exactly what makes the decision difficult :o/. The rest of your assessment is correct.

Andrei
January 29, 2010
Kevin Bealer wrote:
> I like this analysis in principle but the #3 option has a factor - 100x slower - has this really been tested?

Great point. Don't decide til you test. Particularly when the decision is so far-reaching.

> I'll grant that a full pthreads style mutexes, which are function calls with a lot of overhead and logic built into it, not to mention system calls in some cases, are pretty darn slow.  But once we assume that atomics require a memory barrier of some kind on read, and also that a simple spinlock is good enough for a mutex, I wonder if it is that large.  Contrast these two designs to implement "shared real x; x = x + 1;"
> 
> No magic:
> 
>    <memory barrier>
>    CAS loop to do x = x + 1
>    <memory barrier>
> 
> Versus emulated:
> 
>    <memory barrier>
>    register int sl_index = int(& x) & 0xFF;
>    CAS loop to set _spinlock_[sl_index] from 0 to 1
>    x = x + 1 // no CAS needed here this time
>    _spinlock_[sl_index] = 0 // no CAS needed to unlock
>    <memory barrier>
> 
> I assume some of these memory barriers are not needed, but is the second design really 100x slower?  I'd think the CAS is the slowest part followed by the memory barrier, and the rest is fairly minor, right? The sl_index calculation should be cheap since &x must be in a register already.

This is an excellent point, Kevin. Could someone on this list write and run some test code pronto? Pretty please? With sugar on top?

Also note that shared array assignment is more important than real to look at.

Andrei
January 29, 2010
On Fri, 29 Jan 2010 11:06:11 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:

> Steve Schveighoffer wrote:
>> But on a 64 bit machine you will not.  Arrays are 128 bit items on 64 bit machines.  There are not 128-bit atomic operations on a 64 bit machine.
>
> There are on some, which is exactly what makes the decision difficult :o/. The rest of your assessment is correct.
>
> Andrei

Actually, atomic read/write is supported for aligned 128-bit values (via SSE2). It's just that some CPUs don't support unaligned atomic 128-bit reads.
January 29, 2010
Robert Jacques wrote:
> On Wed, 27 Jan 2010 18:07:54 -0500, Benjamin Shropshire <benjamin at precisionsoftware.us> wrote:
>> What does MPI do?
>
> MPI stands for message passing interface. It knows nothing of shared state.
>
I was asking if MPI allows passing 80 bit real in messages? IIRC it does have ways to pass text from one encoding to another.
January 29, 2010
On Fri, 29 Jan 2010 15:51:53 -0500, Benjamin Shropshire <benjamin at precisionsoftware.us> wrote:
> Robert Jacques wrote:
>> On Wed, 27 Jan 2010 18:07:54 -0500, Benjamin Shropshire <benjamin at precisionsoftware.us> wrote:
>>> What does MPI do?
>>
>> MPI stands for message passing interface. It knows nothing of shared state.
>>
> I was asking if MPI allows passing 80 bit real in messages? IIRC it does have ways to pass text from one encoding to another.

Oh, sorry. Actually, to the best of my knowledge MPI doesn't have any conversion routines, text or otherwise. What it does have is some flags for the data type: char, int, float, double and complex. It also supports structs and other data structures, binary blobs, etc. (hmm, it makes me wonder about float endianess..)
January 29, 2010
Robert Jacques wrote:
> On Fri, 29 Jan 2010 11:06:11 -0500, Andrei Alexandrescu <andrei at erdani.com> wrote:
> 
>> Steve Schveighoffer wrote:
>>> But on a 64 bit machine you will not.  Arrays are 128 bit items on 64 bit machines.  There are not 128-bit atomic operations on a 64 bit machine.
>>
>> There are on some, which is exactly what makes the decision difficult :o/. The rest of your assessment is correct.
>>
>> Andrei
> 
> Actually, atomic read/write is supported for aligned 128-bit values (via SSE2). It's just that some CPUs don't support unaligned atomic 128-bit reads.

CMPXCHG16B works on Intel-64 and late-model AMD, but not on earlier AMDs. It does not require special alignment and can be used to emulate straight assignment.

http://en.wikipedia.org/wiki/X86-64

"Early AMD64 processors lacked the CMPXCHG16B instruction, which is an extension of the CMPXCHG8B instruction present on most post-486 processors. Similar to CMPXCHG8B, CMPXCHG16B allows for atomic operations on 128-bit double quadword (or oword) data types. This is useful for parallel algorithms that use compare and swap on data larger than the size of a pointer, common in lock-free and wait-free algorithms. Without CMPXCHG16B one must use workarounds, such as a critical section or alternative lock-free approaches."

Just to restate what I think after having given the matter more thought, I think it would be an unmitigated disaster and a shame if we allowed atomic shared array assignments on some machines but not other.


Andrei
1 2 3
Next ›   Last »