Assembler [OT - was opSlice as lvalue] (page 3) - D Programming Language Discussion Forum

July 08, 2004

Re: Assembler [OT - was opSlice as lvalue]

Posted by Norbert Nemec
in reply to Arcane Jill

Permalink

Norbert Nemec

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> In article <ccivu1$pgc$1@digitaldaemon.com>, Norbert Nemec says...
> 
>>If you are interested in seeing the code, I can try to dig it up and bring it to a presentable shape. Will take some time, though.
> 
> Well, there's "interested" and there's "having time". Personally I would *love* to take on a head-to-head optimization challenge - my assembler versus your compiled C. But, I am just too busy right now. I have things to do that are more important than my pride.
> 
> But one day, when I'm less busy (Free time? What's that?), I'll take you up on that challenge. And to make it interesting, the loser buys the winner a Guinness, yes?

I think we can save that time by giving some numbers:

The program we were optimizing used a fixed number of floating point operations per run. For a given size of the data, say, 10 Million floating point additions/multiplications, etc. No matter what optimization, these operations had to be done. (Different algorithms might have needed less, but the task was to implement this algorithm)

The theoretical speed limit of a processor can be given in MFLOPS (Million Floating Point Operations Per Second) The processor we were using back then hat it had roughly 1000 MFLOPS maximum performance. Therefore, it is clear that - no matter what - the program will run at least 10 seconds. The only problem is, to come as near to that limit as possible.

The original version achieved a performance of ~50 MFLOPS, therefore exploiting roughly 5% of the potential of the processor. This is a typical value for naively written C code, using simple nested loops running over data that is much larger than the cache. 10Mio. ops at 50 MFLOPS gives 200sec running time.

In the course of the semester, we hand-optimized this code up to ~600 MFLOPS, i.e. exploiting ~60% of the processors capabilities. This is about the limit that can be achieved for data-intensive real-world calculations. (programs on small data-sets that fits into the L1-Cache can often get near the theoretical limit, but they are rather academic)

Now, 600MFLOPS means ~17seconds run-time. 10 seconds of this were used practically, 7 were wasted somewhere. A hardware-profiler showed where the time was lost:

* integer-operations and flow-control are negligible. They happen in a separate unit of the CPU, so they are effectively happening in parallel to the productive floating-point operations.

* very little time was lost by pipeline stalls (the operations were very
well interlaced)

* very little time was lost on branch-mispredictions (the loops were so simply structured that the prediction unit of the CPU had no problem)

* the vast amount of the lost time was still spend on cache misses. Even though we did everything to exploit the cache as much as possible, we still had some bottleneck left there.

Rewriting the code in assembler might perhaps give you an increase of a few more percent, but the only way to really go beyond what we already achieved would be to further optimize the locality of the data access making more use of the cache. And this really means that doing the whole thing in Assembler would get more and more complicated.

As a last word about the topic: everything I said here is not special to the algorithm we were looking at, but it is the daily bread of numerics. "Numerics" means that there are three possible bottlenecks

* The floating point unit itself
* Pipeline-stalls
* Data access.

The first can only be overcome by mathematics searching for different algorithms. The second one means that fine-grained instruction reordering is important, the third one, that instructions should be reordered on a larger scale. None of these problems can be solved by using hand-coded assembler (except in rare, very simple cases). Using plain C cleverly is possible but ugly. In the long run, optimizing compilers for well-designed languages might take this ugly job from the user.

Arcane Jill wrote: > > Well, there's "interested" and there's "having time". Personally I would *love* > to take on a head-to-head optimization challenge - my assembler versus your > compiled C. But, I am just too busy right now. I have things to do that are more > important than my pride. It's one thing to code a few small routines in assembler. It's another that the language can deal with it well enough that it really makes a difference. For instance, you may still have temporaries floating around, or, because you coded a few small atomic operations in assembler, you may be giving up opportunity for reduction, and combination of those operations. Assembler, though it may help, is not the solution to the above mentioned issues. --Steve

Norbert Nemec wrote: > > Rewriting the code in assembler might perhaps give you an increase of a few > more percent, but the only way to really go beyond what we already achieved > would be to further optimize the locality of the data access making more > use of the cache. And this really means that doing the whole thing in > Assembler would get more and more complicated. You hit the nail on the head here. My experience in game programming has been exactly as this. Typically we may have a few, very small, inner loops hand coded in Assembler by the time a game is finished. Cache misses, and hence painfully slow memory access (relatively), always seems to end up as our bottleneck (on a main CPU anyway). --Steve

Derek wrote: > On Wed, 07 Jul 2004 09:14:32 +0200, Norbert Nemec wrote: > > >>I don't know about Walters reasoning in that respect, but I can state my >>position: >> >>I think that even the opSlice that is there should be used with care. As I >>see it, there is no way to cleanly extend it to multiple dimensions > > > So what? Single dimension is just fine for my needs. As far as I can see, a > slice operation is one that is only done on a vector (single dimension > array). To me is just a way of specifying a subset of vector elements. > > Once one moves into multi-dimensional array operations, eg. maxtrix; a > vector of vectors, we need a new syntax anyway. opSlice can remain the same > (a vector slice) and D can have some new opXXX function for 'matrix > slicing'. >..... > > > My methods currently looks a lot like the longhand above ... > > void opSlice(int i, int j, char[] x) > { > for (int k = 0; i < j; i++,k++) > { > if ((i >= 0) && (i < Elem.length) && (k < x.length)) > Elem[i]( cast(int)x[k] ); > }; > } > > in anticipation of being able to write ... > > Foo[i..j] = x; > > whereas I now write ... > > Foo.opSlice(i, j, x); > Perhaps this is primarily an argument asserting that [] should be a more flexibly overloadable operator? Perhaps one with a Template? So, e.g., a Hash could be indexed with something like: x = myHash["Betty".."Bill"] which would definitely require an overloading of the [] operation, as hashes don't usually have any sensibly defined ordering. (And if you want that, a tree is a better data structure.) I will admit to a fond rememberance of Fortran multi-dimensional array subscripting, which I still consider superior for most purposes to the C version. But Fortran never dealt with ranges, or user defined types being used as indexes (or even strings as indexes). So something new is needed to accomplish what is intended. If tuples were a part of D, I would suggest that array indicies be implicit tuples of ranges. I.e., that every index be implicitly converted to a tuple, even if only of one element. And that each tuple member be implicitly a range, even if the start and finish were the same. This would product a kind of orthogonal simplicity in thinking about what is intended here, and simplify the parsing. Thus one could have: Foo[i..j, k..l, m..n] = x And (using "(" as a tuple marker, this would be understood as: Foo[(i..j), (k..l), (m..n)] = x which would be (approx.) the same as: Foo[i..j][k..l][m..n] = x But note that for this to work a range would need to be a more general object, so that it could be specified how it would act over unexpected types, e.g., over strings. Or names. (We would want Betty..Bill to mean all of the people I'm thinking of between Betty and Bill, inclusive), this means that the strings would need to be references to some particular type of string that had a range operation defined over itself.) Ranges: Integers is a relatively simple case of range, and even there the documentation is quite confusing. I don't understand why x[1..3] should mean (x[1], x[2]). That isn't three elements long. It doesn't end with the third index of x (which would be x[3]. All I can think of is that in this case we have decided to use 1 based indexing rather than 0 based indexing, which seems quite strange.

Arcane Jill wrote: > In article <miydn03kpeka$.4mtnnmprlmi5.dlg@40tude.net>, Derek says... > > >>If I were to reword my example above (with a pedantic streak)... >>.. > Pretty much every language on the planet can do this. D can do this too. The > statement: > > # myData[4..9] = "Derek"; > > does actually compile and execute correctly. But what we DON'T have is the > ability to overload that expression. > > I don't get why adding an opSliceAssign() overload would be a problem though. > Why does anyone think it would be a problem? > > Arcane Jill What if you don't have integers as your index? What I'd like is an extension of what you are proposing, + an extension of what a range is (i.e., allow range to be overloaded for different types of arguments). So that one would be able to define: # myData["Bill"..."Ermitrude"] Note that here myData would be defined as DType[Name[]] myData; and it would be Name that was required to define what a range meant. And Name would need to have a special access method (or constructor?) that took a string as a value. This may be a bit of a stretch beyond what we would want to add before D 1.0 comes out, but might be reasonable for consideration for D 2.0. Access method vs. constructor: I tend to think of this as being most useful when one already has a list that contains the indicies that one is attempting to access, so this could just be syntactic sugar for looking up the index of the name (or id#?) and using that as an access method. P.S.: What I'm REALLY thinking of...sort of, is a B+Tree or B*Tree that's accessed as if it were an array. But I want the tree to have keys of any comparable type, not just strings or integers.

Forums