April 13, 2010
On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
> That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I guess is the reason it's an intrinsic in the first place). I would have expected this to be much, much faster than a user function. Anyone care enough to check the generated assembly?

The opcode is fairly slow anyway (as far as opcodes go) - odds are the implementation inside the processor is similar to Jerome's method, and the main savings come from it loading fewer bytes into the pipeline.

I remember a line from a blog, IIRC it was the author of the C++ FQA writing it, saying hardware and software are pretty much the same thing - moving an instruction to hardware doesn't mean it will be any faster, since it is the same algorithm, just done in processor microcode instead of user opcodes.

-- 
Adam D. Ruppe
http://arsdnet.net
April 13, 2010
Adam D. Ruppe wrote:
> On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
>> That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I guess is the reason it's an intrinsic in the first place). I would have expected this to be much, much faster than a user function. Anyone care enough to check the generated assembly?
> 
> The opcode is fairly slow anyway (as far as opcodes go) - odds are the
> implementation inside the processor is similar to Jerome's method, and
> the main savings come from it loading fewer bytes into the pipeline.
> 
> I remember a line from a blog, IIRC it was the author of the C++ FQA
> writing it, saying hardware and software are pretty much the same thing -
> moving an instruction to hardware doesn't mean it will be any faster,
> since it is the same algorithm, just done in processor microcode instead of
> user opcodes.
> 
It's fast on Intel, slow on AMD. I bet the speed difference comes from inlining max().
April 13, 2010
Don wrote:
> Adam D. Ruppe wrote:
>> On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
>>> That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I guess is the reason it's an intrinsic in the first place). I would have expected this to be much, much faster than a user function. Anyone care enough to check the generated assembly?
>>
>> The opcode is fairly slow anyway (as far as opcodes go) - odds are the
>> implementation inside the processor is similar to Jerome's method, and
>> the main savings come from it loading fewer bytes into the pipeline.
>>
>> I remember a line from a blog, IIRC it was the author of the C++ FQA
>> writing it, saying hardware and software are pretty much the same thing -
>> moving an instruction to hardware doesn't mean it will be any faster,
>> since it is the same algorithm, just done in processor microcode instead of
>> user opcodes.
>>
> It's fast on Intel, slow on AMD. I bet the speed difference comes from inlining max().

Specifically, bsr is 7 uops on AMD, 1 uop on Intel since the original Pentium. AMD's performance is shameful.

And bsr() is supported in the compiler; in fact DMC uses it extensively, which is why it's included in DMD!
April 13, 2010
Steven Schveighoffer wrote:
> Jérôme M. Berger Wrote:
> 
>> Steven Schveighoffer wrote:
>>> When we're talking about the difference between O(1) and O(lgn), I'll
>>> take accuracy over speed in my compiler any day.
>> 	And when we're talking about the difference between 10s and 55s for
>> a minimal loss of accuracy, which will you take? Especially if the
>> accuracy loss is less than is lost elsewhere (due to holes in the
>> ranges).
> 
> Really?  You rebuilt the compiler with your range propagation algorithm and verified that it adds 10 seconds versus an accurate one that adds 55s?  How much time did the compiler spend to compile?  I'd hazard to guess that a code base that adds 10s worth of your algorithm takes at least a few hours to compile.  Is 55s that bad at that point?
> 
> Again, if it takes the compiler an extra insignificant amount of time to be more accurate, I'm all for accuracy over speed when you get down to that level of insignificance.  I'd say the point of pain has to be at least 10% of the compile time before it makes any sizable difference.
> 
	My point is that if you always choose an algorithm that is 5 to 6
times slower just because it brings extra precision which you may
not really need, then you will wind up with a compiler that is 5 to
6 times slower than it needs to be. Sure the difference on *one*
function is not great in absolute terms, but if you make the same
choice for *all* functions, then where do you go?

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



April 13, 2010
Fawzi Mohamed wrote:
> 
> On 13-apr-10, at 12:02, Don wrote:
>> It's been part of DMD2 for a while now. It allows you to do things like:
>>
>> ubyte lowbits(int x)
>> {
>>    return x & 7;
>> }
>>
>> without an explicit cast. The compiler knows that x&7 can safely fit
>> inside a single byte.  Whereas   ((x&7) << 12) | (x&3);
>> does not fit, and requires an explicit cast.
> 
> ah ok I understand, I thought that was treated like x & cast(ubyte)7 ,
> and so as comparison of the compile time value with the ranges of ubyte
> (no range propagation needed).
> But I can understand why it is treated as cast(ubyte)((cast(int)x) & 7),
> as it is probably easier for the compiler, as it upcasts by default.
> 
> Thanks for the explanation.
> 
	Note that in Don's example, "x" is an int, not a ubyte, so you
don't need the cast to "int" in your second example, but you still
need range propagation in the first...

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



April 13, 2010
Steven Schveighoffer wrote:
> Steven Schveighoffer Wrote:
> 
>> I'll have to double check, I thought I copied your code verbatim (I did switch around the arguments to be minA, maxA, minB, maxB to fit my test harness, and I changed the uint_32_t to uint).  I'll check tomorrow at work where the computer I used to test is.
>>
> 
> I think I see the issue, I was using an older version of your highbit, at some point you changed it, but didn't repeat it in this solution, so I searched for it on the newsgroup and found your first version.  That differs from later versions you posted, but I can't find where you identified there was a bug.
> 
> I'll test again tomorrow, but it looks like that probably accounts for the problem.
> 
> The version I was using: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=108788
> 
> The later version: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=108855
> 
> It looks like in that second post that Ali had the same problem I did.
> 
> Next time, you should include your whole code each time, especially parts you changed.
> 
	Oups, I hadn't noticed that I sent two different versions (except
for the C<->D translation). Sorry about that.

	Note that there is no bug: both versions are correct. The first one
uses 1-based indexes, while the second uses 0-based indexes an
processes 0 differently.

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



April 13, 2010
Steven Schveighoffer wrote:
> using your highbit function, which is very neat BTW
> 
	Thanks, but I can't claim credit for it. It's a classical algorithm
which comes up regularly in programming discussions and newsgroups...

> uint maxor(uint mina, uint maxa, uint minb, uint maxb)
> {
>     return maxa | maxb | ((1 << highbit(((1 << (max(highbit(mina ^
> maxa), highbit(minb ^ maxb)) + 1)) - 1) & maxa & maxb)) - 1);
> }
> 
	Neat!

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



April 13, 2010
On Tue, 13 Apr 2010 13:37:13 -0400, Jérôme M. Berger <jeberger@free.fr> wrote:

> Steven Schveighoffer wrote:
>> Jérôme M. Berger Wrote:
>>
>>> Steven Schveighoffer wrote:
>>>> When we're talking about the difference between O(1) and O(lgn), I'll
>>>> take accuracy over speed in my compiler any day.
>>> 	And when we're talking about the difference between 10s and 55s for
>>> a minimal loss of accuracy, which will you take? Especially if the
>>> accuracy loss is less than is lost elsewhere (due to holes in the
>>> ranges).
>>
>> Really?  You rebuilt the compiler with your range propagation algorithm and verified that it adds 10 seconds versus an accurate one that adds 55s?  How much time did the compiler spend to compile?  I'd hazard to guess that a code base that adds 10s worth of your algorithm takes at least a few hours to compile.  Is 55s that bad at that point?
>>
>> Again, if it takes the compiler an extra insignificant amount of time to be more accurate, I'm all for accuracy over speed when you get down to that level of insignificance.  I'd say the point of pain has to be at least 10% of the compile time before it makes any sizable difference.
>>
> 	My point is that if you always choose an algorithm that is 5 to 6
> times slower just because it brings extra precision which you may
> not really need, then you will wind up with a compiler that is 5 to
> 6 times slower than it needs to be. Sure the difference on *one*
> function is not great in absolute terms, but if you make the same
> choice for *all* functions, then where do you go?

In my opinion?  Yes, slower compilers that make code easier to write are better.  I don't spend lots of time compiling, I spend it writing code.  And I don't need to babysit the compiler, it goes off and does its thing.

Performance is only important in the end result.  I'm not saying I want my compiler to be slow, but I want it to be accurate and useful more than I want it to be quick.  In this case, there is a quick and fully accurate solution, so it doesn't matter.  But, for instance, if the compiler could do a full analysis to check if variables escape their scope, and that makes the compiler 5x slower, then I'd rather have the compiler verify my work instead of quickly producing memory-corrupting code.

-Steve
April 13, 2010
Steven Schveighoffer wrote:
> In my opinion?  Yes, slower compilers that make code easier to write are better.  I don't spend lots of time compiling, I spend it writing code. And I don't need to babysit the compiler, it goes off and does its thing.
> 
> Performance is only important in the end result.  I'm not saying I want my compiler to be slow, but I want it to be accurate and useful more than I want it to be quick.  In this case, there is a quick and fully accurate solution, so it doesn't matter.  But, for instance, if the compiler could do a full analysis to check if variables escape their scope, and that makes the compiler 5x slower, then I'd rather have the compiler verify my work instead of quickly producing memory-corrupting code.
> 
	This is why there are static verification tools. I'm all in favor
of this kind of tools, but I don't want to have to put up with their
slowness every time I compile something. Having a quick
write-compile-test cycle is very important IMO. Then when I have a
rough outline of the program, I can run another tool (or adding a
command line option) to enable extra checking (and then it doesn't
matter if the tool takes the whole night to run and I get the
results the next morning when I get to work).

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



April 17, 2010
On 04/12/2010 12:57 PM, "Jérôme M. Berger" wrote:
> Andrei Alexandrescu wrote:
>> On 04/11/2010 09:18 PM, Adam D. Ruppe wrote:
>>> On Sun, Apr 11, 2010 at 10:00:41PM -0400, Steven Schveighoffer wrote:
>>>> We are talking about range propagation, a function of the compiler,
>>>> not a function of the compiled program.  Therefore, slower but more
>>>> exact functions should be preferred, since they only affect the
>>>> compile time.
>>>
>>> I agree with this. While my solution has a certain beauty to it and is
>>> fast, your solution accurately covers all the bases, whereas mine
>>> fails with
>>> min>   1. I say we go with your's.
>>
>> Agreed. Steve's solution is the best we have for unsigned numbers.
>>
>> Andrei
>>
> 	Nope, not anymore...
>
> 		Jerome

Looks great, thanks Jerome!

Now, how about that case in which some or all of the ranges involved include negative values? A simple approach is to define them in terms of ranges for unsigned numbers. Here are the cases I identified:

a) If both ranges cross zero, then:

minOR == setmsb(min(
  minOR(nomsb(min_a), nomsb(-1), nomsb(min_b), nomsb(max_b)),
  minOR(nomsb(min_a), nomsb(max_a), nomsb(min_b), nomsb(-1))))
maxOR == maxOR(0, max_a, 0, max_b)

b) If both ranges are negative, then:

minOR == setmsb(minOR(nomsb(max_a), nomsb(min_a), nomsb(max_b), nomsb(min_b)))
maxOR == setmsb(maxOR(nomsb(max_a), nomsb(min_a), nomsb(max_b), nomsb(min_b)))

c) If a crosses zero and b doesn't:

minOR == setmsb(minOR(nomsb(min_a), nomsb(-1), min_b, max_b))
maxOR == maxOR(0, max_a, min_b, max_b)

The primitive nomsb clears off the sign bit in a number, and setmsb sets it.

Makes sense?



Andrei