April 13, 2010
Fawzi Mohamed wrote:
> 
> On 12-apr-10, at 21:40, Steven Schveighoffer wrote:
> 
>> On Mon, 12 Apr 2010 13:45:14 -0400, Jérôme M. Berger <jeberger@free.fr> wrote:
>>
>>> Steven Schveighoffer wrote:
>>>> J�r�me M. Berger Wrote:
>>>>
>>>>> J�r�me M. Berger wrote:
>>>>>>     OK, I found a solution that is optimal in over 99% of the cases
>>>>>> (99.195% to be precise), while still being fast (because it avoids
>>>>>> looping over the bits):
>>>>>>
>>>>>     I've run my function and Andrei's on all possible min, max pairs in
>>>>> 0..299 without checking, just for the timing. On my computer (Athlon
>>>>> X2 64 @2GHz), my function took 50s and Andrei's took 266s. The
>>>>> difference should be even bigger for larger numbers.
>>>>
>>>> Can I ask, what is the point of having a non-exact solution (unless you are just doing this for fun)?
>>>>
>>>> We are talking about range propagation, a function of the compiler, not a function of the compiled program.  Therefore, slower but more exact functions should be preferred, since they only affect the compile time.  Note that you will only need to do this range propagation on lines that "or" two values together, and something that reduces the runtime of the compiler by 216 seconds, but only when compiling enough code to have over 8 billion 'or' operations in it (300^4), I don't think is really that important.  Give me the exact solution that prevents me from having to cast when the compiler can prove it, even if the runtime of the compiler is slightly slower.
>>>>
>>>     Funny you should say that given the current thread comparing the
>>> speed of the D compiler to that of the Go compiler...
>>
>> My point was simply that the amount of time saved is relative to the size of the program being compiled.  If we assume conservatively that every other line in a program has bitwise or in it, then you are talking about a 16 billion line program, meaning the 216 seconds you save is insignificant compared to the total compile time.  My point was that your fast solution that is inaccurate is not preferable because nobody notices the speed, they just notice the inaccuracy.
>>
>>>     We are talking about range propagation, a function of the compiler,
>>> not a function of the compiled program. Since we can't get a 100%
>>> accurate representation of the possible values anyway (even yours
>>> might leave holes in the middle after all), then it might be better
>>> to choose a faster, slightly less precise algorithm if the
>>> difference is not too great. That's the difference between a
>>> compiler and a full-fledged static code analysis an program prover.
>>
>> When we're talking about the difference between O(1) and O(lgn), I'll take accuracy over speed in my compiler any day.  The solution should be 100% accurate for the problem statement.  If the problem statement is not what we need, then we need a new problem statement :)  Solving the problem statement for 99% of values is not good enough.
>>
>>>
>>>     Anyway, the point is moot, I have a new version of my algorithm
>>> with 100% precision and high speed.
>>
>> Yes, I'm still trying to understand how it works :)
> 
> Sorry for the probably stupid question, but I don't understand much the need of all this range propagation, in the compiler either you have a a constant (and then you have the value at compile time, and you don't need any range propagation, you can compare with the value), or you have a runtime value.


It's been part of DMD2 for a while now. It allows you to do things like:

ubyte lowbits(int x)
{
    return x & 7;
}

without an explicit cast. The compiler knows that x&7 can safely fit inside a single byte.  Whereas   ((x&7) << 12) | (x&3);
does not fit, and requires an explicit cast.




April 13, 2010
On 13-apr-10, at 12:02, Don wrote:

> Fawzi Mohamed wrote:
>> On 12-apr-10, at 21:40, Steven Schveighoffer wrote:
>>> On Mon, 12 Apr 2010 13:45:14 -0400, Jérôme M. Berger <jeberger@free.fr> wrote:
>>>
>>>> Steven Schveighoffer wrote:
>>>>> J�r�me M. Berger Wrote:
>>>>>
>>>>>> J�r�me M. Berger wrote:
>>>>>>>    OK, I found a solution that is optimal in over 99% of the cases
>>>>>>> (99.195% to be precise), while still being fast (because it avoids
>>>>>>> looping over the bits):
>>>>>>>
>>>>>>    I've run my function and Andrei's on all possible min, max pairs in
>>>>>> 0..299 without checking, just for the timing. On my computer (Athlon
>>>>>> X2 64 @2GHz), my function took 50s and Andrei's took 266s. The
>>>>>> difference should be even bigger for larger numbers.
>>>>>
>>>>> Can I ask, what is the point of having a non-exact solution (unless you are just doing this for fun)?
>>>>>
>>>>> We are talking about range propagation, a function of the compiler, not a function of the compiled program.  Therefore, slower but more exact functions should be preferred, since they only affect the compile time.  Note that you will only need to do this range propagation on lines that "or" two values together, and something that reduces the runtime of the compiler by 216 seconds, but only when compiling enough code to have over 8 billion 'or' operations in it (300^4), I don't think is really that important.  Give me the exact solution that prevents me from having to cast when the compiler can prove it, even if the runtime of the compiler is slightly slower.
>>>>>
>>>>    Funny you should say that given the current thread comparing the
>>>> speed of the D compiler to that of the Go compiler...
>>>
>>> My point was simply that the amount of time saved is relative to the size of the program being compiled.  If we assume conservatively that every other line in a program has bitwise or in it, then you are talking about a 16 billion line program, meaning the 216 seconds you save is insignificant compared to the total compile time.  My point was that your fast solution that is inaccurate is not preferable because nobody notices the speed, they just notice the inaccuracy.
>>>
>>>>    We are talking about range propagation, a function of the compiler,
>>>> not a function of the compiled program. Since we can't get a 100%
>>>> accurate representation of the possible values anyway (even yours
>>>> might leave holes in the middle after all), then it might be better
>>>> to choose a faster, slightly less precise algorithm if the
>>>> difference is not too great. That's the difference between a
>>>> compiler and a full-fledged static code analysis an program prover.
>>>
>>> When we're talking about the difference between O(1) and O(lgn), I'll take accuracy over speed in my compiler any day.  The solution should be 100% accurate for the problem statement.  If the problem statement is not what we need, then we need a new problem statement :)  Solving the problem statement for 99% of values is not good enough.
>>>
>>>>
>>>>    Anyway, the point is moot, I have a new version of my algorithm
>>>> with 100% precision and high speed.
>>>
>>> Yes, I'm still trying to understand how it works :)
>> Sorry for the probably stupid question, but I don't understand much the need of all this range propagation, in the compiler either you have a a constant (and then you have the value at compile time, and you don't need any range propagation, you can compare with the value), or you have a runtime value.
>
>
> It's been part of DMD2 for a while now. It allows you to do things like:
>
> ubyte lowbits(int x)
> {
>    return x & 7;
> }
>
> without an explicit cast. The compiler knows that x&7 can safely fit inside a single byte.  Whereas   ((x&7) << 12) | (x&3);
> does not fit, and requires an explicit cast.

ah ok I understand, I thought that was treated like x & cast(ubyte)7 , and so as comparison of the compile time value with the ranges of ubyte (no range propagation needed).
But I can understand why it is treated as cast(ubyte)((cast(int)x) & 7), as it is probably easier for the compiler, as it upcasts by default.

Thanks for the explanation.

April 13, 2010
On 13-apr-10, at 12:01, bearophile wrote:

> Fawzi Mohamed:
>> I guess that I am missing something obvious, so I don't see the
>> reason
>> for range propagation, but maybe I am not the only one, so thanks for
>> an explanation...
>
> Range propagation can improve the situation a little, so it's not bad. But it can't replace proper integral overflows. You need integral overflows at runtime.

integral overflow are helpful only if you have automatic conversion to
a larger type, but that breaks the compile time knowledge of the size
of such an integer, so that you have to assume that it might need to
be pushed to the heap.
Yes you might use some tricks (like using size_t.sizeof*8-1 bits, or a
pointer to spare some place), but I don't think that D wants to go
that way for the basic integers...

>
> Bye,
> bearophile



April 13, 2010
On Mon, 12 Apr 2010 22:37:00 -0400, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> Jérôme M. Berger Wrote:
>
>> Steven Schveighoffer wrote:
>> > Fails for test case:
>> >
>> > minA = 4, maxA = 4, minB = 4, maxB = 6 (outputs 7, accurate result is  
>> 6).
>> >
>> 	Nope, outputs 6. Note that I've run an exhaustive search for all
>> combinations in the [0, 63] range, so if there are mistakes they
>> have to be outside that range...
>>
>
> I'll have to double check, I thought I copied your code verbatim (I did switch around the arguments to be minA, maxA, minB, maxB to fit my test harness, and I changed the uint_32_t to uint).  I'll check tomorrow at work where the computer I used to test is.

I confirm, with the updated highbit, your solution works.

Also, inspired by your solution (and using your highbit function, which is very neat BTW) I came up with a one-liner.  Enjoy :)

uint maxor(uint mina, uint maxa, uint minb, uint maxb)
{
    return maxa | maxb | ((1 << highbit(((1 << (max(highbit(mina ^ maxa), highbit(minb ^ maxb)) + 1)) - 1) & maxa & maxb)) - 1);
}

Anyone want to do the equivalent minor?  I've spent way too much time on this :)

-Steve
April 13, 2010
Steven Schveighoffer wrote:
> On Mon, 12 Apr 2010 22:37:00 -0400, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> 
>> Jérôme M. Berger Wrote:
>>
>>> Steven Schveighoffer wrote:
>>> > Fails for test case:
>>> >
>>> > minA = 4, maxA = 4, minB = 4, maxB = 6 (outputs 7, accurate result 
>>> is 6).
>>> >
>>>     Nope, outputs 6. Note that I've run an exhaustive search for all
>>> combinations in the [0, 63] range, so if there are mistakes they
>>> have to be outside that range...
>>>
>>
>> I'll have to double check, I thought I copied your code verbatim (I did switch around the arguments to be minA, maxA, minB, maxB to fit my test harness, and I changed the uint_32_t to uint).  I'll check tomorrow at work where the computer I used to test is.
> 
> I confirm, with the updated highbit, your solution works.
> 
> Also, inspired by your solution (and using your highbit function, which is very neat BTW) I came up with a one-liner.  Enjoy :)
> 
> uint maxor(uint mina, uint maxa, uint minb, uint maxb)
> {
>     return maxa | maxb | ((1 << highbit(((1 << (max(highbit(mina ^ maxa), highbit(minb ^ maxb)) + 1)) - 1) & maxa & maxb)) - 1);
> }
> 
> Anyone want to do the equivalent minor?  I've spent way too much time on this :)
> 
> -Steve

Awesome stuff, guys.
I think that's worthy of inclusion as an exercise in Knuth's Volume 4, fascile 1. Suggest it to someone at ACCU?
(Knuth has highbit(x) but calls it lambda(x). He notes the result that highbit(x)==highbit(y) iff x^y <= x&y).

April 13, 2010
Jerome's highbit function is the same as std.intrinsic.bsr. I wonder which is faster?

I ran a test, and for 100 million iterations (1..10000000), the intrinsic beat out his function be a mere 300 milliseconds on my box. - highbit ran in an average of 1897 ms and bsr did the same in an average if 1534.

Recompiling with -inline -O -release cuts the raw numbers about in half, but keeps about the same difference, leading me to think overhead amounts for a fair amount of the percentage instead of actual implementation. The new averages are 1134 and 853.

I've gotta say, your implementation of the function is better than I would have done though. Without bsr, I probably would have looped, shifted, and tested each individual bit...

On 4/13/10, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
> On Mon, 12 Apr 2010 22:37:00 -0400, Steven Schveighoffer <schveiguy@yahoo.com> wrote:
>
>> Jérôme M. Berger Wrote:
>>
>>> Steven Schveighoffer wrote:
>>> > Fails for test case:
>>> >
>>> > minA = 4, maxA = 4, minB = 4, maxB = 6 (outputs 7, accurate result is
>>> 6).
>>> >
>>> 	Nope, outputs 6. Note that I've run an exhaustive search for all
>>> combinations in the [0, 63] range, so if there are mistakes they
>>> have to be outside that range...
>>>
>>
>> I'll have to double check, I thought I copied your code verbatim (I did switch around the arguments to be minA, maxA, minB, maxB to fit my test harness, and I changed the uint_32_t to uint).  I'll check tomorrow at work where the computer I used to test is.
>
> I confirm, with the updated highbit, your solution works.
>
> Also, inspired by your solution (and using your highbit function, which is
> very neat BTW) I came up with a one-liner.  Enjoy :)
>
> uint maxor(uint mina, uint maxa, uint minb, uint maxb)
> {
>      return maxa | maxb | ((1 << highbit(((1 << (max(highbit(mina ^ maxa),
> highbit(minb ^ maxb)) + 1)) - 1) & maxa & maxb)) - 1);
> }
>
> Anyone want to do the equivalent minor?  I've spent way too much time on this :)
>
> -Steve
>
April 13, 2010
On Tue, 13 Apr 2010 09:56:34 -0400, Adam Ruppe <destructionator@gmail.com> wrote:

> Jerome's highbit function is the same as std.intrinsic.bsr. I wonder
> which is faster?

Just a note, this code should be runnable in C++, because the compiler is written in C++.  Is bsr available there?

> Recompiling with -inline -O -release cuts the raw numbers about in
> half, but keeps about the same difference, leading me to think
> overhead amounts for a fair amount of the percentage instead of actual
> implementation. The new averages are 1134 and 853.

Probably the inlining accounts for the savings here.  Calling a function is rather expensive.

> I've gotta say, your implementation of the function is better than I
> would have done though. Without bsr, I probably would have looped,
> shifted, and tested each individual bit...

I never would have thought of doing a binary search for the high bit, it is definitely a novel idea to me :)

-Steve
April 13, 2010
Steven Schveighoffer:
> I never would have thought of doing a binary search for the high bit, it is definitely a novel idea to me :)

You can learn something from this page (it can also be useful for the translation to C++): http://graphics.stanford.edu/~seander/bithacks.html

Bye,
bearophile
April 13, 2010
Adam Ruppe Wrote:

> Jerome's highbit function is the same as std.intrinsic.bsr. I wonder which is faster?
> 
> I ran a test, and for 100 million iterations (1..10000000), the intrinsic beat out his function be a mere 300 milliseconds on my box. - highbit ran in an average of 1897 ms and bsr did the same in an average if 1534.
> 
> Recompiling with -inline -O -release cuts the raw numbers about in half, but keeps about the same difference, leading me to think overhead amounts for a fair amount of the percentage instead of actual implementation. The new averages are 1134 and 853.

That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I guess is the reason it's an intrinsic in the first place). I would have expected this to be much, much faster than a user function. Anyone care enough to check the generated assembly?

[1] http://www.itis.mn.it/linux/quarta/x86/bsr.htm

-- Clemens
April 13, 2010
On Tue, Apr 13, 2010 at 10:52:14AM -0400, Steven Schveighoffer wrote:
> Just a note, this code should be runnable in C++, because the compiler is written in C++.  Is bsr available there?

I just assumed since it was in D that it was probably
in DMC too, but I can't find it in the docs, so maybe not.
Well, there's always inline asm :D

But 300 milliseconds over 100 million iterations is negligible anyway.




-- 
Adam D. Ruppe
http://arsdnet.net