June 09, 2010
On 06/09/2010 01:28 AM, retard wrote:
> Wed, 09 Jun 2010 01:13:43 -0400, Nick Sabalausky wrote:
>
>> "retard"<re@tard.com.invalid>  wrote in message
>> news:hun6ok$13s5$1@digitalmars.com...
>>> Tue, 08 Jun 2010 16:14:51 -0500, Andrei Alexandrescu wrote:
>>>
>>>> On 06/08/2010 04:05 PM, Walter Bright wrote:
>>>>> Andrei Alexandrescu wrote:
>>>>>> On 06/08/2010 01:27 PM, "Jérôme M. Berger" wrote:
>>>>>>> Please define "reasonable performance"...
>>>>>>
>>>>>> Within 15% of hand-optimized code specialized for the types at hand.
>>>>>
>>>>> I would have said O(n) or O(log n), as opposed to, say, O(n*n).
>>>>>
>>>>> General rules for performance improvements:
>>>>>
>>>>> 1. nobody notices a 10% improvement
>>>>>
>>>>> 2. users will start noticing speedups when they exceed 2x
>>>>>
>>>>> 3. a 10x speedup is a game changer
>>>>
>>>> max of n elements is O(n).
>>>
>>> This probably means that D 2 won't be very efficient on multicore until
>>> the authors learn some basic parallel programming skills. Now where did
>>> you get your PhD - I'm collecting a list of toy universities people
>>> should avoid.
>>
>> You used to have meaningful things to say. Now you're just trolling.
>
> Max of n unordered elements can be solved in O(log log n) time assuming
> you have enough cores and constant time memory access. Happy now?

When calculating the complexity of an operation you don't consider cores in unlimited supply; it's a constant. Complexity being calculated in terms of the number of inputs, max of n elements is O(n) because you need to look at each element at least once.

Cores being a constant k, you can then consider that max can be distributed such that each core looks at n/k elements. What's left is k intermediate results, which can be further processed in log(k) time (see below); since we consider k a constant, this approach could claim a net k-fold speedup.

If available cores are proportional to n (an unusual assumption), you first compare pairs of the n elements with n/2 cores, then n/4 comparisons against pairs of the remaining elements etc. Each step halves the number of candidates so the complexity is O(log n). I'd be curious to find out how that can be reduced to O(log log n).

Finally, for small values of n, you could consider the number of cores sufficiently large, but, as was pointed out, using threads for max is actually impractical, so superscalar execution may be a better venue. Practically this means exploiting ILP by reducing data dependencies between intermediate results. I suggest you take a look at a thread entitled "challenge: implement the max function" that I started on 01/21/2007. That thread discusses ILP issues, and continues with the thread "challenge #2: implement the varargs_reduce metafunction". Allow me to quote from my own post on 01/23/2007:

==================
That's a good point, and goes right to the core of my solution, which (similarly to Bruno Medeiros') arranges operations in an optimal way for superscalar evaluation, e.g. max(a, b, c, d) is expanded not to:

max2(max2(max2(a, b), c), d)

but instead to:

max2(max2(a, b), max2(c, d))

The number of operations is the same but in the latter case there is logaritmically less dependency between partial results. When max2 expands into a primitive comparison, the code is easy prey for a superscalar machine to execute in parallel.

This won't count in a great deal of real code, but the fact that this will go in the standard library warrants the thought. Besides, the fact that the language makes it so easy, we can actually think of such subtlety, is very encouraging.
==================

The messages that follow further discuss how associative reduce should be ordered to take advantage of superscalar execution.


Andrei
June 09, 2010
Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> On 06/08/2010 01:27 PM, "Jérôme M. Berger" wrote:
>>>     Please define "reasonable performance"...
>>
>> Within 15% of hand-optimized code specialized for the types at hand.
> 
> I would have said O(n) or O(log n), as opposed to, say, O(n*n).
> 
	Well, then my python function is Ok (O(n)) ;)

		Jerome
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



June 09, 2010
Andrei Alexandrescu wrote:
> On 06/08/2010 01:27 PM, "Jérôme M. Berger" wrote:
>> Andrei Alexandrescu wrote:
>>> On 06/07/2010 04:35 PM, "Jérôme M. Berger" wrote:
>>>> Andrei Alexandrescu wrote:
>>>>> On 06/07/2010 12:57 PM, "Jérôme M. Berger" wrote:
>>>>>>> Do this in any dynamic language ->    FAIL because looping is so
>>>>>>> slow that you might
>>>>>>> die of old age before it executes.  Besides, who wants to do
>>>>>>> computationally
>>>>>>> intensive, multithreaded work in a dynamic language?
>>>>>>
>>>>>>       In python: max (map (max, args)) should have reasonable
>>>>>> performances and is *much* more elegant...
>>>>>
>>>>> I very much doubt that.
>>>>>
>>>>      What do you doubt? That it has reasonable performance or that
>>>> it is
>>>> more elegant?
>>>
>>> That it has reasonable performance. Then, there are a number of things that can't be compared such as figuring out the tightest static types, something that Python doesn't worry about (at the expense of precision and performance).
>>>
>>     Please define "reasonable performance"...
> 
> Within 15% of hand-optimized code specialized for the types at hand.
> 
	Then almost nothing has "reasonable performance"...

		Jerome

PS: I should maybe point out that I also do quite a lot of Verilog development. Therefore I can hand optimize specialized code a lot better than anything you can do on a standard CPU...
-- 
mailto:jeberger@free.fr
http://jeberger.free.fr
Jabber: jeberger@jabber.fr



June 09, 2010
"Simen kjaeraas" <simen.kjaras@gmail.com> wrote in message news:op.vd05x514vxi10f@biotronic-pc.lan...
>
>  ...I was apparently too drunk to write bad code.
>

That's officially my favorite sentence of the day :)


June 11, 2010
Wed, 09 Jun 2010 09:37:51 -0500, Andrei Alexandrescu wrote:

> On 06/09/2010 01:28 AM, retard wrote:
>> Max of n unordered elements can be solved in O(log log n) time assuming you have enough cores and constant time memory access. Happy now?
> 
> When calculating the complexity of an operation you don't consider cores in unlimited supply; it's a constant. Complexity being calculated in terms of the number of inputs, max of n elements is O(n) because you need to look at each element at least once.
> 
> Cores being a constant k, you can then consider that max can be distributed such that each core looks at n/k elements. What's left is k intermediate results, which can be further processed in log(k) time (see below); since we consider k a constant, this approach could claim a net k-fold speedup.

With this logic, all parallel algorithms have the same time complexity as their sequential versions.

> If available cores are proportional to n (an unusual assumption), you first compare pairs of the n elements with n/2 cores, then n/4 comparisons against pairs of the remaining elements etc. Each step halves the number of candidates so the complexity is O(log n). I'd be curious to find out how that can be reduced to O(log log n).

First, divide the problem into log(log(n)) blocks and solve them sequentially, the complexity is O(log log n), because the problem size is log(log(n)) and the sequential algorithm works in O(n). Then, use a divide and conquer algorithm for the n/log(log(n)) maximums obtained from the blocks. With the correct algorithm this stage is also O(log log n), thus the overall complexity is O(log log n). And yes, it assumes you have n/log(log(n)) cores.

> Finally, for small values of n, you could consider the number of cores sufficiently large, but, as was pointed out, using threads for max is actually impractical, so superscalar execution may be a better venue. Practically this means exploiting ILP by reducing data dependencies between intermediate results. I suggest you take a look at a thread entitled "challenge: implement the max function" that I started on 01/21/2007. That thread discusses ILP issues, and continues with the thread "challenge #2: implement the varargs_reduce metafunction".

I believe this is true in practice. I only mentioned the time complexity of the parallel version, I didn't make any claims about its real world performance. If the problem is small enough, I'd use Python instead.
June 11, 2010
retard wrote:
> First, divide the problem into log(log(n)) blocks and solve them sequentially, the complexity is O(log log n), because the problem size is log(log(n)) and the sequential algorithm works in O(n). Then, use a divide and conquer algorithm for the n/log(log(n)) maximums obtained from the blocks. With the correct algorithm this stage is also O(log log n), thus the overall complexity is O(log log n). And yes, it assumes you have n/log(log(n)) cores.

What is the correct algorithm for the second stage? Do you have a reference at hand?

Andrei
June 11, 2010
Fri, 11 Jun 2010 08:37:31 -0700, Andrei Alexandrescu wrote:

> retard wrote:
>> First, divide the problem into log(log(n)) blocks and solve them
>> sequentially, the complexity is O(log log n), because the problem size
>> is log(log(n)) and the sequential algorithm works in O(n). Then, use a
>> divide and conquer algorithm for the n/log(log(n)) maximums obtained
>> from the blocks. With the correct algorithm this stage is also O(log
>> log n), thus the overall complexity is O(log log n). And yes, it
>> assumes you have n/log(log(n)) cores.
> 
> What is the correct algorithm for the second stage? Do you have a reference at hand?
> 
> Andrei

Unfortunately I can't remember. It was probably explained in Jaja's book, which I skimmed through when there was the last parallel programming course 7-8 years ago - I only have some notes left.
June 12, 2010
retard wrote:
> Fri, 11 Jun 2010 08:37:31 -0700, Andrei Alexandrescu wrote:
> 
>> retard wrote:
>>> First, divide the problem into log(log(n)) blocks and solve them
>>> sequentially, the complexity is O(log log n), because the problem size
>>> is log(log(n)) and the sequential algorithm works in O(n). Then, use a
>>> divide and conquer algorithm for the n/log(log(n)) maximums obtained
>>> from the blocks. With the correct algorithm this stage is also O(log
>>> log n), thus the overall complexity is O(log log n). And yes, it
>>> assumes you have n/log(log(n)) cores.
>> What is the correct algorithm for the second stage? Do you have a
>> reference at hand?
>>
>> Andrei
> 
> Unfortunately I can't remember. It was probably explained in Jaja's book, which I skimmed through when there was the last parallel programming course 7-8 years ago - I only have some notes left.

That's right. I'm now glad I followed this through; I learned a couple of things.

www.cs.berkeley.edu/~satishr/cs273.03/scribe9.ps

Focusing on posting more content and fewer of those bizarre outbursts would make everyone on this group happier - primarily yourself.


Andrei
June 13, 2010
retard wrote:
> Mon, 07 Jun 2010 14:06:24 -0400, Nick Sabalausky wrote:
> 
>> ""Jérôme M. Berger"" <jeberger@free.fr> wrote in message
>> news:hujboe$tp2$1@digitalmars.com...
>>> Nick Sabalausky wrote:
>>>> I actually find that funny. Something in Java that isn't an Object? I
>>>> remember "Everything's an object!" being paraded around as a selling
>>>> point.
>>>>
>>> Yes, in Java, everything is an object except where that bothered
>>> the language "designers". There are several such design decisions in
>>> Java...
>> Yea, that's a good example of why I've grown a distaste towards
>> hard-and-fast religious design strategies. The designer inevitably comes
>> across cases where it just doesn't work particularly well, and then
>> they're forced to either stay true to their misguided principles by
>> accepting an awkward problematic design, or contradict their alleged
>> principles and go with a better design. And when they do the latter,
>> that runs the risk of causing problems in other areas that had been
>> relying on the old principle being rigidly followed.
> 
> Part of the "religious" feel of Java comes from the fact that it runs on a VM. The safe memory model imposes some restrictions as you might have noticed in SafeD. The 'everything is an Object' idea is a bit broken, performance-wise. That's why C# added reified types. The reason they added primitive types has a rationale behind it. The decision made the VM a useful tool for real enterprise applications.

Hear, hear.  The idea that "everything is a Object" is definitely broken.  Java cannot fix this misidea and this leaves an opportunity for newcomer languages (or oldcomer languages in new clothes).
June 13, 2010
Walter Bright wrote:
> Nick Sabalausky wrote:
>> Yea, that's a good example of why I've grown a distaste towards hard-and-fast religious design strategies. The designer inevitably comes across cases where it just doesn't work particularly well, and then they're forced to either stay true to their misguided principles by accepting an awkward problematic design, or contradict their alleged principles and go with a better design. And when they do the latter, that runs the risk of causing problems in other areas that had been relying on the old principle being rigidly followed.
> 
> D has design principles, but those principles are often contradictory. I don't see a good reason to follow a design principle out of principle if it destroys the utility of the language.

Me feels that many readers of this ng would enjoy a recap of just exactly are those design principles?  Of more interest though will be the ensuring discussion from your response!!! :-)


1 2 3 4 5 6 7 8 9 10 11
Next ›   Last »