Sign in to follow this  
Bombshell_M

Selecting K smallest elements

Recommended Posts

Bombshell_M    122
In the wikipedia entry, a bound is mentioned: http://en.wikipedia.org/wiki/Selection_algorithm#Lower_bounds Where can I find an algorithm that actually achieves that lower bound?

Share this post


Link to post
Share on other sites
yadango    567
it tells you right there in the wiki... doing a partial sort. a partial heapsort will do the trick. you just make a heap which can be done in O(logn) and then place the smallest element in the front. then reheap. then repeat k times for the k smallest elements. you should have yourself a kO(logn) ~ O(logn) algorithm if k is small. with heapsort you can find the k-smallest and largest elements as well by making minheaps and maxheaps.

Share this post


Link to post
Share on other sites
SaltyGoodness    128
Quote:
Original post by Starfox
Er.. As far as I know, fastest option for building a heap is O(n), and not O(log n).


Yup. If it was O(log n), that would imply you're only touching log n elements!

One of the simplest ways to achieve decent performance when k is reasonably smaller than n is to build a heap in O(n) time, then remove k elements from the heap in O(log (n-i)) time for each i'th element.

I'm not sure that's the lowest bound, Knuth's description of the problem is pretty complicated (although I've skim read over that section). To get an idea of how complicated, Knuth states it took mathematicians several attempts to even determine what the lower bound is!

Share this post


Link to post
Share on other sites
Bombshell_M    122
That bound is not a big-oh. It's the exact number of comparisons involved. In other words, and for the simplest case of k = 2: What algorithm would find the second smallest elements with (n-k+log(n)) comparisons?

Share this post


Link to post
Share on other sites
Antheus    2409
Quote:
Original post by Bombshell_M

What algorithm would find the second smallest elements with (n-k+log(n)) comparisons?


For k=2 it evaluates to n - 2 + log n.

I would guess that quickselect that returns only k-th (presented in article) element achieves that, although I cannot prove it right now.

Lower bound might also not be universally attainable. Quicksort in general is known as optimal, yet it will never reach lower bound. I seem to recall that it's some 40% away from optimal algorithm for arbitrary N.

Share this post


Link to post
Share on other sites
yadango    567
ah, yeah, my goof, heapify is O(n). got it confused with reheaping. just trying to tell the guy though that he can use a partial heapsort to solve that problem.

so let's see... heapify = O(n)
then k - 1 reheap operations = (k - 1)O(lg(n))

so lowerbound should be O(n) right (if k is small that is)?

Share this post


Link to post
Share on other sites
Prefect    373
As I understand the article, the given formula is a lower bound on the complexity of the problem, meaning that it is impossible to solve the problem using fewer comparisons. This does not mean that it is possible to solve the problem with this many comparisons.

In fact, the statement
Quote:
(from Wikipedia)
This bound is achievable for k=2 but better, more complex bounds exist for larger k.

suggests that it has in fact been proven that for k>2 no algorithm exists that uses only the number of comparisons given by the formula in that article.

If my back-of-the-envelope calculations are correct, an optimal algorithm for k=2 can be obtained using a divide-and-conquer strategy similar to the "Median of medians" algorithm described on the Wikipedia page (it's simpler than that algorithm, in fact). The same strategy should give good algorithms for k>2 as well, but I don't know if they're still optimal.

Try to develop that algorithm yourself, it's a good exercise. If you're stuck, feel free to ask for more ideas.

cu,
Prefect

Share this post


Link to post
Share on other sites
Bombshell_M    122
Wikipedia's version is not optimal. Here's my implementation:


int PerfCount = 0;

int select_partition(int* A, int L, int R, int PivotIndex)
{
int PivotValue = A[PivotIndex];
swap(A[PivotIndex], A[R]);
int StoreIndex = L;
for(int i = L; i <= R-1; ++i)
{
++PerfCount;
if(A[i] < PivotValue)
{
swap(A[StoreIndex], A[i]);
++StoreIndex;
}
}
swap(A[R], A[StoreIndex]);
return StoreIndex;
}

int select(int* A, int K, int L, int R)
{
while(true)
{
int PivotIndex = (L+R)/2;
int PivotNewIndex = select_partition(A, L, R, PivotIndex);
if(K == PivotNewIndex)
{
return A[K];
}
else if(K < PivotNewIndex)
{
R = PivotNewIndex-1;
}
else
{
L = PivotNewIndex+1;
}
}
}



for n = 8, K = 2, PerfCount = 13 > n + log(n) - 2.

Share this post


Link to post
Share on other sites
SaltyGoodness    128
Quote:
Original post by Bombshell_M
Wikipedia's version is not optimal. Here's my implementation:

*** Source Snippet Removed ***

for n = 8, K = 2, PerfCount = 13 > n + log(n) - 2.


The problem with any algorithm of this nature is that its performance is dependent on the ordering of elements in the array. What's required to achieve the lower bound (if that's even possible) would be something more like a sorting network.

Share this post


Link to post
Share on other sites
Prefect    373
On the site you mentioned yourself, [a]http://en.wikipedia.org/wiki/Selection_algorithm[/a], there's a section with "Median of Medians" in the title. That's the algorithm I was talking about.

As for the algorithm you posted, I think you're a little confused about the problem you're trying to solve. Consider these two different problems:
(1) Select the k-th smallest element
(2) Select the k smallest elements
Notice the plural in problem (2)?

Your algorithm solves problem (1). SaltyGoodness correctly points out that the runtime of your algorithm depends significantly on the ordering of the input array. Think about this: What will happen when A[PivotIndex] is the largest element? What will happen if after the first select_partition A[PivotIndex] is again the largest element among the remaining elements? You'll soon realize that your algorithm is very inefficient in the worst case (Omega(n^2), like quicksort).

But again, we weren't even talking about problem (1) yesterday. Problem (1) can be solved in linear time using the "Median of Medians" algorithm. Try groking that algorithm, as it uses a very interesting and clever technique. A simple search for "Median of Medians" on the internet will reveal more sources, though the best explanation I've seen so far is in Introduction to Algorithms by Cormen et al. Check it out in a library.

Now: The lower bound that you originally cited refers to the complexity of problem (2), and a similar technique can be used to obtain an efficient algorithm at least for k = 2.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this