Quicksort Pivot Selection

Started by
16 comments, last by Novark 14 years, 1 month ago
I'm working with Quicksort, and am wondering: What are some methods that you use to choose the pivot to optimize performance? I'd like to set up two quicksort functions and try comparing their performances with different pivot choices. Any thoughts/ideas? :-)
Advertisement
Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly: for example, if you pick the median of 2n+1 arbitrarily chosen keys as a pivot a partition can have as few as n items.
Are you sure you want to use Quicksort?

Omae Wa Mou Shindeiru

Quote:Original post by LorenzoGatti
Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly: for example, if you pick the median of 2n+1 arbitrarily chosen keys as a pivot a partition can have as few as n items.


I'm not too concerned about improving the worst case - just looking for ways to either avoid it (randomizing the data usually does the trick), or thinking of other ways to choose the pivot. I'm sure there are probably some articles written on this.

Quote:Are you sure you want to use Quicksort?


Yep, I'm not implementing this for anything in particular...just messing around to try and find decent results for sorting a list of integers.
The optimum pivot is value which splits the data set into two equal halves. i.e. half the values will be less than the pivot, half greater.

Naive quicksort implementations selects the sample in the middle of the range and hopes that on average, this is ok, if you have an entirely normal distribution of random values the midpoint sometimes you get very lucky, sometimes you get very unlucky but usually you're somewhere in the middle. Hackers can attack a quicksort by deliberately seeding the data which values which produce the worst possible case pivot selection meaning that the quick sort gets extremely slow / produces so many recursions it crashes.

Scanning the entire range for a good pivot is obviously too expensive.

Typically solutions to improve pivot choice include:
1. Picking a random sample as the pivot (hackers can predict this if they can predict your random number generator, so more effort on their part)
2. Examining 3 values from the range, first, last, middle and picking the one which looks like the likely best pivot. (Typically known as tri-median) I've tried choosing from 5 values in the range but got worse results (on average) Again, hackers can get around this but again, it's more difficult.

My quick sort implementation bails to a merge sort at a given depth which is much more resistant to hacking / ensures that even for the worst possible case dataset I will not stack overflow etc.. In practice, I've never seen it drop to the merge sort since I coverted to using tri-median pivot selection

If you're only interested in performance, I recommend trying tri-median selection and see if it works for you. Quick sort performance is entirely dependant the data set provided. It is possible that other sorts might be faster for your cases, quick sort is a good general purpose sorter, it will cope with most data sets very well.

Hope this helps
Cheers,
Martin

If I've helped you, a rating++ would be appreciated

[Edited by - Martin on March 4, 2010 9:53:25 AM]
Cheers,MartinIf I've helped you, a rating++ would be appreciated
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.
Quote:Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.


Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

[Edited by - Martin on March 4, 2010 5:14:51 AM]
Cheers,MartinIf I've helped you, a rating++ would be appreciated
Quote:Original post by LorenzoGatti
Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly[...]


I believe this is not correct. It may not be practical, but you can actually turn quicksort into a worst-case O(n*log(n)) algorithm by picking the pivot carefully. For instance, you can pick the median, which can be computed in linear time using the "median of medians" algorithm.

You can find a description that algorithm in this Wikipedia page.
Quote:Original post by Martin
Quote:Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.
This question is akin to: "Comprehensive overview of quantum physics". Aka, a topic that has been covered from every angle on thousands of pages by just about all of brightest minds.

Books by Knuth (Art of Computer Programming) or Sedgewick (Algorithms in C++) are a good start.

There is also casual
">overview for someone who merely wants trivia knowledge. This talk is a nice hint at how many aspects beyond pivot there are.

But to "optimize" quicksort, start with books above. Or, use sort() provided with language of choice.

Quote:Scanning the entire range for a good pivot is obviously too expensive.
Not necessarily.

For example, one can scan while filling the original array - it needs to come from somewhere, and it needs to be filled one by one.

Quote:hackers
Hackers are not really a problem, non-determinism is a bigger issue. Often a slower but reliable alternative is preferred, especially when looking at logn vs. n^2 case.

Quote:I'm not implementing this for anything in particular
Which is very unfortunate, since the only conclusions can then be gotten at theoretical level by studying characteristics on paper.

Unfortunately, obtaining conclusive results via experimentation is not viable in this case, especially considering that quicksort could be considered to have been covered completely from every angle.
Quote:Original post by implicit
Quote:Original post by Martin
Quote:Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.
It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioned does eliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.
Quicksort stack frames are fairly small in typical implementations.

My question to the OP is: Is this for sorting an array or a linked list?
I have a kick-ass method of optimising linked-list-based quicksort such that the pivot selection shifts sorted and reverse sorted to the best case! Doesn't apply to arrays though.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement