**0**

# Quicksort Pivot Selection

Started by Novark, Mar 03 2010 06:38 PM

17 replies to this topic

###
#1
Members - Reputation: **133**

Posted 03 March 2010 - 06:38 PM

I'm working with Quicksort, and am wondering:
What are some methods that you use to choose the pivot to optimize performance?
I'd like to set up two quicksort functions and try comparing their performances with different pivot choices.
Any thoughts/ideas? :-)

Sponsor:

###
#2
Crossbones+ - Reputation: **2709**

Posted 03 March 2010 - 10:01 PM

Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly: for example, if you pick the median of 2n+1 arbitrarily chosen keys as a pivot a partition can have as few as n items.

Are you sure you want to use Quicksort?

Are you sure you want to use Quicksort?

###
#3
Members - Reputation: **133**

Posted 03 March 2010 - 10:31 PM

Quote:

Original post by LorenzoGatti

Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly: for example, if you pick the median of 2n+1 arbitrarily chosen keys as a pivot a partition can have as few as n items.

I'm not too concerned about improving the worst case - just looking for ways to either avoid it (randomizing the data usually does the trick), or thinking of other ways to choose the pivot. I'm sure there are probably some articles written on this.

Quote:

Are you sure you want to use Quicksort?

Yep, I'm not implementing this for anything in particular...just messing around to try and find decent results for sorting a list of integers.

###
#4
Members - Reputation: **194**

Posted 03 March 2010 - 10:53 PM

The optimum pivot is value which splits the data set into two equal halves. i.e. half the values will be less than the pivot, half greater.

Naive quicksort implementations selects the sample in the middle of the range and hopes that on average, this is ok, if you have an entirely normal distribution of random values the midpoint sometimes you get very lucky, sometimes you get very unlucky but usually you're somewhere in the middle. Hackers can attack a quicksort by deliberately seeding the data which values which produce the worst possible case pivot selection meaning that the quick sort gets extremely slow / produces so many recursions it crashes.

Scanning the entire range for a good pivot is obviously too expensive.

Typically solutions to improve pivot choice include:

1. Picking a random sample as the pivot (hackers can predict this if they can predict your random number generator, so more effort on their part)

2. Examining 3 values from the range, first, last, middle and picking the one which looks like the likely best pivot. (Typically known as tri-median) I've tried choosing from 5 values in the range but got worse results (on average) Again, hackers can get around this but again, it's more difficult.

My quick sort implementation bails to a merge sort at a given depth which is much more resistant to hacking / ensures that even for the worst possible case dataset I will not stack overflow etc.. In practice, I've never seen it drop to the merge sort since I coverted to using tri-median pivot selection

If you're only interested in performance, I recommend trying tri-median selection and see if it works for you. Quick sort performance is entirely dependant the data set provided. It is possible that other sorts might be faster for your cases, quick sort is a good general purpose sorter, it will cope with most data sets very well.

Hope this helps

Cheers,

Martin

[Edited by - Martin on March 4, 2010 9:53:25 AM]

Naive quicksort implementations selects the sample in the middle of the range and hopes that on average, this is ok, if you have an entirely normal distribution of random values the midpoint sometimes you get very lucky, sometimes you get very unlucky but usually you're somewhere in the middle. Hackers can attack a quicksort by deliberately seeding the data which values which produce the worst possible case pivot selection meaning that the quick sort gets extremely slow / produces so many recursions it crashes.

Scanning the entire range for a good pivot is obviously too expensive.

Typically solutions to improve pivot choice include:

1. Picking a random sample as the pivot (hackers can predict this if they can predict your random number generator, so more effort on their part)

2. Examining 3 values from the range, first, last, middle and picking the one which looks like the likely best pivot. (Typically known as tri-median) I've tried choosing from 5 values in the range but got worse results (on average) Again, hackers can get around this but again, it's more difficult.

My quick sort implementation bails to a merge sort at a given depth which is much more resistant to hacking / ensures that even for the worst possible case dataset I will not stack overflow etc.. In practice, I've never seen it drop to the merge sort since I coverted to using tri-median pivot selection

If you're only interested in performance, I recommend trying tri-median selection and see if it works for you. Quick sort performance is entirely dependant the data set provided. It is possible that other sorts might be faster for your cases, quick sort is a good general purpose sorter, it will cope with most data sets very well.

Hope this helps

Cheers,

Martin

*If I've helped you, a rating++ would be appreciated*[Edited by - Martin on March 4, 2010 9:53:25 AM]

###
#6
Members - Reputation: **194**

Posted 03 March 2010 - 11:14 PM

Quote:

Original post by Adam_42

If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

[Edited by - Martin on March 4, 2010 5:14:51 AM]

###
#7
Crossbones+ - Reputation: **13367**

Posted 04 March 2010 - 03:54 AM

Quote:

Original post by LorenzoGatti

Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly[...]

I believe this is not correct. It may not be practical, but you can actually turn quicksort into a worst-case O(n*log(n)) algorithm by picking the pivot carefully. For instance, you can pick the median, which can be computed in linear time using the "median of medians" algorithm.

You can find a description that algorithm in this Wikipedia page.

###
#8
Members - Reputation: **504**

Posted 04 March 2010 - 04:14 AM

Quote:I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)

Original post by MartinQuote:

Original post by Adam_42

If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

###
#9
Members - Reputation: **2397**

Posted 04 March 2010 - 04:36 AM

This question is akin to: "Comprehensive overview of quantum physics". Aka, a topic that has been covered from every angle on thousands of pages by just about all of brightest minds.

Books by Knuth (Art of Computer Programming) or Sedgewick (Algorithms in C++) are a good start.

There is also casual overview for someone who merely wants trivia knowledge. This talk is a nice hint at how many aspects beyond pivot there are.

But to "optimize" quicksort, start with books above. Or, use sort() provided with language of choice.

For example, one can scan while filling the original array - it needs to come from somewhere, and it needs to be filled one by one.

Unfortunately, obtaining conclusive results via experimentation is not viable in this case, especially considering that quicksort could be considered to have been covered completely from every angle.

Books by Knuth (Art of Computer Programming) or Sedgewick (Algorithms in C++) are a good start.

There is also casual overview for someone who merely wants trivia knowledge. This talk is a nice hint at how many aspects beyond pivot there are.

But to "optimize" quicksort, start with books above. Or, use sort() provided with language of choice.

Quote:Not necessarily.

Scanning the entire range for a good pivot is obviously too expensive.

For example, one can scan while filling the original array - it needs to come from somewhere, and it needs to be filled one by one.

Quote:Hackers are not really a problem, non-determinism is a bigger issue. Often a slower but reliable alternative is preferred, especially when looking at logn vs. n^2 case.

hackers

Quote:Which is very unfortunate, since the only conclusions can then be gotten at theoretical level by studying characteristics on paper.

I'm not implementing this for anything in particular

Unfortunately, obtaining conclusive results via experimentation is not viable in this case, especially considering that quicksort could be considered to have been covered completely from every angle.

###
#10
Crossbones+ - Reputation: **2306**

Posted 04 March 2010 - 06:08 AM

Quote:

Original post by implicitQuote:I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)

Original post by MartinQuote:

Original post by Adam_42

If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.

It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioned

*does*eliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.

Quicksort stack frames are fairly small in typical implementations.

My question to the OP is: Is this for sorting an array or a linked list?

I have a kick-ass method of optimising linked-list-based quicksort such that the pivot selection shifts sorted and reverse sorted to the best case! Doesn't apply to arrays though.

###
#11
Members - Reputation: **504**

Posted 04 March 2010 - 06:11 AM

Quote:Well, yeah... Wasn't that what I said?

Original post by iMalcQuote:

Original post by implicitQuote:I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)

Original post by Martin

Original post by Adam_42

If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.

It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioneddoeseliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.

Quicksort stack frames are fairly small in typical implementations.

###
#12
Crossbones+ - Reputation: **2306**

Posted 04 March 2010 - 06:12 AM

Quote:

Original post by implicitQuote:Well, yeah... Wasn't that what I said?

Original post by iMalcQuote:

Original post by implicit

Original post by Martin

Original post by Adam_42

If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.

It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioneddoeseliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.

Quicksort stack frames are fairly small in typical implementations.

Yeah. Sorry I wasn't disagreeing with you. I was disagreing with the other guy and quoting you for support.

###
#14
Members - Reputation: **2397**

Posted 04 March 2010 - 06:16 AM

Quote:

Original post by iMalc

Practically, the optimisation mentioneddoeseliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.

Quicksort stack frames are fairly small in typical implementations.

When designing such algorithms, there are two cases:

- You need absolute, precise-to-one guarantee, proven and verifiable

- Everything else

"Chance" is a very bad thing if reliability matters. Entire Java ecosystem, especially anything open source in there is designed around that. Which means you can almost bet that any of such solution will blow up hard when pressed just slightly, or when coming close to memory bounds.

Practice however has shown that very few actually need reliable algorithms (aka will pay for them), at which point anything goes, even bubble sort is actually a valid solution.

From business perspective - bubble sort (or more realistically, heap sort) is better - since it is deterministic and does not rely on "chance". Even probabilistic algorithms can have their bounds analyzed if needed. Otherwise, chance means "hopefully the machine is big enough, but we don't know or care how to test that" - which it frequently is.

This doesn't apply to QS (it's too dead horse like problem) but to general pragmatic approach to application design.

As far as QS goes, it is possible to determine the upper bound on stack, and implement it manually, or fail predictably if there is not enough memory. It doesn't necessarily solve implicit stack smashing, but neither do other function calls (if needed, they can be validated by making platform-specific assumptions and using benevolent hacks to assert them), but it does give a fail fast version which at very least covers the algorithm part.

###
#16
Members - Reputation: **194**

Posted 04 March 2010 - 08:40 PM

Quote:

Original post by implicit

I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)

There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Humm, this is true, challenging my own assumptions. The worst case data set you could provide (from the point of view of generating the most recursions) would be one which always split the data set exactly in two, if the split was uneven the smaller set would be dealt with recursively -> less recursion.

So for a 64 bit address space, sorting bytes, we have a maximum of 2^64-1 bytes so sort, well, you can see immediately, the maximum recursion depth is 64-1.

[Edited by - Martin on March 5, 2010 4:40:41 AM]

Cheers,Martin

*If I've helped you, a rating++ would be appreciated*###
#17
Crossbones+ - Reputation: **2306**

Posted 05 March 2010 - 06:40 PM

Quote:Yup, you nailed it.

Original post by Martin

So for a 64 bit address space, sorting bytes, we have a maximum of 2^64-1 bytes so sort, well, you can see immediately, the maximum recursion depth is 64-1.

Novark, perhaps you missed my question earlier. Is this for an array or a linked-list?