Jump to content

  • Log In with Google      Sign In   
  • Create Account

Quicksort Pivot Selection


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
17 replies to this topic

#1 Novark   Members   -  Reputation: 133

Like
0Likes
Like

Posted 03 March 2010 - 06:38 PM

I'm working with Quicksort, and am wondering: What are some methods that you use to choose the pivot to optimize performance? I'd like to set up two quicksort functions and try comparing their performances with different pivot choices. Any thoughts/ideas? :-)

Sponsor:

#2 LorenzoGatti   Crossbones+   -  Reputation: 2709

Like
0Likes
Like

Posted 03 March 2010 - 10:01 PM

Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly: for example, if you pick the median of 2n+1 arbitrarily chosen keys as a pivot a partition can have as few as n items.
Are you sure you want to use Quicksort?


#3 Novark   Members   -  Reputation: 133

Like
0Likes
Like

Posted 03 March 2010 - 10:31 PM

Quote:
Original post by LorenzoGatti
Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly: for example, if you pick the median of 2n+1 arbitrarily chosen keys as a pivot a partition can have as few as n items.


I'm not too concerned about improving the worst case - just looking for ways to either avoid it (randomizing the data usually does the trick), or thinking of other ways to choose the pivot. I'm sure there are probably some articles written on this.

Quote:
Are you sure you want to use Quicksort?


Yep, I'm not implementing this for anything in particular...just messing around to try and find decent results for sorting a list of integers.

#4 Martin   Members   -  Reputation: 194

Like
0Likes
Like

Posted 03 March 2010 - 10:53 PM

The optimum pivot is value which splits the data set into two equal halves. i.e. half the values will be less than the pivot, half greater.

Naive quicksort implementations selects the sample in the middle of the range and hopes that on average, this is ok, if you have an entirely normal distribution of random values the midpoint sometimes you get very lucky, sometimes you get very unlucky but usually you're somewhere in the middle. Hackers can attack a quicksort by deliberately seeding the data which values which produce the worst possible case pivot selection meaning that the quick sort gets extremely slow / produces so many recursions it crashes.

Scanning the entire range for a good pivot is obviously too expensive.

Typically solutions to improve pivot choice include:
1. Picking a random sample as the pivot (hackers can predict this if they can predict your random number generator, so more effort on their part)
2. Examining 3 values from the range, first, last, middle and picking the one which looks like the likely best pivot. (Typically known as tri-median) I've tried choosing from 5 values in the range but got worse results (on average) Again, hackers can get around this but again, it's more difficult.

My quick sort implementation bails to a merge sort at a given depth which is much more resistant to hacking / ensures that even for the worst possible case dataset I will not stack overflow etc.. In practice, I've never seen it drop to the merge sort since I coverted to using tri-median pivot selection

If you're only interested in performance, I recommend trying tri-median selection and see if it works for you. Quick sort performance is entirely dependant the data set provided. It is possible that other sorts might be faster for your cases, quick sort is a good general purpose sorter, it will cope with most data sets very well.

Hope this helps
Cheers,
Martin

If I've helped you, a rating++ would be appreciated

[Edited by - Martin on March 4, 2010 9:53:25 AM]

#5 Adam_42   Crossbones+   -  Reputation: 2512

Like
0Likes
Like

Posted 03 March 2010 - 11:10 PM

If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

#6 Martin   Members   -  Reputation: 194

Like
0Likes
Like

Posted 03 March 2010 - 11:14 PM

Quote:
Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.


Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.

[Edited by - Martin on March 4, 2010 5:14:51 AM]

#7 Álvaro   Crossbones+   -  Reputation: 13367

Like
0Likes
Like

Posted 04 March 2010 - 03:54 AM

Quote:
Original post by LorenzoGatti
Pivot choices can be good on average given a certain distribution of keys, but there is no way to improve the worst case significantly[...]


I believe this is not correct. It may not be practical, but you can actually turn quicksort into a worst-case O(n*log(n)) algorithm by picking the pivot carefully. For instance, you can pick the median, which can be computed in linear time using the "median of medians" algorithm.

You can find a description that algorithm in this Wikipedia page.

#8 implicit   Members   -  Reputation: 504

Like
0Likes
Like

Posted 04 March 2010 - 04:14 AM

Quote:
Original post by Martin
Quote:
Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

#9 Antheus   Members   -  Reputation: 2397

Like
0Likes
Like

Posted 04 March 2010 - 04:36 AM

This question is akin to: "Comprehensive overview of quantum physics". Aka, a topic that has been covered from every angle on thousands of pages by just about all of brightest minds.

Books by Knuth (Art of Computer Programming) or Sedgewick (Algorithms in C++) are a good start.

There is also casual overview for someone who merely wants trivia knowledge. This talk is a nice hint at how many aspects beyond pivot there are.

But to "optimize" quicksort, start with books above. Or, use sort() provided with language of choice.

Quote:
Scanning the entire range for a good pivot is obviously too expensive.
Not necessarily.

For example, one can scan while filling the original array - it needs to come from somewhere, and it needs to be filled one by one.

Quote:
hackers
Hackers are not really a problem, non-determinism is a bigger issue. Often a slower but reliable alternative is preferred, especially when looking at logn vs. n^2 case.

Quote:
I'm not implementing this for anything in particular
Which is very unfortunate, since the only conclusions can then be gotten at theoretical level by studying characteristics on paper.

Unfortunately, obtaining conclusive results via experimentation is not viable in this case, especially considering that quicksort could be considered to have been covered completely from every angle.

#10 iMalc   Crossbones+   -  Reputation: 2306

Like
0Likes
Like

Posted 04 March 2010 - 06:08 AM

Quote:
Original post by implicit
Quote:
Original post by Martin
Quote:
Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.
It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioned does eliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.
Quicksort stack frames are fairly small in typical implementations.

My question to the OP is: Is this for sorting an array or a linked list?
I have a kick-ass method of optimising linked-list-based quicksort such that the pivot selection shifts sorted and reverse sorted to the best case! Doesn't apply to arrays though.

#11 implicit   Members   -  Reputation: 504

Like
0Likes
Like

Posted 04 March 2010 - 06:11 AM

Quote:
Original post by iMalc
Quote:
Original post by implicit
Quote:
Original post by Martin
Quote:
Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.
It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioned does eliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.
Quicksort stack frames are fairly small in typical implementations.
Well, yeah... Wasn't that what I said?

#12 iMalc   Crossbones+   -  Reputation: 2306

Like
0Likes
Like

Posted 04 March 2010 - 06:12 AM

Quote:
Original post by implicit
Quote:
Original post by iMalc
Quote:
Original post by implicit
Quote:
Original post by Martin
Quote:
Original post by Adam_42
If I remember right, you can eliminate the chance of a stack overflow very simply - always recurse into the smallest partition, and then split the bigger one and repeat.

Thats a good optimisation yes, it doesn't eliminate the chance of stack overflow, but it does significantly reduce stack usage, i.e. it will take a much much larger data set to generate a stack overflow.
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.

Since it reduces the stack usage to O(log n) it's not so much a case of requiring a significantly larger data set to cause stack overflow. Multiplying the number of items by 1000 would require bugger all extra recursive calls.
It's more a case of only overflowing the stack if it was already extremely close to reaching the limit already at the point of the call to perform the sort.

Practically, the optimisation mentioned does eliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.
Quicksort stack frames are fairly small in typical implementations.
Well, yeah... Wasn't that what I said?


Yeah. Sorry I wasn't disagreeing with you. I was disagreing with the other guy and quoting you for support.

#13 implicit   Members   -  Reputation: 504

Like
0Likes
Like

Posted 04 March 2010 - 06:15 AM

Quote:
Original post by iMalc
Yeah. Sorry I wasn't disagreeing with you. I was disagreing with the other guy and quoting you for support.
Whoa... You can do that? ;)

#14 Antheus   Members   -  Reputation: 2397

Like
0Likes
Like

Posted 04 March 2010 - 06:16 AM

Quote:
Original post by iMalc

Practically, the optimisation mentioned does eliminate the liklihood of blowing the stack, as unless the stack was already extremely close to being exhausted, even being asked to sort 4 billion items in the pathologically worst ordering, wont be a problem.
Quicksort stack frames are fairly small in typical implementations.


When designing such algorithms, there are two cases:
- You need absolute, precise-to-one guarantee, proven and verifiable
- Everything else

"Chance" is a very bad thing if reliability matters. Entire Java ecosystem, especially anything open source in there is designed around that. Which means you can almost bet that any of such solution will blow up hard when pressed just slightly, or when coming close to memory bounds.

Practice however has shown that very few actually need reliable algorithms (aka will pay for them), at which point anything goes, even bubble sort is actually a valid solution.

From business perspective - bubble sort (or more realistically, heap sort) is better - since it is deterministic and does not rely on "chance". Even probabilistic algorithms can have their bounds analyzed if needed. Otherwise, chance means "hopefully the machine is big enough, but we don't know or care how to test that" - which it frequently is.

This doesn't apply to QS (it's too dead horse like problem) but to general pragmatic approach to application design.

As far as QS goes, it is possible to determine the upper bound on stack, and implement it manually, or fail predictably if there is not enough memory. It doesn't necessarily solve implicit stack smashing, but neither do other function calls (if needed, they can be validated by making platform-specific assumptions and using benevolent hacks to assert them), but it does give a fail fast version which at very least covers the algorithm part.

#15 Novark   Members   -  Reputation: 133

Like
0Likes
Like

Posted 04 March 2010 - 07:02 PM

Thanks for all the replies everyone,

I'll take a look at some of the suggested methods!

#16 Martin   Members   -  Reputation: 194

Like
0Likes
Like

Posted 04 March 2010 - 08:40 PM

Quote:
Original post by implicit
I wouldn't worry about that, in practice you'll run out of address space *long* before causing a stack overflow (e.g. one stack frame per address bit at worst.)
There are certainly exceptions, such as microcontrollers with tiny fixed-depth stacks or possible, but hopefully no one anyone writing software for one of those will know to avoid recursion.


Humm, this is true, challenging my own assumptions. The worst case data set you could provide (from the point of view of generating the most recursions) would be one which always split the data set exactly in two, if the split was uneven the smaller set would be dealt with recursively -> less recursion.

So for a 64 bit address space, sorting bytes, we have a maximum of 2^64-1 bytes so sort, well, you can see immediately, the maximum recursion depth is 64-1.

[Edited by - Martin on March 5, 2010 4:40:41 AM]
Cheers,MartinIf I've helped you, a rating++ would be appreciated

#17 iMalc   Crossbones+   -  Reputation: 2306

Like
0Likes
Like

Posted 05 March 2010 - 06:40 PM

Quote:
Original post by Martin
So for a 64 bit address space, sorting bytes, we have a maximum of 2^64-1 bytes so sort, well, you can see immediately, the maximum recursion depth is 64-1.
Yup, you nailed it.

Novark, perhaps you missed my question earlier. Is this for an array or a linked-list?

#18 Novark   Members   -  Reputation: 133

Like
0Likes
Like

Posted 07 March 2010 - 11:46 AM

Quote:
Novark, perhaps you missed my question earlier. Is this for an array or a linked-list?


I'm using arrays in my implementation, but feel free to share your method using linked lists if you want!




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS