Sign in to follow this  
utilae

QuickSort questions

Recommended Posts

Hi, I am using C++. I want to use a sorting algorithm and have had a good look at all of them, but I think QuickSort would be the fastest and it's disadvantages could be removed. If QuickSorts chosen pivot is bad then its speed is O(n^2), ie slow. QuickSort would always be O(n*log(n)), ie fast if the median was always chosen. In numbers 1-10, 5 is median. In letters A-Z, M or N is median. QuickSort performs slowly if the list is sorted perfectly. Some questions. 1)Does picking the median as the pivot always mean that QuickSort will be fast, even if the list is sorted. So if the list is sorted and the pivot picked happens to be the median, then will QuickSort go fast or will it still go slow cause the list is sorted? 2)The best way to pick the median is to know where the median is, which is easier if the list is already sorted. So, wouldn't it be a good idea to store the index to the median after the sort (eg if u using vectors), so that if you sort again, it goes very fast still, cause you can set the median that you stored? 3)How would you make QuickSort stable, and would it loose too much speed? Can it be made stable without speed loss? 4)What are the best sort algorithms to use for the following numbers of items in the list (n). n=10, 100, 1000, 10000, 100000, 1000000

Share this post


Link to post
Share on other sites
About 4), the algorithm can work depending also on the objects to sort. I read in one programming book (but don't remember which one) that it is possible to sort in O(n) time a vector of integers, provided that you know before hand the maximum value and assuming this is not so much high compared with the size of the vector (i.e, a vector of 10 elements with 1000000 being the maximum value would not be a good candidate for this).

The algorithm was something like this:
1) Provided a vector (VOriginal) of integers, get the maximum value of it.
2) Generate an auxiliar vector (VAuxiliar) of integers with the same size as the maximum value obtained before. Initialize all the values to 0.
3) For every position (pos) on the original vector:
VAuxiliar[ VOriginal[pos] ]++;
4) You will end up with VAuxuliar containing all the sorted information:
VAuxiliar[0] = number of 0s on the original array.
VAuxiliar[1] = number of 1s on the original array.
And so on.
(Of course you can remap the values to treat negatives values)
Just use VAuxuliar to get VOriginal sorted.

It is quite simple but efficient algorithm, despite being restricted in its use. However, you might find it useful in some situations.

Share this post


Link to post
Share on other sites
Quote:
Original post by utilae
I want to use a sorting algorithm and have had a good look at all of them, but I think QuickSort would be the fastest and it's disadvantages could be removed.


Quick sort is generally considered to be the best general purpose sorting routine, however, it is always possible to find data sets which favour one sorting algorithm over another.

Quote:
Original post by utilae
If QuickSorts chosen pivot is bad then its speed is O(n^2), ie slow. QuickSort would always be O(n*log(n)), ie fast if the median was always chosen. In numbers 1-10, 5 is median. In letters A-Z, M or N is median. QuickSort performs slowly if the list is sorted perfectly.

Some questions.
1)Does picking the median as the pivot always mean that QuickSort will be fast, even if the list is sorted. So if the list is sorted and the pivot picked happens to be the median, then will QuickSort go fast or will it still go slow cause the list is sorted?


The best choice of pivot is one that divides the current data set exactly in half so that half the elements are lower than the pivot and half are higher. Given a data set of 1, 2, 3, 4, 5, 50, 100, you can see that 4 would be the best choice, not 50.

Quote:
Original post by utilae
2)The best way to pick the median is to know where the median is, which is easier if the list is already sorted. So, wouldn't it be a good idea to store the index to the median after the sort (eg if u using vectors), so that if you sort again, it goes very fast still, cause you can set the median that you stored?


Finding the true median of any partion can be done in O(n) time and so theoretically the worst case performance of quick sort can be all together removed. In practice finding the median (which always commits you to O(n) time) often takes more time than letting quick sort naively struggle on.

If worst case performance is a real issue other sorts may be of more interest to you.

Hackers often attempt to exploit quick sort by deliberatly feeding in a large data set deliberatly pre ordered to provide quick sort with an array of poor median choices hoping to cause a stack overflow. Choosing a random median from a partion can prevent this unless the hackers replicate your random number generator.

Quote:
Original post by utilae
3)How would you make QuickSort stable, and would it loose too much speed? Can it be made stable without speed loss?


No it can't be made without some minor speed loss. Making it stable is simple, keep an index to the original order. Then perform another pass through the data finding where 2 of more elements with the same sort value occur and sort these by their original index. How slow this is depends entirely on your data set as does the choice of sort algorithm to use. Finding which ranges to sort it trivial and can be done in o(n) time.

Quote:
Original post by utilae
4)What are the best sort algorithms to use for the following numbers of items in the list (n).
n=10, 100, 1000, 10000, 100000, 1000000


Again it's very hard to come up with a best sort algorithm, it depends on your data set. I would suggest for 10 an insertion sort. Anything greater than 10 quick sort. Suck it and see.

Shameless plug time: <sorry> our quick sort routine is actually a hybrid of 3 different sorting algorithms with a semi intelligent pivot choice and a hybrid stack / recursion implementation. Downloadable for non commercial use for free.

[Edited by - Martin Fuller on March 17, 2006 6:35:28 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by utilae
1)Does picking the median as the pivot always mean that QuickSort will be fast


No Free Lunch Theorem. You make some cases faster, you make other cases slower.

Quote:
So if the list is sorted and the pivot picked happens to be the median, then will QuickSort go fast or will it still go slow cause the list is sorted?


It'll be fast. But remember you'll be picking another pivot for each partition.

Quote:
2)The best way to pick the median is to know where the median is, which is easier if the list is already sorted.


Typically, you don't actually use the median. That's too slow to compute -- quicksort is quick partly because its inner loop is so simple. Instead, most algorithms use the median of 3 values. Typically from the beginning, middle and end of the range. It's a compromise. No free lunch.

Quote:
So, wouldn't it be a good idea to store the index to the median after the sort (eg if u using vectors), so that if you sort again, it goes very fast still, cause you can set the median that you stored?


Median-of-three works just as well. And where would you store that median. What do you do when the data is modified? Don't take me wrong, caching results is often useful, but here it is unnecessary.

Quote:
3)How would you make QuickSort stable, and would it loose too much speed?


Add the original position of the element in the list as a secondary sort key, so that instead of sorting "BCBA", you really are sorting (B,0) (C,1) (B,2) (A,3) to (A,3) (B,0) (B,2) (C,1).

Quote:
Can it be made stable without speed loss?


No.

Quote:
4)What are the best sort algorithms to use for the following numbers of items in the list (n).
n=10, 100, 1000, 10000, 100000, 1000000


In C++? std::sort or std::stable_sort. std::partial_sort and std::partial_sort_copy are also available.

std::make_heap and std::sort_heap also work, as would an appropriate combination of std::partition or std::stable_partition and std::merge.

And yes, I know I haven't answered your question. I'd probably use std::sort unless I need something more specific for some reason or other.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Fruny
Quote:
Original post by utilae
1)Does picking the median as the pivot always mean that QuickSort will be fast


No Free Lunch Theorem. You make some cases faster, you make other cases slower.
NFLT doesn't apply in a case like this. Your statement is correct, but it has nothing to do with NFLT.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
NFLT doesn't apply in a case like this. Your statement is correct, but it has nothing to do with NFLT.


While the NFLT is formulated for search algorithms, it generalizes to other algorithms. In fact, it works just fine for sorting if you define your cost function to be the number of out-of-order elements in the array.

The act of picking the median does slow down the algorithm when compared to the cases where a naive choice would have done the right thing on its own, faster.

Share this post


Link to post
Share on other sites
Quote:
Original post by Fruny
Quote:
Original post by utilae
4)What are the best sort algorithms to use for the following numbers of items in the list (n).
n=10, 100, 1000, 10000, 100000, 1000000


In C++? std::sort or std::stable_sort. std::partial_sort and std::partial_sort_copy are also available.

std::make_heap and std::sort_heap also work, as would an appropriate combination of std::partition or std::stable_partition and std::merge.

And yes, I know I haven't answered your question. I'd probably use std::sort unless I need something more specific for some reason or other.


I would say you answered it quite well.

Share this post


Link to post
Share on other sites
The thing about picking a splitter with quicksort is that the larger the dataset, the less the chance becomes that you pick bad splitter values. Given that the chance of picking a large number of terrible splitters is very small, it's more efficient to "cross that bridge when you come to it" so to speak.[smile]

Elaborating on that:
Rather than spending a lot of time beforehand trying to pick the ideal splitter, simply realise that almost all the time it will pick a reasonable splitter. Then you just have to have a plan B should the unlikely ever occur. You simply keep an eye on the recursion depth and if the worst comes to the worst, you switch to another sorting algorithm before you end up taking longer than bubblesort.

The process I've just described is known as the Introspective sort (IntroSort) algorithm. The worst case of O(n*n) is avoided, so you get O(nlogn) guaranteed running time, but with practically the speed of just using quicksort. Throw in Sedgewick's final-insertionsort pass and you've got what the standard library typically uses.[cool]

For only 10 items, insertion sort will be reasonably optimal, which is of course also what the standard library would typically use.

So, don't reinvent the wheel. std::sort (or std::stable_sort if you want it to be stable) will already do what you want, but probably much faster than your code can. That said...


The Quicksort technique does not make for the fastest algorithm though. If you are using an integral type of key (or compatible) for your items, then you can perform sorting in O(nloglogn) or O(nlogk) time etc. Iff you find std:sort too slow (which I highly doubt) then you can use algorithms such as flashsort, bucketsort, or radixsort.

Really there's one absolutely critical piece of information we're missing here: Are you sorting an array, or a linked-list? When it comes to a linked-list many people will tell you that mergesort was practically made for a list. And really it is very good, however for any decent number of items, bucketsort really really blows anything else out of the water! And it can easily be stable too![cool]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Fruny
Quote:
Original post by Anonymous Poster
NFLT doesn't apply in a case like this. Your statement is correct, but it has nothing to do with NFLT.


While the NFLT is formulated for search algorithms, it generalizes to other algorithms. In fact, it works just fine for sorting if you define your cost function to be the number of out-of-order elements in the array.
NFLT states that "[...] all algorithms that search for an extremum of a cost function perform exactly the same, when averaged over all possible cost functions.". What are the *other* cost functions in this case? Perhaps that non-sorted arrays are preferred over sorted ones? Yeah, NFLT can be applied to sorting in the sense that a sorted list may not be the desired outcome in all cases. Thus sorting an array may make things worse with a different cost function than one which prefers a sorted array. But that's completely irrelevant.

We're talking about the performance here. Performance is the real-life bias needed to draw true conclusions and to make NFLT irrelevant. If I add 1 million NOP operations to a sort algorithm, does it perform better in some cases and worse in some? No. It performs worse in all cases. If I didn't make that assumption (that is, if it was possible that adding 1 million NOPs might sometimes make programs run faster), then one could reason that no algorithm can be faster than another. But that's cleraly not the case in our world.
Quote:
The act of picking the median does slow down the algorithm when compared to the cases where a naive choice would have done the right thing on its own, faster.
Yes. That's why I said "Your statement is correct, but it has nothing to do with NFLT."

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by utilae
4)What are the best sort algorithms to use for the following numbers of items in the list (n).
n=10, 100, 1000, 10000, 100000, 1000000


In a lecture about sorting algorithms my prof showed us charts about that.

There for small numbers of items (<

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by utilae
4)What are the best sort algorithms to use for the following numbers of items in the list (n).
n=10, 100, 1000, 10000, 100000, 1000000


In a lecture about sorting algorithms my prof showed us charts about that.

There for small numbers of items (less than 1000) insertion sort beats the others in nearly all areas.

The larger the numbers the more quicksort and heapsort catch up. They were both quite close in average case.

I guess it is easier to just use heapsort than to make quicksort more stable ;)
HeapSort is a real O(n log n) algorithm

Share this post


Link to post
Share on other sites
Radix sort from front is best because it's stable.
Insertion sort works for arrays smaller than 120.
For Integer type of variables 32 bit or so, a reasonable consideration is Radix sort from back. As it has guaranted running time, and it's stable on top of that.

CPU cache is however something that destroys performance of most algorithms.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Quote:
Original post by utilae
4)What are the best sort algorithms to use for the following numbers of items in the list (n).
n=10, 100, 1000, 10000, 100000, 1000000


In a lecture about sorting algorithms my prof showed us charts about that.

There for small numbers of items (less than 1000) insertion sort beats the others in nearly all areas.

The larger the numbers the more quicksort and heapsort catch up. They were both quite close in average case.

I guess it is easier to just use heapsort than to make quicksort more stable ;)
HeapSort is a real O(n log n) algorithm
There is no such thing as "more stable". It either is stable or it isn't. You must not know what a stable sorting algorithm is, look it up, or read this.

HeapSort is NOT stable. Furthurmore Heapsort is a constant factor of around 2 to 4 times slower than quicksort.

Share this post


Link to post
Share on other sites
Quote:
Original post by Raghar
Radix sort from front is best because it's stable.
Insertion sort works for arrays smaller than 120.
For Integer type of variables 32 bit or so, a reasonable consideration is Radix sort from back. As it has guaranted running time, and it's stable on top of that.

CPU cache is however something that destroys performance of most algorithms.
Insertion sort "works" for arrays of any size. If you mean that it is the size where it becomes slower than other algorithms, then you'll find you've overestimated InertionSort. For a random array, the cutoff point where other algorithms typically outperform Insertionsort is above somewhere between 8 and 32 items.
If the array is very close to already sorted however, insertionsort can always win.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this