Finding mode and median

Started by
7 comments, last by alvaro 12 years, 2 months ago
Anyone know of any efficient algorithms for finding the mode or median of an unsorted array? (Both of these are trivial if the array is sorted)

I can't find much on the web regarding this. A few forum threads and a mention of an algorithm called "median of medians" which I believe is flawed and does not produce the correct result.

I do not agree that the median of a bunch of medians of a bunch of sets is the true median of the union of all the sets. On average it will probably be close, but I'm not interested in that.

Here are some stabs at algorithms. Please let me know if there are any better:

median:
two lists and a current_median_value
iterate through our array, if element is less than or equal to current_median, put it in min list. if it's greater, put it in max list. when the difference between the number of elements in one list and the other list becomes two, push the smallest (from the max list) or the largest (from the min list) into the current_median_value spot and shove the previous current_median_value into the other list.
-you have to keep both lists sorted which is linear per insert but n^2 in total.
diminishing merge sort
a vanishing (1/2^n) merge sort where you pick your pivot. move everything to its partition, then if a sub-list is smaller than half the total elements of the original array, you discard it. basically you do merge sort where you don't care about the lower half of the array. so it's faster than quick sort. final answer is a O(1) lookup

mode:
hash table
iterate through array. for each element, insert it into hash table if it doesn't exist. otherwise, increment its counter by 1.
-seems like there should be a more efficient way. obviously if an element occurs over half the time, then you can just do bit-wise counting and do the whole thing in linear time and constant space. but that's a clever optimization in that special case. is there any adaptation to the general case?
Advertisement
You can compute the median using QuickSelect, which is easy to implement and takes average linear time. If you want a guaranteed linear-time algorithm, use "median of medians".

For the mode, using the hash is the best solution I know of.
You can compute the median using QuickSelect, which is easy to implement and takes average linear time. If you want a guaranteed linear-time algorithm, use "median of medians". For the mode, using the hash is the best solution I know of.


How does median of medians give you the true median?

If you take an array, break it into sub arrays, find the median of all of them, then take the median of all the medians u computed, you will NOT have the actual median 100% of the time. Probably not even the majority of the time. I think it's an approximation. A statistical thing that works when you don't necessarily need the exact median but just something close.
You need to read the article I quoted, not try to infer its content by its name.

EDIT: To be more helpful, the median of medians is used as the pivot in QuickSelect to guarantee linear worst-case performance.
Oh I see. So median of medians finds a good pivot to guarantee that quickselect runs in linear time. It'll be the quick select that finds us our actual median.

Anyone know of any efficient algorithms for finding the mode or median of an unsorted array? (Both of these are trivial if the array is sorted)


What does efficient mean?

O(nlogn) is typical sorting complexity.

But radix or merge sort are O(n), which is minimum required for exact solution.

Median-of-medians or similar specialized solutions might have lower constant factors, but are the same as far as complexity goes.

The rest would be implementation trade-offs.

O(nlogn) is typical sorting complexity.

But radix or merge sort are O(n), which is minimum required for exact solution.


Hmmm... radix sort is not universally applicable (your data may not be integers), and merge sort is O(n*log(n)). Perhaps you meant something else instead of "merge sort"...
Merge has O(n) best case, so it's not ideal.

radix sort is not universally applicable (your data may not be integers)[/quote]

A little though experiment - is median/mode sensible for items which cannot be radix sorted?

We're still at Big-O level here, radix sorting floats can be done by using their string representation if need be.

A little though experiment - is median/mode sensible for items which cannot be radix sorted?

We're still at Big-O level here, radix sorting floats can be done by using their string representation if need be.


I am not sure if I have a practical example, but if you implement a data type that represents algebraic real numbers (as a polynomial and an approximate value of the root), there are algorithms that allow you to compare any two numbers, so you can use a comparison-based selection algorithm to compute the median. However, radix sort is kind of useless in this case.

This topic is closed to new replies.

Advertisement