# Separation By Frequency

This topic is 4498 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm currently in need of a (preferably simple to implement) algorithm to take a set of items with different probabilities and optimally split them into x groups (where x is something that is dictated to the algorithm), where each group has items of roughly the same probability (optimal meaning the lowest variance in probability of items within each group). Does anybody know of such an algorithm?

##### Share on other sites
Why don't you just sort them regularly and equally split them into x groups?

EDIT: Ow, i think i understand your problem. Offcourse you wouldnt want the same number of elements in each group.

I don't really know an exsisting method that does this, however, i can think of one if you'd want 2^n groups.

First you could sort the entire list using your favorite sorting method:)
Then you could split the list into 2 groups of closest matching variance. Looks a little like quick sort.

You need 2 pointers (pRight,pLeft)
Two variables to keep track of the variance in the 2 subsets (vRight, vLeft)

(crappy untested pseudo code below, just to give you an idea)
int getSplitPoint(float* array, int len){  int pLeft = 0;  int pRight = len-1;  int vLeft = 0;  int vRight = 0;  while (pLeft < pRight) {    while (vLeft <= vRight && pLeft < pRight) {      pLeft++;      // calculate absolute variance of subset 0 -> pLeft      // for instance: nasty piece of code below       // (calculates average - initial)      vLeft = (((vLeft * pLeft-1) + array[pLeft])/pLeft) - array[0];    }    while (vRight <= vLeft && pLeft < pRight) {      pRight--;      //same nasty code here      vLeft = (((vRight * ((len-pRight)-1)) + array[pRight])/(len-pRight)) - array[len-1];    }  }  //now 0..pLeft and pRight..len(array)-1 have closest matching variances  return pRight; (use this to split in 2)}

if you want 4 sets you would call this function once on the entire array to get split1, then on array[0..split1] and array[split1,len-split1+1]. to get split2 and split3.
Now you have 4 groups:
0..split2,
split2..split1 ,
split1..split3
split3..len.

After some optimization this seems to me a nice way to split into 2^n groups.

Good luck,
Limitz;

[Edited by - Limitz on March 27, 2006 9:03:29 AM]

##### Share on other sites
The variance of each group or "bucket" will be minimized by placing new items in buckets whose current mean is closest to the item under consideration. Let P = {p1, ..., pn} the items to be sorted and k the number of buckets into which they should be placed. If you only get to see the elements of P one at a time, then about the best you can do for simplicity is as follows:

If n <= k, put one item in each bucket until there are no more items. The sample variance of each bucket is zero (since the sample mean is equal to the sample), so this is optimal. If n > k, put one item in each bucket until there are no more empty buckets. For each of the remaining items, put that item in the bucket whose sample mean is closest to the item's value. In this way, you minimize the change in variance at each step (but not necessarily the overall variance).

Even if you do get to see all the items at once, I don't think you can guarantee an optimal packing. This looks suspiciously like a bin-packing problem, which are notoriously NP-hard problems in computational complexity.

hope that helps,

##### Share on other sites
So you have n points.

You want to divide these n points into k buckets.

Naive: Try them all. Then there are (n-k)choose(k) ways of doing this.

You want the division to be "optimal" in that your choice has the minimium variance in each bucket.

Let v(buckets) be the vector of variances of the bucket.

Do you want to minimize sup(v), L_1(v) (sum of the variances), L_2(v) (sum of the squares of the variances)?

By variance, I assume you mean statistical variance? Not sup(x-y) for x,y from bucket (the "width" of each bucket)?

I do strongly suspect your problem is in NP, but it is worth getting a clearer description of the problem. Maybe we'll get lucky!

1. 1
Rutin
25
2. 2
3. 3
4. 4
JoeJ
18
5. 5

• 14
• 14
• 11
• 11
• 9
• ### Forum Statistics

• Total Topics
631757
• Total Posts
3002141
×