weighted random numbers

Started by
15 comments, last by alvaro 15 years, 11 months ago
Hello, I have searched Google for a couple hours looking for help in this issue, but they were highly mathematical. What I have is a table with letters and occurrences. For example: a 1200 b 500 c 700 d 350 ... z 35 that I obtained by analyzing a word list of the English language with specifically American words included. Now want I want to do is choose a random letter with their occurrences weighting them. Its for a text twist like game. The easiest solution I've come across is to scale down the distribution to 100 or maybe 1000 element array and place the letters in the array according to their distribution. For example: [a,a,a,b,c,c,d] and choosing a random element from that. However for letters with small probabilities this prevents them from being chosen, not to mention the space for the array. Anyone know a simple yet effective solution for this?
Advertisement
Sum up all the occurrences to get the total amount of different letters, N.
Then calculate a random number n between 1 and N, and iterate over your table of occurrences:
If 1 <= n <= 1200, output a.
Otherwise if 1 <= n-1200 <= 500, output b.
Otherwise if 1 <= n-1200-500 <= 700, output c.
...

I think you get the idea.
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.
clb's solution is workable but is arbitrarily slow. alvaro's solution is the right one, but I think there's a better way to understand it.

First of all, let's put the occurrences in probability form. I'm just going to make these up, and now english has only four letters. Bad cad a baba dab.

a: 2/10.
b: 4/10.
c: 1/10.
d: 3/10.

Now let's express it in a stranger way.

a: 2/10.
a or b: 6/10.
a or b or c: 7/10.
a or b or c or d: 10/10.

Think of this like the probabilities all being bars stacked up on top of each other. the "a" bar is 0.2 units high. The "b" unit is 0.4 units high, reaching up to 0.6 units above the ground (because it's on top of the "a" unit). The "c" unit reaches up to 0.7 units. The "d" unit is on top and reaches up to the 1.0 mark.

So let's pick a number between 0 and 1... say, 0.35423. What bar do you find at the point 0.35423 units above the ground? Survey says.... "b". So that's your pick. Any time you want to sample your random table, in other words, you simply pick a random number between 0 and 1, and see where it falls in your "cumulative probability distribution" (what that summing up is called). Computing the CPD is trivial to do with a for-loop. The lookup is usually done with a binary search.
Quote:Original post by alvaro
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.


This is related to my interests. It sounds a lot like the algorithm for Huffman coding. Please do provide more information.

Sneftel:

I like where this is going, but I don't quite understand the bars stacking on top of each other. Also if your solution is like alvaro's, how do you suppose I store the tree so that I can binary search it? My understanding is that binary trees have at best log2(n) behavior which is similar to binary search, which part of this am I missing?
Quote:Original post by alvaro
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.


With n = 26 (the numbers of letters in the alphabet), you're method is not an improvement over clb's solution. Your method also introduces a lot of complexity that is unnessary. For this problem, clb's method if fine both in complexity and performance.
I don't think my solution is that complicated. Here's some code:
#include <iostream>#include <cstdlib>#include <vector>#include <ctime>class WeightedRandomNumberGenerator {  // We use a vector to represent a binary tree, where 1 is the root  // and the children of i are 2*i and 2*i+1 (this is common when  // implementing heaps)  std::vector<double> weight_tree;  unsigned n;  public:  WeightedRandomNumberGenerator(std::vector<double> const &weights)    : weight_tree(2*weights.size()), n(weights.size()) {    // Copy the original weights at the leaves    for(unsigned i=0;i<n;++i)      weight_tree[i+n] = weights;        // Accumulate the weights through the tree    for(unsigned i=n-1; i>0; --i)      weight_tree = weight_tree[2*i]+weight_tree[2*i+1];  }    // This is where the magic happens  unsigned pick_random() {    double r = (double(std::rand())/RAND_MAX)*weight_tree[1];    unsigned i=1;    while(i<n){      if(r<weight_tree[2*i])        i=2*i;      else {        r-=weight_tree[2*i];        i=2*i+1;      }    }        return i-n;  }};// Example of code that uses the class aboveint main(){  std::srand(std::time(0));  char letters[26]={'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};  unsigned weights[26]={1200,500,700,350,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35};    WeightedRandomNumberGenerator wrng(std::vector<double>(weights, weights+26));    for(int i=0;i<20;++i)    std::cout << letters[wrng.pick_random()] << '\n';}

EDIT: You don't have to use doubles if your weights are integers. Modifying the code I provided for integer weights should be trivial.

EDIT: Added a few comments

[Edited by - alvaro on May 5, 2008 10:05:55 PM]
Quote:Original post by egwenejs
With n = 26 (the numbers of letters in the alphabet), you're method is not an improvement over clb's solution. Your method also introduces a lot of complexity that is unnessary. For this problem, clb's method if fine both in complexity and performance.

I didn't say that clb's solution wasn't appropriate in this case. The general problem of generating random numbers following a particular distribution is an interesting one, and I thought it would be instructive to show a solution that is fast for large values of n.

Quote:Original post by egwenejs
Quote:Original post by alvaro
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.


With n = 26 (the numbers of letters in the alphabet), you're method is not an improvement over clb's solution. Your method also introduces a lot of complexity that is unnessary. For this problem, clb's method if fine both in complexity and performance.


Thank you for pointing that out. However, I did somewhat understate the problem. I do wish for it to be efficient and usefull in the general case because it will go in my library for other projects with n far greater than 26. So I do hope we can bypass the complexity issue in favor of efficiency.
Sneftel's algorithm is at least as fast as mine and uses less memory. The extra advantage of mine is that one can modify the probability of an element on the fly in O(log(n)) time (I didn't provide the code for this, but it's easy). We actually needed this to be fast in a Monte Carlo go program.

This topic is closed to new replies.

Advertisement