Sign in to follow this  
atonalpanic

weighted random numbers

Recommended Posts

Hello, I have searched Google for a couple hours looking for help in this issue, but they were highly mathematical. What I have is a table with letters and occurrences. For example: a 1200 b 500 c 700 d 350 ... z 35 that I obtained by analyzing a word list of the English language with specifically American words included. Now want I want to do is choose a random letter with their occurrences weighting them. Its for a text twist like game. The easiest solution I've come across is to scale down the distribution to 100 or maybe 1000 element array and place the letters in the array according to their distribution. For example: [a,a,a,b,c,c,d] and choosing a random element from that. However for letters with small probabilities this prevents them from being chosen, not to mention the space for the array. Anyone know a simple yet effective solution for this?

Share this post


Link to post
Share on other sites
Sum up all the occurrences to get the total amount of different letters, N.
Then calculate a random number n between 1 and N, and iterate over your table of occurrences:
If 1 <= n <= 1200, output a.
Otherwise if 1 <= n-1200 <= 500, output b.
Otherwise if 1 <= n-1200-500 <= 700, output c.
...

I think you get the idea.

Share this post


Link to post
Share on other sites
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.

Share this post


Link to post
Share on other sites
clb's solution is workable but is arbitrarily slow. alvaro's solution is the right one, but I think there's a better way to understand it.

First of all, let's put the occurrences in probability form. I'm just going to make these up, and now english has only four letters. Bad cad a baba dab.

a: 2/10.
b: 4/10.
c: 1/10.
d: 3/10.

Now let's express it in a stranger way.

a: 2/10.
a or b: 6/10.
a or b or c: 7/10.
a or b or c or d: 10/10.

Think of this like the probabilities all being bars stacked up on top of each other. the "a" bar is 0.2 units high. The "b" unit is 0.4 units high, reaching up to 0.6 units above the ground (because it's on top of the "a" unit). The "c" unit reaches up to 0.7 units. The "d" unit is on top and reaches up to the 1.0 mark.

So let's pick a number between 0 and 1... say, 0.35423. What bar do you find at the point 0.35423 units above the ground? Survey says.... "b". So that's your pick. Any time you want to sample your random table, in other words, you simply pick a random number between 0 and 1, and see where it falls in your "cumulative probability distribution" (what that summing up is called). Computing the CPD is trivial to do with a for-loop. The lookup is usually done with a binary search.

Share this post


Link to post
Share on other sites
Quote:
Original post by alvaro
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.


This is related to my interests. It sounds a lot like the algorithm for Huffman coding. Please do provide more information.

Sneftel:

I like where this is going, but I don't quite understand the bars stacking on top of each other. Also if your solution is like alvaro's, how do you suppose I store the tree so that I can binary search it? My understanding is that binary trees have at best log2(n) behavior which is similar to binary search, which part of this am I missing?

Share this post


Link to post
Share on other sites
Quote:
Original post by alvaro
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.


With n = 26 (the numbers of letters in the alphabet), you're method is not an improvement over clb's solution. Your method also introduces a lot of complexity that is unnessary. For this problem, clb's method if fine both in complexity and performance.

Share this post


Link to post
Share on other sites
I don't think my solution is that complicated. Here's some code:
#include <iostream>
#include <cstdlib>
#include <vector>
#include <ctime>

class WeightedRandomNumberGenerator {
// We use a vector to represent a binary tree, where 1 is the root
// and the children of i are 2*i and 2*i+1 (this is common when
// implementing heaps)
std::vector<double> weight_tree;
unsigned n;

public:
WeightedRandomNumberGenerator(std::vector<double> const &weights)
: weight_tree(2*weights.size()), n(weights.size()) {
// Copy the original weights at the leaves
for(unsigned i=0;i<n;++i)
weight_tree[i+n] = weights[i];

// Accumulate the weights through the tree
for(unsigned i=n-1; i>0; --i)
weight_tree[i] = weight_tree[2*i]+weight_tree[2*i+1];
}

// This is where the magic happens
unsigned pick_random() {
double r = (double(std::rand())/RAND_MAX)*weight_tree[1];
unsigned i=1;
while(i<n){
if(r<weight_tree[2*i])
i=2*i;
else {
r-=weight_tree[2*i];
i=2*i+1;
}
}

return i-n;
}
};

// Example of code that uses the class above
int main(){
std::srand(std::time(0));
char letters[26]={'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
unsigned weights[26]={1200,500,700,350,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35};

WeightedRandomNumberGenerator wrng(std::vector<double>(weights, weights+26));

for(int i=0;i<20;++i)
std::cout << letters[wrng.pick_random()] << '\n';
}





EDIT: You don't have to use doubles if your weights are integers. Modifying the code I provided for integer weights should be trivial.

EDIT: Added a few comments

[Edited by - alvaro on May 5, 2008 10:05:55 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by egwenejs
With n = 26 (the numbers of letters in the alphabet), you're method is not an improvement over clb's solution. Your method also introduces a lot of complexity that is unnessary. For this problem, clb's method if fine both in complexity and performance.

I didn't say that clb's solution wasn't appropriate in this case. The general problem of generating random numbers following a particular distribution is an interesting one, and I thought it would be instructive to show a solution that is fast for large values of n.

Share this post


Link to post
Share on other sites
Quote:
Original post by egwenejs
Quote:
Original post by alvaro
While clb's proposed solution is the easiest, it runs in O(n) time, where n is the number of entries in your list. You can do it in O(log(n)) time by building a tree where each entry in the list ends up being a leaf and where each intermediate node has an associated weight that is the sum of the weights of all the leaves that hang from it. You can then start at the root with a random number between 0 and the sum of the weights. You look at the weights of the descendants of the root to see which of them you should follow. If you build an example, it's pretty obvious how to proceed from there.

If this is not clear enough, I'll write some code for you tomorrow.


With n = 26 (the numbers of letters in the alphabet), you're method is not an improvement over clb's solution. Your method also introduces a lot of complexity that is unnessary. For this problem, clb's method if fine both in complexity and performance.


Thank you for pointing that out. However, I did somewhat understate the problem. I do wish for it to be efficient and usefull in the general case because it will go in my library for other projects with n far greater than 26. So I do hope we can bypass the complexity issue in favor of efficiency.

Share this post


Link to post
Share on other sites
Sneftel's algorithm is at least as fast as mine and uses less memory. The extra advantage of mine is that one can modify the probability of an element on the fly in O(log(n)) time (I didn't provide the code for this, but it's easy). We actually needed this to be fast in a Monte Carlo go program.

Share this post


Link to post
Share on other sites
Quote:
Original post by alvaro
I don't think my solution is that complicated. Here's some code:
*** Source Snippet Removed ***
EDIT: You don't have to use doubles if your weights are integers. Modifying the code I provided for integer weights should be trivial.

EDIT: Added a few comments


I appreciate the code, I'll look over it more thoroughly tomorrow. I forget but I think I've read that in a for loop var++ and ++var do the same thing, is this correct?

edit: Also, is there a way to force a templated argument to be of a numeric type?

Share this post


Link to post
Share on other sites
Quote:
Original post by atonalpanic
I like where this is going, but I don't quite understand the bars stacking on top of each other.

What don't you understand about it? Picture four bars, of different heights, all stacked up.
Quote:
Also if your solution is like alvaro's, how do you suppose I store the tree so that I can binary search it?
The CPD, by definition, is a sequence of numbers which are in order from smallest to biggest. You binary search it in the normal way.

BTW, I misunderstood cjb's approach on a first reading. His way is less efficient, but it is not "arbitrarily slow".

Share this post


Link to post
Share on other sites
Quote:
Original post by atonalpanic
I appreciate the code, I'll look over it more thoroughly tomorrow. I forget but I think I've read that in a for loop var++ and ++var do the same thing, is this correct?

For simple iterator types (integers or pointers) yes, they are about the same. However, if you have a non-trivial iterator type, var++ will make a copy of the old value of the iterator, while ++var won't. In general, I try to use ++var in loops since it never hurts and it sometimes helps.

Quote:
edit: Also, is there a way to force a templated argument to be of a numeric type?

If you try to instantiate a template that expects a numeric type using something else, you'll get an ugly compiler error about some operation not being defined. There are two things that might allow you to say you want it to be numeric explicitly:
1) Use boost's static asserts (I don't know enough about the subject to know if this is even possible)
2) Wait for the new version of C++, which should have something called "concepts" that will allow you to specify some requirements on template types.

Share this post


Link to post
Share on other sites
Quote:
Original post by atonalpanic
This is related to my interests. It sounds a lot like the algorithm for Huffman coding. Please do provide more information.

It is exactly like Huffman coding. Just as with Huffman coding, you want the most common elements to be at the top of the tree. The optimal tree in both cases is the same.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
a: 2/10.
a or b: 6/10.
a or b or c: 7/10.
a or b or c or d: 10/10.


Quote:
Original post by Sneftel
BTW, I misunderstood clb's approach on a first reading. His way is less efficient, but it is not "arbitrarily slow".


Ours are exactly same, except I did a linear search, whereas you performed binary search. Difference, as we all know, is O(N) vs O(logN), and I think it would show in timings already with N as small as 26.

If you're doing a general purpose library, I suggest going alvaro's route, as you can modify frequencies in O(logN) time as well, instead of O(N) updates with the array approach.

Share this post


Link to post
Share on other sites
Hello again,

I noticed that when I run the code you posted, it always returns the same 20 letters each run of the program. I added the line: std::srand( time( NULL ) );
in the constructor to seen the rand() function but now all it does is spew back a bunch of letter a's. I was wondering how to make it so that every run of the program spits back new random letters.

Edit: I take that back now when I run the program it exhibits the behavior I wanted...yet I didn't change anything. Ohh well, there is still the smaller problem that every run of the program starts with the same letter. I should also say that the letter that always shows first is not a letter with a high probability of showing.

Share this post


Link to post
Share on other sites
You should call `srand(time(0))' only once at the beginning of the program (I usually do it at the top of main()).

EDIT: Actually, what I do (when I am not being lazy) is call time(0), print the result in the log and then call srand with that value. That way if I find a bug I can call srand with that value and reproduce the problem.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this