Sign in to follow this  
owl

How to randomly select an item, taking probability into account

Recommended Posts

Say I have 5 items.
ITEM      PROBABILITY
 A            50/100
 B            25/100
 C            10/100
 D            10/100
 E             5/100    
What I'm doing right now to build an array of size 100, and filling it with 50 A's, 25 B's, 10 C's, 10 D's, 5 E's. Then I generate a random number from 0 to 99, and I pick the value from the array based on that random number. The most repeated element (the one with higher probability) is selected more often. Is there some other (less expensive) way of selecting an element based on it's probability of being selected? Thank you in advance.

Share this post


Link to post
Share on other sites
I usually just repeatedly subtract percentages from a random number, until it ends up less than 0. As in:


int rand = rand() % 100;
int picked;
for (picked=0; picked < num_items; picked++)
{
rand -= prob[picked];
if (rand < 0) break;
}

Share this post


Link to post
Share on other sites
Sure...in this case, generate a random integer between 1 and 100. If it's 1-50, choose A; if it's 51-75, choose B; etc. and so forth.

This of course doesn't hold up under systems where your probabilities don't all add up to 100%, but it would work fine in the scenario you suggest.

Cheers,
Twilight Dragon

Share this post


Link to post
Share on other sites
I'd say this method is fast when used in a tight loop, but has the drawback of costing some memory.

I'd do something like this:

int a = rand() / ( RAND_MAX >> 8 );

if( a < 128 )
{
// 50% probability
}
else if( a < 192 )
{
// 25% probability
}
else
{
// 25% probability
}

Share this post


Link to post
Share on other sites
What about just picking a number from 1 - 100 and depending on the range it falls in, which are aligned to the probabilities, then you have your answer?

Ex:
A : 0-50
B : 50-75
C : 75-85
D : 85-95
E : 95-100

Pseudo Code:
Var x = rand[0-100];
// x == 39

The selected value is then A.

Of course my number ranges aren't right, but you get the idea [wink]. Would that work?

Share this post


Link to post
Share on other sites
Just get a random number between 0 and 100 then set up an if/else:

if( random < 5 )
//do E
else if( random < 15 )
//do D
else if( random < 25 )
//do C
else if( random < 50 )
//do B
else
//do A

Share this post


Link to post
Share on other sites
I'm sorry, the example I posted isn't completely accurate.

The list of probabilities I have isn't sorted, it changes from time to time and it's A LOT bigger than that.

PS: and the probability is in float (1.7542)

Share this post


Link to post
Share on other sites
Then the method of pinacolada (the first answer) is what you need. You only need the list of objects and the prob of each one. Then examine your list until the random <=0.

If your prob are a float, compute a number high enoug and compute a float. Considering the max value you get from random() is RAND_MAX (or is it MAX_RAND?) which value is 65535, then your float is:
p = (float)random()/65535.0f;

That give you a 0-1 value... multiply it by your delta in order to get a number in 0-100 range:
p = 100 * (float)random()/65535.0f;

Now, a variant may be to use an accumulated prob. So your list:
ITEM PROBABILITY
A 50/100
B 25/100
C 10/100
D 10/100
E 5/100

Becomes:
A 50/100
B 75/100
C 85/100
D 95/100
E 100/100

Then test your value until it is greater then the accumulated probability.
for(int i=0; i<maxID; i++)
if(list[i]>p) break;

i holds your index.

This may perform better as you dont need to do any operation except a comparison which maps to ASM: JG.

Now if you want faster... go for assembly.

Luck!
Guimo



Share this post


Link to post
Share on other sites
The way I'd go about solving this problem depends heavily on how often the probabilities are changing, how many there are, and what sort of pre-processing you can do before hand.

If there's not a lot of different categories, or if there's a few that make the bulk of all possible choices (for example, if only one of three choices will be picked 95% of the time), then the simple approach (like pinacolada's code) will be adequate (and that's what I've used when faced with a similar problem). It would be best if you could sort your choices in this case, so the most probable ones are first in the chain.

Other approaches would be to precalc. a look up table or a tree, but this will depend on exactly what you are doing. Can you give us some more information as to why the simple approach won't be adequate?

Share this post


Link to post
Share on other sites
I'm doing genetic programming. The population individuals are tested for fitness and then a probability is aplied to them based on their fitness.

the list can go from 1000-10000 and above.

the list isn't sorted by probability. And I feel that sorting such a list and then transversing it for piña colada is slower than making


for each individual
{
for 1 to probabiliy
{
fill_probability_array()
}
}


then

rnum = random(probability_array_size)

Share this post


Link to post
Share on other sites
Have the computer do the grunt work. Just assign each outcome a number. If you really want to use percentages then have the computer calculate them during initialization, but otherwise just calculate a cumulative total for each entry. Using your table as an example start with {{'A',50,},{'B',25,},{'C',10,},{'D',10,},{'E',5,}} then during intialization fill out the last column so you have {{'A',50,50},{'B',25,75},{'C',10,85},{'D',10,95},{'E',5,100}}. Then you just use the same method the first reply gave. The numbers you assign can translate directly to percentages as this example shows, but you are freed from them having to do so.

With 10k entries performance may be a critical issue. Really the only change needed is to do a binary search rather than a sequential one. I believe a standard binary search returns the first entry greater than the value searched for when a match is not found which is what you want.

[Edited by - LilBudyWizer on September 7, 2005 7:24:00 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by owl
I'm doing genetic programming. The population individuals are tested for fitness and then a probability is aplied to them based on their fitness.


Thought it might be that. But when I was playing around with genetic programming, I was culling the bottom percent from the population with every generation, which required sorting based on fitness. Thus I already had sorting for free, and the simple algorithm (pretty close to pinacolada's code) worked quite fast.

Are you having problems with this aspect of the algorithm in terms of speed? I'd have thought that the other parts of the genetic algorithm would take the brunt of the CPU work.

Share this post


Link to post
Share on other sites
Quote:
Original post by Trapper Zoid
I was culling the bottom percent from the population with every generation, which required sorting based on fitness. Thus I already had sorting for free,


What does it mean to "cull the bottom percent"? How that gave you sorting for free? *blinks*

Quote:
Original post by Trapper Zoid
Are you having problems with this aspect of the algorithm in terms of speed? I'd have thought that the other parts of the genetic algorithm would take the brunt of the CPU work.


Well, I made a pretty robust library that encapsulates almost all one need to do genetic programming. I just finished today and it is damn too slow because it's completely unoptimized. The selection by probability is just one thing I want to fix. Allocation and dealocation of nodes (when spawning) is probably another thing (if not the most) that makes it slow.

Share this post


Link to post
Share on other sites
Quote:
Original post by owl
Quote:
Original post by Trapper Zoid
I was culling the bottom percent from the population with every generation, which required sorting based on fitness. Thus I already had sorting for free,


What does it mean to "cull the bottom percent"? How that gave you sorting for free? *blinks*


Whoops, a bad choice of words there on my part. What I meant is that with my version of a genetic algorithm, after I calculated the fitness of the entire population I would only keep the top scoring values (I think it was about half). So I had to sort the population anyway. The sorting wasn't really "for free", it was just that since I was doing it anyway I might as well use it. Sorting really shouldn't be that expensive for a population of up to a few thousand or so.


Quote:

Well, I made a pretty robust library that encapsulates almost all one need to do genetic programming. I just finished today and it is damn too slow because it's completely unoptimized. The selection by probability is just one thing I want to fix. Allocation and dealocation of nodes (when spawning) is probably another thing (if not the most) that makes it slow.


I'd run a profiler over your library to see where the bottlenecks are. While I'm not the most experienced optimiser, I usually find with my slow code it's usually the bits that deal with allocating and deallocating memory that are the real killer. It might be that the random selection part you have now is fine.

Share this post


Link to post
Share on other sites
although I haven't read each post in depth, it seems to me that people are making a one to one correspondance b/w a random number and an object, it seems really inefficent, as random number generation is a realtively inefficent thing. you could make more then one decision per random number like so

//pick a group of objects based on probability p

random = random() //random number 0-1 float

for(i = 0 to maxObjects) {

if(random < probability[i])
choose object;
}

//now each chosen object is chosen with probability p[objet]

for(each choosen object)
do your thing...

this takes only one rndom number and produces a list of objects instead of one object. sometimes we have a large object list, which is undesirable, but in the long run it will converge to each object getting picked p(obj)*ntrials times

Share this post


Link to post
Share on other sites
of course, the events are not independant I don't know your statistical constraints, I never done this type of programming b4.

Share this post


Link to post
Share on other sites
If you don't have access to a profiler you can use QueryPerformanceCounter. It basically returns a tick count and another function will tell you how long a tick is if you want to convert to seconds or whatever. You query once for the start time, again for the end time, calculate the differance and accumulate it. A count of the number of times the section of code executed is generally a good idea.

You just take a top down approach. You've been working on a performance problem so chances are you have some idea of a function below which the problem lies. So you time that function. You break that time down by the time spent in function calls and loops that it performs. If it spent most of it's time in one function then repeat the process on that function. Just be sure you started in the right place. That just requires checking the programs time against the time for the section you are looking at.

A good general rule is that 90% of the time is spent in 10% of the code. Nothing you can do to that other 90% of the code can make more than a 10% differance in the overall program. When you have a performance problem it starts being more like 1% of the code is 99% of your time. It can be time consuming to figure out what to do about that 1%, but finding it doesn't take much time if you simply take a methodical approach.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this