# How to randomly select an item, taking probability into account

## Recommended Posts

Say I have 5 items.
ITEM      PROBABILITY
A            50/100
B            25/100
C            10/100
D            10/100
E             5/100

What I'm doing right now to build an array of size 100, and filling it with 50 A's, 25 B's, 10 C's, 10 D's, 5 E's. Then I generate a random number from 0 to 99, and I pick the value from the array based on that random number. The most repeated element (the one with higher probability) is selected more often. Is there some other (less expensive) way of selecting an element based on it's probability of being selected? Thank you in advance.

##### Share on other sites
I usually just repeatedly subtract percentages from a random number, until it ends up less than 0. As in:

int rand = rand() % 100;int picked;for (picked=0; picked < num_items; picked++){  rand -= prob[picked];  if (rand < 0) break;}

##### Share on other sites
Sure...in this case, generate a random integer between 1 and 100. If it's 1-50, choose A; if it's 51-75, choose B; etc. and so forth.

This of course doesn't hold up under systems where your probabilities don't all add up to 100%, but it would work fine in the scenario you suggest.

Cheers,
Twilight Dragon

##### Share on other sites
I'd say this method is fast when used in a tight loop, but has the drawback of costing some memory.

I'd do something like this:

int a = rand() / ( RAND_MAX >> 8 );

if( a < 128 )
{
// 50% probability
}
else if( a < 192 )
{
// 25% probability
}
else
{
// 25% probability
}

##### Share on other sites
What about just picking a number from 1 - 100 and depending on the range it falls in, which are aligned to the probabilities, then you have your answer?

Ex:
A : 0-50
B : 50-75
C : 75-85
D : 85-95
E : 95-100

Pseudo Code:
Var x = rand[0-100];
// x == 39

The selected value is then A.

Of course my number ranges aren't right, but you get the idea [wink]. Would that work?

##### Share on other sites
Just get a random number between 0 and 100 then set up an if/else:
if( random < 5 )    //do Eelse if( random < 15 )    //do Delse if( random < 25 )    //do Celse if( random < 50 )    //do Belse    //do A

##### Share on other sites
I'm sorry, the example I posted isn't completely accurate.

The list of probabilities I have isn't sorted, it changes from time to time and it's A LOT bigger than that.

PS: and the probability is in float (1.7542)

##### Share on other sites
Then the method of pinacolada (the first answer) is what you need. You only need the list of objects and the prob of each one. Then examine your list until the random <=0.

If your prob are a float, compute a number high enoug and compute a float. Considering the max value you get from random() is RAND_MAX (or is it MAX_RAND?) which value is 65535, then your float is:
p = (float)random()/65535.0f;

That give you a 0-1 value... multiply it by your delta in order to get a number in 0-100 range:
p = 100 * (float)random()/65535.0f;

Now, a variant may be to use an accumulated prob. So your list:
ITEM PROBABILITY
A 50/100
B 25/100
C 10/100
D 10/100
E 5/100

Becomes:
A 50/100
B 75/100
C 85/100
D 95/100
E 100/100

Then test your value until it is greater then the accumulated probability.
for(int i=0; i<maxID; i++)
if(list[i]>p) break;

This may perform better as you dont need to do any operation except a comparison which maps to ASM: JG.

Now if you want faster... go for assembly.

Luck!
Guimo

##### Share on other sites
The way I'd go about solving this problem depends heavily on how often the probabilities are changing, how many there are, and what sort of pre-processing you can do before hand.

If there's not a lot of different categories, or if there's a few that make the bulk of all possible choices (for example, if only one of three choices will be picked 95% of the time), then the simple approach (like pinacolada's code) will be adequate (and that's what I've used when faced with a similar problem). It would be best if you could sort your choices in this case, so the most probable ones are first in the chain.

Other approaches would be to precalc. a look up table or a tree, but this will depend on exactly what you are doing. Can you give us some more information as to why the simple approach won't be adequate?

##### Share on other sites
I'm doing genetic programming. The population individuals are tested for fitness and then a probability is aplied to them based on their fitness.

the list can go from 1000-10000 and above.

the list isn't sorted by probability. And I feel that sorting such a list and then transversing it for piña colada is slower than making

for each individual{   for 1 to probabiliy   {       fill_probability_array()   }}then rnum = random(probability_array_size)

##### Share on other sites
Have the computer do the grunt work. Just assign each outcome a number. If you really want to use percentages then have the computer calculate them during initialization, but otherwise just calculate a cumulative total for each entry. Using your table as an example start with {{'A',50,},{'B',25,},{'C',10,},{'D',10,},{'E',5,}} then during intialization fill out the last column so you have {{'A',50,50},{'B',25,75},{'C',10,85},{'D',10,95},{'E',5,100}}. Then you just use the same method the first reply gave. The numbers you assign can translate directly to percentages as this example shows, but you are freed from them having to do so.

With 10k entries performance may be a critical issue. Really the only change needed is to do a binary search rather than a sequential one. I believe a standard binary search returns the first entry greater than the value searched for when a match is not found which is what you want.

[Edited by - LilBudyWizer on September 7, 2005 7:24:00 PM]

##### Share on other sites
Quote:
 Original post by owlI'm doing genetic programming. The population individuals are tested for fitness and then a probability is aplied to them based on their fitness.

Thought it might be that. But when I was playing around with genetic programming, I was culling the bottom percent from the population with every generation, which required sorting based on fitness. Thus I already had sorting for free, and the simple algorithm (pretty close to pinacolada's code) worked quite fast.

Are you having problems with this aspect of the algorithm in terms of speed? I'd have thought that the other parts of the genetic algorithm would take the brunt of the CPU work.

##### Share on other sites
Quote:
 Original post by Trapper ZoidI was culling the bottom percent from the population with every generation, which required sorting based on fitness. Thus I already had sorting for free,

What does it mean to "cull the bottom percent"? How that gave you sorting for free? *blinks*

Quote:
 Original post by Trapper ZoidAre you having problems with this aspect of the algorithm in terms of speed? I'd have thought that the other parts of the genetic algorithm would take the brunt of the CPU work.

Well, I made a pretty robust library that encapsulates almost all one need to do genetic programming. I just finished today and it is damn too slow because it's completely unoptimized. The selection by probability is just one thing I want to fix. Allocation and dealocation of nodes (when spawning) is probably another thing (if not the most) that makes it slow.

##### Share on other sites
Quote:
Original post by owl
Quote:
 Original post by Trapper ZoidI was culling the bottom percent from the population with every generation, which required sorting based on fitness. Thus I already had sorting for free,

What does it mean to "cull the bottom percent"? How that gave you sorting for free? *blinks*

Whoops, a bad choice of words there on my part. What I meant is that with my version of a genetic algorithm, after I calculated the fitness of the entire population I would only keep the top scoring values (I think it was about half). So I had to sort the population anyway. The sorting wasn't really "for free", it was just that since I was doing it anyway I might as well use it. Sorting really shouldn't be that expensive for a population of up to a few thousand or so.

Quote:
 Well, I made a pretty robust library that encapsulates almost all one need to do genetic programming. I just finished today and it is damn too slow because it's completely unoptimized. The selection by probability is just one thing I want to fix. Allocation and dealocation of nodes (when spawning) is probably another thing (if not the most) that makes it slow.

I'd run a profiler over your library to see where the bottlenecks are. While I'm not the most experienced optimiser, I usually find with my slow code it's usually the bits that deal with allocating and deallocating memory that are the real killer. It might be that the random selection part you have now is fine.

##### Share on other sites
although I haven't read each post in depth, it seems to me that people are making a one to one correspondance b/w a random number and an object, it seems really inefficent, as random number generation is a realtively inefficent thing. you could make more then one decision per random number like so

//pick a group of objects based on probability p

random = random() //random number 0-1 float

for(i = 0 to maxObjects) {

if(random < probability[i])
choose object;
}

//now each chosen object is chosen with probability p[objet]

for(each choosen object)

this takes only one rndom number and produces a list of objects instead of one object. sometimes we have a large object list, which is undesirable, but in the long run it will converge to each object getting picked p(obj)*ntrials times

##### Share on other sites
of course, the events are not independant I don't know your statistical constraints, I never done this type of programming b4.

##### Share on other sites
If you don't have access to a profiler you can use QueryPerformanceCounter. It basically returns a tick count and another function will tell you how long a tick is if you want to convert to seconds or whatever. You query once for the start time, again for the end time, calculate the differance and accumulate it. A count of the number of times the section of code executed is generally a good idea.

You just take a top down approach. You've been working on a performance problem so chances are you have some idea of a function below which the problem lies. So you time that function. You break that time down by the time spent in function calls and loops that it performs. If it spent most of it's time in one function then repeat the process on that function. Just be sure you started in the right place. That just requires checking the programs time against the time for the section you are looking at.

A good general rule is that 90% of the time is spent in 10% of the code. Nothing you can do to that other 90% of the code can make more than a 10% differance in the overall program. When you have a performance problem it starts being more like 1% of the code is 99% of your time. It can be time consuming to figure out what to do about that 1%, but finding it doesn't take much time if you simply take a methodical approach.

## Create an account

Register a new account

• ## Partner Spotlight

• ### Forum Statistics

• Total Topics
627680
• Total Posts
2978609

• 13
• 12
• 10
• 12
• 22