Hello all,
There is a following problem:
1)you get an array of say 1000 floats which represent probability distribution over 1000 possible items
2)every item is made of two different elements which are drawn (without returning) from a pool of say 50 elements
3)at any given point in a game you and your opponent have one of the items according to the distribution in 1)
Now the task is to calculate for every item how likely you are to hold them or in other words in how many ways it's possible to construct a pair of (our_item, opponent's_item). The answer isn't just a distribution in 1) because as the elements are drawn from a common pool, opponent's distribution influences ours.
To be a bit less general, it's the best to think about a card game. Any card game would do but let's stick to a traditional deck of 52 cards and hands made of two cards.
If for example our distribution of hands is: AsAh:0.5, KsKh:0.5 and our opponent's distribution of hands is: AsAd:1 our real probabilities are 0% for AsAh and 1 for KsKh (as AsAh is not possible for us to hold).
The algoritm to calculate it in linear time is this (it's based on inclusion/exclusion principle):
Elem* table = retrieve_proper_table(game_state);
float dupli[52] = {0};
float matchups[1000] = {0};
float sum = 0.0f;
for (int i = 0; i < 1000; i++){
dupli[table[i]->card1] += dist[i];
dupli[table[i]->card2] += dist[i];
sum += dist[i];
}
for (int i = 0; i < 1000; i++){
matchups[i] += sum + dist[i]; /*inclusion/exclusion principle*/
matchups[i] -= dupli[table[i]->card1] + dupli[table[i]->card2];
}
Where table is an array of structs like this:
typedef struct{
uint16 card1;
uint16 card2;
} Elem;
which represents elements (in this case cards) the item at n'th position is made from.
dist is an array of floats from 1)
This algorithm takes about 8 CPU cycles per item in an array to execute. This is because you first need to read all the elements from the distribution, add weights at corresponsing positions to an array of duplicates ("dupli") and iterate again this time subtracting corresponding elements.
This is a lot of writes and reads from memory. Even if it's all in L1 cache it's too slow. Especially painful are the reads from dupli array in 2nd iteration as they make vectorizing the loop impossible (GCC vectorizer says: "data not suitable for gather" and it's no surprise as we need to read 2 elements there)
The thing is in my case there are only 52 possible elements the items are made from. I am thinking that maybe it would be faster if we could keep all the 52 required floats in registers instead of an array. The problem is I don't know how to go about it. I can change the way Elem struct looks like and keep w/e info is needed there. is it possible to keep name of registers there? I imagine something like that should be possible.
The way it would work: we iterate over the distribution, then in corresponding array of Elem we have names of registers we want to keep the sum of given cards in, then we use the same registers when iterating 2nd time to subtract. That would spare a lot of reads from L1 in 2nd iteration as we wouldn't need to read from dupli array anymore (as it would be in registers) and only read from table array.
I realize this is quite a domain specific problem but maybe it's interesting for someone and maybe it would be applicable for different card games or some other problems. I've spent quite a lot of time experimenting/optimizing it and I am unable to beat the times I am getting (again, about 8 CPU cycles per item on i7 3770 compiled with GCC 4.8).