# Is it possible to optimize this? (keeping a small array in registers)

This topic is 1175 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello all,

There is a following problem:

1)you get an array of say 1000 floats which represent probability distribution over 1000 possible items

2)every item is made of two different elements which are drawn (without returning) from a pool of say 50 elements

3)at any given point in a game you and your opponent have one of the items according to the distribution in 1)

Now the task is to calculate for every item how likely you are to hold them or in other words in how many ways it's possible to construct a pair of (our_item, opponent's_item). The answer isn't just a distribution in 1) because as the elements are drawn from a common pool, opponent's distribution influences ours.

To be a bit less general, it's the best to think about a card game. Any card game would do but let's stick to a traditional deck of 52 cards and hands made of two cards.

If for example our distribution of hands is: AsAh:0.5, KsKh:0.5 and our opponent's distribution of hands is: AsAd:1 our real probabilities are 0% for AsAh and 1 for KsKh (as AsAh is not possible for us to hold).

The algoritm to calculate it in linear time is this (it's based on inclusion/exclusion principle):

Elem* table = retrieve_proper_table(game_state);
float dupli[52] = {0};
float matchups[1000] = {0};
float sum = 0.0f;

for (int i = 0; i < 1000; i++){
dupli[table[i]->card1] += dist[i];
dupli[table[i]->card2] += dist[i];
sum += dist[i];
}
for (int i = 0; i < 1000; i++){
matchups[i] += sum + dist[i]; /*inclusion/exclusion principle*/
matchups[i] -= dupli[table[i]->card1] + dupli[table[i]->card2];
}

Where table is an array of structs like this:

typedef struct{
uint16 card1;
uint16 card2;
} Elem;

which represents elements (in this case cards) the item at n'th position is made from.

dist is an array of floats from 1)

This algorithm takes about 8 CPU cycles per item in an array to execute. This is because you first need to read all the elements from the distribution, add weights at corresponsing positions to an array of duplicates ("dupli") and iterate again this time subtracting corresponding elements.

This is a lot of writes and reads from memory. Even if it's all in L1 cache it's too slow. Especially painful are the reads from dupli array in 2nd iteration as they make vectorizing the loop impossible (GCC vectorizer says: "data not suitable for gather" and it's no surprise as we need to read 2 elements there)

The thing is in my case there are only 52 possible elements the items are made from. I am thinking that maybe it would be faster if we could keep all the 52 required floats in registers instead of an array. The problem is I don't know how to go about it. I can change the way Elem struct looks like and keep w/e info is needed there. is it possible to keep name of registers there? I imagine something like that should be possible.

The way it would work: we iterate over the distribution, then in corresponding array of Elem we have names of registers we want to keep the sum of given cards in, then we use the same registers when iterating 2nd time to subtract. That would spare a lot of reads from L1 in 2nd iteration as we wouldn't need to read from dupli array anymore (as it would be in registers) and only read from table array.

I realize this is quite a domain specific problem but maybe it's interesting for someone and maybe it would be applicable for different card games or some other problems. I've spent quite a lot of time experimenting/optimizing it and I am unable to beat the times I am getting (again, about 8 CPU cycles per item on i7 3770 compiled with GCC 4.8).

Edited by OneTrickPony82

##### Share on other sites
AFAIK, there's no such thing as an array of registers, other than memory. If that kind of computation really is at the core of what you are doing and you need to do it faster, you may want to consider using a GPU.

##### Share on other sites
An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?

##### Share on other sites

>>AFAIK, there's no such thing as an array of registers, other than memory

Yes I know, but here I have 52 elements only. If it was 2 elements, that would be easy:

if (card1 == 0)

write_to_register1;

else

write_to_register2;

Unfortunately putting 52 elements long if-else statement in there isn't too appealing. If somehow it was possible to keep name of the register in memory though then it would make it possible.

>>you may want to consider using a GPU.

The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time).

It is really a bottleneck, going down to say 6 cycles per item would make everything way better.

>>An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?

This is a probability distribution and the results are needed for every item. I REALLY need all of them. This is not part of the problem, the code is a part of an algorithm which calculates results for all possible items. The "items I really need" doesn't make sense in the context.

Edited by OneTrickPony82

##### Share on other sites

The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time).
It is really a bottleneck, going down to say 6 cycles per item would make everything way better.

GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back. Using the GPU only makes sense if you can execute enough of your program on the GPU that you only need to ship limited amounts of data between CPU and GPU.

What is it your program is doing, anyway?

##### Share on other sites
How would you implement the write_to_register that you mention?

It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.

##### Share on other sites

>>What is it your program is doing, anyway?

Solving a popular card game.

>>GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back.

The problem is the algorithm I am using (and all the fast known algorithms in my domain) are sequential in nature so CPU is better at other parts. I would need to copy back and forth here.

>>It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.

This is incorrect. The 2nd line from 2nd loop takes 3.3 cycles out of 8 the whole thing takes.

Reading to cache is not a bottleneck here. The problem is that in those loops  you need to write (in 1st half) and read (2nd half) from a short local array. You need to write/read 2 elements at the time which makes it not vectorizable. This array is in L1 but the reads from L1 are still a bottle neck.

There are enough registers in the CPU to hold 52 floating points. It kinda sorta looks it should be possible to pull off but I am not good enough at assembly

Edited by OneTrickPony82

##### Share on other sites

>>How slow is too slow? 8 thousand cycles (is that right?) is small enough to be measured in microseconds. How fast does it need to be?

The faster the better. This function and one very similar one (more complicated but with similar structure) are the bottleneck.

The algorithm works on huge decision tree and calls those functions hundred thousands or sometimes millions of times per iteration.

>>It's often good practice to perform all reads from memory at the top of a block, do compute in the middle, then perform writes to memory at the bottom. e.g. you can be sure that this block won't constrain the optimizer:

Thanks, this is a useful suggestion but in this case it doesn't change anything.

When I am measuring only parts of the function I get that the first loop takes 4.4 cycles per item and 2nd part 3.5 cycles per item. That makes sense I guess seeing how there are 3 loads and 2 writes 1st loop while 2nd one has 3 loads and 1 write.

The thing is that in this function we are first adding numbers to 52 values in memory just to keep loading them in 2nd part.

My idea was that there is enough space in SIMD registers to keep 52 floats, so that maybe something like:

3)retrieve value of registers corresponding to card1/card2 and add dist there, store again

and then when iterating again we have those in registers as well and maybe retrieving them and doing opeartions on them would be faster than reading from cache. I realize you can't just add say 3rd float from 6th SIMD register but you can first extract it, then add, then store again (I think?).

Anyway thanks much for your help. I am looking at this function for a year now and basically lost all hope already but sometimes I wake up with an idea and I can't sleep until someone debunks it :)

1. 1
2. 2
Rutin
21
3. 3
4. 4
A4L
15
5. 5
khawk
14

• 13
• 26
• 10
• 11
• 9
• ### Forum Statistics

• Total Topics
633737
• Total Posts
3013613
×