Back to General and Gameplay Programming

Is it possible to optimize this? (keeping a small array in registers)

General and Gameplay Programming Programming

Started by OneTrickPony82 September 20, 2015 08:59 AM

33 comments, last by iMalc 8 years, 6 months ago

OneTrickPony82

106

Author

September 20, 2015 08:59 AM

Hello all,

There is a following problem:

1)you get an array of say 1000 floats which represent probability distribution over 1000 possible items

2)every item is made of two different elements which are drawn (without returning) from a pool of say 50 elements

3)at any given point in a game you and your opponent have one of the items according to the distribution in 1)

Now the task is to calculate for every item how likely you are to hold them or in other words in how many ways it's possible to construct a pair of (our_item, opponent's_item). The answer isn't just a distribution in 1) because as the elements are drawn from a common pool, opponent's distribution influences ours.

To be a bit less general, it's the best to think about a card game. Any card game would do but let's stick to a traditional deck of 52 cards and hands made of two cards.

If for example our distribution of hands is: AsAh:0.5, KsKh:0.5 and our opponent's distribution of hands is: AsAd:1 our real probabilities are 0% for AsAh and 1 for KsKh (as AsAh is not possible for us to hold).

The algoritm to calculate it in linear time is this (it's based on inclusion/exclusion principle):


Elem* table = retrieve_proper_table(game_state);
float dupli[52] = {0};
float matchups[1000] = {0};
float sum = 0.0f;

for (int i = 0; i < 1000; i++){
    dupli[table[i]->card1] += dist[i];
    dupli[table[i]->card2] += dist[i];
    sum += dist[i];
}
for (int i = 0; i < 1000; i++){
    matchups[i] += sum + dist[i]; /*inclusion/exclusion principle*/
    matchups[i] -= dupli[table[i]->card1] + dupli[table[i]->card2];
}

Where table is an array of structs like this:


typedef struct{
    uint16 card1;
    uint16 card2;
} Elem;

which represents elements (in this case cards) the item at n'th position is made from.

dist is an array of floats from 1)

This algorithm takes about 8 CPU cycles per item in an array to execute. This is because you first need to read all the elements from the distribution, add weights at corresponsing positions to an array of duplicates ("dupli") and iterate again this time subtracting corresponding elements.

This is a lot of writes and reads from memory. Even if it's all in L1 cache it's too slow. Especially painful are the reads from dupli array in 2nd iteration as they make vectorizing the loop impossible (GCC vectorizer says: "data not suitable for gather" and it's no surprise as we need to read 2 elements there)

The thing is in my case there are only 52 possible elements the items are made from. I am thinking that maybe it would be faster if we could keep all the 52 required floats in registers instead of an array. The problem is I don't know how to go about it. I can change the way Elem struct looks like and keep w/e info is needed there. is it possible to keep name of registers there? I imagine something like that should be possible.

The way it would work: we iterate over the distribution, then in corresponding array of Elem we have names of registers we want to keep the sum of given cards in, then we use the same registers when iterating 2nd time to subtract. That would spare a lot of reads from L1 in 2nd iteration as we wouldn't need to read from dupli array anymore (as it would be in registers) and only read from table array.

I realize this is quite a domain specific problem but maybe it's interesting for someone and maybe it would be applicable for different card games or some other problems. I've spent quite a lot of time experimenting/optimizing it and I am unable to beat the times I am getting (again, about 8 CPU cycles per item on i7 3770 compiled with GCC 4.8).

alvaro

21,604

September 20, 2015 10:48 AM

AFAIK, there's no such thing as an array of registers, other than memory. If that kind of computation really is at the core of what you are doing and you need to do it faster, you may want to consider using a GPU.

Alberth

10,209

September 20, 2015 10:56 AM

An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?

OneTrickPony82

106

Author

September 20, 2015 11:00 AM

>>AFAIK, there's no such thing as an array of registers, other than memory

Yes I know, but here I have 52 elements only. If it was 2 elements, that would be easy:

if (card1 == 0)

write_to_register1;

else

write_to_register2;

Unfortunately putting 52 elements long if-else statement in there isn't too appealing. If somehow it was possible to keep name of the register in memory though then it would make it possible.

>>you may want to consider using a GPU.

The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time).

It is really a bottleneck, going down to say 6 cycles per item would make everything way better.

>>An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?

This is a probability distribution and the results are needed for every item. I REALLY need all of them. This is not part of the problem, the code is a part of an algorithm which calculates results for all possible items. The "items I really need" doesn't make sense in the context.

alvaro

21,604

September 20, 2015 11:19 AM

The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time).
It is really a bottleneck, going down to say 6 cycles per item would make everything way better.

GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back. Using the GPU only makes sense if you can execute enough of your program on the GPU that you only need to ship limited amounts of data between CPU and GPU.

What is it your program is doing, anyway?

Aardvajk

13,217

September 20, 2015 12:51 PM

How would you implement the write_to_register that you mention?

It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.

journal | website | twitter | youtube

OneTrickPony82

106

Author

September 20, 2015 01:47 PM

>>What is it your program is doing, anyway?

Solving a popular card game.

>>GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back.

The problem is the algorithm I am using (and all the fast known algorithms in my domain) are sequential in nature so CPU is better at other parts. I would need to copy back and forth here.

>>It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.

This is incorrect. The 2nd line from 2nd loop takes 3.3 cycles out of 8 the whole thing takes.

Reading to cache is not a bottleneck here. The problem is that in those loops you need to write (in 1st half) and read (2nd half) from a short local array. You need to write/read 2 elements at the time which makes it not vectorizable. This array is in L1 but the reads from L1 are still a bottle neck.

There are enough registers in the CPU to hold 52 floating points. It kinda sorta looks it should be possible to pull off but I am not good enough at assembly

Hodgman

52,717

September 20, 2015 03:03 PM

This algorithm takes about 8 CPU cycles per item in an array to execute.
This is a lot of writes and reads from memory. Even if it's all in L1 cache it's too slow.

How slow is too slow? 8 thousand cycles (is that right?) is small enough to be measured in microseconds. How fast does it need to be?

Aliasing rules might be tripping you up. Duplicated code blocks won't necessarily be merged when doing so would break the language rules. e.g. in this block, dist is read 3 times, but the compiler might actually have to emit 3 load instructions instead of merging those loads into 1 shared one.


for (int i = 0; i < 1000; i++){
  dupli[table[i]->card1] += dist[i]; //fetch dist[i] as float, then write to a float somewhere in memory
                                     //Maybe dist[i] was modified by the above line, better read it again to be safe
  dupli[table[i]->card2] += dist[i]; //fetch dist[i] as float, then write to a float somewhere in memory
                                     //Maybe dist[i] was modified by the above line, better read it again to be safe
  sum += dist[i];                    //fetch dist[i] as float
}

It's often good practice to perform all reads from memory at the top of a block, do compute in the middle, then perform writes to memory at the bottom. e.g. you can be sure that this block won't constrain the optimizer:


for (int i = 0; i < 1000; i++){
  Elem* e = table + i;
  int card1 = e->card1;//read uint16 from from pointer
  int card2 = e->card2;//read uint16 from from pointer
  float dist_i = dist[i];   //read float from from pointer
  float dupli_1 = dupli[card1];//read float from from pointer
  float dupli_2 = dupli[card2];//read float from from pointer
  dupli[card1] = dupli_1 + dist_i; //write float to pointer
  dupli[card2] = dupli_2 + dist_i; //write float to pointer
  sum += dist_i;//write to local (won't alias)
}

. 22 Racing Series .

OneTrickPony82

106

Author

September 20, 2015 05:43 PM

>>How slow is too slow? 8 thousand cycles (is that right?) is small enough to be measured in microseconds. How fast does it need to be?

The faster the better. This function and one very similar one (more complicated but with similar structure) are the bottleneck.

The algorithm works on huge decision tree and calls those functions hundred thousands or sometimes millions of times per iteration.

>>It's often good practice to perform all reads from memory at the top of a block, do compute in the middle, then perform writes to memory at the bottom. e.g. you can be sure that this block won't constrain the optimizer:

Thanks, this is a useful suggestion but in this case it doesn't change anything.

When I am measuring only parts of the function I get that the first loop takes 4.4 cycles per item and 2nd part 3.5 cycles per item. That makes sense I guess seeing how there are 3 loads and 2 writes 1st loop while 2nd one has 3 loads and 1 write.

The thing is that in this function we are first adding numbers to 52 values in memory just to keep loading them in 2nd part.

My idea was that there is enough space in SIMD registers to keep 52 floats, so that maybe something like:

1)read card1, card2

2)load a dist

3)retrieve value of registers corresponding to card1/card2 and add dist there, store again

and then when iterating again we have those in registers as well and maybe retrieving them and doing opeartions on them would be faster than reading from cache. I realize you can't just add say 3rd float from 6th SIMD register but you can first extract it, then add, then store again (I think?).

Anyway thanks much for your help. I am looking at this function for a year now and basically lost all hope already but sometimes I wake up with an idea and I can't sleep until someone debunks it :)