# Is it possible to optimize this? (keeping a small array in registers)

This topic is 878 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello all,

There is a following problem:

1)you get an array of say 1000 floats which represent probability distribution over 1000 possible items

2)every item is made of two different elements which are drawn (without returning) from a pool of say 50 elements

3)at any given point in a game you and your opponent have one of the items according to the distribution in 1)

Now the task is to calculate for every item how likely you are to hold them or in other words in how many ways it's possible to construct a pair of (our_item, opponent's_item). The answer isn't just a distribution in 1) because as the elements are drawn from a common pool, opponent's distribution influences ours.

To be a bit less general, it's the best to think about a card game. Any card game would do but let's stick to a traditional deck of 52 cards and hands made of two cards.

If for example our distribution of hands is: AsAh:0.5, KsKh:0.5 and our opponent's distribution of hands is: AsAd:1 our real probabilities are 0% for AsAh and 1 for KsKh (as AsAh is not possible for us to hold).

The algoritm to calculate it in linear time is this (it's based on inclusion/exclusion principle):

Elem* table = retrieve_proper_table(game_state);
float dupli[52] = {0};
float matchups[1000] = {0};
float sum = 0.0f;

for (int i = 0; i < 1000; i++){
dupli[table[i]->card1] += dist[i];
dupli[table[i]->card2] += dist[i];
sum += dist[i];
}
for (int i = 0; i < 1000; i++){
matchups[i] += sum + dist[i]; /*inclusion/exclusion principle*/
matchups[i] -= dupli[table[i]->card1] + dupli[table[i]->card2];
}

Where table is an array of structs like this:

typedef struct{
uint16 card1;
uint16 card2;
} Elem;

which represents elements (in this case cards) the item at n'th position is made from.

dist is an array of floats from 1)

This algorithm takes about 8 CPU cycles per item in an array to execute. This is because you first need to read all the elements from the distribution, add weights at corresponsing positions to an array of duplicates ("dupli") and iterate again this time subtracting corresponding elements.

This is a lot of writes and reads from memory. Even if it's all in L1 cache it's too slow. Especially painful are the reads from dupli array in 2nd iteration as they make vectorizing the loop impossible (GCC vectorizer says: "data not suitable for gather" and it's no surprise as we need to read 2 elements there)

The thing is in my case there are only 52 possible elements the items are made from. I am thinking that maybe it would be faster if we could keep all the 52 required floats in registers instead of an array. The problem is I don't know how to go about it. I can change the way Elem struct looks like and keep w/e info is needed there. is it possible to keep name of registers there? I imagine something like that should be possible.

The way it would work: we iterate over the distribution, then in corresponding array of Elem we have names of registers we want to keep the sum of given cards in, then we use the same registers when iterating 2nd time to subtract. That would spare a lot of reads from L1 in 2nd iteration as we wouldn't need to read from dupli array anymore (as it would be in registers) and only read from table array.

I realize this is quite a domain specific problem but maybe it's interesting for someone and maybe it would be applicable for different card games or some other problems. I've spent quite a lot of time experimenting/optimizing it and I am unable to beat the times I am getting (again, about 8 CPU cycles per item on i7 3770 compiled with GCC 4.8).

Edited by OneTrickPony82

##### Share on other sites
AFAIK, there's no such thing as an array of registers, other than memory. If that kind of computation really is at the core of what you are doing and you need to do it faster, you may want to consider using a GPU.

##### Share on other sites
An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?

##### Share on other sites

>>AFAIK, there's no such thing as an array of registers, other than memory

Yes I know, but here I have 52 elements only. If it was 2 elements, that would be easy:

if (card1 == 0)

write_to_register1;

else

write_to_register2;

Unfortunately putting 52 elements long if-else statement in there isn't too appealing. If somehow it was possible to keep name of the register in memory though then it would make it possible.

>>you may want to consider using a GPU.

The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time).

It is really a bottleneck, going down to say 6 cycles per item would make everything way better.

>>An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?

This is a probability distribution and the results are needed for every item. I REALLY need all of them. This is not part of the problem, the code is a part of an algorithm which calculates results for all possible items. The "items I really need" doesn't make sense in the context.

Edited by OneTrickPony82

##### Share on other sites

The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time).
It is really a bottleneck, going down to say 6 cycles per item would make everything way better.

GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back. Using the GPU only makes sense if you can execute enough of your program on the GPU that you only need to ship limited amounts of data between CPU and GPU.

What is it your program is doing, anyway?

##### Share on other sites
How would you implement the write_to_register that you mention?

It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.

##### Share on other sites

>>What is it your program is doing, anyway?

Solving a popular card game.

>>GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back.

The problem is the algorithm I am using (and all the fast known algorithms in my domain) are sequential in nature so CPU is better at other parts. I would need to copy back and forth here.

>>It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.

This is incorrect. The 2nd line from 2nd loop takes 3.3 cycles out of 8 the whole thing takes.

Reading to cache is not a bottleneck here. The problem is that in those loops  you need to write (in 1st half) and read (2nd half) from a short local array. You need to write/read 2 elements at the time which makes it not vectorizable. This array is in L1 but the reads from L1 are still a bottle neck.

There are enough registers in the CPU to hold 52 floating points. It kinda sorta looks it should be possible to pull off but I am not good enough at assembly

Edited by OneTrickPony82

##### Share on other sites

>>How slow is too slow? 8 thousand cycles (is that right?) is small enough to be measured in microseconds. How fast does it need to be?

The faster the better. This function and one very similar one (more complicated but with similar structure) are the bottleneck.

The algorithm works on huge decision tree and calls those functions hundred thousands or sometimes millions of times per iteration.

>>It's often good practice to perform all reads from memory at the top of a block, do compute in the middle, then perform writes to memory at the bottom. e.g. you can be sure that this block won't constrain the optimizer:

Thanks, this is a useful suggestion but in this case it doesn't change anything.

When I am measuring only parts of the function I get that the first loop takes 4.4 cycles per item and 2nd part 3.5 cycles per item. That makes sense I guess seeing how there are 3 loads and 2 writes 1st loop while 2nd one has 3 loads and 1 write.

The thing is that in this function we are first adding numbers to 52 values in memory just to keep loading them in 2nd part.

My idea was that there is enough space in SIMD registers to keep 52 floats, so that maybe something like:

3)retrieve value of registers corresponding to card1/card2 and add dist there, store again

and then when iterating again we have those in registers as well and maybe retrieving them and doing opeartions on them would be faster than reading from cache. I realize you can't just add say 3rd float from 6th SIMD register but you can first extract it, then add, then store again (I think?).

Anyway thanks much for your help. I am looking at this function for a year now and basically lost all hope already but sometimes I wake up with an idea and I can't sleep until someone debunks it :)

##### Share on other sites
Take a step back. Can you find any heuristics? Divide and conquer? Or parallelize?

##### Share on other sites

>>No expert either, but I'm pretty sure counting cycles is a fairly meaningless metric on today's technology.

It's quite meaningful. I am measuring stuff in "wall clock" time as well but I've found measuring cycles is more convenient and very reliable.

Calling rdtsc is better than calling a OS API function to get time so there is that.

>>Take a step back. Can you find any heuristics? Divide and conquer? Or parallelize?

Guys, I am programming this thing for more than a year. I am working on it full-time and the project is quite successful.. I talk regularly to other people who program in the same domain and we all agree those functions are a bottleneck. Posting "the big picture" here is pointless as it would take you at least several weeks to get on speed (that's assuming you are 10x more talented than I am which I am sure you are).

I find the attitude a bit condescending. I've asked a well defined question about in my view interesting optimization problem and you are assuming I am missing something in the big picture. Maybe I am or maybe I am not, it isn't meaningful. I am very interested if it's possible to optimize this piece of code beyond what I was able to do myself. It's an interesting problem in itself if anything.

There are no magic shortcuts in what I am doing. There are places which could be optimized and this is one of them. The code is already highly parallel, divide and conquer doesn't apply to algorithms of this class, heuristics may or may not apply but it's completely different issue.

I've posted here because I know some game developers are very good at low level optimization. I am not very good at it and I have no experience in assembly so I am sure there are tricks which I am unaware of there. I really don't know why condescending answers with the assumption that I don't know what I am doing get upvoted on this forum.

Edited by OneTrickPony82

##### Share on other sites

I find the attitude a bit condescending. I've asked a well defined question about in my view interesting optimization problem and you are assuming I am missing something in the big picture.

Okay, fair enough. I wanted to be sure all the bases were covered.

Feel free to down vote our posts if you feel we've been unhelpful.

##### Share on other sites

If Hodgmans suggestion doesn't make much of a difference then probably manually making it vectorized is the only way to get it faster. You can fit your 52 floats into registers but I would guess it will be slower.. as you always have to handle the case where you have to sample from 7 registers, so that's 6 mask-combines (if you keep 8 floats per vector).

If you settle for 4 floats per vector then maybe, but then you need 13 registers and only have 3 left for other stuff..

If you get a newer CPU with AVX2 I think you can make the second loop significantly shorter when vectorized to do 8 floats at a time as there are scatter/gather load/store ops.. so by tweaking the table to be laid out as a proper mask that could be quite quick.

Also I'm just guessing as I can't really say without actually trying it.

(I'm assuming all combinations of 52 cards are possible for the 2 picked, if you could do some high-level changes that divided the problem in parts where each inner-loop only handled part of that range at a time, then that would probably help a lot).

Edited by Erik Rufelt

##### Share on other sites
Do you need full floating point accuracy, or would a smaller fixed-point format do?

It's definitely possible to hand-vectorize. If you ported to GPU that's what you'd be doing :)
Whats the oldest CPU you need it to run on?

Do you already make use of multithreading at a higher level of the algorithm?

Have you tried partially unrolling the loops? e.g. use [i..i+9] within the loop body and increment the counter by 10 at a time. That will increase your register usage while also helping out the CPU's internal scheduler and prefetcher.

You could always port your entire algorithm to the GPU instead of just this one part :D
Or if you're that dedicated, to Intel MIC / Xeon Phi.

##### Share on other sites

Just a quick remark: even if it's unfeasible to keep everything into registers, it may be way more feasible to try to keep things in the cache, right? Since accessing the actual RAM is a severe performance penalty, keeping everything you need into the cache would definitely make for a huge speed improvement.

Also on unrolling the loops and such... you have never seen what GCC will pull off when you leave it unbounded, right? =P (I've seen the damn thing cram in two whole matrices entirely into registers then work with that, despite the code being written in an extremely naive way)

##### Share on other sites

>>Low level optimization is only useful after high level optimization is exhausted.

I find low level optimization a very interesting problem in itself. There is surely more to win in high level things and we are working on them as well.

Still, no matter what kind of algorithmic improvement we make, those functions will always be a bottle neck.

So yeah, I find it both important and interesting.

>>Replacing or tuning high level algorithms can reach many orders of magnitude of difference

We are already doing something faster than publicly known "state of the art". Surely there is way more to be done (and I know some people in my domain got signficiant high level improvement) but getting there is very difficult as well.

>>I'm sorry you feel that is condescending, but it is the truth.

Low level optimization is still interesting. It still can bring very big improvements. High level algorithms are important but it's not that like we (and other people) are not working on them. It's just not a topic of this conversation.

>>. No matter how much you tune your 18 speed bike it will never race around the track like an F1 racer. No matter how fast you get your F1 racer to run on land speed records, rockets will reach faster speeds.

This can be used to argue you should never do low level optimization. It's just meaningless. Low level things matter to some people. I am one of them.

>>When looking for major performance improvements you must first look at the biggest practical high level changes. Only after that is done do you look to low level tunings.

I am working on it for more than a year and I already based my work on what other people in the field got to during many years of work. You may say it's "after that" already even though there are for sure a lot of high level improvements to make as well.

Why can't we just discuss an interesting low level optimization problem?

I find low level optimization to be interesting in itself. I am not very good at it and I hope to learn more about it.

>>If you get a newer CPU with AVX2 I think you can make the second loop significantly shorter when vectorized to do 8 floats at a time as there are scatter/gather load/store ops.. so by tweaking the table to be laid out as a proper mask that could be quite quick.

Wow, this is very interesting. I wonder if GCC will be able to apply those gather/scater instructions by itself. Unfortunatley right now I am working on ivy bridge CPU but I think I am going to get a newer one just to see.

>>(I'm assuming all combinations of 52 cards are possible for the 2 picked, if you could do some high-level changes that divided the problem in parts where each inner-loop only handled part of that range at a time, then that would probably help a lot).

card1 is always different than card2 and I can force card1 being bigger (or smaller) than card2 but that's unfortunately all.

>>Do you need full floating point accuracy, or would a smaller fixed-point format do?

Hard to say but probably here I need full floating point as I am already suffering a bit for having only 32 bits here (those numbers get summed up over huge tree).

>>Whats the oldest CPU you need it to run on?

It's already fast enough for most use cases. Making it faster is important for people who can afford very powerful hardware so even if I make improvement which only works on say Haswell it will be worth something for me.

>>Do you already make use of multithreading at a higher level of the algorithm?

The algorithm is very easy parallelizable and I can basically use as many cores as I am given by splitting the task way before those functions are called (basically every thread gets its own independent branch of the tree).

>>Have you tried partially unrolling the loops? e.g. use [i..i+9] within the loop body and increment the counter by 10 at a time. That will increase your register usage while also helping out the CPU's internal scheduler and prefetcher.

I tried and it was the same time. That's why I am pretty sure reads from L1 are the bottleneck, even with unrolling it the values are written to memory and then read from it in 2nd part.

>>You could always port your entire algorithm to the GPU instead of just this one part

It's not an easy problem. While the whole thing is possible to reduce to linear equations it's not very efficient. (the resulting problem is very big) So far most people in the field think CPUs are better suited for it but we never know.

>>Just a quick remark: even if it's unfeasible to keep everything into registers, it may be way more feasible to try to keep things in the cache, right? Since accessing the actual RAM is a severe performance penalty, keeping everything you need into the cache would definitely make for a huge speed improvement.

##### Share on other sites

>>If you get a newer CPU with AVX2 I think you can make the second loop significantly shorter when vectorized to do 8 floats at a time as there are scatter/gather load/store ops.. so by tweaking the table to be laid out as a proper mask that could be quite quick.

Wow, this is very interesting. I wonder if GCC will be able to apply those gather/scater instructions by itself. Unfortunatley right now I am working on ivy bridge CPU but I think I am going to get a newer one just to see.

You could port some performance sensitive parts of your code from C/GCC to ISPC. It's a new language from Intel that extends C to treat vectorized code as a first-class language concept. It will automatically emit SSE/AVX/etc instructions. Though yeah, on SSE it will give gather warnings seeing as SSE has no vectorized-load.

I find that rewriting algorithms in ISPC gives me good insight into how I should rearrange my data structures in order to best make use of vector instruction sets.

##### Share on other sites

Caveats: I'm not really good at this kind of optimization. It should properly be performed at the assembly level, and I'm just not that good at it, especially on modern CPUs where performance results are often surprising. (Fewer instructions doesn't mean faster). My level of skill is about "Don't do anything unnecessary, and make sure everything fits into the cache".

But here at some things that I would try, if confronted with this problem.

1) sum is a tempting target. Could you calculate sum when you calculate dist, instead of in this loop? If you use the same dist array for more than one run, that could save a little time.

2) In the second loop, you add sum to every element of matchups. Could your algorithm just assume that matchups is normalized to start at sum? Or maybe apply sum when you access the matchups array later, instead of adding it every time?

3) You are having trouble getting your compiler to vectorize loops for you. I'm not good at this either. But you could try a couple of things. One would be to try a different compiler. I'm completely out of date here. But certainly you could try Visual C++. Intel's C compiler used to be known for optimization. Even if your target platform isn't supported you could "borrow" the assembler output.

4) Or, split your loops into multiple loops until the compiler vectorizes at least one of the loops. matchups = sum + dist certainly should vectorize on its own, I would think.

Anyways, just some thoughts. As I said, I'm not great at this, so I would just blunder around trying stuff, and maybe something will give a performance boost. Actually, I guess I would try to find someone good at assembly optimization, but you're probably already looking.

Enjoy!

Geoff

##### Share on other sites

It's a very interesting optimization problem.

I somehow doubt it can get very much faster though.. are those 'table' and 'dist' arrays very large and only used once, or are they used multiple times?

If the same elements aren't re-used without a very large number of separate elements being iterated over first then they will disappear from cache entirely. As your first loop reads 8 bytes per iteration and the second reads the same 8 (should be in cache) + writes 4, if those memory locations are "new" then they will be loaded from RAM. At 8 cycles total for 8 bytes read + 4 bytes written you can maybe set an absolute maximum performance as (bandwidth of RAM) divided by (the number of Ghz your CPU runs at times the number of threads).

(This is of course not true if the function is run multiple times on the same arrays while they are still in cache).

Are you sure you get any vectorization at all from GCC?

I pasted your code into VC++ 2015 and it didn't vectorize anything.

(Also I was mistaken about scatter, only gather is currently available, scatter is only in Xeon Phi, though you don't need it for the second loop, but in order to vectorize the first efficiently you would need scatter).

EDIT: Removed likely mistake about interleaving.

Edited by Erik Rufelt

##### Share on other sites

>>You could port some performance sensitive parts of your code from C/GCC to ISPC

Thanks, I have't heard about it. I tried Intel C compiler but it produced slower code than GCC 4.8 for me (but that could be the results of me developing/benchmarking under GCC 4.8 in the first place)

>>1) sum is a tempting target. Could you calculate sum when you calculate dist, instead of in this loop? If you use the same dist array for more than one run, that could save a little time.

Yes, actually it's more than one but not massively more (usually 2 or 3 per iteration). I think it will save some cycles. Thanks.

>>2) In the second loop, you add sum to every element of matchups. Could your algorithm just assume that matchups is normalized to start at sum? Or maybe apply sum when you access the matchups array later, instead of adding it every time?

It adds it because inclusion/exclusion principle (the sum is sum of all minus some of those who overlap at one card, plus some of those which overlap at two which is only value at given position) and I could actually save adding sum right now in some places (but not all) but this addition just isn't that tempting to optimize away.

>>3) You are having trouble getting your compiler to vectorize loops for you. I'm not good at this either. But you could try a couple of things. One would be to try a different compiler. I'm completely out of date here. But certainly you could try Visual C++. Intel's C compiler used to be known for optimization. Even if your target platform isn't supported you could "borrow" the assembler output.

I develop for Windows and I switched to GCC because VS is in my view a crappy compiler (doesn't support C99, only supports OpenMP 2.0, produces slow code). Intel is decent but for my case produces slower code as well.

I've found this is a very controversial subject and a lot of better programmers than me swear that eitehr VS or Intel are better so I am not looking for religious war.

>>4) Or, split your loops into multiple loops until the compiler vectorizes at least one of the loops. matchups = sum + dist certainly should vectorize on its own, I would think.

Yes, I've tried it but it's slower sadly.

>>I somehow doubt it can get very much faster though.. are those 'table' and 'dist' arrays very large and only used once, or are they used multiple times?

there are many table arrays. When we go over huge decision tree (millions of nodes) we pick the proper one. There are about 2000 of those arrays. Dist is different at every leaf in the tree (it's a probability distribution at current point). As table repeats over many leaves (but not dist) I am pretty sure it's in cache already (but sadly I never figured out how to measure on Windows and as my customer base uses 95% Windows and 5% Mac (but they have parallels/bootcamp anyway) there wasn't enough incentives to port to Linux yet. Having access to better profiling tools might be one though

>>Are you sure you get any vectorization at all from GCC?

Yes, I am sure. Overall my program benefits about 15% from including AVX alone.

Maybe I write very simple code but I've found that auto-vectorizer from GCC really does a decent job.

>>I pasted your code into VC++ 2015 and it didn't vectorize anything.

VC is in my view a crappy compiler but again, a lot of people disagree - often people I have a lot of respect for.

I code in C so I have no opinion how it handles all the C++ madness but at numerical code it is just slower on everything I've ever tested.

>>(Also I was mistaken about scatter, only gather is currently available, scatter is only in Xeon Phi, though you don't need it for the second loop, but in order to vectorize the first efficiently you would need scatter).

I can't wait for a new Xeon Phi. I have 0 experience with the older ones but wha tthey say about them in the presentations looks like perfect fit for my problem (way better than GPUs). Gather alone would speed it up significantly I believe (and even more so the other similar function I have).

Thanks for your help guys. I was really mostly interested in the "array in the registers" part but I've learnt a few more things on the way :)

Edited by OneTrickPony82

##### Share on other sites

The thing is in my case there are only 52 possible elements the items are made from. I am thinking that maybe it would be faster if we could keep all the 52 required floats in registers instead of an array.

C++ from some up high compilers has implemented this :

http://stackoverflow.com/questions/6335023/c-how-to-allocate-memory-dynamically-on-stack

I have never used it, but it could accept 52 structs , this way you would page break less, but you should also pool the arrays you revisit in every iteration at least to do neighbour -  but in first place to try not do it as adviced, if it is possible.

In my old times I have used this approach to allocate on stack some larger unpractical data

class Stacky14Float{

public:

Stacky7Float(){};

float a1;float a2;float a3;float a4;float a5;float a6;float a7;

float b1;float b2;float b3;float b4;float b5;float b6;float b7;

}

......and than just used as a local instanced class

Stacky7Float st();

##### Share on other sites

Is it possible to solve multiple instances of this problem at once (like 4)? Maybe that would allow using SIMD instructions?