Branchless code

Started by
13 comments, last by Russell 21 years, 4 months ago
Correct. Branchless code means no if''s, no else, no case statements, etc. Function pointers aren''t going to help. What you should do is research cpu branch prediction, pipelining, and caching techniques. Then you will be in a better position to write code that has a lot less branches in it. You won''t eliminate them, but you can make them better.

On the other hand, your compiler should arrange the code so it''s pretty good. If you find that something isn''t fast enough, then you most likely have an algorithm problem, not a ''code'' problem.

Essentially, what happens is a CPU can fetch multiple instructions with each fetch. Then, when processing them, it does them in parallel as the code allows. Consider the following code with two pipelines that are 5 stages long and a two instruction fetch.

int a=0;
int b=0;
a++;
a++;
b++;
b++;

Technically, that is slower than this code:

int a=0;
int b=0;
a++;
b++;
a++;
b++;

Because what happens is that the integers are declared, and set to zero. In ex1, the two instructions of a++ are fetched, then decoded. But in the third stage, it is found that they access the same memory, so they can''t be executed in parallel. It''s called a WAW or Write after Write condition. So, the CPU stalls for a cycle or two while a++ is calculated. Then, depending on the CPU it may feed the value back into the third stage registers, or it might not. Then the second a++ value is calculated. If the CPU is pipelined, the first example will take a minimum of 8 cycles while the second will only take 6 cycles.

Unfortunately, I think that what you''re trying to do is essentially set yourself up for an exercise in futility. Your compiler knows this is a problem. Not only does it know it''s a problem, it will fix it, so unless you write this in assembly and don''t let the compiler touch it, you''re going to execute in 6 cycles regardless of the way that you write it.

Even if the compiler doesn''t do anything about it, many CPU''s have what is called out of order execution. It can see the code coming in, and through the use of additional stages, buffers, and memory registers buried in the different stages of the CPU, it will dynamically reorder the code that is being received in an attempt to avoid memory stalls.

I think that you are going to be much better off worrying about loops within loops (which again, your compiler might unroll) and using better algorithms for parts of your code. Using a log N order algorithm is much better than using an algorithm that is of the order N squared.




Looking for an honest video game publisher? Visit www.gamethoughts.com
Shameless plug: Game Thoughts
Advertisement
quote:Technically, that is slower than this code:

Except that:
there is no requirement to emit separate instructions
there is no requirement to emit any instructions (which is to say, the variables could be initialized to 2)
there is no requirement to have any variables (which is to say, if the variables are unused they can be entirely eliminated)
the order of execution within the processor is not constrained by the order of the instructions fed to it.
char a[99999],*p=a;int main(int c,char**V){char*v=c>0?1[V]:(char*)V;if(c>=0)for(;*v&&93!=*v;){62==*v&&++p||60==*v&&--p||43==*v&&++*p||45==*v&&--*p||44==*v&&(*p=getchar())||46==*v&&putchar(*p)||91==*v&&(*p&&main(0,(char**)(--v+2))||(v=(char*)main(-1,(char**)++v)-1));++v;}else for(c=1;c;c+=(91==*v)-(93==*v),++v);return(int)v;}  /*** drpizza@battleaxe.net ***/
Eliminating branches is always good.
WRT to chess stuff: what I did in my engine was replace stuff like

  if(side == white)  infront = sq+8;else  infront = sq-8;  

with infront = sq+NEXTRANK; (which could be compiled to cmovcc or arithmetic trickery).
Also, just keep 2 versions of all bitboard masks, so you can do

  static const BitBoard NB_start[2] = { 0x6600000000000000, 0x0000000000000066 };/* penalize non-developed knights and bishops */c = (b[knight] | b[bishop]) & NB_start[side];s += -8*nbits(c);  


HTH
Jan
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
quote:
the order of execution within the processor is not constrained by the order of the instructions fed to it.


Oh yes, there is.

Certain kinds of code are better suited to exploit more processor resources than others. There''s only so much rearranging a processor can do. Sequences of instructions with lots of dependencies can be improved by breaking the dependency chains (doing useful work between dependencies.)

Of course, these considerations can only really be taken into account at the assembly language level.

Realistically, branches aren''t _that_ bad. Some of these elaborate methods of removing branches aren''t worth the trouble.

NickB''s example might be more trouble than it''s worth, especially on a register-starved architecture like the X86.

NickB: Are you certain your code runs faster than an alternative method using a branch?

I think a balance has to be struck between using CMOVcc instructions (or conditional/predicated instructions on those architectures that support them -- IA-64, ARM, etc.) and branches. In some cases (especially when there is a large difference between the taken and not-taken paths), branches will be quicker than a long string of wasteful conditional instructions that all end up evaluating false.

I haven''t done any tests on this, but I''d like to in the future in order to determine what is optimal and what isn''t.

Of course, minimizing branches is a good thing... Just make sure the alternative doesn''t turn out to be worse!

---
Bart
----Bart
quote:Original post by DrPizza
Technically, that is slower than this code:

Except that:
there is no requirement to emit separate instructions
there is no requirement to emit any instructions (which is to say, the variables could be initialized to 2)
there is no requirement to have any variables (which is to say, if the variables are unused they can be entirely eliminated)
the order of execution within the processor is not constrained by the order of the instructions fed to it.


Uhh… wrong. Just because you write two instructions on the same line doesn''t mean they''re not separate instructions. And if you look at my example, it requires the use of variables, because ''Tehcnically, that is slower than this code''.

I was using it as an example to show that the only real way to optimize something to use less branches is at the assembly level, but since not a lot of people have extensive assembly experience, and Russell appears to be one of those people, I figured I''d use C syntax to emulate what you might see in assembly.

Looking for an honest video game publisher? Visit www.gamethoughts.com
Shameless plug: Game Thoughts

This topic is closed to new replies.

Advertisement