Back to General and Gameplay Programming

Branchless code

Russell · 2002-11-20T00:12:55

Branchless code has the potential to run faster than the equivalent code that uses conditionals, or so I''ve heard. I was wondering about how people go about getting rid of conditional code, since it would seem to make a significant difference in some cases (such as frequently used code). One idea that came to mind is to use an array of function pointers instead of a conditional. For example, instead of something like (in my chess program): if (side_to_move == WHITE) { // do white stuff here } else { // do black stuff here } You could do something like: // Forgive me if my syntax for using function pointers is wrong typedef ... Function; Function do_stuff[] = { DoWhiteStuff, DoBlackStuff }; do_stuff[side_to_move](); That would eliminate the potential branch misprediction penalty, and only cost a call and ret instruction. So as long as the call+ret instructions don''t cost more than 10-20 cycles that you would suffer for a branch misprediction, you win here. In thinking about this, it sounded similar to the OOP refactoring technique known as "replace conditional with polymorphism". Is that essentially what is happening when I replace a conditional with a lookup table of function pointers? Anyway, I was wondering if someone more knowledgable than myself about these things could shed some light on which method would be more efficient, and about the real worth of branchless code. Is it worth a great amount of extra work to achieve branchless code? Or is it only going to provide minimal speedups? Thanks, Russell

General and Gameplay Programming Programming

Started by Russell November 18, 2002 07:22 PM

13 comments, last by Russell 21 years, 4 months ago

mtaber

122

November 19, 2002 08:49 AM

Correct. Branchless code means no if''s, no else, no case statements, etc. Function pointers aren''t going to help. What you should do is research cpu branch prediction, pipelining, and caching techniques. Then you will be in a better position to write code that has a lot less branches in it. You won''t eliminate them, but you can make them better.

On the other hand, your compiler should arrange the code so it''s pretty good. If you find that something isn''t fast enough, then you most likely have an algorithm problem, not a ''code'' problem.

Essentially, what happens is a CPU can fetch multiple instructions with each fetch. Then, when processing them, it does them in parallel as the code allows. Consider the following code with two pipelines that are 5 stages long and a two instruction fetch.

int a=0;
int b=0;
a++;
a++;
b++;
b++;

Technically, that is slower than this code:

int a=0;
int b=0;
a++;
b++;
a++;
b++;

Because what happens is that the integers are declared, and set to zero. In ex1, the two instructions of a++ are fetched, then decoded. But in the third stage, it is found that they access the same memory, so they can''t be executed in parallel. It''s called a WAW or Write after Write condition. So, the CPU stalls for a cycle or two while a++ is calculated. Then, depending on the CPU it may feed the value back into the third stage registers, or it might not. Then the second a++ value is calculated. If the CPU is pipelined, the first example will take a minimum of 8 cycles while the second will only take 6 cycles.

Unfortunately, I think that what you''re trying to do is essentially set yourself up for an exercise in futility. Your compiler knows this is a problem. Not only does it know it''s a problem, it will fix it, so unless you write this in assembly and don''t let the compiler touch it, you''re going to execute in 6 cycles regardless of the way that you write it.

Even if the compiler doesn''t do anything about it, many CPU''s have what is called out of order execution. It can see the code coming in, and through the use of additional stages, buffers, and memory registers buried in the different stages of the CPU, it will dynamically reorder the code that is being received in an attempt to avoid memory stalls.

I think that you are going to be much better off worrying about loops within loops (which again, your compiler might unroll) and using better algorithms for parts of your code. Using a log N order algorithm is much better than using an algorithm that is of the order N squared.

Looking for an honest video game publisher? Visit www.gamethoughts.com

Shameless plug: Game Thoughts

DrPizza

160

November 19, 2002 09:31 AM

quote:Technically, that is slower than this code:

char a[99999],*p=a;int main(int c,char**V){char*v=c>0?1[V]:(char*)V;if(c>=0)for(;*v&&93!=*v;){62==*v&&++p||60==*v&&--p||43==*v&&++*p||45==*v&&--*p||44==*v&&(*p=getchar())||46==*v&&putchar(*p)||91==*v&&(*p&&main(0,(char**)(--v+2))||(v=(char*)main(-1,(char**)++v)-1));++v;}else for(c=1;c;c+=(91==*v)-(93==*v),++v);return(int)v;}  /*** drpizza@battleaxe.net ***/

Jan Wassenberg

1,000

November 19, 2002 02:58 PM

Eliminating branches is always good.
WRT to chess stuff: what I did in my engine was replace stuff like

  if(side == white)  infront = sq+8;else  infront = sq-8;

with infront = sq+NEXTRANK; (which could be compiled to cmovcc or arithmetic trickery).
Also, just keep 2 versions of all bitboard masks, so you can do

  static const BitBoard NB_start[2] = { 0x6600000000000000, 0x0000000000000066 };/* penalize non-developed knights and bishops */c = (b[knight] | b[bishop]) & NB_start[side];s += -8*nbits(c);

HTH
Jan

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

trzy

100

November 19, 2002 09:37 PM

quote:
the order of execution within the processor is not constrained by the order of the instructions fed to it.

Oh yes, there is.

Certain kinds of code are better suited to exploit more processor resources than others. There''s only so much rearranging a processor can do. Sequences of instructions with lots of dependencies can be improved by breaking the dependency chains (doing useful work between dependencies.)

Of course, these considerations can only really be taken into account at the assembly language level.

Realistically, branches aren''t _that_ bad. Some of these elaborate methods of removing branches aren''t worth the trouble.

NickB''s example might be more trouble than it''s worth, especially on a register-starved architecture like the X86.

NickB: Are you certain your code runs faster than an alternative method using a branch?

I think a balance has to be struck between using CMOVcc instructions (or conditional/predicated instructions on those architectures that support them -- IA-64, ARM, etc.) and branches. In some cases (especially when there is a large difference between the taken and not-taken paths), branches will be quicker than a long string of wasteful conditional instructions that all end up evaluating false.

I haven''t done any tests on this, but I''d like to in the future in order to determine what is optimal and what isn''t.

Of course, minimizing branches is a good thing... Just make sure the alternative doesn''t turn out to be worse!

---
Bart

----Bart

mtaber

122

November 20, 2002 12:12 AM

quote:Original post by DrPizza
Technically, that is slower than this code:

Except that:
there is no requirement to emit separate instructions
there is no requirement to emit any instructions (which is to say, the variables could be initialized to 2)
there is no requirement to have any variables (which is to say, if the variables are unused they can be entirely eliminated)
the order of execution within the processor is not constrained by the order of the instructions fed to it.

Uhh… wrong. Just because you write two instructions on the same line doesn''t mean they''re not separate instructions. And if you look at my example, it requires the use of variables, because ''Tehcnically, that is slower than this code''.

I was using it as an example to show that the only real way to optimize something to use less branches is at the assembly level, but since not a lot of people have extensive assembly experience, and Russell appears to be one of those people, I figured I''d use C syntax to emulate what you might see in assembly.

Looking for an honest video game publisher? Visit www.gamethoughts.com

Shameless plug: Game Thoughts

Branchless code

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Branchless code

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines