Archived

This topic is now archived and is closed to further replies.

Branchless code

This topic is 5503 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Branchless code has the potential to run faster than the equivalent code that uses conditionals, or so I''ve heard. I was wondering about how people go about getting rid of conditional code, since it would seem to make a significant difference in some cases (such as frequently used code). One idea that came to mind is to use an array of function pointers instead of a conditional. For example, instead of something like (in my chess program): if (side_to_move == WHITE) { // do white stuff here } else { // do black stuff here } You could do something like: // Forgive me if my syntax for using function pointers is wrong typedef ... Function; Function do_stuff[] = { DoWhiteStuff, DoBlackStuff }; do_stuff[side_to_move](); That would eliminate the potential branch misprediction penalty, and only cost a call and ret instruction. So as long as the call+ret instructions don''t cost more than 10-20 cycles that you would suffer for a branch misprediction, you win here. In thinking about this, it sounded similar to the OOP refactoring technique known as "replace conditional with polymorphism". Is that essentially what is happening when I replace a conditional with a lookup table of function pointers? Anyway, I was wondering if someone more knowledgable than myself about these things could shed some light on which method would be more efficient, and about the real worth of branchless code. Is it worth a great amount of extra work to achieve branchless code? Or is it only going to provide minimal speedups? Thanks, Russell

Share this post


Link to post
Share on other sites
I''m not really knowledgeable about this, but I''ll tak an educated guess, and if someone knows more than I do, I''d love to hear it.

For the example situation where there are only two cases, I would suspect the if statement is faster. With an array of function pointers, there will be an array reference, which I would guess, takes at least as long as a conditional in the if. Regardless, in the situation with only two cases, it depends on what is more readable. I personally find the if statement easier to read.

However, if you had something like

switch(side_to_move)
{
case white:
...
case red:
...
case blue:
...
}

where you have a lot of case statements, then I would go with the function pointers, because in that case, they will be faster, especially if the switch statement is rather large. 2 or 3 I wouldn''t worry about, 20 or 30, I''d worry. That''s an extreme though.

My opinion, if you only have one conditional, don''t bother changing your code, it''s probably not worth it. If you have a lot of conditionals that are called frequently, then you probably would change to function pointers if speed is an issue.

________________________
Grab your sword, and get prepared for the SolidSteel RPG engine, coming soon...(i.e. when it''s done)

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Only optimize places where you need increased efficiency. The example the original poster gave works in theory, aside from a few syntax issues. The trade-off is a taking a possible branch-prediction failure and turning it into an indirect jump. Determing whether or not it''s worth it is the difficlt part, but it works something like this.

x -- the time the original code took to execute
y -- the time addition incurred by a branch-prediction failure
z -- the time of an indirect jump
A -- the percent of the time you expect a branch-prediction failure to occur

So if x+y*A >= z, then you can argue that in most cases the function-pointer method is more efficient. You wouldn''t want to apply this method to anything but some portion of code that is executed repeatedly though, as the average case argument falls apart. You would never want to do this on code that will be executed only a few times or simply not very often. The other drawback to this is that since C and C++ do not support heterogenous arrays, the function signature is limited to one case. This may or may not pose a problem for your design.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:

the time the original code took to execute



By this I meant the time it takes for a regular if-statement to be executed, which is just a direct jump or executed the next instruction of code, not the code contained in the if-statement itself. Anyway, it depends mostly on the percentage of time that you expect a branch-predition failure to occur.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Calls/ret *also* need to be predicted, just like branches. If the CPU can''t predict a call, because the call will depend on the value of a register (or variable), then it ends up being as expensive (if not more) than a mispredicted branch.

The same applies to the ret too, so don''t think about messing up stack values to go around this.

Unless the code inside if (color == WHITE) {} else {} is dead simple, don''t think for a second that it will run faster without the branch. It most likely won''t. Your best bet is to move the conditionals outside of the inner loop (assuming your code is inside an inner loop). If that code isn''t inside an inner loop, then you are most likely trying to optimize the wrong code.

As always, profile before you decide to optimize. Think before you code

Share this post


Link to post
Share on other sites
Function pointers aren''t the solution -- they''ll make things much worse.

They won''t be predicted properly (I don''t think branch prediction algorithms even try to make predictions for indirect branches -- or maybe they do?)

Spend your effort on minimizing the amount of conditions to test and maximizing the amount of work done within the conditional code blocks.

---
Bart

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Like the lowermost AP said, compiler can''t predict the jump until it has fetched the address from a register.. This efficiently halts the pipeline and is much harder to predict than a normal conditional branch which can only go to two fixed addresses.

Share this post


Link to post
Share on other sites
No; branchless code really means you don't jump - at all!

You'll probably need to do this kind of thing at assembly level (to derive any befit from it, anyhow). It's fairly common in MMX code for example. The basic principle is that you have multiple answers to a formula (say, you wish to get the absolute value of an integer, where you would have the original number and it's negative). You'd then generate a mask that when anded would eliminate the incorrect answer(s).


    
//eg: get absolute value of integer

// start with number in eax:


xor edx, edx
mov ecx, eax
test eax, eax
setl dl
neg ecx
dec edx // if eax is less than zero, mask is 0, otherwise it's 0xFFFFFFFF

xor ebx, ebx
test ecx, ecx
setl bl // generate mask for ecx

and eax, edx // mask eax

dec ebx
and ecx, ebx // mask ecx

or eax, ecx // combine answers, only positive answer can result



Of course, as I'm sure someone will point out this could be done faster by doing using the P6 instruction CMOVcc - but this is just to illustrate the idea! as you can see this has no jumps or calls in it - my guess is that this should execute in about 6 cycles or so.


  
mov ecx, eax
neg ecx
cmovge eax, ecx


Hope that gives you an idea of what it's about

[EDIT: I'll try to actually put the correct code in this time!]

[edited by - NickB on November 19, 2002 5:25:57 AM]

Share this post


Link to post
Share on other sites
Correct. Branchless code means no if''s, no else, no case statements, etc. Function pointers aren''t going to help. What you should do is research cpu branch prediction, pipelining, and caching techniques. Then you will be in a better position to write code that has a lot less branches in it. You won''t eliminate them, but you can make them better.

On the other hand, your compiler should arrange the code so it''s pretty good. If you find that something isn''t fast enough, then you most likely have an algorithm problem, not a ''code'' problem.

Essentially, what happens is a CPU can fetch multiple instructions with each fetch. Then, when processing them, it does them in parallel as the code allows. Consider the following code with two pipelines that are 5 stages long and a two instruction fetch.

int a=0;
int b=0;
a++;
a++;
b++;
b++;

Technically, that is slower than this code:

int a=0;
int b=0;
a++;
b++;
a++;
b++;

Because what happens is that the integers are declared, and set to zero. In ex1, the two instructions of a++ are fetched, then decoded. But in the third stage, it is found that they access the same memory, so they can''t be executed in parallel. It''s called a WAW or Write after Write condition. So, the CPU stalls for a cycle or two while a++ is calculated. Then, depending on the CPU it may feed the value back into the third stage registers, or it might not. Then the second a++ value is calculated. If the CPU is pipelined, the first example will take a minimum of 8 cycles while the second will only take 6 cycles.

Unfortunately, I think that what you''re trying to do is essentially set yourself up for an exercise in futility. Your compiler knows this is a problem. Not only does it know it''s a problem, it will fix it, so unless you write this in assembly and don''t let the compiler touch it, you''re going to execute in 6 cycles regardless of the way that you write it.

Even if the compiler doesn''t do anything about it, many CPU''s have what is called out of order execution. It can see the code coming in, and through the use of additional stages, buffers, and memory registers buried in the different stages of the CPU, it will dynamically reorder the code that is being received in an attempt to avoid memory stalls.

I think that you are going to be much better off worrying about loops within loops (which again, your compiler might unroll) and using better algorithms for parts of your code. Using a log N order algorithm is much better than using an algorithm that is of the order N squared.





Looking for an honest video game publisher? Visit www.gamethoughts.com

Share this post


Link to post
Share on other sites
quote:
Technically, that is slower than this code:

Except that:
there is no requirement to emit separate instructions
there is no requirement to emit any instructions (which is to say, the variables could be initialized to 2)
there is no requirement to have any variables (which is to say, if the variables are unused they can be entirely eliminated)
the order of execution within the processor is not constrained by the order of the instructions fed to it.

Share this post


Link to post
Share on other sites
Eliminating branches is always good.
WRT to chess stuff: what I did in my engine was replace stuff like

  if(side == white)
infront = sq+8;
else
infront = sq-8;

with infront = sq+NEXTRANK; (which could be compiled to cmovcc or arithmetic trickery).
Also, just keep 2 versions of all bitboard masks, so you can do

  static const BitBoard NB_start[2] = { 0x6600000000000000, 0x0000000000000066 };

/* penalize non-developed knights and bishops */
c = (b[knight] | b[bishop]) & NB_start[side];
s += -8*nbits(c);


HTH
Jan

Share this post


Link to post
Share on other sites
quote:

the order of execution within the processor is not constrained by the order of the instructions fed to it.



Oh yes, there is.

Certain kinds of code are better suited to exploit more processor resources than others. There''s only so much rearranging a processor can do. Sequences of instructions with lots of dependencies can be improved by breaking the dependency chains (doing useful work between dependencies.)

Of course, these considerations can only really be taken into account at the assembly language level.

Realistically, branches aren''t _that_ bad. Some of these elaborate methods of removing branches aren''t worth the trouble.

NickB''s example might be more trouble than it''s worth, especially on a register-starved architecture like the X86.

NickB: Are you certain your code runs faster than an alternative method using a branch?

I think a balance has to be struck between using CMOVcc instructions (or conditional/predicated instructions on those architectures that support them -- IA-64, ARM, etc.) and branches. In some cases (especially when there is a large difference between the taken and not-taken paths), branches will be quicker than a long string of wasteful conditional instructions that all end up evaluating false.

I haven''t done any tests on this, but I''d like to in the future in order to determine what is optimal and what isn''t.

Of course, minimizing branches is a good thing... Just make sure the alternative doesn''t turn out to be worse!

---
Bart

Share this post


Link to post
Share on other sites
quote:
Original post by DrPizza
[quote]Technically, that is slower than this code:

Except that:
there is no requirement to emit separate instructions
there is no requirement to emit any instructions (which is to say, the variables could be initialized to 2)
there is no requirement to have any variables (which is to say, if the variables are unused they can be entirely eliminated)
the order of execution within the processor is not constrained by the order of the instructions fed to it.


Uhh... wrong. Just because you write two instructions on the same line doesn''t mean they''re not separate instructions. And if you look at my example, it requires the use of variables, because ''Tehcnically, that is slower than this code''.

I was using it as an example to show that the only real way to optimize something to use less branches is at the assembly level, but since not a lot of people have extensive assembly experience, and Russell appears to be one of those people, I figured I''d use C syntax to emulate what you might see in assembly.


Looking for an honest video game publisher? Visit www.gamethoughts.com

Share this post


Link to post
Share on other sites