Sign in to follow this  
lukabratzi

SSE Branching Question

Recommended Posts

Hey everyone,

So I'm working on a relatively simple SSE function. Essentially, I give it three inputs and compute three outputs based on the inputs. The issue is that one of these outputs has three potential values. For instance:
Output = < a1, b, c > if component 1 is the largest
Output = < a2, b, c > if component 2 is the largest
Output = < a3, b, c > if component 3 is the largest

Now my main goals is to be able to implement this completely without branching. To do this I use SSE to compute all three possible outputs <a1, a2, a3 > in parallel. This part works perfectly and I've verified its correct.

My problem is that I then need to choose an output value for a without branching. I initially tried to do this completely without using any SSE cmp instructions because I figured that intrinsic instructions like max, min, cmplt, etc would all use branching in some fashion behind the scenes. But I've been reading up on a few sites (Intel, lecture slides, etc) and I'm not so sure that these instructions actually use branching, which kind of blows my mind. I was hoping that someone on here would have a definitive answer they could give me.

If the SSE comparison instructions DO use branches, does anyone have an idea for how I could select my output without branching? I've been working on this for like the past 24 hours and am almost at my wits end.

If anyone needs additional info just let me know and I can answer your questions.

Share this post


Link to post
Share on other sites
There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:
[code]
// result = (lhs < rhs) ? option1 : option2;
__m128 mask = _mm_cmplt_ps(lhs, rhs);
__m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2));
[/code]

Share this post


Link to post
Share on other sites
[quote name='edd²' timestamp='1333833206' post='4929152']
There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:
[code]
// result = (lhs < rhs) ? option1 : option2;
__m128 mask = _mm_cmplt_ps(lhs, rhs);
__m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2));
[/code]
[/quote]
Right, I saw that. But here's what concerns me:
// result = (lhs < rhs) ? option 1 : option2

How is that not a branch? Does SSE do some special bit algorithm? Otherwise I can't see how that's not the same as:

if ( lhs < rhs )
result = option1
else
result = option 2

Share this post


Link to post
Share on other sites
A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:

[code]
const unsigned choice_masks[2] = { 0, (unsigned)-1 };

unsigned mask = choice_masks[lhs < rhs];
unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned
[/code]

No conditionals, no branching.

Share this post


Link to post
Share on other sites
[quote name='Nypyren' timestamp='1333838874' post='4929178']
That line is a comment. It's just explaining what the branchless code is up to.
[/quote]
[quote name='edd²' timestamp='1333839394' post='4929180']
A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:

[code]
const unsigned choice_masks[2] = { 0, (unsigned)-1 };

unsigned mask = choice_masks[lhs < rhs];
unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned
[/code]

No conditionals, no branching.
[/quote]
Fantastic, that really clears things up for me and will let me eliminate a lot of gross bit manipulations I had to do. Thanks for clearing this up for me guys!

Share this post


Link to post
Share on other sites
The 'move mask' instruction is basically the only simd instruction that leads up to a 'real branch' in that the mask is then testable with the condition registers on the main CPU.


All the other functions that look like they branch are just setting per-component masks in the SIMD register and aren't real branches that mess with the CPU pipeline. i.e. _mm_gt_ps for vectorized floating point greater than, sets 0xFFFFFFFF or 0 in each slot in the register, so you can do permuting (via blend instructions, or and and nand/or on older cpus) with other data to generate a combined correct answer without any branches. Edited by Zoner

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this