# SSE Branching Question

This topic is 2267 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey everyone,

So I'm working on a relatively simple SSE function. Essentially, I give it three inputs and compute three outputs based on the inputs. The issue is that one of these outputs has three potential values. For instance:
Output = < a1, b, c > if component 1 is the largest
Output = < a2, b, c > if component 2 is the largest
Output = < a3, b, c > if component 3 is the largest

Now my main goals is to be able to implement this completely without branching. To do this I use SSE to compute all three possible outputs <a1, a2, a3 > in parallel. This part works perfectly and I've verified its correct.

My problem is that I then need to choose an output value for a without branching. I initially tried to do this completely without using any SSE cmp instructions because I figured that intrinsic instructions like max, min, cmplt, etc would all use branching in some fashion behind the scenes. But I've been reading up on a few sites (Intel, lecture slides, etc) and I'm not so sure that these instructions actually use branching, which kind of blows my mind. I was hoping that someone on here would have a definitive answer they could give me.

If the SSE comparison instructions DO use branches, does anyone have an idea for how I could select my output without branching? I've been working on this for like the past 24 hours and am almost at my wits end.

##### Share on other sites
There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:
 // result = (lhs < rhs) ? option1 : option2; __m128 mask = _mm_cmplt_ps(lhs, rhs); __m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2)); 

##### Share on other sites

There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:
 // result = (lhs < rhs) ? option1 : option2; __m128 mask = _mm_cmplt_ps(lhs, rhs); __m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2)); 

Right, I saw that. But here's what concerns me:
// result = (lhs < rhs) ? option 1 : option2

How is that not a branch? Does SSE do some special bit algorithm? Otherwise I can't see how that's not the same as:

if ( lhs < rhs )
result = option1
else
result = option 2

##### Share on other sites
That line is a comment. It's just explaining what the branchless code is up to.

##### Share on other sites
A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:

 const unsigned choice_masks[2] = { 0, (unsigned)-1 }; unsigned mask = choice_masks[lhs < rhs]; unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned 

No conditionals, no branching.

##### Share on other sites

That line is a comment. It's just explaining what the branchless code is up to.

A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:

 const unsigned choice_masks[2] = { 0, (unsigned)-1 }; unsigned mask = choice_masks[lhs < rhs]; unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned 

No conditionals, no branching.

Fantastic, that really clears things up for me and will let me eliminate a lot of gross bit manipulations I had to do. Thanks for clearing this up for me guys!

##### Share on other sites
The 'move mask' instruction is basically the only simd instruction that leads up to a 'real branch' in that the mask is then testable with the condition registers on the main CPU.

All the other functions that look like they branch are just setting per-component masks in the SIMD register and aren't real branches that mess with the CPU pipeline. i.e. _mm_gt_ps for vectorized floating point greater than, sets 0xFFFFFFFF or 0 in each slot in the register, so you can do permuting (via blend instructions, or and and nand/or on older cpus) with other data to generate a combined correct answer without any branches. Edited by Zoner

1. 1
Rutin
25
2. 2
3. 3
4. 4
JoeJ
18
5. 5

• 14
• 14
• 11
• 11
• 9
• ### Forum Statistics

• Total Topics
631757
• Total Posts
3002140
×