Jump to content
  • Advertisement
Sign in to follow this  
lukabratzi

SSE Branching Question

This topic is 2411 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey everyone,

So I'm working on a relatively simple SSE function. Essentially, I give it three inputs and compute three outputs based on the inputs. The issue is that one of these outputs has three potential values. For instance:
Output = < a1, b, c > if component 1 is the largest
Output = < a2, b, c > if component 2 is the largest
Output = < a3, b, c > if component 3 is the largest

Now my main goals is to be able to implement this completely without branching. To do this I use SSE to compute all three possible outputs <a1, a2, a3 > in parallel. This part works perfectly and I've verified its correct.

My problem is that I then need to choose an output value for a without branching. I initially tried to do this completely without using any SSE cmp instructions because I figured that intrinsic instructions like max, min, cmplt, etc would all use branching in some fashion behind the scenes. But I've been reading up on a few sites (Intel, lecture slides, etc) and I'm not so sure that these instructions actually use branching, which kind of blows my mind. I was hoping that someone on here would have a definitive answer they could give me.

If the SSE comparison instructions DO use branches, does anyone have an idea for how I could select my output without branching? I've been working on this for like the past 24 hours and am almost at my wits end.

If anyone needs additional info just let me know and I can answer your questions.

Share this post


Link to post
Share on other sites
Advertisement
There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:

// result = (lhs < rhs) ? option1 : option2;
__m128 mask = _mm_cmplt_ps(lhs, rhs);
__m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2));

Share this post


Link to post
Share on other sites

There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:

// result = (lhs < rhs) ? option1 : option2;
__m128 mask = _mm_cmplt_ps(lhs, rhs);
__m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2));


Right, I saw that. But here's what concerns me:
// result = (lhs < rhs) ? option 1 : option2

How is that not a branch? Does SSE do some special bit algorithm? Otherwise I can't see how that's not the same as:

if ( lhs < rhs )
result = option1
else
result = option 2

Share this post


Link to post
Share on other sites
A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:


const unsigned choice_masks[2] = { 0, (unsigned)-1 };

unsigned mask = choice_masks[lhs < rhs];
unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned


No conditionals, no branching.

Share this post


Link to post
Share on other sites

That line is a comment. It's just explaining what the branchless code is up to.


A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:


const unsigned choice_masks[2] = { 0, (unsigned)-1 };

unsigned mask = choice_masks[lhs < rhs];
unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned


No conditionals, no branching.

Fantastic, that really clears things up for me and will let me eliminate a lot of gross bit manipulations I had to do. Thanks for clearing this up for me guys!

Share this post


Link to post
Share on other sites
The 'move mask' instruction is basically the only simd instruction that leads up to a 'real branch' in that the mask is then testable with the condition registers on the main CPU.


All the other functions that look like they branch are just setting per-component masks in the SIMD register and aren't real branches that mess with the CPU pipeline. i.e. _mm_gt_ps for vectorized floating point greater than, sets 0xFFFFFFFF or 0 in each slot in the register, so you can do permuting (via blend instructions, or and and nand/or on older cpus) with other data to generate a combined correct answer without any branches. Edited by Zoner

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!