SSE Branching Question

Started by
5 comments, last by Zoner 11 years, 11 months ago
Hey everyone,

So I'm working on a relatively simple SSE function. Essentially, I give it three inputs and compute three outputs based on the inputs. The issue is that one of these outputs has three potential values. For instance:
Output = < a1, b, c > if component 1 is the largest
Output = < a2, b, c > if component 2 is the largest
Output = < a3, b, c > if component 3 is the largest

Now my main goals is to be able to implement this completely without branching. To do this I use SSE to compute all three possible outputs <a1, a2, a3 > in parallel. This part works perfectly and I've verified its correct.

My problem is that I then need to choose an output value for a without branching. I initially tried to do this completely without using any SSE cmp instructions because I figured that intrinsic instructions like max, min, cmplt, etc would all use branching in some fashion behind the scenes. But I've been reading up on a few sites (Intel, lecture slides, etc) and I'm not so sure that these instructions actually use branching, which kind of blows my mind. I was hoping that someone on here would have a definitive answer they could give me.

If the SSE comparison instructions DO use branches, does anyone have an idea for how I could select my output without branching? I've been working on this for like the past 24 hours and am almost at my wits end.

If anyone needs additional info just let me know and I can answer your questions.
Advertisement
There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:

// result = (lhs < rhs) ? option1 : option2;
__m128 mask = _mm_cmplt_ps(lhs, rhs);
__m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2));

There are a number of SSE instructions that generate bit masks (all ones or all zeros) depending on the result of a comparison. These masks are then used to choose between two options.

An example using intrinsics:

// result = (lhs < rhs) ? option1 : option2;
__m128 mask = _mm_cmplt_ps(lhs, rhs);
__m128 result = _mm_or_ps(_mm_and_ps(mask, option1), _mm_andnot_ps(mask, option2));


Right, I saw that. But here's what concerns me:
// result = (lhs < rhs) ? option 1 : option2

How is that not a branch? Does SSE do some special bit algorithm? Otherwise I can't see how that's not the same as:

if ( lhs < rhs )
result = option1
else
result = option 2
That line is a comment. It's just explaining what the branchless code is up to.
A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:


const unsigned choice_masks[2] = { 0, (unsigned)-1 };

unsigned mask = choice_masks[lhs < rhs];
unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned


No conditionals, no branching.

That line is a comment. It's just explaining what the branchless code is up to.


A branch means moving the program counter forwards (beyond the next instruction, typically) or backwards. There's no branching going on. _mm_cmplt_ps is an instruction that can be viewed as taking two inputs and generating a single output. Once that instruction has finished, the program counter advances to the next instruction. No branch.

The same holds for the other instructions/intrinsics used in the example.

Here's the same algorithm for scalar variables in regular C:


const unsigned choice_masks[2] = { 0, (unsigned)-1 };

unsigned mask = choice_masks[lhs < rhs];
unsigned result = (mask & option1) | ((~mask) & option2); // assuming option1 and option2 are also of type unsigned


No conditionals, no branching.

Fantastic, that really clears things up for me and will let me eliminate a lot of gross bit manipulations I had to do. Thanks for clearing this up for me guys!
The 'move mask' instruction is basically the only simd instruction that leads up to a 'real branch' in that the mask is then testable with the condition registers on the main CPU.


All the other functions that look like they branch are just setting per-component masks in the SIMD register and aren't real branches that mess with the CPU pipeline. i.e. _mm_gt_ps for vectorized floating point greater than, sets 0xFFFFFFFF or 0 in each slot in the register, so you can do permuting (via blend instructions, or and and nand/or on older cpus) with other data to generate a combined correct answer without any branches.
http://www.gearboxsoftware.com/

This topic is closed to new replies.

Advertisement