# SSE Bit Shift?

This topic is 3617 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

As far as I can tell, there are no instructions in the various SSE versions to do a bit shift by a non-uniform amount. E.g.: o0 = i0 << n0 o1 = i1 << n1 o2 = i2 << n2 o3 = i3 << n3 There are only functions to shift all packed members by a single amount (e.g. n0 = n1 = n2 = n3). Are there any relatively fast ways to emulate this behavior? By "fast" I mean much faster than extracting each packed member and performing the shifting on the main CPU ALU.

##### Share on other sites
pmulld (SSE4.1), 4 shifts + masking (obvious). Perhaps 16/32bit muls if constrains allow.

##### Share on other sites
Ugh. 4 separate shifts is slow. In this algorithm I was trying to make an SSE version of, the functions I wrote which act like an SSE version of ROL take 68% of the running time; in contrast, the multiply function which does 4x4 int32 multiplication (using several multiplies and shuffles), takes 18%, and the rest (a combination of arithmetic, logical, and uniform shift ops) takes the remaining 14%.

So yeah, using 4 separate shifts is very slow. At least 10x as slow as uniform shifting.

##### Share on other sites
You've asked for a shift, not a rol. And yes, surprisingly, doing 4x{shift, and, or} takes about twelve times more cycles than a single shift.
Again, if you're that desperate, there's pmulld.

##### Share on other sites
I actually don't understand exactly what you want to do, can you be more clear...
Perhaps code what you want in a serial form as an example.

Also what version of SSE are you coding for?

1. 1
2. 2
3. 3
4. 4
Rutin
13
5. 5

• 24
• 10
• 9
• 9
• 11
• ### Forum Statistics

• Total Topics
633695
• Total Posts
3013373
×