SSE Bit Shift?

Started by
3 comments, last by scratt 15 years, 3 months ago
As far as I can tell, there are no instructions in the various SSE versions to do a bit shift by a non-uniform amount. E.g.: o0 = i0 << n0 o1 = i1 << n1 o2 = i2 << n2 o3 = i3 << n3 There are only functions to shift all packed members by a single amount (e.g. n0 = n1 = n2 = n3). Are there any relatively fast ways to emulate this behavior? By "fast" I mean much faster than extracting each packed member and performing the shifting on the main CPU ALU.
Advertisement
pmulld (SSE4.1), 4 shifts + masking (obvious). Perhaps 16/32bit muls if constrains allow.
Ugh. 4 separate shifts is slow. In this algorithm I was trying to make an SSE version of, the functions I wrote which act like an SSE version of ROL take 68% of the running time; in contrast, the multiply function which does 4x4 int32 multiplication (using several multiplies and shuffles), takes 18%, and the rest (a combination of arithmetic, logical, and uniform shift ops) takes the remaining 14%.

So yeah, using 4 separate shifts is very slow. At least 10x as slow as uniform shifting.
You've asked for a shift, not a rol. And yes, surprisingly, doing 4x{shift, and, or} takes about twelve times more cycles than a single shift.
Again, if you're that desperate, there's pmulld.
I actually don't understand exactly what you want to do, can you be more clear...
Perhaps code what you want in a serial form as an example.

Also what version of SSE are you coding for?
Feel free to 'rate me down', especially when I prove you wrong, because it will make you feel better for a second....

This topic is closed to new replies.

Advertisement