SSE Bit Shift?
As far as I can tell, there are no instructions in the various SSE versions to do a bit shift by a non-uniform amount. E.g.:
o0 = i0 << n0
o1 = i1 << n1
o2 = i2 << n2
o3 = i3 << n3
There are only functions to shift all packed members by a single amount (e.g. n0 = n1 = n2 = n3).
Are there any relatively fast ways to emulate this behavior? By "fast" I mean much faster than extracting each packed member and performing the shifting on the main CPU ALU.
Ugh. 4 separate shifts is slow. In this algorithm I was trying to make an SSE version of, the functions I wrote which act like an SSE version of ROL take 68% of the running time; in contrast, the multiply function which does 4x4 int32 multiplication (using several multiplies and shuffles), takes 18%, and the rest (a combination of arithmetic, logical, and uniform shift ops) takes the remaining 14%.
So yeah, using 4 separate shifts is very slow. At least 10x as slow as uniform shifting.
So yeah, using 4 separate shifts is very slow. At least 10x as slow as uniform shifting.
You've asked for a shift, not a rol. And yes, surprisingly, doing 4x{shift, and, or} takes about twelve times more cycles than a single shift.
Again, if you're that desperate, there's pmulld.
Again, if you're that desperate, there's pmulld.
I actually don't understand exactly what you want to do, can you be more clear...
Perhaps code what you want in a serial form as an example.
Also what version of SSE are you coding for?
Perhaps code what you want in a serial form as an example.
Also what version of SSE are you coding for?
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement