Sign in to follow this  

To the assembly gurus: more efficient bit rotates with SSE2?

Recommended Posts

Hi, I'm new to GD, so sorry if I'm posting in the wrong board. To the point: I'm writing some performance-critical code that does SHA-1 hashing. I'm using integer SSE2 to do four SHA-1 hashes at once (so each XMM reg holds four 32-bit state variables; thus the algorithm doesn't differ much from the scalar version). Anyways, a thing that bugs me is that you can't perform efficient bit rotates. Since you don't have the handy ROR/ROL instructions for XMM regs, you need to do it this way #define ROTATELEFT(x, bits) ((x << bits) | (x >> (32 - bits)) which costs 4 instructions and clobbers an extra temp register. So I am looking for a faster way to do these bit rotations. The general case seems to be unsolvable. However, in the SHA-1 algorithm, one of the steps is a bit rotate left by a single bit. Basically, you need to get a snapshot of the sign bits of the doublewords in the XMM reg, shift the reg to the left 1 bit, and update the LSBs with the snapshot. So I came up with a scheme like this: 1. Get the MSBs of the four doublewords in the XMM and store them in, e.g. EAX 2. Shift left the XMM one bit 3. Use EAX to index a lookup table with the correct m128 vector to OR with the XMM. I'm stuck at step 1. I can't find an efficient way to fetch those MSBs. There is the instruction PMOVMSKB, which fetches the MSBs of all 16 bytes of the XMM, however it gives me way too much information; a 65536-entry lookup table would be too costly to do. If there's a clever way to fetch only the bits I need, so that I can use a 16-entry lookup, it would be great. I think I might be missing something important here, so I'm open to any suggestions.

Share this post

Link to post
Share on other sites
Hmm, samoth you're probably right... I wonder where can one see an accurate instruction clock timings reference anyway (since the guys here: <> disapprove intel's official manuals)?

In my case, instruction combining is not an issue at all. The SHA1 algorithm allows a lot of stuff to be done independently, so different calculations combine quite nicely for the most part. I was more concerned about the clobbered scratch register.

Anyway, I went on to implement the following suboptimal algorithm:

1. Use PMOVMSKB to fetch MSBs.
2. Mask it so that only bits I care about remain
3. Use this value to index the lookup table.

Using a few tricks, step 3 uses no additional instructions, and doesn't pollute the cache much (the LUT is 128K, but only 16 entries of it are ever used). So it's still 4 instructions to do a bit rotate, and no additional register is needed. Still, the performance is a bit lower on the i7 (haven't got time to test on other PCs). I guess if no better idea arrives and benchmarks don't support the "optimization", I should scrape this approach altogether.

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this