# Critique This SSE2 Code

This topic is 4089 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

So, I was bored and... well, I decided to write something with 4x SSE2 (my first program with SSE2). I decided I'd write an implementation of RC6, as it appeared pretty parallelizable - the algorithm has two math pipelines with two values each per block and blocks are independent, so my implementation could process two blocks in one pass to use 4x parallel math. Those of you more familiar with SSE2 and know the RC6 algorithm might be starting to laugh right about now. It turned out (I found out a fair ways in) that RC6 uses two operations that aren't supported in packed math - 4x parallel multiply, and 4x parallel shift with a different amount for each value in the vector. Anyway, I did my best (at least for a first draft), and came up with this:
#define ALIGN16 __declspec(align(16))
#define ROUNDS 20
#define SCHED_SIZE (2 * ROUNDS + 4)

typedef unsigned char BYTE;
typedef unsigned long WORD;	// Unsigned 32-bit integer
typedef WORD KEY[SCHED_SIZE];

// Performs an ABCD + ABCD -> ACAC + BDBD permutation
inline void DeinterleaveBlocks(__m128i ABCD1, __m128i ABCD2, __m128i &ACAC12, __m128i &BDBD12)
{
__m128i AABB = _mm_unpacklo_epi32(ABCD1, ABCD2),
CCDD = _mm_unpackhi_epi32(ABCD1, ABCD2);

ACAC12 = _mm_unpacklo_epi32(AABB, CCDD);
BDBD12 = _mm_unpackhi_epi32(AABB, CCDD);
}

// Simply a vector version of _rotl
inline __m128i _mm_rotl_epi32(__m128i x, int bits)
{
assert(bits <= 32);

return _mm_or_si128(_mm_slli_epi32(x, bits), _mm_srli_epi32(x, 32 - bits));
}

void EncryptTwoBlocksSIMD(WORD lpBlock[8], WORD lpKey[SCHED_SIZE])
{
__m128i *lpBlocks = (__m128i *)lpBlock;

// Load the constants - initial and final adjustment values, set of 1s, and the post-multiply mask
ALIGN16 static const WORD nOnes[4] = {1, 1, 1, 1},
nMulMask[4] = {WORD(-1), 0, WORD(-1), 0};
__m128i S0101 = _mm_shuffle_epi32(_mm_loadl_epi64((__m128i *)lpKey), 0x44),
S_3434 = _mm_shuffle_epi32(_mm_loadl_epi64((__m128i *)&lpKey[SCHED_SIZE - 2]), 0x44),

// Segregate As and Cs from Bs and Ds
__m128i ACAC12, BDBD12;
DeinterleaveBlocks(ABCD1, ABCD2, ACAC12, BDBD12);

// Pre-encryption processing

for (size_t i = 1; i <= ROUNDS; i++)
{
// Initial (pre-rotate) values
__m128i tutu12 = _mm_slli_epi32(BDBD12, 1);	// Multiply by 2

// There's no packed multiply, so we have have to do it the hard way: 32x32 = 64 multiplication twice. That'll give us BXBX + DXDX. Then shift and OR them together to get the BDBD12 we want. What a pain.
__m128i mulLow = _mm_and_si128(_mm_mul_epu32(BDBD12, tutu12), mulMask),
mulHigh = _mm_and_si128(_mm_mul_epu32(_mm_srli_si128(BDBD12, 4), _mm_srli_si128(tutu12, 4)), mulMask);

tutu12 = _mm_or_si128(mulLow, _mm_slli_si128(mulHigh, 4));
// Rotate left
tutu12 = _mm_rotl_epi32(tutu12, 5);

// Compute the new ACAC
ACAC12 = _mm_xor_si128(ACAC12, tutu12);

// Now the really sucky part: there's no parallel shift, so we have to save it to memory and do it with integer math, then load it again.
ALIGN16 WORD ntutu12[4], nACAC12[4];
_mm_store_si128((__m128i *)nACAC12, ACAC12);
_mm_store_si128((__m128i *)ntutu12, tutu12);

nACAC12[0] = _rotl(nACAC12[0], ntutu12[1]);
nACAC12[1] = _rotl(nACAC12[1], ntutu12[0]);
nACAC12[2] = _rotl(nACAC12[2], ntutu12[3]);
nACAC12[3] = _rotl(nACAC12[3], ntutu12[2]);

// Now add the stuff from the key schedule
__m128i S2i0101 = _mm_shuffle_epi32(_mm_loadl_epi64((__m128i *)&lpKey[i * 2]), 0x44);

// Do the shift: BDBD becomes ACAC, ACAC becomes DBDB (swap order)
__m128i temp12 = BDBD12;
BDBD12 = _mm_shuffle_epi32(ACAC12, 0xB1);
ACAC12 = temp12;
}

// Post-encryption processing

// Reassemble the original blocks and store them
ABCD1 = _mm_unpacklo_epi32(ACAC12, BDBD12);
ABCD2 = _mm_unpackhi_epi32(ACAC12, BDBD12);

_mm_store_si128(&lpBlocks[0], ABCD1);
_mm_store_si128(&lpBlocks[1], ABCD2);
}

This implementation has been verified for correctness. I haven't yet gone through and pinched cycles yet, though you're welcome to comment about stuff like that if you want (especially inside the loop). What I most need to know is if there's any better way to accomplish the 4x parallel integer multiplication, and the 4x parallel shift by variable amount. And no, I haven't benchmarked the code to see if it's any faster than the regular version, yet :P And yes, I do get frequent hate mail from my coworkers because I use word-wrap. [Edited by - Catafriggm on May 9, 2007 1:02:09 AM]

##### Share on other sites
[ source ] > [ code ] at not table breaking

##### Share on other sites
I'm not familiar with RC6 and I haven't looked over your code in detail but have you looked at implementing the 4 way parallel shift using multiplies by 2^n rather than left shifts by n? You'll have to do a fair bit of masking and shuffling to get things in the right place but I'd guess it's probably faster than storing to memory and using the integer registers to do the shifts.

1. 1
Rutin
26
2. 2
3. 3
JoeJ
20
4. 4
5. 5

• 10
• 10
• 9
• 9
• 10
• ### Forum Statistics

• Total Topics
631752
• Total Posts
3002091
×