Sign in to follow this  

Critique This SSE2 Code

Recommended Posts

Catafriggm    296
So, I was bored and... well, I decided to write something with 4x SSE2 (my first program with SSE2). I decided I'd write an implementation of RC6, as it appeared pretty parallelizable - the algorithm has two math pipelines with two values each per block and blocks are independent, so my implementation could process two blocks in one pass to use 4x parallel math. Those of you more familiar with SSE2 and know the RC6 algorithm might be starting to laugh right about now. It turned out (I found out a fair ways in) that RC6 uses two operations that aren't supported in packed math - 4x parallel multiply, and 4x parallel shift with a different amount for each value in the vector. Anyway, I did my best (at least for a first draft), and came up with this:
#define ALIGN16 __declspec(align(16))
#define ROUNDS 20
#define SCHED_SIZE (2 * ROUNDS + 4)

typedef unsigned char BYTE;
typedef unsigned long WORD;	// Unsigned 32-bit integer

// Performs an ABCD + ABCD -> ACAC + BDBD permutation
inline void DeinterleaveBlocks(__m128i ABCD1, __m128i ABCD2, __m128i &ACAC12, __m128i &BDBD12)
	__m128i AABB = _mm_unpacklo_epi32(ABCD1, ABCD2),
		CCDD = _mm_unpackhi_epi32(ABCD1, ABCD2);

	ACAC12 = _mm_unpacklo_epi32(AABB, CCDD);
	BDBD12 = _mm_unpackhi_epi32(AABB, CCDD);

// Simply a vector version of _rotl
inline __m128i _mm_rotl_epi32(__m128i x, int bits)
	assert(bits <= 32);

	return _mm_or_si128(_mm_slli_epi32(x, bits), _mm_srli_epi32(x, 32 - bits));

void EncryptTwoBlocksSIMD(WORD lpBlock[8], WORD lpKey[SCHED_SIZE])
	__m128i *lpBlocks = (__m128i *)lpBlock;

	// Load the constants - initial and final adjustment values, set of 1s, and the post-multiply mask
	ALIGN16 static const WORD nOnes[4] = {1, 1, 1, 1},
		nMulMask[4] = {WORD(-1), 0, WORD(-1), 0};
	__m128i S0101 = _mm_shuffle_epi32(_mm_loadl_epi64((__m128i *)lpKey), 0x44),
		S_3434 = _mm_shuffle_epi32(_mm_loadl_epi64((__m128i *)&lpKey[SCHED_SIZE - 2]), 0x44),
		ones = _mm_load_si128((__m128i *)nOnes),
		mulMask = _mm_load_si128((__m128i *)nMulMask);

	// Load the two blocks
	__m128i ABCD1 = _mm_load_si128(&lpBlocks[0]),
		ABCD2 = _mm_load_si128(&lpBlocks[1]);

	// Segregate As and Cs from Bs and Ds
	__m128i ACAC12, BDBD12;
	DeinterleaveBlocks(ABCD1, ABCD2, ACAC12, BDBD12);

	// Pre-encryption processing
	BDBD12 = _mm_add_epi32(BDBD12, S0101);

	for (size_t i = 1; i <= ROUNDS; i++)
		// Initial (pre-rotate) values
		__m128i tutu12 = _mm_slli_epi32(BDBD12, 1);	// Multiply by 2
		tutu12 = _mm_add_epi32(tutu12, ones);	// Add 1
		// There's no packed multiply, so we have have to do it the hard way: 32x32 = 64 multiplication twice. That'll give us BXBX + DXDX. Then shift and OR them together to get the BDBD12 we want. What a pain.
		__m128i mulLow = _mm_and_si128(_mm_mul_epu32(BDBD12, tutu12), mulMask),
			mulHigh = _mm_and_si128(_mm_mul_epu32(_mm_srli_si128(BDBD12, 4), _mm_srli_si128(tutu12, 4)), mulMask);

		tutu12 = _mm_or_si128(mulLow, _mm_slli_si128(mulHigh, 4));
		// Rotate left
		tutu12 = _mm_rotl_epi32(tutu12, 5);

		// Compute the new ACAC
		ACAC12 = _mm_xor_si128(ACAC12, tutu12);

		// Now the really sucky part: there's no parallel shift, so we have to save it to memory and do it with integer math, then load it again.
		ALIGN16 WORD ntutu12[4], nACAC12[4];
		_mm_store_si128((__m128i *)nACAC12, ACAC12);
		_mm_store_si128((__m128i *)ntutu12, tutu12);

		nACAC12[0] = _rotl(nACAC12[0], ntutu12[1]);
		nACAC12[1] = _rotl(nACAC12[1], ntutu12[0]);
		nACAC12[2] = _rotl(nACAC12[2], ntutu12[3]);
		nACAC12[3] = _rotl(nACAC12[3], ntutu12[2]);

		ACAC12 = _mm_load_si128((__m128i *)nACAC12);

		// Now add the stuff from the key schedule
		__m128i S2i0101 = _mm_shuffle_epi32(_mm_loadl_epi64((__m128i *)&lpKey[i * 2]), 0x44);

		ACAC12 = _mm_add_epi32(ACAC12, S2i0101);

		// Do the shift: BDBD becomes ACAC, ACAC becomes DBDB (swap order)
		__m128i temp12 = BDBD12;
		BDBD12 = _mm_shuffle_epi32(ACAC12, 0xB1);
		ACAC12 = temp12;

	// Post-encryption processing
	ACAC12 = _mm_add_epi32(ACAC12, S_3434);

	// Reassemble the original blocks and store them
	ABCD1 = _mm_unpacklo_epi32(ACAC12, BDBD12);
	ABCD2 = _mm_unpackhi_epi32(ACAC12, BDBD12);

	_mm_store_si128(&lpBlocks[0], ABCD1);
	_mm_store_si128(&lpBlocks[1], ABCD2);
This implementation has been verified for correctness. I haven't yet gone through and pinched cycles yet, though you're welcome to comment about stuff like that if you want (especially inside the loop). What I most need to know is if there's any better way to accomplish the 4x parallel integer multiplication, and the 4x parallel shift by variable amount. And no, I haven't benchmarked the code to see if it's any faster than the regular version, yet :P And yes, I do get frequent hate mail from my coworkers because I use word-wrap. [Edited by - Catafriggm on May 9, 2007 1:02:09 AM]

Share this post

Link to post
Share on other sites
mattnewport    1038
I'm not familiar with RC6 and I haven't looked over your code in detail but have you looked at implementing the 4 way parallel shift using multiplies by 2^n rather than left shifts by n? You'll have to do a fair bit of masking and shuffling to get things in the right place but I'd guess it's probably faster than storing to memory and using the integer registers to do the shifts.

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this