Sign in to follow this  

movsd vs SSE2 implemetation

This topic is 3458 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

im trying to optimize memory copying... im comparing __movsd and the following SSE2 code
void TestSSE2MovInterleave(unsigned char* pDest, unsigned char* pSrc, std::size_t count) {
	__m128i val = _mm_setzero_si128();
	__m128i val2 = _mm_setzero_si128();
	__m128i val3 = _mm_setzero_si128();
	__m128i val4 = _mm_setzero_si128();
	__m128i val5 = _mm_setzero_si128();
	__m128i val6 = _mm_setzero_si128();
	__m128i* pD = reinterpret_cast<__m128i*>(pDest);
	__m128i* pS = reinterpret_cast<__m128i*>(pSrc);

	int times = count / (16*6);
	for(int i=0; i < times; ++i) {
		val = _mm_load_si128(pS);
		val2 = _mm_load_si128(pS+1);
		_mm_stream_si128(pD, val);
		val3 = _mm_load_si128(pS+2);
		_mm_stream_si128(pD+1, val2);
		val4 = _mm_load_si128(pS+3);
		_mm_stream_si128(pD+2, val3);
		val5 = _mm_load_si128(pS+4);
		_mm_stream_si128(pD+3, val4);
		val6 = _mm_load_si128(pS+5);

		_mm_stream_si128(pD+4, val5);
		_mm_stream_si128(pD+5, val6);

		pD+=6;
		pS+=6;
	}
}

however... the SSE2 version is faster when i move alot of data and __movsd is faster when i copy less data.... why is that? do i need to make a seperate implemention for when i move lots of data and one for when i move less?

Share this post


Link to post
Share on other sites
I have discovered the technique you are using recently reading it on an AMD article (http://developer.amd.com/documentation/articles/pages/PerformanceOptimizationofWindowsApplicationsonAMDProcessors2.aspx) so I am not an expert. If I have understanded what happen, __movsd maintains the data in the cache, while the other method "discard" it. If the buffer is small then with __movsd it will be in the cache when you use it while with the other method it will be in main memory and you have to load it.

What does your code compete against the code from the AMD article?


int nontemporal_copy(char* outbuff, char* inbuff, int size) {
const int step = 64; // a handy unroll factor, equal to WC buffer size
while(size > step) {
_mm_prefetch(inbuff + 320, _MM_HINT_NTA); // non-temporal
__m128i A = _mm_loadu_si128((__m128i*) (inbuff + 0));
__m128i B = _mm_loadu_si128((__m128i*) (inbuff + 16));
__m128i C = _mm_loadu_si128((__m128i*) (inbuff + 32));
__m128i D = _mm_loadu_si128((__m128i*) (inbuff + 48));
// destination must be 16-byte aligned for streaming store!
_mm_stream_si128((__m128i*) (outbuff + 0), A);
_mm_stream_si128((__m128i*) (outbuff + 16), B);
_mm_stream_si128((__m128i*) (outbuff + 32), C);
_mm_stream_si128((__m128i*) (outbuff + 48), D);
inbuff += step;
outbuff += step;
size -= step;
}
_mm_mfence(); // ensure last WC buffers get flushed to memory
}



Share this post


Link to post
Share on other sites
From my understanding of these two approaches, the __movsd is the native performance method in moving memory data which only uses a single instruction. The instruction is optimized but still limited by the available CPU resources, i.e., registers and caches and memory addressing capabilities. So the burst rate is high for small memory copy.

On the SEE side, it needs more cycles for the CPU to prepare the execution state which limits the burst rate but after the environment is setup, the 128 bit (or 2 64 bit on old CPUs) memory load and store should give huge performance boost in the long run.

Share this post


Link to post
Share on other sites
Apatriarca is correct. movsd will write out to cached memory, so if you write less bytes than what fits in your cache, you will be testing your cache write bandwidth (you can test L1 and L2/L3 bandwidth by writing up to the limit of these cache sizes).

The streaming stores do not go through the cache at all, meaning that you write directly to main memory. For large copies this is a big win, since writing via the cache implies that each new cacheline you are about to write, will require a wasteful read of the entire cacheline, implying that you read once for every write (excluding the read of the source data).

pcwlai: I think you are thinking of the good 'ol days when the fpu stack had to be saved and restored for mmx registers. SSE does not really have any significant set up cost (excluding AoS to SoA conversion (where applicable) of course).

> do i need to make a seperate implemention for when i move lots of data and one for when i move less?

That would be a start, but really the important bit here is whether or not you intend to use the data shortly after the copy or not. If you do need to use the data afterwards, then movsd would be the best bet (or SSE using temporal/non-streaming stores) for small copies. If you don't intend to access the data, then streaming stores would be best for small copies - although it may appears to take longer, you are just avoiding the deferred (but still costly) write that would occur later anyway, and in the process you are not polluting your cache. For large data blocks, always use streaming stores, since you can only fit a fraction of it in cache anyway.

Share this post


Link to post
Share on other sites

This topic is 3458 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this