SSE Optimizations 25X slower!! Yeeeehhaaaw!!!

Started by
13 comments, last by samgzman 20 years, 3 months ago
This might be a little more adavanced. I have some code here that interpolates between two integer arrays. Im using SSE intrinsics to try and wield the power of Intel CPUs. I have written two functions that both do the same thing... One using traditional methods and one that attempted to use SSE optimization so that it could process on 4 elements of the array at a time (Ahh...The beauty of parallelism) When u run this little gem, the optimized function goes about 25X slower... that makes perfect sense doesn''t it?? In order for this to compile in VC++ 6, you have to have SP5 installed as well as the Processor Pack that provided Intrinsics (Direct access to low level calls). Just copy and past as a console app, and tell me what conclusions u come to. Am I being a total moron?

#include <windows.h>
#include <iostream.h>
#include <stdio.h>

#include <xmmintrin.h>

// Prototpes
void InterpolateSSE(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize);
void Interpolate(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize);

int main()
{
	// Allocate memory for buffers
	int* test1 = (int*) _aligned_malloc(256 * sizeof(int), 16);
	int* test2 = (int*) _aligned_malloc(256 * sizeof(int), 16);
	int* test3 = (int*) _aligned_malloc(256 * sizeof(int), 16);
	for(int i=0; i < 256; i++)
	{
		// Fill arrays with arbitrary values
		test1 = (i+5) * 3;
		test2 = (i+2) *5 ;
		test3 = 0;
	}	

	// Setup performance counter
	LARGE_INTEGER perf_frequency;
	LARGE_INTEGER start;
	LARGE_INTEGER stop;
	QueryPerformanceFrequency(&perf_frequency);
	float event_time ;

	// Test without SSE
	QueryPerformanceCounter(&start);
	for(i=0; i < 100000; i++)
		Interpolate(test1, test2, test3, 0.32f, 256);
	QueryPerformanceCounter(&stop);
	event_time = ( (float)stop.QuadPart - (float)start.QuadPart ) / perf_frequency.QuadPart;
	printf ("\n\n\n\n\t\t100,000 loops took: %4.6f seconds without SSE\n", event_time);
	printf ("\n\n\n\t\tPleas wait while 2nd test completes…");


	// Test with SSE
	QueryPerformanceCounter(&start);
	for(i=0; i < 100000; i++)
		InterpolateSSE(test1, test2, test3, 0.32f, 256);
	QueryPerformanceCounter(&stop);
	event_time = ( (float)stop.QuadPart - (float)start.QuadPart ) / perf_frequency.QuadPart;
	printf ("\n\n\n\n\t\t100,000 loops took: %4.6f seconds with SSE\n\n\n\n\n", event_time);
	cin >> i;

	// Deallocate memory
	_aligned_free(test1);
	_aligned_free(test2);
	_aligned_free(test3);

	

	return 0;
}


// Interpolate between int arrays using SSE ''Optimizations''
void InterpolateSSE(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize)
{
	// This function assumes sizeof(iStrip1) = sizeof(iStrip2) = sizeof(iDest)
	__m128 a = _mm_set_ps1(fPercent1); 
	__m128 b = _mm_set_ps1((1.0f - fPercent1));
	

	__m128 low, high;
	__m128* pSrc1 = (__m128*) iStrip1;
    __m128* pSrc2 = (__m128*) iStrip2;
    __m128* pDest = (__m128*) iDest;

	
	int nLoop = iSize >> 2;

	for(int i=0; i < nLoop; i++)
	{
		low =  _mm_mul_ps(*pSrc1 , b);
		high = _mm_mul_ps(*pSrc2 , a);
		*pDest = _mm_add_ps(low, high);          
		//*pDest = _mm_add_ps(_mm_mul_ps(*pSrc1 , a), _mm_mul_ps(*pSrc2 , b));          
		//int total = (int)((m_iDereferenceBuffer1[j] * m_fP1) + (m_iDereferenceBuffer2[j] * m_fP2) - m_iCloudCover);

		pSrc1++;
		pSrc2++;
		pDest++;
	}
}

// Interpolate between int arrays using traditional methods
void Interpolate(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize)
{
	for (int i=0; i < iSize; i++)
		iDest = (int)(iStrip1*(1.0f - fPercent1) + iStrip2*(fPercent1));
}

 </pre>   </i>  
Advertisement
your 2 cases arent the same

your buffer''s are integer yet you use SSE to load them like floats, but your scaler interp correctly uses them as integer

of course SSE is slower if you dont give it valid numbers to play with, when i changed everything to be float (both your buffers and the scaler interp function) i get

100,000 loops took: 4.279575 seconds without SSE
Please wait while 2nd test completes...
100,000 loops took: 0.018331 seconds with SSE
Ok... I am a moron. Thanks for ur help!
Wow!

That is quite an improvement!

I may have to look into SSE programming, it seems very simple, yet powerful.

I''ve always wondered how those new cpu technologies improve performance, this thread definitely sheds some light.

Not to hijack the thread, but I am guessing a good application of this would be write an SSE(2) function to modify (float)vertices ..... seeing as how SSE is great for SIMD float stuff (I think its even good with integers too).

Am I following thr correct train of thought in using SSE?

Original Poster: Thanks for the heads up here.
The time measurement is way wrong you don''t go from 4 to 0.018 seconds.

Doing it right yields almost exactly the expected 4times speedup in this very simple case.

Nontheless quite impressive for very little work.

SSE2 would for most 3D engines be a really bad alternative since the only thing it gives you is SSE with doubles basicly giving you greater accurucy and less parallelism.

SSE does indeed work for integer math to basicly by extending the MMX instructionset to the xmm registers.
quote:Original post by Anonymous Poster
The time measurement is way wrong you don''t go from 4 to 0.018 seconds.


How is (stop-start)/tps wrong?
Actually AP is right, the time for non SSE was wrong (errant breakpoint or my breakpoint at the end of main being hit mid non-sse timing)

actual results were :-

100,000 loops took: 0.056946 seconds without SSE
Please wait while 2nd test completes...
100,000 loops took: 0.015388 seconds with SSE

I just didnt think it mattered that much as it was more the correction to his code that was the relevant part of my post.

and MMX (pentium mmx and up) works with packed integers to a maximum of 64bits in size

eg
8 * int8, 4 * int16 or 2 * int32

SSE (pentium pro/althon MP/XP and up) works with packed floats, 128 bits so 4 * float
there are also instructions to convert to an mmx register (eg 4 floats -> 4 int16s, or 2 of the floats -> int32 etc)

SSE also has "scaler" math where the operation is only performed on the first component of the vector, allowing you to mix mmx(integer) with scaler floating point.

SSE2 (pentium 4 and up) added 2 * packed double, and 4 * packed __int32 instructions

SSE2 would only be relevant for its 4 * int32 capability


quote:Original post by Narcist
Actually AP is right, the time for non SSE was wrong (errant breakpoint or my breakpoint at the end of main being hit mid non-sse timing)


Ahhh. I thought I had gone braindead for a minute there. And please forgive the semi-off-topic intrusion.
Ok once again really slow so even Narcist can hear:

SSE extended MMX to work with 128bit quantities.

Do I need to repeat it once more for it to stick?

so on a SSE compatible CPU
paddb xmm0, xmm1

would be a legal instruction working on packed bytes in a 128bit block the vanilla MMX variant working with 64bits is
paddm mm0, mm1

also when using the XMM registers you don''t need to use EMMS since they''re not aliased upon the original FPU registers.

cheers...
//DrunkenCoder
Once again so even the Anonymous poster can understand

SSE did NOT add 128bit packed integer, SSE *2* added 128 bit packed integer, that is NOT a valid instruction on SSE

straight from MSDN

The packed arithmetic intrinsics supporting the 128-bit integerMMX technology enhancements provided by the Streaming SIMDExtensions 2 (SSE2) instructions process are listed in theInteger Arithmetic Operations table. _mm_add_epi8 PADDB Addition _mm_add_epi16 PADDW Addition _mm_add_epi32 PADDD Addition _mm_add_si64 PADDQ Addition _mm_add_epi64 PADDQ Addition _mm_adds_epi8 PADDSB Addition _mm_adds_epi16 PADDSW Addition _mm_adds_epu8 PADDUSB Addition _mm_adds_epu16 PADDUSW Addition _mm_avg_epu8 PAVGB Computes average _mm_avg_epu16 PAVGW Computes average _mm_madd_epi16 PMADDWD Multiplication/addition _mm_max_epi16 PMAXSW Computes maxima _mm_max_epu8 PMAXUB Computes maxima _mm_min_epi16 PMINSW Computes minima _mm_min_epu8 PMINUB Computes minima _mm_mulhi_epi16 PMULHW Multiplication _mm_mulhi_epu16 PMULHUW Multiplication _mm_mullo_epi16 PMULLW Multiplication _mm_mul_su32 PMULUDQ Multiplication _mm_mul_epu32 PMULUDQ Multiplication _mm_sad_epu8 PSADBW Computes difference/adds _mm_sub_epi8 PSUBB Subtraction _mm_sub_epi16 PSUBW Subtraction _mm_sub_epi32 PSUBD Subtraction _mm_sub_si64 PSUBQ Subtraction _mm_sub_epi64 PSUBQ Subtraction _mm_subs_epi8 PSUBSB Subtraction _mm_subs_epi16 PSUBSW Subtraction _mm_subs_epu8 PSUBUSB Subtraction _mm_subs_epu16 PSUBUSW Subtraction   


and from IA-32 Intel Architecture Software Developer's Manual, Volume 1 Chapter 11.1

The SSE2 extensions use the same single instruction multipledata (SIMD) execution model that is used with the MMX technology and the SSE extensions. It extends this model with support for packed double-precision floating point values and for 128-bit packed integers.The SSE2 extensions add the following features to the IA-32 architecture, while maintaining backward compatibility with all existing IA-32 processors, applications and operating systems.Six data types:- 128-bit packed double precision floating point (two IEEE  standard 754 double-precision floating-point values packed into  a double quadword)- 128-bit packed byte integers.- 128-bit packed word integers.- 128-bit packed doubleword integers.- 128-bit packed quadword integers. 



all the above is SSE *2* not plain SSE

The only integer instructions SSE provides are

PEXTRW - Extracts one of four wordsPINSRW - Inserts a word  PMAXSW - Computes the maximum  PMAXUB - Computes the maximum, unsigned  PMINSW - Computes the minimum  PMINUB - Computes the minimum, unsigned  PMOVMSKB - Creates an 8-bit mask  PMULHUW - Multiplies, returning high bits  PSHUFW - Returns a combination of four words  MASKMOVQ - Computes conditional store  PAVGB - Computes rounded average  PAVGW - Computes rounded average  PSADBW - Computes sum of absolute differences    


ALL of which ONLY work on the mm0->mm7 registers on a CPU that doesnt suppport SSE2.

Any integer on xmm0-xmm7 are purely SSE2

On an SSE2 cpu (eg p4) then all the mmx instructions also have xmm equivelents

All of this information is from the intel docs and from msdn

I also didnt say you had to use an EMMS instruction after SSE, i simply stated you could interleave scaler SSE (addss, mulss etc) with MMX, whearas you cant interleave FP code with MMX (due to the mmx registers being aliases for the fp registers and the time consuiming emms instruction to switch between the 2)


Edit: fixed formatting

[edited by - Narcist on January 15, 2004 2:52:20 PM]

This topic is closed to new replies.

Advertisement