Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

samgzman

SSE Optimizations 25X slower!! Yeeeehhaaaw!!!

This topic is 5418 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

This might be a little more adavanced. I have some code here that interpolates between two integer arrays. Im using SSE intrinsics to try and wield the power of Intel CPUs. I have written two functions that both do the same thing... One using traditional methods and one that attempted to use SSE optimization so that it could process on 4 elements of the array at a time (Ahh...The beauty of parallelism) When u run this little gem, the optimized function goes about 25X slower... that makes perfect sense doesn''t it?? In order for this to compile in VC++ 6, you have to have SP5 installed as well as the Processor Pack that provided Intrinsics (Direct access to low level calls). Just copy and past as a console app, and tell me what conclusions u come to. Am I being a total moron?
#include <windows.h>
#include <iostream.h>
#include <stdio.h>

#include <xmmintrin.h>

// Prototpes
void InterpolateSSE(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize);
void Interpolate(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize);

int main()
{
	// Allocate memory for buffers
	int* test1 = (int*) _aligned_malloc(256 * sizeof(int), 16);
	int* test2 = (int*) _aligned_malloc(256 * sizeof(int), 16);
	int* test3 = (int*) _aligned_malloc(256 * sizeof(int), 16);
	for(int i=0; i < 256; i++)
	{
		// Fill arrays with arbitrary values
		test1 = (i+5) * 3;
		test2[i] = (i+2) *5 ;
		test3[i] = 0;
	}	

	// Setup performance counter
	LARGE_INTEGER perf_frequency;
	LARGE_INTEGER start;
	LARGE_INTEGER stop;
	QueryPerformanceFrequency(&perf_frequency);
	float event_time ;

	// Test without SSE
	QueryPerformanceCounter(&start);
	for(i=0; i < 100000; i++)
		Interpolate(test1, test2, test3, 0.32f, 256);
	QueryPerformanceCounter(&stop);
	event_time = ( (float)stop.QuadPart - (float)start.QuadPart ) / perf_frequency.QuadPart;
	printf ("\n\n\n\n\t\t100,000 loops took: %4.6f seconds without SSE\n", event_time);
	printf ("\n\n\n\t\tPleas wait while 2nd test completes...");


	// Test with SSE
	QueryPerformanceCounter(&start);
	for(i=0; i < 100000; i++)
		InterpolateSSE(test1, test2, test3, 0.32f, 256);
	QueryPerformanceCounter(&stop);
	event_time = ( (float)stop.QuadPart - (float)start.QuadPart ) / perf_frequency.QuadPart;
	printf ("\n\n\n\n\t\t100,000 loops took: %4.6f seconds with SSE\n\n\n\n\n", event_time);
	cin >> i;

	// Deallocate memory
	_aligned_free(test1);
	_aligned_free(test2);
	_aligned_free(test3);

	

	return 0;
}


// Interpolate between int arrays using SSE ''Optimizations''
void InterpolateSSE(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize)
{
	// This function assumes sizeof(iStrip1) = sizeof(iStrip2) = sizeof(iDest)
	__m128 a = _mm_set_ps1(fPercent1); 
	__m128 b = _mm_set_ps1((1.0f - fPercent1));
	

	__m128 low, high;
	__m128* pSrc1 = (__m128*) iStrip1;
    __m128* pSrc2 = (__m128*) iStrip2;
    __m128* pDest = (__m128*) iDest;

	
	int nLoop = iSize >> 2;

	for(int i=0; i < nLoop; i++)
	{
		low =  _mm_mul_ps(*pSrc1 , b);
		high = _mm_mul_ps(*pSrc2 , a);
		*pDest = _mm_add_ps(low, high);          
		//*pDest = _mm_add_ps(_mm_mul_ps(*pSrc1 , a), _mm_mul_ps(*pSrc2 , b));          
		//int total = (int)((m_iDereferenceBuffer1[j] * m_fP1) + (m_iDereferenceBuffer2[j] * m_fP2) - m_iCloudCover);

		pSrc1++;
		pSrc2++;
		pDest++;
	}
}

// Interpolate between int arrays using traditional methods
void Interpolate(int* iStrip1, int* iStrip2, int* iDest, float fPercent1, int iSize)
{
	for (int i=0; i < iSize; i++)
		iDest[i] = (int)(iStrip1[i]*(1.0f - fPercent1) + iStrip2[i]*(fPercent1));
}

 

Share this post


Link to post
Share on other sites
Advertisement
your 2 cases arent the same

your buffer''s are integer yet you use SSE to load them like floats, but your scaler interp correctly uses them as integer

of course SSE is slower if you dont give it valid numbers to play with, when i changed everything to be float (both your buffers and the scaler interp function) i get

100,000 loops took: 4.279575 seconds without SSE
Please wait while 2nd test completes...
100,000 loops took: 0.018331 seconds with SSE

Share this post


Link to post
Share on other sites
Wow!

That is quite an improvement!

I may have to look into SSE programming, it seems very simple, yet powerful.

I''ve always wondered how those new cpu technologies improve performance, this thread definitely sheds some light.

Not to hijack the thread, but I am guessing a good application of this would be write an SSE(2) function to modify (float)vertices ..... seeing as how SSE is great for SIMD float stuff (I think its even good with integers too).

Am I following thr correct train of thought in using SSE?

Original Poster: Thanks for the heads up here.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
The time measurement is way wrong you don''t go from 4 to 0.018 seconds.

Doing it right yields almost exactly the expected 4times speedup in this very simple case.

Nontheless quite impressive for very little work.

SSE2 would for most 3D engines be a really bad alternative since the only thing it gives you is SSE with doubles basicly giving you greater accurucy and less parallelism.

SSE does indeed work for integer math to basicly by extending the MMX instructionset to the xmm registers.

Share this post


Link to post
Share on other sites
quote:
Original post by Anonymous Poster
The time measurement is way wrong you don''t go from 4 to 0.018 seconds.



How is (stop-start)/tps wrong?

Share this post


Link to post
Share on other sites
Actually AP is right, the time for non SSE was wrong (errant breakpoint or my breakpoint at the end of main being hit mid non-sse timing)

actual results were :-

100,000 loops took: 0.056946 seconds without SSE
Please wait while 2nd test completes...
100,000 loops took: 0.015388 seconds with SSE

I just didnt think it mattered that much as it was more the correction to his code that was the relevant part of my post.

and MMX (pentium mmx and up) works with packed integers to a maximum of 64bits in size

eg
8 * int8, 4 * int16 or 2 * int32

SSE (pentium pro/althon MP/XP and up) works with packed floats, 128 bits so 4 * float
there are also instructions to convert to an mmx register (eg 4 floats -> 4 int16s, or 2 of the floats -> int32 etc)

SSE also has "scaler" math where the operation is only performed on the first component of the vector, allowing you to mix mmx(integer) with scaler floating point.

SSE2 (pentium 4 and up) added 2 * packed double, and 4 * packed __int32 instructions

SSE2 would only be relevant for its 4 * int32 capability


Share this post


Link to post
Share on other sites
quote:
Original post by Narcist
Actually AP is right, the time for non SSE was wrong (errant breakpoint or my breakpoint at the end of main being hit mid non-sse timing)


Ahhh. I thought I had gone braindead for a minute there. And please forgive the semi-off-topic intrusion.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Ok once again really slow so even Narcist can hear:

SSE extended MMX to work with 128bit quantities.

Do I need to repeat it once more for it to stick?

so on a SSE compatible CPU
paddb xmm0, xmm1

would be a legal instruction working on packed bytes in a 128bit block the vanilla MMX variant working with 64bits is
paddm mm0, mm1

also when using the XMM registers you don''t need to use EMMS since they''re not aliased upon the original FPU registers.

cheers...
//DrunkenCoder

Share this post


Link to post
Share on other sites
Once again so even the Anonymous poster can understand

SSE did NOT add 128bit packed integer, SSE *2* added 128 bit packed integer, that is NOT a valid instruction on SSE

straight from MSDN


The packed arithmetic intrinsics supporting the 128-bit integer
MMX technology enhancements provided by the Streaming SIMD
Extensions 2 (SSE2) instructions process are listed in the
Integer Arithmetic Operations table.

_mm_add_epi8 PADDB Addition
_mm_add_epi16 PADDW Addition
_mm_add_epi32 PADDD Addition
_mm_add_si64 PADDQ Addition
_mm_add_epi64 PADDQ Addition
_mm_adds_epi8 PADDSB Addition
_mm_adds_epi16 PADDSW Addition
_mm_adds_epu8 PADDUSB Addition
_mm_adds_epu16 PADDUSW Addition
_mm_avg_epu8 PAVGB Computes average
_mm_avg_epu16 PAVGW Computes average
_mm_madd_epi16 PMADDWD Multiplication/addition
_mm_max_epi16 PMAXSW Computes maxima
_mm_max_epu8 PMAXUB Computes maxima
_mm_min_epi16 PMINSW Computes minima
_mm_min_epu8 PMINUB Computes minima
_mm_mulhi_epi16 PMULHW Multiplication
_mm_mulhi_epu16 PMULHUW Multiplication
_mm_mullo_epi16 PMULLW Multiplication
_mm_mul_su32 PMULUDQ Multiplication
_mm_mul_epu32 PMULUDQ Multiplication
_mm_sad_epu8 PSADBW Computes difference/adds
_mm_sub_epi8 PSUBB Subtraction
_mm_sub_epi16 PSUBW Subtraction
_mm_sub_epi32 PSUBD Subtraction
_mm_sub_si64 PSUBQ Subtraction
_mm_sub_epi64 PSUBQ Subtraction
_mm_subs_epi8 PSUBSB Subtraction
_mm_subs_epi16 PSUBSW Subtraction
_mm_subs_epu8 PSUBUSB Subtraction
_mm_subs_epu16 PSUBUSW Subtraction


and from IA-32 Intel Architecture Software Developer's Manual, Volume 1 Chapter 11.1


The SSE2 extensions use the same single instruction multiple
data (SIMD) execution model that is used with the MMX technology
and the SSE extensions. It extends this model with support for
packed double-precision floating point values and for 128-bit
packed integers.

The SSE2 extensions add the following features to the IA-32
architecture, while maintaining backward compatibility with all
existing IA-32 processors, applications and operating systems.

Six data types:
- 128-bit packed double precision floating point (two IEEE
standard 754 double-precision floating-point values packed into
a double quadword)
- 128-bit packed byte integers.
- 128-bit packed word integers.
- 128-bit packed doubleword integers.
- 128-bit packed quadword integers.



all the above is SSE *2* not plain SSE

The only integer instructions SSE provides are


PEXTRW - Extracts one of four words
PINSRW - Inserts a word
PMAXSW - Computes the maximum
PMAXUB - Computes the maximum, unsigned
PMINSW - Computes the minimum
PMINUB - Computes the minimum, unsigned
PMOVMSKB - Creates an 8-bit mask
PMULHUW - Multiplies, returning high bits
PSHUFW - Returns a combination of four words
MASKMOVQ - Computes conditional store
PAVGB - Computes rounded average
PAVGW - Computes rounded average
PSADBW - Computes sum of absolute differences


ALL of which ONLY work on the mm0->mm7 registers on a CPU that doesnt suppport SSE2.

Any integer on xmm0-xmm7 are purely SSE2

On an SSE2 cpu (eg p4) then all the mmx instructions also have xmm equivelents

All of this information is from the intel docs and from msdn

I also didnt say you had to use an EMMS instruction after SSE, i simply stated you could interleave scaler SSE (addss, mulss etc) with MMX, whearas you cant interleave FP code with MMX (due to the mmx registers being aliases for the fp registers and the time consuiming emms instruction to switch between the 2)


Edit: fixed formatting

[edited by - Narcist on January 15, 2004 2:52:20 PM]

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!