SSE profiling and results

Started by
8 comments, last by momotte 14 years ago
I usually work in debug mode, which means that I've done most of my SSE tests in debug mode and I have not been impressed. Until I tried it out in release mode. And to be honest I don't see how the results could be possible. Here's the test case:

	__int64 p1, p2;

	p1 = __rdtsc();
	for(DWORD i = 0; i < (1 << 10); i++)
		IVector v3 = v * i;
	p2 = __rdtsc();
	lout << "PROFILE1: " << (DWORD)(p2 - p1) << " cycles" << endl;

	p1 =__rdtsc();
	for(DWORD i = 0; i < (1 << 10); i++)
		TVector3D rv3 = rv1 * i;
	p2 = __rdtsc();
	lout << "PROFILE2: " << (DWORD)(p2 - p1) << " cycles" << endl;

IVector is a single-precision SSE-enhanced vector class and TVector3D is a double-precision non-SSE version. The multiplication code for either class looks like this:

__forceinline 
IVector IVector::operator*(const float & _v)
{
	__align16 float value[4] = { _v, _v, _v, _v };
	return IVector(_mm_mul_ps(_sse, *(__m128*)value));
}

------------------------------

//TReal is a double!
TVector3D TVector3D::operator*(TReal pNumber)
{
	return TVector3D(x * pNumber, y * pNumber, z * pNumber);
}
The results of the above profiling test (pretty much persistent through multiple runs; SSE2 is enabled in compiler settings) are: DEBUG MODE SSE-enabled: ~163000 cycles SSE-disabled: ~75000 cycles Conclusion: double-precision brute force beats single-precision SSE hands down. I'm not 100% why SSE bombs, but I'm not really worried as this is in debug mode. If someone can elaborate on this I'd appreciate it. RELEASE MODE SSE-enabled: ~30 cycles SSE-disabled: ~22000 cycles Conclusion: WTF In other words, I'm not at all dismayed by the debug mode results or the SSE-disabled release mode result, but I'm having a really hard time accepting the release mode SSE-enabled result. Am I blind? Or is this actually normal? The number of iterations is (1 << 10) = 1024, which would make it ~34 vector multiplications per clock. I'm relatively sure this isn't a cache issue as I got these numbers the first time I ran the test. I've also re-ordered code to avoid cache hits - the results stay the same.
Advertisement
what compiler are you using?

if you're using devstudio, it might be worth enabling sse from your project properties and see what results you get:
your project properties->c/c++->code generation->/arch:sse or sse2

And if you really could then trying running the same code on a different platform (linux or else).
I think the code of the loop isn't generated at all in Release mode when using SSE. That loop do in fact nothing. This is a common optimization in modern compilers. In the other case the compiler generate the loop because the function isn't inlined. You have to read the assembly generated by the compiler.
Quote:Original post by apatriarca
I think the code of the loop isn't generated at all in Release mode when using SSE. That loop do in fact nothing. This is a common optimization in modern compilers. In the other case the compiler generate the loop because the function isn't inlined. You have to read the assembly generated by the compiler.


Ah! Good observation!

Indeed, when replacing the loops with something a bit less determined, eg:

	for(DWORD i = 0; i < (1 << 10); i++)		IVector v3 = v * (rand()%(i+1));


The results become:

72000 (SSE) vs 103000 (non-SSE) cycles in release mode. I'm still not impressed: the non-SSE version is double-precision, it's not inlined and it's not accelerated, yet ir performs at far less than twice the speed. Bleh. Granted, though - now, most of the time is spent inside rand().

EDIT:

Okay, now I'm confused: I removed the rand() from the loop by changing the test code to this:

	__int64 p1, p2;	float rn_f[10];	double rn_d[10];	for(int i = 0; i < 10; i++)		{		rn_f = rand();		rn_d = rand();		}	IVector v3;	DWORD t1 = timeGetTime();	p1 = __rdtsc();	for(DWORD i = 0; i < (1 << 20); i++)		v3 = v * (rn_f[i%10]);	p2 = __rdtsc();	DWORD t2 = timeGetTime();	lout << "PROFILE1: " << (DWORD)(p2 - p1) << " cycles (" << (t2 - t1) << " ms)" << endl;	TVector3D rv3;	t1 = timeGetTime();	p1 =__rdtsc();	for(DWORD i = 0; i < (1 << 20); i++)		rv3 = rv1 * (rn_d[i%10]);	p2 = __rdtsc();	t2 = timeGetTime();	lout << "PROFILE2: " << (DWORD)(p2 - p1) << " cycles (" << (t2 - t1) << " ms)" << endl;


Results in release mode with optimizations disabled (1 << 20 iterations):

SSE = ~120M cycles (47ms)
non-SSE = ~60M cycles (23ms)

Once more, SSE fails miserably. Is it because disabling optimizations also disables instruction set enhancements? Because the disassembly suggests it doesn't:

	return IVector(_mm_mul_ps(_sse, *(__m128*)value));000E1396  movaps      xmm0,xmmword ptr [ebp-20h] 000E139A  mov         ecx,dword ptr [ebp-8] 000E139D  movaps      xmm1,xmmword ptr [ecx+10h] 000E13A1  mulps       xmm1,xmm0 000E13A4  movaps      xmmword ptr [ebp-30h],xmm1 000E13A8  movaps      xmm0,xmmword ptr [ebp-30h] 000E13AC  mov         ecx,dword ptr [ebx+8] 000E13AF  call        IVector::IVector (0E1180h) 000E13B4  mov         eax,dword ptr [ebx+8] 


I'm using VS2008 on Windows 7.

[Edited by - irreversible on April 16, 2010 8:16:35 AM]
Hm I also got some weird results with SSE.
I took the same measure-method like you and
after playing around with my code I realized
that some little change in the order of two
movaps commands resulted in 10x worse performance.
Additionally after testing several scenarios with this
it flashed me that the compiler-generated version
which ran after mine also got this performance impact.

I searched the internet and came across some specification
of Intel that said the processor I got (Core 2 Duo) has some
kind of automatic prefetch.
I cannot remember all of it and sadly I am not able to find
the article anymore but I think it said the processor keeps
track of the data streams an can optimize big loops with
automatic prefetches.

This lead me to the conclusion the processor "learned" what
I was doing and prefetched it. When I changed the order of
the two movaps it could not keep track of the streams and
therefore did no prefetch resulting in a 10x worse performance
in both, my and the compilers version.

My advice is to not simply test one operation like multiplying
or adding some vectors but testing in the "wild".
By testing in the "wild" I mean you sould set up some test-case
simulating the future usage of your vector class to directly
measure the benefit you gain by using SSE instructions.
[Edit]
Sry double post because of connection problems
if you want a meaningful benchmark, you should tag both of your functions as noinline to prevent the compiler from excessive optimization that would make your tests meaningless.
you don't need a call to rand(), and using a '%' operator in your loop performs an integer division, which will take longer than the mulps you're trying to benchmark.
I doubt the compiler will detect your functions as "pure" if they're forced to noinline, but if it still removes the benchmark loop, you can try:

volatile TVector3D dummy;
for(DWORD i = 0; i < (1 << 10); i++)
dummy = rv1 * i;

by the way, in your SSE code, you can use the _mm_set1_ps() intrinsic, that broadcasts a scalar into the 4 vector components.

EDIT:
just saw the dissasembly in your post above.
what's the code of IVector's constructor ? why isn't it inlined?
also, that looks like a debug dissassembly, what's the release-one like ?
Quote:Original post by momotte
if you want a meaningful benchmark, you should tag both of your functions as noinline to prevent the compiler from excessive optimization that would make your tests meaningless.
No. That is the wrong way to do it.
No benchmark should require you to cross your fingers and hope some bit doesn't get optimised out, nor should you tie the compiler's hands behind its back and expect it to give similar results in real-world scenarios.

The right way to do it is to write your benchmark code such that the compiler has no choice but to actually run the code, i.e. it cannot get the correct result without running the code. Every part of every line of code must have some impact on the final result.
For example, one could take the output from one iteration and feed it into the next iteration. Or one could add the x, y, & z values of each result to some total variable. Or you could simply toggle a bool depending on something about the vector, e.g. it's length being greater than a certain amount.

You're also much better off trying to put the specific operation inside a much more realistic scenario. For example calculating the perpendicular component of some vector to another vector in the loop requires a reasonable number of operations.
Otherwise what you get is something that tells you that in the specific scenario you chose the difference was say a factor of 30%, and then when you try it in any more realistic scenario, you instead find that it drops by 20%, because as soon as it has to interact with other code, the compiler can't optimise it as well as it could in isolation.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Shouldn't the compiler by smart enough to SSE optimise "vector * scalar" without needing to use the intrinsics? Just enable SSE2, Fp:fast and intrinsic functions in the settings.

One way of stopping the compiler from optimising out benchmark code is to std::cout one of the elements of the array. This is enough to trick VS2008, but other compilers might realise the rest of the array does nothing and collapse it to a single iteration. In that case you could always cout a random element of the array.
iMalc> for the example given by the OP, I disagree. the guy wants to benchmark a mulps.
oh, sure, you can accumulate the whole computations into a single value, and display this value to stdout afterwards, but the bench result for something like a vector mul will be drowned into all the side-ops going on. (and these generated side-ops, with inlined functions, might lead to more performant code in the non-SSE version, leading him to think SSE's mul instructions perform badly compared to regular scalar code.

and one of the advantages of uning a non-inlined external function to be benchmarked is also that you can benchmark an empty function to get an idea of the loop and other computational overhead inherent to the benchmark.

of course, a more meaningful benchmark would be to execute multiple times a function that does a batched multiplication of two arrays of numbers, one with SSE, the other one without.
and I agree there is little point into benchmarking a single vector mul like this, without context, except curiosity.

but if he wants to precisely time some function, he will have to tie the compiler's hands, so that the benchmark-specific code does not bleed and blend into what he's actually trying to benchmark. that's when you can't control things, aren't sure what the compiler is really going to do, and get strange results...

and if he actually wants to benchmark a stream of vector muls, then what the benchmark loop should call would be something like this:

extern void BatchedMul(const float *src0, const float *src1, float *dst, u32 count);

and a BatchedMulSSE, BatchedMulDummy that does nothing, BatchedMulVMX, whatever...

This topic is closed to new replies.

Advertisement