Public Group

# To use SSE, or not

This topic is 3628 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Okay, not an SSE specific question here. I've created a math library that has SSE support. The library will initialize, retrieve all the information from the CPU (determine if SSE is available), if it is, it'll set a global variable (g_bSSE) to true. The problem is, every time I do some kind of operation (like a matrix multiplication) it checks if SSE is available, this isn't a very smart move, since it'll be checking every single operation on any single Vector/Matrix/Quaternion if SSE is available. Is there a more efficient way of handling this? Example:
void ShtrMatrix4::operator *= (float f)
{
if(!g_bSSE) {
_11 *= f;
_12 *= f;
_13 *= f;
_14 *= f;

_21 *= f;
_22 *= f;
_23 *= f;
_24 *= f;

_31 *= f;
_32 *= f;
_33 *= f;
_34 *= f;

_41 *= f;
_42 *= f;
_43 *= f;
_44 *= f;
} else {
__asm {
mov		esi,	this
mov		edi,	f
movaps	xmm0,	[esi]
movss	xmm1,	[edi]
shufps	xmm1,	xmm1,	0
mulps	xmm0,	xmm1
movaps	[esi],	xmm0
movups	xmm0,	[esi + 16]
mulps	xmm0,	xmm1
movaps	[esi + 16],	xmm0
movups	xmm0,	[esi + 32]
mulps	xmm0,	xmm1
movaps	[esi + 32],	xmm0
movups	xmm0,	[esi + 48]
mulps	xmm0,	xmm1
movaps	[esi + 48],	xmm0
}
}
}


##### Share on other sites
You could use a function pointer instead, and have two functions; one SSE and one non-SSE. Then, you could have a startup function that checks for SSE support and assigns the function pointers appropriately.

If this is in a DLL, you could do some pretty ugly thunking too, to get rid of the function pointer overhead completely.

I wonder what the D3DX DLLs do actually, since they support SSE, 3DNow and MMX versions of the functions...

##### Share on other sites
a plugin structure comes to mind. You use classes, so create an abstract base class and create implementation classes for standard, sse, 3dnow, etc. Then use a factory type method to construct your implementation class. Only in the factory method do you need to do the processor check.

##### Share on other sites
There are many solutions. None of them is the best, but be aware of these:

1) SSE builds. This is the fastest. Make two different builds, one with SSE and another without it. Make a third program (the "launcher") which will detect the presence of SSE. If it is there, run the SSE built, otherwise run the compatible one.
The problem lies if you mix SSE, SSE2, SSE3, 3DNow!, etc. Because you can't make a build for each combination.

2) Go the XviD way. Make three functions:
void (*MyFunc) ()
void MySSE_Func()
void MyC_Func()

Then initialize:
[source lang=c]void init(){if( SSE ) MyFunc = MySSE_Func();else MyFunc = MyC_Func()}

And then call MyFunc()

Calling MyFunc() should be slightly slower than calling a normal function. So you'll have to benchmark what is faster in each case (use a conditional all the time, or do this technique). My bets are on this technique though, note it ought to be more cache-friendlier (as less code is inside the function compared to your approach). Another drawback, is you can't make use of the inline keyword.
Specially if the function is called many, many, many times.

Hope this helps
Dark Sylinc

##### Share on other sites
The author of libSIMDx86 calls a solution to that very problem "code overlay".
Basically we're talking of self-modifying code, so in other words, it's getting close to a hack.

Self-modifying code isn't necessarily as easy and straightforward as one would imagine. Leaving apart the 20 million bad things that could happen, you may have quite a struggle with your operating system to get it done in the first place.
In any case, you must mark the memory page you wrote to as "executable" on nearly every system. However, some systems don't allow you to mark a page writeable if it's executable, so you would have to first make it non-executable, writeable, do your modifications, and make the page non-writeable, executable.
Then of course, your code has to be relocatable, or you may have to map a memory region to a known address, which may not be possible.

Personally, I very much prefer writing two code paths, one using the SSE functions, and one not using them. So, for example, if you're matrix-skinning 20 models with 2000 vertices each, you check for SSE before entering the loop, and you process all 20 models either one way or the other.
This may be a a nanosecond or two slower, and yes, it duplicates some code. However, it's faster than using function pointers (or virtual functions, which is the same) or many thousand individual branches that kill all benefits of using SSE in the first place. It also works reliably, with no hacks and no tampering with page access rights.

1. 1
2. 2
3. 3
4. 4
Rutin
17
5. 5

• 12
• 9
• 12
• 37
• 12
• ### Forum Statistics

• Total Topics
631419
• Total Posts
2999986
×