Jump to content
  • Advertisement
Sign in to follow this  

Rip up my SSE Sin function

This topic is 4644 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So I've been obsessed about SIMD functionally lately. Simple things like a function to add a long array and whatnot. But Ive found that memory access limitations are a big deal, so Ive trying to find more complex functions. Things that take more then two or three ops per variable. So far the biggest challange to that Ive found is trig. Below is my first stab at implimenting a sin() as a Taylor series, using SSE. It does the first 8 terms, unrolls because throwing any if or loop into the main loop makes things crawl. The biggest downside Ive found to this is that it gives completely useless answers if the input is outside of the -2pi to 2pi range. And I can not for the life of me figure out a way to convert higher or lower radians into their -pi to pi range that doesn't require an if or loop. Aside from that little problem... its 7 times faster then doing sin() with a big set. Doing the exact same taylor series in normal C++, one at a time, its almost exactly 4 times faster. So its using the SIMD feature to its maximum, memory access limits (no kidding :p). Anyway. Here it is. Any suggestions before I move to finishing up the trig set like this guy?
int NVec::Sin_f32_SSE(f32 *Src, f32 *Dest, size_t Len)
{
    const int ElemsInVec = 4;

    if(!HasSSE)
        return UNSUPPORTED_ERROR;
    #ifndef __SSE__
        return UNSUPPORTED_ERROR;
    #else
	
    if(Src == NULL || Dest == NULL)
    return NULL_ERROR;

    if(Len <= 0)
        return SIZE_ERROR;
	
    size_t Loop, VecLen, RemainderLen;
    VecLen = Len/(ElemsInVec); // default floors
    RemainderLen = Len % (ElemsInVec);

    f32 *VS, *VD;

    VS = (f32*)Src;
    VD = (f32*)Dest;

	// This is so fucking disgusting, an unrolled Taylor series for sin.
	// to 8 fucking terms. Is there even a word for thats 8 term diviors 15! ??
	// But by god is it fast compared to NoVec calling sin()
	// 23.2183s vs 3.53527s in 540x1000000 run
    //
    // However the function goes to infinity if you leave the -360 to +360 range	

	
	v4sf Rads, Sins, Pows, Term;
	// constant storage...
	__v4sf OneOverSix = {1.0/6.0, 1.0/6.0, 1.0/6.0, 1.0/6.0};  
	__v4sf OneOverOneTwenty = {1.0/120.0, 1.0/120.0, 1.0/120.0, 1.0/120.0};
	__v4sf OneOverFiftyFourty = {1.0/5040.0, 1.0/5040.0, 1.0/5040.0, 1.0/5040.0};
	__v4sf OneOver362880 = {1.0/362880.0, 1.0/362880.0, 1.0/362880.0, 1.0/362880.0};
	__v4sf OneOver39916800 = {1.0/39916800.0, 1.0/39916800.0, 1.0/39916800.0, 1.0/39916800.0}; 
	__v4sf OneOver6227020800 = {1.0/6227020800.0, 1.0/6227020800.0, 1.0/6227020800.0, 1.0/6227020800.0};
	__v4sf OneOver1307674368000 = {1.0/1307674368000.0, 1.0/1307674368000.0, 1.0/1307674368000.0, 1.0/1307674368000.0}; 
	
	for(Loop = 0; Loop < VecLen; ++Loop) {
		
		Rads = _mm_load_ps(VS);
        VS += ElemsInVec;

		// sin(x) = x - (x^3)/6 + (x^5)/120 - (x^7)/5040 + (x^9)/362880 
		//          - (x^11)/39916800 + (x^13)/6227020800
		//          - (x^15)/1307674368000;
		 
		
		// x
                Sins = Rads;

		// (x^3)/6
		Pows = _mm_mul_ps(Rads, Rads);        // Pows = x^2
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^3
		Term = _mm_mul_ps(Pows, OneOverSix);
		Sins = _mm_sub_ps(Sins, Term);
		
		// (x^5)/120
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^4
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^5
		Term = _mm_mul_ps(Pows, OneOverOneTwenty);
		Sins = _mm_add_ps(Sins, Term);
		
		// (x^7)/5040
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^6
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^7
		Term = _mm_mul_ps(Pows, OneOverFiftyFourty);
		Sins = _mm_sub_ps(Sins, Term);
		
		// (x^9)/362880
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^8
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^9
		Term = _mm_mul_ps(Pows, OneOver362880);
		Sins = _mm_add_ps(Sins, Term);
		
		// (x^11)/39916800
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^10
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^11
		Term = _mm_mul_ps(Pows, OneOver39916800);
		Sins = _mm_sub_ps(Sins, Term);
		
		// (x^13)/6227020800
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^12
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^13
		Term = _mm_mul_ps(Pows, OneOver6227020800);
		Sins = _mm_add_ps(Sins, Term);
		
		// (x^15)/1307674368000
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^14
		Pows = _mm_mul_ps(Pows, Rads);        // Pows = x^15
		Term = _mm_mul_ps(Pows, OneOver1307674368000);
		Sins = _mm_sub_ps(Sins, Term);
		
		
        _mm_store_ps(VD, Sins);
        VD += ElemsInVec;	
	}
	
	// finish off the remainders
	Sin_f32_NoVec(VS, VD, RemainderLen);
	
	return SUCCESS;
	#endif	
}

Share this post


Link to post
Share on other sites
Advertisement
Post the assembly.

I think you can make this faster by doing 2 rows of your Loop at once. In fact, you could probably make it even faster by 3 or 4 rows...

// (x^3)/6
Pows = _mm_mul_ps(Rads, Rads); // Pows = x^2
Pows = _mm_mul_ps(Pows, Rads); // Pows = x^3
Term = _mm_mul_ps(Pows, OneOverSix);
Sins = _mm_sub_ps(Sins, Term);


becomes:

// (x^3)/6
Pows0 = _mm_mul_ps(Rads0, Rads0); // Pows = x^2
Pows1 = _mm_mul_ps(Rads1, Rads1); // Pows = x^2

Pows0 = _mm_mul_ps(Pows0, Rads0); // Pows = x^3
Pows1 = _mm_mul_ps(Pows1, Rads1); // Pows = x^3

Term0 = _mm_mul_ps(Pows0, OneOverSix);
Term1 = _mm_mul_ps(Pows1, OneOverSix);

Sins0 = _mm_sub_ps(Sins0, Term0);
Sins1 = _mm_sub_ps(Sins1, Term1);



That's how I'd write the assembly. I've never trusted intrinsics much... VS7.1 handles them like an osterich with its head in the sand. If it's converted properly, what I just wrote will saturate the entire SSE register file, which is good. Your dependency chain is horrificly long, and I would expect even my version (in proper assembly) to be stalling about 40% of the time.

You picked a particularly hard SSE problem to start with, and I would think you'll not be happy with your stalling rate, especially if you had to provide functionality to accept arbitrary input domain (even without that, it's still bad, and to improve it is unintuitive).

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!