Sign in to follow this  

Newbie SSE Question

This topic is 4302 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

After several posters recommended SSE, I tried it. I found a simple tutorial yet I get a "Access violation reading location 0xffffffff" when I actually try something simple. #include <xmmintrin.h> class MyClass { public: void Test(); protected: __declspec(align(16)) float mRow_1[MAX_LEN]; __declspec(align(16)) float mRow_2[MAX_LEN]; }; void MyClass::Test() { int xy = 0; __m128 tmp; __m128* pSource1 = (__m128*) mRow_1; __m128* pSource2 = (__m128*) mRow_2; for(xy = 0; xy < 100; ++xy) { tmp = _mm_add_ps(*pSource1 , *pSource2); //failure here!! } } I must be making a very simple error. I saw an example it does the same thing. Any suggestions?

Share this post


Link to post
Share on other sites
1) How big is MAX_LEN, and why are you using 100 instead of it in the loop?

2) In this piece of code:


__m128* pSource1 = (__m128*) mRow_1;
__m128* pSource2 = (__m128*) mRow_2;




shouldn't there be an & before mRow_1 and mRow_2 (I can't remember my pointer arithmetic - 2 hours sleep does that to you)

Share this post


Link to post
Share on other sites
Quote:
Original post by fathom88
Isn't it &Arry[0] the same as Array?


like I said, I've had 2 hours sleep :). you are of course, correct.

Share this post


Link to post
Share on other sites
I think it's a memory alignment error. When I declared

__declspec(align(16)) float Row_1[MAX_LEN];
__declspec(align(16)) float Row_2[MAX_LEN];


as locals in the function, the function worked fine (sort of). I can't access the results. Do I need to align the entire class or something?

Share this post


Link to post
Share on other sites
I've also had problems when accessing 'aligned' data as class members. What compiler are you using? Some compilers just choke on SSE aligned members

When I had the above problems I was allocating the offending data structures on the stack. By dynamically allocating from the heap, the problem was solved. However this may not be an option when you have lots of small structures.

The only other option is to use unaligned (much slower) loads from your class member arrays.

arm.

PS. You don't update the array pointers in your loop example above.

Share this post


Link to post
Share on other sites
I fixed it doing this:

1. Declare row as a pointer.
float *mRow_1

2. Dynamic allocate memory.
mRow_1 = (float*) _aligned_malloc(MAX_LEN * sizeof(float), 16);


Yes, I realize my code snipet is wrong. My code is a actually more complex than my snipet would suggest. I write a quick snipet which only captures the part where I'm having trouble.

I'm using VC++ (visul studio 7). When I compile in debug, my code actually runs a little slower than my pure C++ version. However, it runs super fast when I run in release. Does SSE only show up in release mode?

Share this post


Link to post
Share on other sites
Yeah, SSE code is generated in debug mode, its just that like normal code the assembly generated by the compiler is unoptimised.

arm.

Share this post


Link to post
Share on other sites
There is a critically important step you missed:

Profile your code before and after.

You need to remeber that the optimizing compilers have options to automatically vectorize your code (in other words, use SSE, SSE2, and possibly other parallelizations) for you. All you have to do is search the documentation and set a compiler switch.

You also need to remember that using SSE requires FPU state changes, which can cause a CPU pipeline stall and very easily make your code slower rather than faster.


When doing this sort of change in the name of 'optimization', you absolutely must use a profiler to figure out if this math is even a bottleneck (it probably isn't) and if it is a bottleneck, determine if your new fancy SSE code actually makes your program run faster.

I have been involved in two cases (one in 1997 and another in 1999) where some programmers decided to spend a ton of effort in writing SSE vector libraries. After profiling, we found that the optimizing compilers generated better, faster code than the 'optimized' MMX and XXM inlined classes.


After a few years of real-world experience, looming deadlines, maximum CPU and memory quotas, and everything else you will likely encounter, you will probably learn that these are the steps involved in optimizing:

1. Make the section of code work
2. Profile to identify slow spots
3. If something is surprisingly slow, figure out why
4. Remove major unexpected slowdowns
5. Profile again, moving back to step 3 until it runs 'fast enough'
6. Move on to the next section of code

The changes almost always fall into two types: use a better algorithm, and use a different API call. I can count on one hand (less than 32) instances in the past decade where we absolutely had to rewrite code in asm or use other CPU intrinsics directly. It used to be a big issue -- I had to do this type of asm optimization coding myself back on 12-, 33-, and 50-Mhz machines. But now that machine speeds are measured in GHz, you normally don't need to bother.

frob.

Share this post


Link to post
Share on other sites
To frob,

Thanks for the suggestion. You are correct. I really don't want to use the SSE extension if I don't have to. I tried to set the compiler flag to use SSE in the hopes that it would convert my code for me. However, it didn't work. As for improving the algorithm, it is pretty simple; pull N numbers from array table, sum up, and then multiply by 1/N. I was able to get a good amount of performance increase by accessing the array directly via pointer instead of making a local copy of the area of interest. I still need to boost performance more. I think I'll have to pack my data into a new array and try to come up with an algorithm to use the SSE in performing a sum.

Share this post


Link to post
Share on other sites
Quote:
Original post by frob
I can count on one hand (less than 32) instances in the past decade where we absolutely had to rewrite code in asm or use other CPU intrinsics directly.

Wow, how many fingers do you have?

Share this post


Link to post
Share on other sites
Quote:
Original post by CGameProgrammer
Quote:
Original post by frob
I can count on one hand (less than 32) instances in the past decade where we absolutely had to rewrite code in asm or use other CPU intrinsics directly.

Wow, how many fingers do you have?
Assuming he has 10, he can get all the way up to 1023. If he's one of those lucky people with extras, he might be able to go as high as 2047 or even 4095.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by CGameProgrammer
Quote:
Original post by frob
I can count on one hand (less than 32) instances in the past decade where we absolutely had to rewrite code in asm or use other CPU intrinsics directly.

Wow, how many fingers do you have?
Assuming he has 10, he can get all the way up to 1023. If he's one of those lucky people with extras, he might be able to go as high as 2047 or even 4095.

Technically he said one hand, but you're basically right, he can count up to 31, as he said. I feel foolish.

Share this post


Link to post
Share on other sites

This topic is 4302 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this