Sign in to follow this  

asm vs intrinsics for SSE

This topic is 4394 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Just curious as to what people's thoughts/experiences have been with using SSE intrinsics. I have a loop that I'm rewriting that is perfect for SSE, but I'm wondering if I should use the intrinsics instead. In older post on this forum, a poster said that compilers did a horrible job with SSE intrinsics. I have a feeling my question has already been discussed to death, so I apologize in advance if its been asked already :)

Share this post


Link to post
Share on other sites
I haven't experimented with this myself, however, I would assume with using intrinsics the compiler is free to managed register usage and do instruction scheduling in ways it wouldn't be if you hand coded the assembly yourself. This could be particularly useful for small inline functions that would be injected in different parts of your code where different scheduling might produce better performance.

Also, instrinics can be replaced with generic functions on architectures that don't support these instructions. Or as a reverse example, the early xbox 360 devkits shipped with instrinsics for functions that weren't yet available, but would be on the final hardware.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:

besides, inline asm doesn't work when compiling for x64.


Didn't know that, I'll go look that up. But that pretty much convinced me to go with the intrinsics, cause I'm sure the hell not going to write the entire function in assembly.

Thanks!

Share this post


Link to post
Share on other sites
Oh the anon poster above is me, forgot to fill in name and password field.

[edit]
Oh wait, thought about it for a sec, I can use an inline function for my inner loop and just write that in assmbler right? hmmmm

That would be neat, cause I could just let the compiler generate the assembler for the function and make changes as I see fit.

[Edited by - Unfadable on December 4, 2005 11:59:17 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Unfadable
In older post on this forum, a poster said that compilers did a horrible job with SSE intrinsics.

There are several reasons why intrinsics are (or should be) better than inline assembly, but without a link to the post it is hard to know why the poster wrote that.

Share this post


Link to post
Share on other sites
I did some research on this myself. Now your milage may very, so you should just try using intrinsics and see if it can generate what you want. Anyway, what I came up with using intrinsic function is:

pros: portability, ease of use

cons: doesn't optimize as well as smart hand-coding.


Share this post


Link to post
Share on other sites
Quote:
Original post by Code-R
you're talking REALLY smart here...
The pipeline analysis at this level is not difficult to do, and is well documented. You have to do most of it to write the intrinsics in the first place, so writing the fully optimal assembly isn't that much harder. Probably the best thing is to write the intrinsic version and see whether the compiler does the right thing in the generated code. If not, replace it with pure assembly. (Only problem is that this cycle is so much work that you might be better off in assembly anyway.)

Share this post


Link to post
Share on other sites
I rewrote my math library to use SSE intrinsics a couple of years ago.
I then profiled/looked at a lot of code generated by the compiler.
In my first version the compiler did some stupid things and used alot of extra stores/writes.
Most of these came from pointer aliasing and not using the temporal return optimization (is that the name for it?).
After "fixing" these issues using the restrict keyword and fiddling with the code the compiler did a VERY good job.

In many cases I couldn't make a better hand optimized version.
For some special cases (ray-tri intersection for instance), I could gain a few percent by doing it by hand.
Most likely because I had a better understanding of the whole algorithm.

My advise is to use intrinsics and if your application is very limited to a small function with frequent usage, maybe hand tune that function.
For a raytracer you might be able to gain a few % using hand tuned code for the ray-tri test, IF you could work away the memory speed bottleneck which IMO is a bigger gain.

The last games I worked on had very few isolated functions, I don't recall seeing any function above the 5% mark, thus a rewriting of that function in asm, gaining 3% more speed would increase the total speed of the game with close to nothing.

Just my 2c

Share this post


Link to post
Share on other sites
Well my first attempt at using the intrinsics yesterday seemed to go poorly. Guess I'll need to look at the generated code, but I bet my problems are the same as yours.

I'll need to google some of those terms you mentioned, but would you be able to give me some basic tips on 'do's and don'ts' concerning the intrinsics?

Share this post


Link to post
Share on other sites
Microsoft strongly encourage using intrinsics rather than inline assembly. The main problem with inline assembly is that the compiler cannot reschedule instructions within and around an inline assembly block. It also reduces the compiler's options when choosing where and when to inline and potentially causes sub-optimal register allocation. The compiler is unable to perform higher level optimizations for inline assembly (such as common sub-expression elimination, loop unrolling, etc.) although hopefully that will be less of an issue since your assembly should be pretty optimal in that respect already if you're bothering to go to the trouble of optimizing it at this level.

That's the theory, in practice there may be cases where it's possible to get better performance using inline assembly. I've generally found the compiler is pretty good providing you are careful about using the restrict keyword where appropriate and make sure to do things like always defining your own assignment operators and copy constructors when using wrapped intrinsic types. Fairly subtle mistakes or omissions in your class definitions can cause very sub-optimal code generation. Intrinsics work best when you just use the intrinsic vector 4 types directly rather than wrapping them in classes but sometimes the benefits of working with higher level types outweigh the slight efficiency loss.

Share this post


Link to post
Share on other sites

This topic is 4394 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this