Back to General and Gameplay Programming

SSE2 optimisation

taz0010 · 2011-01-22T06:55:01

I've never used the SSE intrinsics in MSVC before, so I'm looking for advice on whether my particular algorithm can be made faster. It involves determining if a plane intersects a convex polygon in 3D space. While the algorithm runs a very large number of times, I don't think I can parallelise across separate operations, as subsequent stages of the process modify the data set. Plus I'm using MSVC's STL so I don't have 16 byte alignment on my Vec3s. The plane to convex polygon algorithm is as follows: - Iterate over the polygon's points, calculating the point-to-plane distance of each - The plane intersects the polygon if: a) Distance p and p+1 have opposite signs and have magnitude greater than epsilon or Distance p is within epsilon of zero, but p-1 and p+1 have magnitude greater than epsilon and opposite signs - An edge of the polygon lies on the plane if a) Distance p and p+1 are both within epsilon of zero First I was thinking of using intrinsics to evaluate dv = (planePt - v).dot(planeN) each iteration, but maybe it would be better to calculate the operation for 4 points in parallel. But I don't know if the above process can be adapted to SSE, or if it's even worth trying. Could anyone with experience using SSE give me some advice here? Is this an avenue worth persuing at all?

General and Gameplay Programming Programming

Started by taz0010 January 11, 2011 09:47 AM

8 comments, last by Zoner 13 years, 3 months ago

Zoner

232

January 22, 2011 06:55 AM

The info on eliminating branching is interesting. I haven't had any luck with these optimsations in the past. It seems that correctly predicted branches cost virtually nothing on the i7, and emitting extra code to remove a branch normally results in a performance hit. I wonder if it's similar on the iPhone platform.

Each chunk of code is its own little universe

I work a lot on the consoles the the cost of branching, load hit stores, and pointer dereferences when the cache wasn't warmed up with the data are all painfully high.

PC (aka Intel) hardware is a lot more advanced in many ways, Intel's branch prediction scheme pretty good, and their hardware prefetch look-ahead is borders on 'scary good' most of the time.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/

In many cases its not the fact there is a big decision gate in the code thats the problem, its the number of tests. A lot of code tests 2-5 things at a time (i.e. chains of && and ||, etc) and in except of the case of operator overloading which breaks short-circuit evaluation, turns into a cluster of branches. A good real world example of this is a standard ray-box slabs test is full of > and <, which are really lots of little branches. In some cases this can be fixed with using ternary operator instead of an 'if', in others it requires using platform intrinsics (fsel in case of the PowerPC). We have a branch free version which does all the tests without any early outs and it performs much faster.

http://www.gearboxsoftware.com/

SSE2 optimisation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

SSE2 optimisation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines