SSE2 optimisation

Started by
8 comments, last by Zoner 13 years, 3 months ago


The info on eliminating branching is interesting. I haven't had any luck with these optimsations in the past. It seems that correctly predicted branches cost virtually nothing on the i7, and emitting extra code to remove a branch normally results in a performance hit. I wonder if it's similar on the iPhone platform.


Each chunk of code is its own little universe :)

I work a lot on the consoles the the cost of branching, load hit stores, and pointer dereferences when the cache wasn't warmed up with the data are all painfully high.

PC (aka Intel) hardware is a lot more advanced in many ways, Intel's branch prediction scheme pretty good, and their hardware prefetch look-ahead is borders on 'scary good' most of the time.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/


In many cases its not the fact there is a big decision gate in the code thats the problem, its the number of tests. A lot of code tests 2-5 things at a time (i.e. chains of && and ||, etc) and in except of the case of operator overloading which breaks short-circuit evaluation, turns into a cluster of branches. A good real world example of this is a standard ray-box slabs test is full of > and <, which are really lots of little branches. In some cases this can be fixed with using ternary operator instead of an 'if', in others it requires using platform intrinsics (fsel in case of the PowerPC). We have a branch free version which does all the tests without any early outs and it performs much faster.
http://www.gearboxsoftware.com/

This topic is closed to new replies.

Advertisement