Awesome, this works right out of the box! This code is around 30% faster than the native c++ code. Thanks for the fast response Zoner!
[quote name='Zoner' timestamp='1327960607' post='4907781']
The loop will likely need to be unrolled 2-4 more times as to pipeline better (i.e. use more registers until it starts spilling over onto the stack)
If the data is aligned, the load and store can use the aligned 'non-u' versions instead.
I am using the non-u version, but it didn't make much difference. Also unrolling the loop (4 times) didn't have a significant impact, although I re-used the same variables. By "use more registers" did you mean I should introduce more variables for each unrolled loop sequence?
[/quote]
SIMD intrinics can only be audited by looking at optimized code (unoptimized SIMD code is pretty horrific), basically when an algorithm gets too complicated it has to spill various XMM registers onto the stack. So you have to build the code, check out the asm in a debugger and see if it is doing that or not. This is much less of a problem with 64 bit code as there are twice as many registers to work with.
Re-using the same variables should work for a lot of code, although making the pointers use __restrict will probably be necessary so it can schedule the code more aggressively. If the restrict is helping the resulting asm should look something like:
read A
do work A
read B
do work B
store A
do more work on B
read C
store B
do work C
store C
vs
read A
do work A
store A
read B
do work B
store B
read C
do work C
store C