Which (more importantly why) of the following will perform the best for the range of N and M? Does it even matter?
void f0(A a) { ... }
void f1(A& a) { ... } //and pointer variant
void f2(B b) { ... }
void f3(B& b) { ... } //and pointer variant
Some context:
I have a really hardcore loop w/ function calls at the moment and I'm facing the choices above. The function might/might not get inlined.
I ran some scenarios in Intel Amplifier but didn't really get any info. I also remember reading something about __m256 / __m128 always being being register pointers/aliases/(whatever identifies registers) (or something like that).
For A it will depend on the platform and compiler used. For platforms and array sizes where all parameters are going to be passed by register then I'd expect passing by value to be a bit more efficient than passing by reference as passing by reference will add extra stores and loads when the value is in a register before the function call. However most compilers and platforms won't pass a struct by value in registers - see page 19 of http://www.agner.org/optimize/calling_conventions.pdf
For B you're probably best off passing by reference all of the time, due to the array of floats being unlikely to be passed using registers.
However all of that shouldn't matter - if the functions are simple enough for performance to be significantly affected by the cost of passing parameters, then you want to persuade the compiler to inline them where possible. That gives the best opportunity for the compiler to optimize the code.
Just some fun info. I'm using the Intel C++ Compiler by the way.
I spend a day writing a crazy AVX based implementation of my project and it was super painful. I then happen to write a serial version for comparison purposes. Compiled w/ max optimizations + fast2 floating point model + profile guided optimizations, the serial version is faster than my AVX version by ~38% (with equivalent compile settings). Both versions had SOA data structures.
I go look at the ASM because I'm in shock... turns out everything's AVX'fied. The compiler also managed to reduce a heavy store-load blocks that I was having a headache over earlier.
I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one
Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?
I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one
Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?
Haven't timed it but always feels considerably longer than MSVC. Compared to baseline IC it feels similar.
By reference, as 32 bit compilers can't cope with passing them by value (MSVC, intel etc).
My experiments with AVX 256 have come out pretty poor, but using the AVX 128 instructions is a huge win due to the 3 argument versions not needing nearly as many shuffles and moves to get their work done.
I lately made this mini benchmark in an attempt to max out my i5-2500. Compiled with ICC 12 it hits flat out 100% peak performance (59 Gflop/s on a single core running 3.7 GHz in turbo mode). With 4 threads it does 200 Gflop/s which is like 95% theoretical peak...
But the operations are somewhat synthetic. It basically performs iterated matrix multiplications on a array of float4. Not sure if this compiles in MSVC (I think sys/time.h is a "unix thing", right?) but it should be easy to adapt.
is there a specific reason you put it into a struct? From looking at assembly output of my compiler I concluded that "naked" __m256 get mapped to registers almost 1:1 if possible. so passing by reference or value didn't actually make a whole lot difference.