SSE/AVX Pass by Ref or Value?

Started by
5 comments, last by japro 12 years, 5 months ago
I have the following code


//N = 1+
struct A
{
__m256 a[N]; //or __m128
};
struct B
{
__m256 a[N]; //or __m128
float b[8N];
};




Which (more importantly why) of the following will perform the best for the range of N and M? Does it even matter?



void f0(A a) { ... }
void f1(A& a) { ... } //and pointer variant
void f2(B b) { ... }
void f3(B& b) { ... } //and pointer variant


Some context:
I have a really hardcore loop w/ function calls at the moment and I'm facing the choices above. The function might/might not get inlined.

I ran some scenarios in Intel Amplifier but didn't really get any info. I also remember reading something about __m256 / __m128 always being being register pointers/aliases/(whatever identifies registers) (or something like that).
Advertisement
For A it will depend on the platform and compiler used. For platforms and array sizes where all parameters are going to be passed by register then I'd expect passing by value to be a bit more efficient than passing by reference as passing by reference will add extra stores and loads when the value is in a register before the function call. However most compilers and platforms won't pass a struct by value in registers - see page 19 of http://www.agner.org/optimize/calling_conventions.pdf

For B you're probably best off passing by reference all of the time, due to the array of floats being unlikely to be passed using registers.

However all of that shouldn't matter - if the functions are simple enough for performance to be significantly affected by the cost of passing parameters, then you want to persuade the compiler to inline them where possible. That gives the best opportunity for the compiler to optimize the code.
Thanks!

Just some fun info. I'm using the Intel C++ Compiler by the way.

I spend a day writing a crazy AVX based implementation of my project and it was super painful. I then happen to write a serial version for comparison purposes. Compiled w/ max optimizations + fast2 floating point model + profile guided optimizations, the serial version is faster than my AVX version by ~38% (with equivalent compile settings). :( Both versions had SOA data structures.

I go look at the ASM because I'm in shock... turns out everything's AVX'fied. The compiler also managed to reduce a heavy store-load blocks that I was having a headache over earlier.

So yea....
I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one :)

Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?

I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one :)

Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?


Haven't timed it but always feels considerably longer than MSVC. Compared to baseline IC it feels similar.
By reference, as 32 bit compilers can't cope with passing them by value (MSVC, intel etc).

My experiments with AVX 256 have come out pretty poor, but using the AVX 128 instructions is a huge win due to the 3 argument versions not needing nearly as many shuffles and moves to get their work done.
http://www.gearboxsoftware.com/
I lately made this mini benchmark in an attempt to max out my i5-2500. Compiled with ICC 12 it hits flat out 100% peak performance (59 Gflop/s on a single core running 3.7 GHz in turbo mode). With 4 threads it does 200 Gflop/s which is like 95% theoretical peak...
But the operations are somewhat synthetic. It basically performs iterated matrix multiplications on a array of float4. Not sure if this compiles in MSVC (I think sys/time.h is a "unix thing", right?) but it should be easy to adapt.

is there a specific reason you put it into a struct? From looking at assembly output of my compiler I concluded that "naked" __m256 get mapped to registers almost 1:1 if possible. so passing by reference or value didn't actually make a whole lot difference.

[source lang="cpp"]
#include <iostream>
#include <sys/time.h>
#include <omp.h>
#include <immintrin.h>
#include <cmath>
#include <cstdlib>
#include <algorithm>
#include <vector>

/*
* compile with:
*
* g++ ComputePowerAVX.cpp -o ComputePowerAVX -O3 -mavx -fopenmp
*
* or
*
* icpc ComputePowerAVX.cpp -o ComputePowerAVX -O3 -mavx -openmp
*
* takes the number of threads as argument (default 1)
* e.g. (for 4 threads):
* ComputePowerAVX 4
*
*/

double mysecond()
{
struct timeval tp;
struct timezone tzp;
int i;

i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void ComputePowerAVX(float *v, long iSize)
{
const float alpha = 1.3;
const float c = std::cos(alpha);
const float s = std::sin(alpha);

float __attribute__((aligned(32))) rotation[] =
{
1, 0, 0, 0,
0, c,-s, 0,
0, s, c, 0,
0, 0, 0, 1
};

const int reps = iSize/(1024);

#pragma omp parallel
{
//16 kb chunk, so it fits into L1
const int chunk = 4*1024*omp_get_thread_num();

__m128 m0128, m1128, m2128, m3128;
m0128 = _mm_load_ps(&(rotation[ 0]));
m1128 = _mm_load_ps(&(rotation[ 4]));
m2128 = _mm_load_ps(&(rotation[ 8]));
m3128 = _mm_load_ps(&(rotation[12]));

__m256 m0, m1, m2, m3;
m0 = _mm256_broadcast_ps(&m0128);
m1 = _mm256_broadcast_ps(&m1128);
m2 = _mm256_broadcast_ps(&m2128);
m3 = _mm256_broadcast_ps(&m3128);

for(int k = 0;k<reps;++k)
{
for(int i = chunk;i<chunk + 4*1024;i+=128)
{//8 blocks with 8 mul-add pairs with 8 operands each
__m256 v0 = _mm256_load_ps(v + i + 0);
__m256 v1 = _mm256_load_ps(v + i + 8);
__m256 v2 = _mm256_load_ps(v + i +16);
__m256 v3 = _mm256_load_ps(v + i +24);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m0), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m0), v1);

__m256 v4 = _mm256_load_ps(v + i +32);
__m256 v5 = _mm256_load_ps(v + i +40);

v2 = _mm256_add_ps(_mm256_mul_ps(v2, m0), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m0), v3);

__m256 v6 = _mm256_load_ps(v + i +48);
__m256 v7 = _mm256_load_ps(v + i +56);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m0), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m0), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m0), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m0), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m1), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m1), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m1), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m1), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m1), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m1), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m1), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m1), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m2), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m2), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m2), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m2), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m2), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m2), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m2), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m2), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m3), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m3), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m3), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m3), v3);

_mm256_store_ps(v + i + 0, v0);
_mm256_store_ps(v + i + 8, v1);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m3), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m3), v5);

_mm256_store_ps(v + i +16, v2);
_mm256_store_ps(v + i +24, v3);

v6 = _mm256_add_ps(_mm256_mul_ps(v6, m3), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m3), v7);

_mm256_store_ps(v + i +32, v4);
_mm256_store_ps(v + i +40, v5);
_mm256_store_ps(v + i +48, v6);
_mm256_store_ps(v + i +56, v7);

v0 = _mm256_load_ps(v + i + 64);
v1 = _mm256_load_ps(v + i + 72);
v2 = _mm256_load_ps(v + i + 80);
v3 = _mm256_load_ps(v + i + 88);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m0), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m0), v1);

v4 = _mm256_load_ps(v + i + 96);
v5 = _mm256_load_ps(v + i +104);

v2 = _mm256_add_ps(_mm256_mul_ps(v2, m0), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m0), v3);

v6 = _mm256_load_ps(v + i +112);
v7 = _mm256_load_ps(v + i +120);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m0), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m0), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m0), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m0), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m1), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m1), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m1), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m1), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m1), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m1), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m1), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m1), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m2), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m2), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m2), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m2), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m2), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m2), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m2), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m2), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m3), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m3), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m3), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m3), v3);

_mm256_store_ps(v + i + 64, v0);

_mm256_store_ps(v + i + 80, v2);
_mm256_store_ps(v + i + 88, v3);

v6 = _mm256_add_ps(_mm256_mul_ps(v6, m3), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m3), v7);

_mm256_store_ps(v + i + 96, v4);
_mm256_store_ps(v + i +104, v5);
_mm256_store_ps(v + i +112, v6);
_mm256_store_ps(v + i +120, v7);
}
}
}
}

int main(int argc, char *argv[])
{
int nthreads = 1;
if(argc > 1)
nthreads = std::atoi(argv[1]);

omp_set_num_threads(nthreads);
std::cout << "using " << nthreads << " threads" << std::endl;

float *s = (float*)_mm_malloc(sizeof(float)*1024*1024, 32);
std::fill(s, s+1024*1024, 42);

int size = 1l<<30;

int runs = 10;

std::vector<double> gflops(runs);
ComputePowerAVX(s, size); //"warm up"-run (preloads cache and such)

for(int i = 0;i<runs;++i)
{
double start = mysecond();
ComputePowerAVX(s, size);
double end = mysecond();
gflops.at(i) = size*4.*8.*nthreads/(end-start)*1.e-9;
std::cout << "run " << i+1 << " achieved " << gflops.at(i) << " Gflop/s" << std::endl;
}

std::sort(gflops.begin(), gflops.end());
std::cout << "\nmin: " << gflops.at(0)
<< "\nmax: " << gflops.at(runs-1)
<< "\nmedian: " << gflops.at(runs/2)
<< std::endl;

_mm_free(s);
return 0;
}
[/source]

This topic is closed to new replies.

Advertisement