Jump to content
  • Advertisement
Sign in to follow this  
jameszhao00

SSE/AVX Pass by Ref or Value?

This topic is 2605 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have the following code


//N = 1+
struct A
{
__m256 a[N]; //or __m128
};
struct B
{
__m256 a[N]; //or __m128
float b[8N];
};




Which (more importantly why) of the following will perform the best for the range of N and M? Does it even matter?



void f0(A a) { ... }
void f1(A& a) { ... } //and pointer variant
void f2(B b) { ... }
void f3(B& b) { ... } //and pointer variant


Some context:
I have a really hardcore loop w/ function calls at the moment and I'm facing the choices above. The function might/might not get inlined.

I ran some scenarios in Intel Amplifier but didn't really get any info. I also remember reading something about __m256 / __m128 always being being register pointers/aliases/(whatever identifies registers) (or something like that).

Share this post


Link to post
Share on other sites
Advertisement
For A it will depend on the platform and compiler used. For platforms and array sizes where all parameters are going to be passed by register then I'd expect passing by value to be a bit more efficient than passing by reference as passing by reference will add extra stores and loads when the value is in a register before the function call. However most compilers and platforms won't pass a struct by value in registers - see page 19 of http://www.agner.org/optimize/calling_conventions.pdf

For B you're probably best off passing by reference all of the time, due to the array of floats being unlikely to be passed using registers.

However all of that shouldn't matter - if the functions are simple enough for performance to be significantly affected by the cost of passing parameters, then you want to persuade the compiler to inline them where possible. That gives the best opportunity for the compiler to optimize the code.

Share this post


Link to post
Share on other sites
Thanks!

Just some fun info. I'm using the Intel C++ Compiler by the way.

I spend a day writing a crazy AVX based implementation of my project and it was super painful. I then happen to write a serial version for comparison purposes. Compiled w/ max optimizations + fast2 floating point model + profile guided optimizations, the serial version is faster than my AVX version by ~38% (with equivalent compile settings). :( Both versions had SOA data structures.

I go look at the ASM because I'm in shock... turns out everything's AVX'fied. The compiler also managed to reduce a heavy store-load blocks that I was having a headache over earlier.

So yea....

Share this post


Link to post
Share on other sites
I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one :)

Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?

Share this post


Link to post
Share on other sites

I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one :)

Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?


Haven't timed it but always feels considerably longer than MSVC. Compared to baseline IC it feels similar.

Share this post


Link to post
Share on other sites
By reference, as 32 bit compilers can't cope with passing them by value (MSVC, intel etc).

My experiments with AVX 256 have come out pretty poor, but using the AVX 128 instructions is a huge win due to the 3 argument versions not needing nearly as many shuffles and moves to get their work done.

Share this post


Link to post
Share on other sites
I lately made this mini benchmark in an attempt to max out my i5-2500. Compiled with ICC 12 it hits flat out 100% peak performance (59 Gflop/s on a single core running 3.7 GHz in turbo mode). With 4 threads it does 200 Gflop/s which is like 95% theoretical peak...
But the operations are somewhat synthetic. It basically performs iterated matrix multiplications on a array of float4. Not sure if this compiles in MSVC (I think sys/time.h is a "unix thing", right?) but it should be easy to adapt.

is there a specific reason you put it into a struct? From looking at assembly output of my compiler I concluded that "naked" __m256 get mapped to registers almost 1:1 if possible. so passing by reference or value didn't actually make a whole lot difference.

[source lang="cpp"]
#include <iostream>
#include <sys/time.h>
#include <omp.h>
#include <immintrin.h>
#include <cmath>
#include <cstdlib>
#include <algorithm>
#include <vector>

/*
* compile with:
*
* g++ ComputePowerAVX.cpp -o ComputePowerAVX -O3 -mavx -fopenmp
*
* or
*
* icpc ComputePowerAVX.cpp -o ComputePowerAVX -O3 -mavx -openmp
*
* takes the number of threads as argument (default 1)
* e.g. (for 4 threads):
* ComputePowerAVX 4
*
*/

double mysecond()
{
struct timeval tp;
struct timezone tzp;
int i;

i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void ComputePowerAVX(float *v, long iSize)
{
const float alpha = 1.3;
const float c = std::cos(alpha);
const float s = std::sin(alpha);

float __attribute__((aligned(32))) rotation[] =
{
1, 0, 0, 0,
0, c,-s, 0,
0, s, c, 0,
0, 0, 0, 1
};

const int reps = iSize/(1024);

#pragma omp parallel
{
//16 kb chunk, so it fits into L1
const int chunk = 4*1024*omp_get_thread_num();

__m128 m0128, m1128, m2128, m3128;
m0128 = _mm_load_ps(&(rotation[ 0]));
m1128 = _mm_load_ps(&(rotation[ 4]));
m2128 = _mm_load_ps(&(rotation[ 8]));
m3128 = _mm_load_ps(&(rotation[12]));

__m256 m0, m1, m2, m3;
m0 = _mm256_broadcast_ps(&m0128);
m1 = _mm256_broadcast_ps(&m1128);
m2 = _mm256_broadcast_ps(&m2128);
m3 = _mm256_broadcast_ps(&m3128);

for(int k = 0;k<reps;++k)
{
for(int i = chunk;i<chunk + 4*1024;i+=128)
{//8 blocks with 8 mul-add pairs with 8 operands each
__m256 v0 = _mm256_load_ps(v + i + 0);
__m256 v1 = _mm256_load_ps(v + i + 8);
__m256 v2 = _mm256_load_ps(v + i +16);
__m256 v3 = _mm256_load_ps(v + i +24);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m0), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m0), v1);

__m256 v4 = _mm256_load_ps(v + i +32);
__m256 v5 = _mm256_load_ps(v + i +40);

v2 = _mm256_add_ps(_mm256_mul_ps(v2, m0), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m0), v3);

__m256 v6 = _mm256_load_ps(v + i +48);
__m256 v7 = _mm256_load_ps(v + i +56);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m0), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m0), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m0), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m0), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m1), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m1), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m1), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m1), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m1), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m1), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m1), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m1), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m2), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m2), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m2), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m2), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m2), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m2), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m2), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m2), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m3), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m3), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m3), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m3), v3);

_mm256_store_ps(v + i + 0, v0);
_mm256_store_ps(v + i + 8, v1);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m3), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m3), v5);

_mm256_store_ps(v + i +16, v2);
_mm256_store_ps(v + i +24, v3);

v6 = _mm256_add_ps(_mm256_mul_ps(v6, m3), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m3), v7);

_mm256_store_ps(v + i +32, v4);
_mm256_store_ps(v + i +40, v5);
_mm256_store_ps(v + i +48, v6);
_mm256_store_ps(v + i +56, v7);

v0 = _mm256_load_ps(v + i + 64);
v1 = _mm256_load_ps(v + i + 72);
v2 = _mm256_load_ps(v + i + 80);
v3 = _mm256_load_ps(v + i + 88);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m0), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m0), v1);

v4 = _mm256_load_ps(v + i + 96);
v5 = _mm256_load_ps(v + i +104);

v2 = _mm256_add_ps(_mm256_mul_ps(v2, m0), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m0), v3);

v6 = _mm256_load_ps(v + i +112);
v7 = _mm256_load_ps(v + i +120);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m0), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m0), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m0), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m0), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m1), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m1), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m1), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m1), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m1), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m1), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m1), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m1), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m2), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m2), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m2), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m2), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m2), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m2), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m2), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m2), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m3), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m3), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m3), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m3), v3);

_mm256_store_ps(v + i + 64, v0);

_mm256_store_ps(v + i + 80, v2);
_mm256_store_ps(v + i + 88, v3);

v6 = _mm256_add_ps(_mm256_mul_ps(v6, m3), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m3), v7);

_mm256_store_ps(v + i + 96, v4);
_mm256_store_ps(v + i +104, v5);
_mm256_store_ps(v + i +112, v6);
_mm256_store_ps(v + i +120, v7);
}
}
}
}

int main(int argc, char *argv[])
{
int nthreads = 1;
if(argc > 1)
nthreads = std::atoi(argv[1]);

omp_set_num_threads(nthreads);
std::cout << "using " << nthreads << " threads" << std::endl;

float *s = (float*)_mm_malloc(sizeof(float)*1024*1024, 32);
std::fill(s, s+1024*1024, 42);

int size = 1l<<30;

int runs = 10;

std::vector<double> gflops(runs);
ComputePowerAVX(s, size); //"warm up"-run (preloads cache and such)

for(int i = 0;i<runs;++i)
{
double start = mysecond();
ComputePowerAVX(s, size);
double end = mysecond();
gflops.at(i) = size*4.*8.*nthreads/(end-start)*1.e-9;
std::cout << "run " << i+1 << " achieved " << gflops.at(i) << " Gflop/s" << std::endl;
}

std::sort(gflops.begin(), gflops.end());
std::cout << "\nmin: " << gflops.at(0)
<< "\nmax: " << gflops.at(runs-1)
<< "\nmedian: " << gflops.at(runs/2)
<< std::endl;

_mm_free(s);
return 0;
}
[/source]

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!