Back to General and Gameplay Programming

SSE/AVX Pass by Ref or Value?

General and Gameplay Programming Programming

Started by jameszhao00 October 21, 2011 10:00 PM

5 comments, last by japro 12 years, 5 months ago

jameszhao00

271

Author

October 21, 2011 10:00 PM

I have the following code



//N = 1+

struct A

{

    __m256 a[N]; //or __m128

};

struct B

{

    __m256 a[N]; //or __m128

    float b[8N];

};

Which (more importantly why) of the following will perform the best for the range of N and M? Does it even matter?



void f0(A a) { ... }

void f1(A& a) { ... } //and pointer variant

void f2(B b) { ... }

void f3(B& b) { ... } //and pointer variant

Some context:
I have a really hardcore loop w/ function calls at the moment and I'm facing the choices above. The function might/might not get inlined.

I ran some scenarios in Intel Amplifier but didn't really get any info. I also remember reading something about __m256 / __m128 always being being register pointers/aliases/(whatever identifies registers) (or something like that).

Blog

Adam_42

3,664

October 22, 2011 01:19 AM

For A it will depend on the platform and compiler used. For platforms and array sizes where all parameters are going to be passed by register then I'd expect passing by value to be a bit more efficient than passing by reference as passing by reference will add extra stores and loads when the value is in a register before the function call. However most compilers and platforms won't pass a struct by value in registers - see page 19 of http://www.agner.org/optimize/calling_conventions.pdf

For B you're probably best off passing by reference all of the time, due to the array of floats being unlikely to be passed using registers.

However all of that shouldn't matter - if the functions are simple enough for performance to be significantly affected by the cost of passing parameters, then you want to persuade the compiler to inline them where possible. That gives the best opportunity for the compiler to optimize the code.

jameszhao00

271

Author

October 22, 2011 06:26 AM

Thanks!

Just some fun info. I'm using the Intel C++ Compiler by the way.

I spend a day writing a crazy AVX based implementation of my project and it was super painful. I then happen to write a serial version for comparison purposes. Compiled w/ max optimizations + fast2 floating point model + profile guided optimizations, the serial version is faster than my AVX version by ~38% (with equivalent compile settings).

Both versions had SOA data structures.

I go look at the ASM because I'm in shock... turns out everything's AVX'fied. The compiler also managed to reduce a heavy store-load blocks that I was having a headache over earlier.

So yea....

Blog

wforl

169

October 22, 2011 08:48 AM

I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one

Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?

jameszhao00

271

Author

October 22, 2011 04:56 PM

I've had similar experiences. When ever I've tried to apply a micro optimization I've often found the compiler to have been already applying a better one

Out of interest how much longer does your application take to compile with the intel optimizations turned on and off?

Haven't timed it but always feels considerably longer than MSVC. Compared to baseline IC it feels similar.

Blog

Zoner

232

October 27, 2011 09:56 PM

By reference, as 32 bit compilers can't cope with passing them by value (MSVC, intel etc).

My experiments with AVX 256 have come out pretty poor, but using the AVX 128 instructions is a huge win due to the 3 argument versions not needing nearly as many shuffles and moves to get their work done.

http://www.gearboxsoftware.com/

japro

887

October 27, 2011 10:22 PM

I lately made this mini benchmark in an attempt to max out my i5-2500. Compiled with ICC 12 it hits flat out 100% peak performance (59 Gflop/s on a single core running 3.7 GHz in turbo mode). With 4 threads it does 200 Gflop/s which is like 95% theoretical peak...
But the operations are somewhat synthetic. It basically performs iterated matrix multiplications on a array of float4. Not sure if this compiles in MSVC (I think sys/time.h is a "unix thing", right?) but it should be easy to adapt.

is there a specific reason you put it into a struct? From looking at assembly output of my compiler I concluded that "naked" __m256 get mapped to registers almost 1:1 if possible. so passing by reference or value didn't actually make a whole lot difference.

[source lang="cpp"]
#include <iostream>
#include <sys/time.h>
#include <omp.h>
#include <immintrin.h>
#include <cmath>
#include <cstdlib>
#include <algorithm>
#include <vector>

/*
* compile with:
*
* g++ ComputePowerAVX.cpp -o ComputePowerAVX -O3 -mavx -fopenmp
*
* or
*
* icpc ComputePowerAVX.cpp -o ComputePowerAVX -O3 -mavx -openmp
*
* takes the number of threads as argument (default 1)
* e.g. (for 4 threads):
* ComputePowerAVX 4
*
*/

double mysecond()
{
struct timeval tp;
struct timezone tzp;
int i;

i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

void ComputePowerAVX(float *v, long iSize)
{
const float alpha = 1.3;
const float c = std::cos(alpha);
const float s = std::sin(alpha);

float __attribute__((aligned(32))) rotation[] =
{
1, 0, 0, 0,
0, c,-s, 0,
0, s, c, 0,
0, 0, 0, 1
};

const int reps = iSize/(1024);

#pragma omp parallel
{
//16 kb chunk, so it fits into L1
const int chunk = 4*1024*omp_get_thread_num();

__m128 m0128, m1128, m2128, m3128;
m0128 = _mm_load_ps(&(rotation[ 0]));
m1128 = _mm_load_ps(&(rotation[ 4]));
m2128 = _mm_load_ps(&(rotation[ 8]));
m3128 = _mm_load_ps(&(rotation[12]));

__m256 m0, m1, m2, m3;
m0 = _mm256_broadcast_ps(&m0128);
m1 = _mm256_broadcast_ps(&m1128);
m2 = _mm256_broadcast_ps(&m2128);
m3 = _mm256_broadcast_ps(&m3128);

for(int k = 0;k<reps;++k)
{
for(int i = chunk;i<chunk + 4*1024;i+=128)
{//8 blocks with 8 mul-add pairs with 8 operands each
__m256 v0 = _mm256_load_ps(v + i + 0);
__m256 v1 = _mm256_load_ps(v + i + 8);
__m256 v2 = _mm256_load_ps(v + i +16);
__m256 v3 = _mm256_load_ps(v + i +24);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m0), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m0), v1);

__m256 v4 = _mm256_load_ps(v + i +32);
__m256 v5 = _mm256_load_ps(v + i +40);

v2 = _mm256_add_ps(_mm256_mul_ps(v2, m0), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m0), v3);

__m256 v6 = _mm256_load_ps(v + i +48);
__m256 v7 = _mm256_load_ps(v + i +56);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m0), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m0), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m0), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m0), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m1), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m1), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m1), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m1), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m1), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m1), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m1), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m1), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m2), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m2), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m2), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m2), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m2), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m2), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m2), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m2), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m3), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m3), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m3), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m3), v3);

_mm256_store_ps(v + i + 0, v0);
_mm256_store_ps(v + i + 8, v1);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m3), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m3), v5);

_mm256_store_ps(v + i +16, v2);
_mm256_store_ps(v + i +24, v3);

v6 = _mm256_add_ps(_mm256_mul_ps(v6, m3), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m3), v7);

_mm256_store_ps(v + i +32, v4);
_mm256_store_ps(v + i +40, v5);
_mm256_store_ps(v + i +48, v6);
_mm256_store_ps(v + i +56, v7);

v0 = _mm256_load_ps(v + i + 64);
v1 = _mm256_load_ps(v + i + 72);
v2 = _mm256_load_ps(v + i + 80);
v3 = _mm256_load_ps(v + i + 88);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m0), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m0), v1);

v4 = _mm256_load_ps(v + i + 96);
v5 = _mm256_load_ps(v + i +104);

v2 = _mm256_add_ps(_mm256_mul_ps(v2, m0), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m0), v3);

v6 = _mm256_load_ps(v + i +112);
v7 = _mm256_load_ps(v + i +120);

v4 = _mm256_add_ps(_mm256_mul_ps(v4, m0), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m0), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m0), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m0), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m1), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m1), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m1), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m1), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m1), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m1), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m1), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m1), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m2), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m2), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m2), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m2), v3);
v4 = _mm256_add_ps(_mm256_mul_ps(v4, m2), v4);
v5 = _mm256_add_ps(_mm256_mul_ps(v5, m2), v5);
v6 = _mm256_add_ps(_mm256_mul_ps(v6, m2), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m2), v7);

v0 = _mm256_add_ps(_mm256_mul_ps(v0, m3), v0);
v1 = _mm256_add_ps(_mm256_mul_ps(v1, m3), v1);
v2 = _mm256_add_ps(_mm256_mul_ps(v2, m3), v2);
v3 = _mm256_add_ps(_mm256_mul_ps(v3, m3), v3);

_mm256_store_ps(v + i + 64, v0);

_mm256_store_ps(v + i + 80, v2);
_mm256_store_ps(v + i + 88, v3);

v6 = _mm256_add_ps(_mm256_mul_ps(v6, m3), v6);
v7 = _mm256_add_ps(_mm256_mul_ps(v7, m3), v7);

_mm256_store_ps(v + i + 96, v4);
_mm256_store_ps(v + i +104, v5);
_mm256_store_ps(v + i +112, v6);
_mm256_store_ps(v + i +120, v7);
}
}
}
}

int main(int argc, char *argv[])
{
int nthreads = 1;
if(argc > 1)
nthreads = std::atoi(argv[1]);

omp_set_num_threads(nthreads);
std::cout << "using " << nthreads << " threads" << std::endl;

float *s = (float*)_mm_malloc(sizeof(float)*1024*1024, 32);
std::fill(s, s+1024*1024, 42);

int size = 1l<<30;

int runs = 10;

std::vector<double> gflops(runs);
ComputePowerAVX(s, size); //"warm up"-run (preloads cache and such)

for(int i = 0;i<runs;++i)
{
double start = mysecond();
ComputePowerAVX(s, size);
double end = mysecond();
gflops.at(i) = size*4.*8.*nthreads/(end-start)*1.e-9;
std::cout << "run " << i+1 << " achieved " << gflops.at(i) << " Gflop/s" << std::endl;
}

std::sort(gflops.begin(), gflops.end());
std::cout << "\nmin: " << gflops.at(0)
<< "\nmax: " << gflops.at(runs-1)
<< "\nmedian: " << gflops.at(runs/2)
<< std::endl;

_mm_free(s);
return 0;
}
[/source]

Tweet tweet!
My videos on YouTube
OpenGL Example Collection

SSE/AVX Pass by Ref or Value?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

SSE/AVX Pass by Ref or Value?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines