Sign in to follow this  
fir

sse and branching

Recommended Posts

I wonder in one thing

 

some codes (loops) are well suitable for sse as they not involve 

some kind of dispersed branching only strightforward calculations

 

but some involves branching (for example in rasterization pipeline

some triangles are clipped out and not calsulated further)

 

1) is it better to calculate such baranched stages on normal scalar 

code then 'collect' it and run on sse

2) or in general is it better to trying to run in some way on sse parallel

mode (it would be then involved probably some processing of 

"zombie" values - that were clipped out

 

is this possible to answer meaningflly?

 

Share this post


Link to post
Share on other sites

Let the compiler decide what is best.

You may help the compiler by its decission. But that it.Use __m128 datatypes and lookput how to use them. The rest will be done by the compiler.

Share this post


Link to post
Share on other sites

In ISPC, it does your option #2. If you're processing 4 objects at once, but items A,B,C need to branch while D does not need to branch, then the branch is processed for all 4 items, but the results calculated for item D are thrown out. It is a zombie object / is masked out during the branch area.

 

Option #1 is also good through. If you can do a frist pass where you find out which objects need to use the branch, you can then re-sort the data so that in a second pass you can operate on nice groups of 4 objects at a time.

 

On that note, I highly recommend using ISPC rather than C (or actually, in conjunction with C - they inter-operate nicely) when trying to write SIMD code... it's so much easier than learning SSE/etc... plus it works on other instruction sets such as AVX (float8 and float16 architectures) as well as SSE (float4).

 

intrinsics seem not such hard, ofc it is a couple of days of headache 

but if learned should not be so hard i hope

 

as to extension I opt for some years just to include some hardware accelerated types as a primitive types in c language

 

float3 a,b,c,d;

 

a = {1,2,3,4};

b = {1,2,3,5};

c = {1,3,3,5};

 

d = c+ a / b;      //hardware accelerated types on c ground level

Share this post


Link to post
Share on other sites

as to extension I opt for some years just to include some hardware accelerated types as a primitive types in c language

That's what C++ math libraries will give you, such as GLM, etc... Since C doesn't have operator overloading, you can't do this with just a library sad.png 
If you want to extend the language with niceties like that, maybe you should mix some C++ into your C code wink.png 
 

That example with the ispC extensions would look like this, where on SSE the foreach loop will increment by 4 each time and the "varying float" values are float4's (or on AVX the loop increments by 8 each time and the varying float values are float8's).
//export means this function is callable from regular C code:
export void Example( uniform const float aIn[], uniform const float bIn[], uniform const float cIn[], uniform const float dOut[], uniform uint count )
{
	//Assuming you've just got 4 values:
	//  programCount and programIndex are magic variables telling us how wide the SIMD instructions are.
	assert( programCount == 4 );          //assert we're on SSE float4 instructions... On AVX this would be 8.
	varying float a = aIn[programIndex];  // load 4 floats
	varying float b = bIn[programIndex];  // load 4 floats
	varying float c = cIn[programIndex];  // load 4 floats
	varying float d = c + a / b;          // do 4-wide SIMD math
	dOut[programIndex] = d;               // store N floats


	//OR: If you've got any number of values to process:
	foreach( index = 0 ... count )         // increment N at a time
	{
		varying float a = aIn[index];  // load N floats
		varying float b = bIn[index];  // load N floats
		varying float c = cIn[index];  // load N floats
		varying float d = c + a / b;   // do N-wide SIMD math
		dOut[index] = d;               // store N floats
	}
}
Edited by Hodgman

Share this post


Link to post
Share on other sites

Clang is very good at loop vectorization with SSE/AVX and it also has a nice vector extension that gives you math and logic operators as well as HLSL/GLSL like swizzling.

typedef float __attribute__((ext_vector_type(4))) float4;
Edited by Chris_F

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this