sse and branching

Started by
4 comments, last by Chris_F 9 years, 9 months ago

I wonder in one thing

some codes (loops) are well suitable for sse as they not involve

some kind of dispersed branching only strightforward calculations

but some involves branching (for example in rasterization pipeline

some triangles are clipped out and not calsulated further)

1) is it better to calculate such baranched stages on normal scalar

code then 'collect' it and run on sse

2) or in general is it better to trying to run in some way on sse parallel

mode (it would be then involved probably some processing of

"zombie" values - that were clipped out

is this possible to answer meaningflly?

Advertisement

Let the compiler decide what is best.

You may help the compiler by its decission. But that it.Use __m128 datatypes and lookput how to use them. The rest will be done by the compiler.

In ISPC, it does your option #2. If you're processing 4 objects at once, but items A,B,C need to branch while D does not need to branch, then the branch is processed for all 4 items, but the results calculated for item D are thrown out. It is a zombie object / is masked out during the branch area.

Option #1 is also good through. If you can do a frist pass where you find out which objects need to use the branch, you can then re-sort the data so that in a second pass you can operate on nice groups of 4 objects at a time.

On that note, I highly recommend using ISPC rather than C (or actually, in conjunction with C - they inter-operate nicely) when trying to write SIMD code... it's so much easier than learning SSE/etc... plus it works on other instruction sets such as AVX (float8 and float16 architectures) as well as SSE (float4).

In ISPC, it does your option #2. If you're processing 4 objects at once, but items A,B,C need to branch while D does not need to branch, then the branch is processed for all 4 items, but the results calculated for item D are thrown out. It is a zombie object / is masked out during the branch area.

Option #1 is also good through. If you can do a frist pass where you find out which objects need to use the branch, you can then re-sort the data so that in a second pass you can operate on nice groups of 4 objects at a time.

On that note, I highly recommend using ISPC rather than C (or actually, in conjunction with C - they inter-operate nicely) when trying to write SIMD code... it's so much easier than learning SSE/etc... plus it works on other instruction sets such as AVX (float8 and float16 architectures) as well as SSE (float4).

intrinsics seem not such hard, ofc it is a couple of days of headache

but if learned should not be so hard i hope

as to extension I opt for some years just to include some hardware accelerated types as a primitive types in c language

float3 a,b,c,d;

a = {1,2,3,4};

b = {1,2,3,5};

c = {1,3,3,5};

d = c+ a / b; //hardware accelerated types on c ground level

as to extension I opt for some years just to include some hardware accelerated types as a primitive types in c language

That's what C++ math libraries will give you, such as GLM, etc... Since C doesn't have operator overloading, you can't do this with just a library sad.png
If you want to extend the language with niceties like that, maybe you should mix some C++ into your C code wink.png


That example with the ispC extensions would look like this, where on SSE the foreach loop will increment by 4 each time and the "varying float" values are float4's (or on AVX the loop increments by 8 each time and the varying float values are float8's).
//export means this function is callable from regular C code:
export void Example( uniform const float aIn[], uniform const float bIn[], uniform const float cIn[], uniform const float dOut[], uniform uint count )
{
	//Assuming you've just got 4 values:
	//  programCount and programIndex are magic variables telling us how wide the SIMD instructions are.
	assert( programCount == 4 );          //assert we're on SSE float4 instructions... On AVX this would be 8.
	varying float a = aIn[programIndex];  // load 4 floats
	varying float b = bIn[programIndex];  // load 4 floats
	varying float c = cIn[programIndex];  // load 4 floats
	varying float d = c + a / b;          // do 4-wide SIMD math
	dOut[programIndex] = d;               // store N floats


	//OR: If you've got any number of values to process:
	foreach( index = 0 ... count )         // increment N at a time
	{
		varying float a = aIn[index];  // load N floats
		varying float b = bIn[index];  // load N floats
		varying float c = cIn[index];  // load N floats
		varying float d = c + a / b;   // do N-wide SIMD math
		dOut[index] = d;               // store N floats
	}
}

Clang is very good at loop vectorization with SSE/AVX and it also has a nice vector extension that gives you math and logic operators as well as HLSL/GLSL like swizzling.


typedef float __attribute__((ext_vector_type(4))) float4;

This topic is closed to new replies.

Advertisement