Sign in to follow this  
dpadam450

GPU vector processing

Recommended Posts

dpadam450    2357

Will GL/DX compile floats as vec4's? Are vec3's basically vec4?

I assume that these whether compiled as vec4's or not, these are both 1 instruction?

vec4 vector;

vector.x += .5;

Or

vector += vec4(.5,.5,.5,.5);

 

 

And even if they are both equal as 1 instruction/same time, does anyone know if it compiles a standard float to a vec4 or not? It saves memory to not do so, but I just assume everything goes through a float4 vector processor in the end. IE

float a;

a += .5; // would be (a,a,a,a) += (.5,.5,.5,.5);

 

Share this post


Link to post
Share on other sites
Hodgman    51223
I assume you're talking about GLSL/HLSL?

On dx9 era cards, all instructions operated on vec4's, like SSE/etc.
The processors would often shade 4 pixels at once, meaning they'd internally use a vec16 register to hold the vec4 values from 4 pixels.
e.g.
Value.xyzw += (1).xxxx
Is the same speed as:
Value.x += 1
So, vectorizing your code to use all 4 components as much as possible was very important. The HLSL compiler was pretty good at auto-vectorization. On other platforms I'd gotten >2x speed ups from hand-vectorizing my code (which often renders it unreadable).

Modern cards are all scalar, like the FPU of a CPU. The GPU might shade 64 pixels at once, internally using 4 vec64 registers to hold the x,y,z and w values for 64 pixels.
e.g.
Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1

In-between, there were some cards that worked with weird hybrid vec4+vec1 (vec5?) registers, dual issuing a vector and scalar instruction each cycle.

Share this post


Link to post
Share on other sites
dpadam450    2357

 

Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1

 vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?

Share this post


Link to post
Share on other sites
Hodgman    51223

Value.xyzw += (1).xxxx

Is 4 times more complex as:
Value.x += 1

 vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?

On D3D9-era cards, it was generally a single instruction, using SIMD registers. Actually, they're "SIMD SIMD floats" (SIMDMD?).

e.g. Each pixel has a typical SIMD-vec4 looking variable:

struct Vec4 { float f[4]; };

but the GPU works on a SIMD collection of these variables at a time, to shade multiple pixels in parallel:

struct GpuRegister { Vec4 v[NumPixelsShadedAtOnce]; };

 

The GPU works on a "GpuRegister" at a time, with each array element belonging to a pixel. Each pixel is then working with a Vec4 variable (no matter if it's actually a float, vec2, vec3 or vec4).

This architecture requires that the code be vectorized fully -- for example, say that NumPixelsShadedAtOnce is 4:

* if you're ever operating on a float, then 4 slots out of the 16 contained in the GpuRegister are used, and 12 are wasted.

* if you're ever operating on a vec2, then 8 slots out of the 16 contained in the GpuRegister are used, and 8 are wasted.

etc...

 

 

These days, GPUs have generally abandoned SIMD within each pixel, and just use SIMD across many pixels:

e.g. 

struct GpuRegister { float f[NumPixelsShadedAtOnce]; };

Yes this means that vec4 + vec4 is now 4 instructions, however, it also means that there's absolutely zero waste when you need vec2 + vec2 (in the old model, this would waste 50% of your compute power).

 

Say that NumPixelsShadedAtOnce is now 16 (this is the same memory usage as 4 in the old architecture, because now it's 16x1 instead of 4x4).

* If you're working with vec4's it works out the same as before -- you require 4x the instructions, but you're also shading 4x more pixels at once, so it balances out evenly.

* If you're working with vec2's it works out better -- you require 2x the instructions, and you're still working on 4x more pixels at once.

* Same for floats -- you require the same number of instructions as before, but as above, you're working on 16 pixels pixels at once instead of 4.

 

In SIMD CPU code, SSE, etc, this is also a common pattern -- the SoA vs AoS style debate.

e.g. I've seen a lot of CPU side code that instead of using struct Vec4 { float x, y, z, w; };, it uses struct SoAVec4 { float x[4], y[4], z[4], w[4]; }; as this layout makes doing a lot of things much more efficient in SSE (especially for Vec3 or Vec2 processing).

Edited by Hodgman

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this