GPU vector processing

Started by
3 comments, last by Hodgman 10 years, 3 months ago

Will GL/DX compile floats as vec4's? Are vec3's basically vec4?

I assume that these whether compiled as vec4's or not, these are both 1 instruction?

vec4 vector;

vector.x += .5;

Or

vector += vec4(.5,.5,.5,.5);

And even if they are both equal as 1 instruction/same time, does anyone know if it compiles a standard float to a vec4 or not? It saves memory to not do so, but I just assume everything goes through a float4 vector processor in the end. IE

float a;

a += .5; // would be (a,a,a,a) += (.5,.5,.5,.5);

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Advertisement

Scalars are not "compiled" as vectors, they're handled as scalars. In some cases a scalar may get promoted to a vector type, but this only happens if you use them in conjunction with vector types.

Perhaps you're a bit confused because of the vector registers used in DirectX shader assembly. All general purpose registers in DX assembly are 4-component, and many of the math instructions operate on 4 components. Consequently you may see it pack multiple scalar ops into single vector op, or you may see it mask out components when scalar operations are performed. Either way it's important to keep in mind that this is just an intermediate format, and doesn't necessarily correspond to what the hardware does once it concerts the assembly to hardware-specific bytecode. In fact all Nvidia GPU's from the past 8 years and all of the most recent AMD GPU's actually operate in terms of scalar operations of registers, so the vector ops will get converted into a sequence of scalar instructions.

I assume you're talking about GLSL/HLSL?

On dx9 era cards, all instructions operated on vec4's, like SSE/etc.
The processors would often shade 4 pixels at once, meaning they'd internally use a vec16 register to hold the vec4 values from 4 pixels.
e.g.
Value.xyzw += (1).xxxx
Is the same speed as:
Value.x += 1
So, vectorizing your code to use all 4 components as much as possible was very important. The HLSL compiler was pretty good at auto-vectorization. On other platforms I'd gotten >2x speed ups from hand-vectorizing my code (which often renders it unreadable).

Modern cards are all scalar, like the FPU of a CPU. The GPU might shade 64 pixels at once, internally using 4 vec64 registers to hold the x,y,z and w values for 64 pixels.
e.g.
Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1

In-between, there were some cards that worked with weird hybrid vec4+vec1 (vec5?) registers, dual issuing a vector and scalar instruction each cycle.

Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1

vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Value.xyzw += (1).xxxx

Is 4 times more complex as:
Value.x += 1

vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?

On D3D9-era cards, it was generally a single instruction, using SIMD registers. Actually, they're "SIMD SIMD floats" (SIMDMD?).

e.g. Each pixel has a typical SIMD-vec4 looking variable:

struct Vec4 { float f[4]; };

but the GPU works on a SIMD collection of these variables at a time, to shade multiple pixels in parallel:

struct GpuRegister { Vec4 v[NumPixelsShadedAtOnce]; };

The GPU works on a "GpuRegister" at a time, with each array element belonging to a pixel. Each pixel is then working with a Vec4 variable (no matter if it's actually a float, vec2, vec3 or vec4).

This architecture requires that the code be vectorized fully -- for example, say that NumPixelsShadedAtOnce is 4:

* if you're ever operating on a float, then 4 slots out of the 16 contained in the GpuRegister are used, and 12 are wasted.

* if you're ever operating on a vec2, then 8 slots out of the 16 contained in the GpuRegister are used, and 8 are wasted.

etc...

These days, GPUs have generally abandoned SIMD within each pixel, and just use SIMD across many pixels:

e.g.

struct GpuRegister { float f[NumPixelsShadedAtOnce]; };

Yes this means that vec4 + vec4 is now 4 instructions, however, it also means that there's absolutely zero waste when you need vec2 + vec2 (in the old model, this would waste 50% of your compute power).

Say that NumPixelsShadedAtOnce is now 16 (this is the same memory usage as 4 in the old architecture, because now it's 16x1 instead of 4x4).

* If you're working with vec4's it works out the same as before -- you require 4x the instructions, but you're also shading 4x more pixels at once, so it balances out evenly.

* If you're working with vec2's it works out better -- you require 2x the instructions, and you're still working on 4x more pixels at once.

* Same for floats -- you require the same number of instructions as before, but as above, you're working on 16 pixels pixels at once instead of 4.

In SIMD CPU code, SSE, etc, this is also a common pattern -- the SoA vs AoS style debate.

e.g. I've seen a lot of CPU side code that instead of using struct Vec4 { float x, y, z, w; };, it uses struct SoAVec4 { float x[4], y[4], z[4], w[4]; }; as this layout makes doing a lot of things much more efficient in SSE (especially for Vec3 or Vec2 processing).

This topic is closed to new replies.

Advertisement