Advertisement Jump to content
Sign in to follow this  
dpadam450

GPU vector processing

This topic is 1825 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Will GL/DX compile floats as vec4's? Are vec3's basically vec4?

I assume that these whether compiled as vec4's or not, these are both 1 instruction?

vec4 vector;

vector.x += .5;

Or

vector += vec4(.5,.5,.5,.5);

 

 

And even if they are both equal as 1 instruction/same time, does anyone know if it compiles a standard float to a vec4 or not? It saves memory to not do so, but I just assume everything goes through a float4 vector processor in the end. IE

float a;

a += .5; // would be (a,a,a,a) += (.5,.5,.5,.5);

 

Share this post


Link to post
Share on other sites
Advertisement
I assume you're talking about GLSL/HLSL?

On dx9 era cards, all instructions operated on vec4's, like SSE/etc.
The processors would often shade 4 pixels at once, meaning they'd internally use a vec16 register to hold the vec4 values from 4 pixels.
e.g.
Value.xyzw += (1).xxxx
Is the same speed as:
Value.x += 1
So, vectorizing your code to use all 4 components as much as possible was very important. The HLSL compiler was pretty good at auto-vectorization. On other platforms I'd gotten >2x speed ups from hand-vectorizing my code (which often renders it unreadable).

Modern cards are all scalar, like the FPU of a CPU. The GPU might shade 64 pixels at once, internally using 4 vec64 registers to hold the x,y,z and w values for 64 pixels.
e.g.
Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1

In-between, there were some cards that worked with weird hybrid vec4+vec1 (vec5?) registers, dual issuing a vector and scalar instruction each cycle.

Share this post


Link to post
Share on other sites

 

Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1

 vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?

Share this post


Link to post
Share on other sites

Value.xyzw += (1).xxxx

Is 4 times more complex as:
Value.x += 1

 vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?

On D3D9-era cards, it was generally a single instruction, using SIMD registers. Actually, they're "SIMD SIMD floats" (SIMDMD?).

e.g. Each pixel has a typical SIMD-vec4 looking variable:

struct Vec4 { float f[4]; };

but the GPU works on a SIMD collection of these variables at a time, to shade multiple pixels in parallel:

struct GpuRegister { Vec4 v[NumPixelsShadedAtOnce]; };

 

The GPU works on a "GpuRegister" at a time, with each array element belonging to a pixel. Each pixel is then working with a Vec4 variable (no matter if it's actually a float, vec2, vec3 or vec4).

This architecture requires that the code be vectorized fully -- for example, say that NumPixelsShadedAtOnce is 4:

* if you're ever operating on a float, then 4 slots out of the 16 contained in the GpuRegister are used, and 12 are wasted.

* if you're ever operating on a vec2, then 8 slots out of the 16 contained in the GpuRegister are used, and 8 are wasted.

etc...

 

 

These days, GPUs have generally abandoned SIMD within each pixel, and just use SIMD across many pixels:

e.g. 

struct GpuRegister { float f[NumPixelsShadedAtOnce]; };

Yes this means that vec4 + vec4 is now 4 instructions, however, it also means that there's absolutely zero waste when you need vec2 + vec2 (in the old model, this would waste 50% of your compute power).

 

Say that NumPixelsShadedAtOnce is now 16 (this is the same memory usage as 4 in the old architecture, because now it's 16x1 instead of 4x4).

* If you're working with vec4's it works out the same as before -- you require 4x the instructions, but you're also shading 4x more pixels at once, so it balances out evenly.

* If you're working with vec2's it works out better -- you require 2x the instructions, and you're still working on 4x more pixels at once.

* Same for floats -- you require the same number of instructions as before, but as above, you're working on 16 pixels pixels at once instead of 4.

 

In SIMD CPU code, SSE, etc, this is also a common pattern -- the SoA vs AoS style debate.

e.g. I've seen a lot of CPU side code that instead of using struct Vec4 { float x, y, z, w; };, it uses struct SoAVec4 { float x[4], y[4], z[4], w[4]; }; as this layout makes doing a lot of things much more efficient in SSE (especially for Vec3 or Vec2 processing).

Edited by Hodgman

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!