Value.xyzw += (1).xxxx
Is 4 times more complex as:
Value.x += 1
vec4 + vec4 is not a single instruction? There is no vector SIMD? That seems wrong to me that it wouldn't have that as an optimization. Maybe you are referring to Vec4 + float?
On D3D9-era cards, it was generally a single instruction, using SIMD registers. Actually, they're "SIMD SIMD floats" (SIMDMD?).
e.g. Each pixel has a typical SIMD-vec4 looking variable:
struct Vec4 { float f[4]; };
but the GPU works on a SIMD collection of these variables at a time, to shade multiple pixels in parallel:
struct GpuRegister { Vec4 v[NumPixelsShadedAtOnce]; };
The GPU works on a "GpuRegister" at a time, with each array element belonging to a pixel. Each pixel is then working with a Vec4 variable (no matter if it's actually a float, vec2, vec3 or vec4).
This architecture requires that the code be vectorized fully -- for example, say that NumPixelsShadedAtOnce is 4:
* if you're ever operating on a float, then 4 slots out of the 16 contained in the GpuRegister are used, and 12 are wasted.
* if you're ever operating on a vec2, then 8 slots out of the 16 contained in the GpuRegister are used, and 8 are wasted.
etc...
These days, GPUs have generally abandoned SIMD within each pixel, and just use SIMD across many pixels:
e.g.
struct GpuRegister { float f[NumPixelsShadedAtOnce]; };
Yes this means that vec4 + vec4 is now 4 instructions, however, it also means that there's absolutely zero waste when you need vec2 + vec2 (in the old model, this would waste 50% of your compute power).
Say that NumPixelsShadedAtOnce is now 16 (this is the same memory usage as 4 in the old architecture, because now it's 16x1 instead of 4x4).
* If you're working with vec4's it works out the same as before -- you require 4x the instructions, but you're also shading 4x more pixels at once, so it balances out evenly.
* If you're working with vec2's it works out better -- you require 2x the instructions, and you're still working on 4x more pixels at once.
* Same for floats -- you require the same number of instructions as before, but as above, you're working on 16 pixels pixels at once instead of 4.
In SIMD CPU code, SSE, etc, this is also a common pattern -- the SoA vs AoS style debate.
e.g. I've seen a lot of CPU side code that instead of using struct Vec4 { float x, y, z, w; };, it uses struct SoAVec4 { float x[4], y[4], z[4], w[4]; }; as this layout makes doing a lot of things much more efficient in SSE (especially for Vec3 or Vec2 processing).