Shader matrix mul - dp4 or mad?

Started by
7 comments, last by Hodgman 6 years, 10 months ago

Hello

I've been wondering for some time, when compiling a HLSL shader, and inspecting the assembly code, the vector - matrix multiplications can either result in four dp4 or four mad instructions depending on which side I am multiplying from.

My question is, is there a difference in performance, should I worry about it? At the moment, all of my shaders in my engine are using dp4 instruction for that (because I started it from some tutorial that used it that way...). I know that on the final hardware specific code after CreateXSShader, that is not documented, but are there any guidelines to follow?

Advertisement
You can also influence it with your choice of "column_major float4x4" or "row_major float4x4" matrices in HLSL.

10 years ago it would've made a difference as GPUs worked on float4 types at the hardware level. These days GPUs operate on individual floats so it actually works out the same either way.

Thanks, as I thought, so it doesn't matter any more.

I remember back in the old days, that we were encouraged (by the hardware vendors) to choose the dp4 version, as the order of the instructions doesn't change the outcome. Back then the compiler(s) had a tendency to shuffle the instructions around, and the order of the madd instructions could result in numerical different results which again could result in z fighting when using a multipass approach.

I don't know how good (or bad) the compiler is nowadays in maintaining the order, as I'm still using the dp4 approach ;-)

Around what generation of GPU's did it stop mattering?

-potential energy is easily made kinetic-

I think AMD 7xxx (+ PS4 and XBONE) was the first when they switched to use scalars for everything (scalar means processing a float4 not as large type but just 4 floats in order), don't know about NV.

What is really important is that when it makes sense we load big things like matrices to scalar registers for AMD (scalar here means one register for all threads versus VGPRs which are unique for each thread so rare). See there: http://gpuopen.com/optimizing-gpu-occupancy-resource-usage-large-thread-groups/

I did not know that this only works for some buffer types with D3D, so if anyone knows how this applies to Vulkan let me know :)

NVidia's first SIMT architecture (as opposed to the older SIMD/vector style) was the Geforce 8 series (if I remember correctly).

NVidia's first SIMT architecture (as opposed to the older SIMD/vector style) was the Geforce 8 series (if I remember correctly).

By SIMT you basically mean unified shaders right? Going by what you linked that still appears to be a vector architecture. I looked at the linked (from your link) sucessor fermi and that appears to be scalar.

-potential energy is easily made kinetic-

NVidia's first SIMT architecture (as opposed to the older SIMD/vector style) was the Geforce 8 series (if I remember correctly).

By SIMT you basically mean unified shaders right? Going by what you linked that still appears to be a vector architecture. I looked at the linked (from your link) sucessor fermi and that appears to be scalar.

Nah unified shaders means they no longer have separate hardware for vertex shading vs pixel shading. The GeForce 7 had this split, so you could have pixel-shading cores sitting idle while the vertex-shading cores were maxed out... :(
GeForce7 pixel shaders used SIMD instructions (like SSE) - so each pixel could operate on a float4 per clock. GeForce 8 (Tesla) ran 8 pixels per "core" using scalar instructions (float) and Fermi bumped up to 32 pixels per "core", still using scalar instructions per pixel.

This topic is closed to new replies.

Advertisement