Shader matrix mul - dp4 or mad?

Graphics and GPU Programming Programming

Started by turanszkij May 17, 2017 07:58 AM

7 comments, last by Hodgman 6 years, 10 months ago

turanszkij

545

Author

May 17, 2017 07:58 AM

Hello

I've been wondering for some time, when compiling a HLSL shader, and inspecting the assembly code, the vector - matrix multiplications can either result in four dp4 or four mad instructions depending on which side I am multiplying from.

My question is, is there a difference in performance, should I worry about it? At the moment, all of my shaders in my engine are using dp4 instruction for that (because I started it from some tutorial that used it that way...). I know that on the final hardware specific code after CreateXSShader, that is not documented, but are there any guidelines to follow?

Wicked Engine

Hodgman

52,717

May 17, 2017 08:33 AM

You can also influence it with your choice of "column_major float4x4" or "row_major float4x4" matrices in HLSL.

10 years ago it would've made a difference as GPUs worked on float4 types at the hardware level. These days GPUs operate on individual floats so it actually works out the same either way.

. 22 Racing Series .

turanszkij

545

Author

May 17, 2017 09:18 AM

Thanks, as I thought, so it doesn't matter any more.

Wicked Engine

semler

548

June 02, 2017 09:43 PM

I remember back in the old days, that we were encouraged (by the hardware vendors) to choose the dp4 version, as the order of the instructions doesn't change the outcome. Back then the compiler(s) had a tendency to shuffle the instructions around, and the order of the madd instructions could result in numerical different results which again could result in z fighting when using a multipass approach.

I don't know how good (or bad) the compiler is nowadays in maintaining the order, as I'm still using the dp4 approach ;-)

Reto.Hal9k

www.heroesandgenerals.com

Infinisearch

3,058

June 02, 2017 09:58 PM

Around what generation of GPU's did it stop mattering?

-potential energy is easily made kinetic-

JoeJ

4,183

June 03, 2017 07:58 AM

I think AMD 7xxx (+ PS4 and XBONE) was the first when they switched to use scalars for everything (scalar means processing a float4 not as large type but just 4 floats in order), don't know about NV.

What is really important is that when it makes sense we load big things like matrices to scalar registers for AMD (scalar here means one register for all threads versus VGPRs which are unique for each thread so rare). See there: http://gpuopen.com/optimizing-gpu-occupancy-resource-usage-large-thread-groups/

I did not know that this only works for some buffer types with D3D, so if anyone knows how this applies to Vulkan let me know :)

Ryan_001

3,477

June 03, 2017 10:28 AM

NVidia's first SIMT architecture (as opposed to the older SIMD/vector style) was the Geforce 8 series (if I remember correctly).

Infinisearch

3,058

June 03, 2017 05:05 PM

NVidia's first SIMT architecture (as opposed to the older SIMD/vector style) was the Geforce 8 series (if I remember correctly).

By SIMT you basically mean unified shaders right? Going by what you linked that still appears to be a vector architecture. I looked at the linked (from your link) sucessor fermi and that appears to be scalar.

-potential energy is easily made kinetic-

Hodgman

52,717

June 04, 2017 01:40 AM

NVidia's first SIMT architecture (as opposed to the older SIMD/vector style) was the Geforce 8 series (if I remember correctly).

By SIMT you basically mean unified shaders right? Going by what you linked that still appears to be a vector architecture. I looked at the linked (from your link) sucessor fermi and that appears to be scalar.

Nah unified shaders means they no longer have separate hardware for vertex shading vs pixel shading. The GeForce 7 had this split, so you could have pixel-shading cores sitting idle while the vertex-shading cores were maxed out... :(
GeForce7 pixel shaders used SIMD instructions (like SSE) - so each pixel could operate on a float4 per clock. GeForce 8 (Tesla) ran 8 pixels per "core" using scalar instructions (float) and Fermi bumped up to 32 pixels per "core", still using scalar instructions per pixel.

. 22 Racing Series .

Shader matrix mul - dp4 or mad?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Shader matrix mul - dp4 or mad?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines