Sign in to follow this  

.fx, and asm instructions' costs

This topic is 3593 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I'm a little curious about what's going on with shaders, etc. Recently, I noticed that you can transform points using mul(point, matrix); or mul(transpose_matrix, point); When looking at the output asm code, I noticed that the second way used 1 less instruction, and that the code was completely different. The first method basically does 4 dot products, whereas the second use a combination of madds and other instructions (don't remember which ones exactly :) So, my question is : where can I find the GPU cycles used by each asm instructions on modern hardware ? (I know it depends on the GPU, I'm just looking for NVidia's latest GPUs :) Because at first, the second method seems optimal ... instead if it uses instructions which are slower than dot products :p And It is important for me, since my current bottleneck is the pixel shader, and I need to optimise it as much as possible. Thx in advance for any help !

Share this post


Link to post
Share on other sites
It's more complex than that. Different graphics cards will have different instruction sets internally, which the drivers convert to at runtime (and do some optimization while they are doing so).

ATI/AMD make their GPU ShaderAnalyzer available which shows the complete process for their cards, but I don't know of any equivalent for any other cards.

Share this post


Link to post
Share on other sites
Just one question about the matrix multiplication code, can you show the assembly source lines?

I found following lines on the net



//in HLSL

Out.Pos = mul( view_proj_matrix, inPos );

//in asm

mul r0, v0.x, c0
mad r2, v0.y, c1, r0
mad r4, v0.z, c2, r2
mad oPos, v0.w, c3, r4





And as far as I know, doing the mul the other way around mul( inPos , view_proj_matrix), will produce four dp4 operations. So there is equally many instructions in each way.

[edit] I have seen in some places that instead of using 4x dp4, there is just one m4x4 instruction (which may be translated as 4 dp4's after all).

Cheers!

[Edited by - kauna on February 11, 2008 7:55:17 AM]

Share this post


Link to post
Share on other sites
Errr... I'd love to, but since a few days, I can't use fxc : "Failed to create D3D Device" ... (although all my D3D project work perfectly ...)

But I'm pretty sure there was one asm instruction difference between both mul(...) versions ... and the order of the vector / matrix was the only thing I changed between both, so I assumed the additional instruction came from that ...

I'll try to make fxc to work and post some code.


> Adam_42 : I know what I ask is GPU dependant, that's why I wrote "I know it depends on the GPU, I'm just looking for NVidia's latest GPUs"
I'm working with 8800 or 7800 only, and I think both processors take the same amount of cycles for simple instructions.

Back in the day I coded in asm for x86, I had a INTEL book detailing all the asm instruction available in there processor, and the approximate CPU cycle each on take ... isn't there any book like that for NVIDIA's GPUs ??

Share this post


Link to post
Share on other sites
Quote:
Original post by paic
Back in the day I coded in asm for x86, I had a INTEL book detailing all the asm instruction available in there processor, and the approximate CPU cycle each on take ... isn't there any book like that for NVIDIA's GPUs ??

You won't find one, nor will you find it as easy to get the Intel details these days. On the GeForce 8x00 family, in particular, calculation are scalar, so doing vector calculations means using several units. These units are used for doing several pixels at once, and also for vertex shaders (that's unlike the 7x00 and older families, that had distinct pixel and vertex shader units).

In short, a lot of things can affect speed. My suggestion would be to try both methods and benchmark them.

Share this post


Link to post
Share on other sites
ET3D: thx for the input, even if it's a bit disappointing :)

kauna: I did the test, and you're right, it's 4 instructions for each method ... the difference probably came from another part of the code ... strange though, i'm pretty sure I only changed the vector / matrix order when I did the test ...


I've found NVShaderPerf (funny it's not included in the NVPerfKit ...) and it seems to give the number of cycles used, so that's perfect (well, not really user friendly, but it's better than nothing :))

Share this post


Link to post
Share on other sites
You can save an instruction if you're using smaller matrices. Projection uses the last column, and translation uses the last row.

For something that just rotates, without translation, you can use a 3x3 matrix, which can be 3 dp3s or mul, mad, mad.

For typical world and view matrices you can use a 4x3 matrix, which can be 3 dp4s, or mul, mad, mad, mad. The dp4 version is 1 instruction shorter, and uses one less constant. Using less constants allows modern cards to run more instances of the shader at once.

When you have a projection matrix (or ViewProj, or WorldViewProj) you need the entire matrix, which will be 4 dp4s or mul, mad, mad, mad.

The only time you get savings is when using a float4x3 matrix for a world, view, or bone matrices, and you save both an instruction and a register. As this is a very common practice, I'm sure all hardware is designed to be optimal for the dp4 code path. Can't say I've ever tested it though. Back on SM1.1 hardware, the register limits were so tight doing anything other than float4x3 wasn't practical.

Share this post


Link to post
Share on other sites
ET3D: thx for the input, even if it's a bit disappointing :)

kauna: I did the test, and you're right, it's 4 instructions for each method ... the difference probably came from another part of the code ... strange though, i'm pretty sure I only changed the vector / matrix order when I did the test ...


I've found NVShaderPerf (funny it's not included in the NVPerfKit ...) and it seems to give the number of cycles used, so that's perfect (well, not really user friendly, but it's better than nothing :))

Share this post


Link to post
Share on other sites

This topic is 3593 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this