Profiler for HLSL Shaders? (like Intel GPA, PerfHud)

Started by
4 comments, last by Hodgman 12 years, 2 months ago
Hey there!

Is there any way to measure execution time of individual lines or sections of HLSL code? So far I tried Intel GPA and Nvidia PerfHud. I was only able to get timings for whole Draw-Calls that way.

I have to optimize a long and complicated shader and it is not possible to comment out sections of it to test for possible performance gains as the different section's runtimes are pretty much dependant on each other...

Thanks in advance,
XBTC
Advertisement
You can't really profile individual sections of a shader like that because the whole is much more than the sum of it's parts.

Yeah you can pull out one section and replace it with a simpler version...
e.g. instead of
[font=courier new,courier,monospace]float isInShadow = /*code that involves 9 texture fetches*/;[/font]
[font=courier new,courier,monospace]float3 proceduralColor = /*code that does a ton of math*/[/font]
you could write
[font=courier new,courier,monospace]float isInShadow = /*code that simply reads a uniform variable*/;[/font]
[font=courier new,courier,monospace]float3 proceduralColor = /*code that simply reads a uniform variable*/;[/font]
but now you've completely changed the overall scheduling of the shader.

For example, even if the above two lines are unrelated (the procedural colour does not depend on the shadow calc, and vice versa) it could be that removing just one of those lines has no impact on the total shader time, but removing both has a large impact (i.e. even though they're unrelated, they become intertwined as far as performance is concerned).

When a GPU runs into a memory stall (e.g. it wants to use the result of a texture fetch, but the memory isn't ready yet), it acts a lot like a HyperThreaded CPU -- it switches over to another thread. These switches aren't a problem in the ideal case, meaning texture fetches are 'free', but the ideal case can only occur if the texture-cache is big enough, and if your shader uses a small enough number of local/temporary variables. Also, a switch might not need to occur if there is enough ALU ops to be done while the fetch occurs anyway, meaning that if you're doing enough unrelated math, then your fetches are also 'free'. Or if fetching is your main bottleneck, you could say that you're getting your ALU ops for free as they don't affect the critical path (the fetching isn't dependent on them). Also, all of your vector-based math will be broken apart and revectorized to the GPU's preferred width -- e.g. the GPU might operate on vec5's internally, meaning some extra (seemingly unrelated) math might be able to be squished together into earlier instructions for free.
So... that's a long rambling way of saying that it's hard to profile individual lines...laugh.png in short:
* one line of assembly can be doing work for multiple different (unrelated) lines of HLSL.
* the GPU can be executing several assembly instructions per cycle, depending on how many independent calculations you have, giving you some extra work for no extra cost.
* adding more calculations, fetches or temporaries (local variables) can drastically alter the assembly that is produced by the entire shader.


For draw-call timing, you can also use PIX and Parallel Nsight.

I'm not sure if it's still supported, but nVidia used to have a tool called NVShaderPerf, which would compile your shaders and show you the actual nVidia assembly code produced, and how it would be scheduled on their hardware (how many instructions per cycle and in which order), which you could use to perform trial-and-error optimisations to find better schedules -- I think their FX Composer app also used to provide this function.
For AMD hardware there's GPU ShaderAnalyzer which shows you the real hardware instructions the driver creates, and performance figures for a variety of hardware. It makes it easier to try out changes and see how they will affect performance.
@Hodgman: Thanks alot for the detailed explanation. Gave me a new perspective on shader programming...

@Adam: Unfortunately I am not on an AMD platform but let's see what Nvidia has to offer in that departement...

@Hodgman: Thanks alot for the detailed explanation. Gave me a new perspective on shader programming...

@Adam: Unfortunately I am not on an AMD platform but let's see what Nvidia has to offer in that departement...


They dont offer a similar tool sadly, the AMD one is just an indication as you still need to care for NVidia cards anyway on PC. Also the AMD tool will work on windows with an NVidia card it doesn't actually go to the driver, it just knows how the cards work.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

You should still be able to use AMDs analysis tools despite not having an AMD GPU. The results won't show what's happening under the hood on your own PC exactly, but it will be similar. You'll still be able to get an idea of what kind of assembly is produced by a saturate or pow, etc...

This topic is closed to new replies.

Advertisement