• Advertisement
Sign in to follow this  

How to understand GPU profiler data and use it to trace down suspicious abnormally slow GPU task?

This topic is 546 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,

 

If I got it correctly, the reason we use/implement GPU profiler is to identify any suspicious render passes/GPU tasks in our pipeline which is abnormally slow, so we could optimize the renderer more efficiently. However, sometimes I feel that it's very hard for me to identify those suspicious slow GPU jobs from GPU profiler. For example, my GPU profiler told me that it takes 1.5ms for a compute shader to copy a 1080p R8G8B8A8 image from CPU writable memory to a same format Texture2D on default heap every frame. Does that sound normal given that I am using GTX680m? 

 

I feel like most experienced graphics programmer knows roughly how long should it take for GPU to do certain tasks. I just wondering how they know it B-)

 

Thanks 

Share this post


Link to post
Share on other sites
Advertisement

For example, my GPU profiler told me that it takes 1.5ms for a compute shader to copy a 1080p R8G8B8A8 image from CPU writable memory to a same format Texture2D on default heap every frame. Does that sound normal given that I am using GTX680m?

You need to do the math. Get the specs of maximum data transfer of your system, the GPU the PCI-E bus, system RAM, etc (theoretical on-paper specs are nice, whether you got it online or via a tool like GPU-Z; but it's much better if you work with data provided by some specialized benchmark tool you ran on your system). Once you've got the transfer speeds of your system, do the math and check whether you're hitting one of the limits. FYI 1080p RGBA8888 needs 8 bits x 4 channels x 1920 x 1080 = 66355200 bits. Which is 8294400 bytes, 8100 kb, 7.9MB.

Share this post


Link to post
Share on other sites

 

For example, my GPU profiler told me that it takes 1.5ms for a compute shader to copy a 1080p R8G8B8A8 image from CPU writable memory to a same format Texture2D on default heap every frame. Does that sound normal given that I am using GTX680m?

You need to do the math. Get the specs of maximum data transfer of your system, the GPU the PCI-E bus, system RAM, etc (theoretical on-paper specs are nice, whether you got it online or via a tool like GPU-Z; but it's much better if you work with data provided by some specialized benchmark tool you ran on your system). Once you've got the transfer speeds of your system, do the math and check whether you're hitting one of the limits. FYI 1080p RGBA8888 needs 8 bits x 4 channels x 1920 x 1080 = 66355200 bits. Which is 8294400 bytes, 8100 kb, 7.9MB.

 

Thanks Matias, that is very helpful. 

So I checked my system specs(800MHz DDR3 Dual memory) and calculated that my machine ideally could transfer  36.6MB from CPU memory. It seems 7.9MB is not even close to that limit, so could I conclude that I've did something totally wrong, or there are other aspect I should be aware of?

 

Also does the same math also apply to pure GPU execution? For example If I want to know how long should it take for GPU to reset a R16 256^3 Texture3D , then I find my GPU clock rate * shader cores * something?

 

Thanks

Share this post


Link to post
Share on other sites

My maths gives rather different numbers.

 

800MHz * 8 bytes per clock = 6.4GB/s.

Double that for dual channel memory = 12.8GB/s

 

You need to read 8MB of memory and write 8MB of memory, so that's 16MB total.

 

At 12.8GB/s you can move 12.8MB per millisecond. So already it's clear that 16MB will take longer than 1ms to move.

 

Even if you managed to hit 100% theoretical memory transfer speed (unlikely) then it will take at least 1.25ms. I'd say you're not too far away from exactly the correct speed.

Thanks Adam, but just want to know why my number is different than yours: I use this Page as guidance

bandwidth(bit/s) = 800M(Base DRAM clock frequency) * 2(Number of data transfers per clock) * 64(Memory bus (interface) width) * 2(Number of interfaces) = 204800000000(bit/s)

bandwidth(MB/ms) = 204800000000/8/1024/1024/1000 = 24.414(MB/ms)

 

Also I can't understand why I have to count 8MB write in my case. Since compute shader will write it to GPU memory (default heap) so I guess the bandwith for this write is different(much faster right?)

 

Thanks in advance

Share this post


Link to post
Share on other sites

You may be right about the 8MB write if your mobile GPU has dedicated VRAM. I'm used to thinking of mobile GPUs as sharing system memory with the CPU, in which case you would have to count both.

 

I was using this page for my DDR3-800 bandwidth numbers. I took the triple channel memory number, divided by 3 and multiplied by 2 to get back to Dual Channel.

 

If your GPU definitely has dedicated VRAM and it's writing into it then perhaps you're at only around 50% of the theoretical mark you might expect. That too might not be that surprising given that the tiling mode (i.e. 'layout') of the source memory is surely linear. GPUs will often only be able to hit peak throughput on things like texture fetching / bandwidth when the memory access is swizzled/tiled in such a way that it hits all memory banks exactly as intended.

 

Have you tried doing a raw Buffer to Buffer copy between your UPLOAD heap and the DEFAULT heap just to test how long it takes?

Share this post


Link to post
Share on other sites

You may be right about the 8MB write if your mobile GPU has dedicated VRAM. I'm used to thinking of mobile GPUs as sharing system memory with the CPU, in which case you would have to count both.

 

I was using this page for my DDR3-800 bandwidth numbers. I took the triple channel memory number, divided by 3 and multiplied by 2 to get back to Dual Channel.

 

If your GPU definitely has dedicated VRAM and it's writing into it then perhaps you're at only around 50% of the theoretical mark you might expect. That too might not be that surprising given that the tiling mode (i.e. 'layout') of the source memory is surely linear. GPUs will often only be able to hit peak throughput on things like texture fetching / bandwidth when the memory access is swizzled/tiled in such a way that it hits all memory banks exactly as intended.

 

Have you tried doing a raw Buffer to Buffer copy between your UPLOAD heap and the DEFAULT heap just to test how long it takes?

 

I haven't try pure Buffer to Buffer copy since the feedback from my previous post suggest using Texture since there will be multiple non_linear GPU read during later passes. But I should probably profile it (if doing Buffer to Buffer copy could save around 0.8ms, then it may compensate later slightly slower read)

 

Also it would be nice if you could share how you roughly estimate time cost for more general GPU jobs like Gaussian Blur pass, full quad volume Raycast, etc. I feel, those would be much harder to predict the theoretical speed.

 

Thanks

Share this post


Link to post
Share on other sites

Buffer to Buffer copy is probably the best test for measuring bandwidth, but it doesn't represent the best layout for accessing data that is logically two or three dimensional, so stick to textures for that.

 

There really isn't a good one-estimate-fits-all-hardware approach to knowing how long things should take. The difference alone between 720p and 1080p is 2.25x and the hardware disparity between a reasonable mobile GPU and a high end discrete GPU is > 10x. So depending on whether you're talking about a Titan X rendering at 720p or a mobile GPU rendering at 1080p you could be talking about a 20x difference in GPU time.

 

I have a pretty good idea for how long typical tasks should take on Xbox One at common resolutions (720p, 900, 1080p), but that just comes from years of looking at PIX captures of AAA titles from the best developers. If you asked me how long X should take on hardware Y at resolution Z I'd probably start with the numbers I know from Xbox One, divide them by however much faster I think the hardware is and then multiply up by the increased pixel count.

 

It doesn't hurt to try figure out how close you might be coming to various hardware limits a GPU might have; just to see if you're approaching any of those. Metrics like Vertices/second, fill-rate, texture fetches, bandwidth, TFLOPS etc are all readily available for AMD/NVIDIA cards. The only tricky one to work out is how many floating-point operations your shader might cost per pixel/thread as you don't always have access to the raw GPU instructions (at least on NVIDIA cards you don't), you can approximate it from the DXBC though.

Edited by Adam Miles

Share this post


Link to post
Share on other sites

When looking at a PC, the transfer between main memory and the graphics card almost certainly isn't being limited by the bandwidth of your RAM (unless it's really slow). The limit will be the speed of the PCI express bus.

 

Given the age of the card, we can probably assume PCI Express version 2.0. Using 16 lanes at 500MB/s per lane gives a total of 8GB/s of memory bandwidth (in each direction). I'm think those numbers are 8 billion bytes per second, and not 8 * 1024 * 1024 * 1024 bytes per second.

 

(1920 * 1080 * 4) / 0.0015 seconds = 5,529,600,000 bytes/s

 

The optimal duration assuming you fully utilize the bus is: (1920 * 1080 * 4) / (8,000,000,000) = 0.0010368 seconds, or 1.0368 milliseconds.

 

Based on that, you're running at about 69% of the maximum theoretical speed. You might be able to speed it up a bit more, but you're not that far off, and actually hitting the theoretical maximum usually isn't a realistic goal.

Edited by Adam_42

Share this post


Link to post
Share on other sites

Initially I ignored PCI-E bandwidth because I had it in my head that his mobile/notebook GPU was not actually doing a System->VRAM copy. As it turns out though the 680m is a PCI-E 3.0 card 16x lane card, so should be a 16GB/s bus, exceeding the bandwidth of his system memory by a few GB/s.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement