CPU and GPU running in parallel?

Started by
22 comments, last by legalize 16 years, 7 months ago
Hi, I've got a little question about DirectX's inner handling of draw commands: If I tell the Device to draw some primitives, will this lock up the calling process until the GPU is done with this specific draw command or will the process just continue executing - effectively having CPU and GPU running in parallel? In case DirectX draw commands are handled asynchronously, this implies two additional questions for me: - How do I do proper performance profiling? Currently I have a function in my engine that stores QueryPerformanceCounter time values before and after each entity is rendered. If the application code simply continues executing while the GPU draws, this will only profile the performance on the CPU and not on the GPU, right? - Where should I place non-GPU related code for maximum efficiency? Would it be a good idea to handle all draw calls at the beginnig of the frame and then - while the GPU is handling them - do all the game logic? Regards, Bastian
Advertisement
Quote:Original post by transwarp
If I tell the Device to draw some primitives, will this lock up the calling process until the GPU is done with this specific draw command or will the process just continue executing - effectively having CPU and GPU running in parallel?
It'll parallelise it. Calling DrawPrimitive() just submits a list of the triangles to the driver to deal with whenever it can.

Quote:Original post by transwarp
In case DirectX draw commands are handled asynchronously, this implies two additional questions for me:
- How do I do proper performance profiling? Currently I have a function in my engine that stores QueryPerformanceCounter time values before and after each entity is rendered. If the application code simply continues executing while the GPU draws, this will only profile the performance on the CPU and not on the GPU, right?
With difficulty [smile] This is what PIX, NVPerfHUD, and other tools are for. Using QueryPerformanceCounter doesn't tell you anything about the speed of the draw calls at all, just how long it takes to submit them to the driver.

Quote:Original post by transwarp
- Where should I place non-GPU related code for maximum efficiency? Would it be a good idea to handle all draw calls at the beginnig of the frame and then - while the GPU is handling them - do all the game logic?
You want to identify serialisation points. These occur when you're touching or locking data that the GPU needs. The main causes are touching vertex and index buffers and textures when the graphics card needs them to render the current frame (Dynamic resources help with this a lot).
The key thing is to use PIX, et al to see where the problems are, then fix them. There's no point trying to fix something that aint broke.
In theory, updating resources right at the start of the frame, then drawing triangles, then doing game logic, then calling Present() is the most efficient, since it gives the GPU the maximum amount of time to do rendering for that frame. In practice, that's not really practical (it introduces a frame of lag) and the performance increase isn't that substantial.

So, if you update a vertex buffer with vertices, then draw using those vertices, the GPU can't render until the vertices are uploaded, and Present() can't complete until all primitives are drawn (Slightly not true, the GPU is allowed to buffer up to 3 frames).
What's even worse is if you do something like locking a static vertex buffer while D3D is using it. That forces serialisation, and means the GPU has to flush it's buffer of frames before it can give you the lock.

A dyanmic vertex buffer is a bit more clever. Say you're rendering a cube of 24 vertices (4 verts / face, 6 faces). Instead of making one VB of 24 verts, it'd be more efficient to create a VB of, say 96 verts and lock it as so:
Frame 0: Lock vertices 0..23 with D3DLOCK_DISCARD flag
Frame 1: Lock vertices 24..47 with D3DLOCK_NOOVERWRITE flag
Frame 2: Lock vertices 48..71 with D3DLOCK_NOOVERWRITE flag
Frame 3: Lock vertices 72..95 with D3DLOCK_NOOVERWRITE flag
Frame 4: Lock vertices 0..23 with D3DLOCK_DISCARD flag
Etc.
The D3DLOCK_DISCARD flag means "I'm done with the entire contents of this VB. I no longer care what's in it. Just chuck the contents away for all I care and give me a lock". The D3DLOCK_NOOVERWRITE flag means "I promise not to overwrite any of the data that's currently being used in a pending draw call. I will only overwrite vertices that you're not using".
The D3DLOCK_NOOVERWRITE flag in particular is a huge hint to the driver here. Without that flag, the driver sees you locking the vertex buffer, and has to wait until it's drawn the cube from the previous frame before it can give you the lock you requested.
The D3DLOCK_DISCARD flag allows the driver to perform buffer renaming, so it can maintain several vertex buffers internally, and just use them in a round-robbin fassion.
Hi Steve,
thanks for the quick reply!

Quote:Original post by Evil Steve
With difficulty [smile] This is what PIX, NVPerfHUD, and other tools are for. Using QueryPerformanceCounter doesn't tell you anything about the speed of the draw calls at all, just how long it takes to submit them to the driver.

I've tried to use PIX for profiling but I must admit I have so far not understood how to determine how long a particular draw command takes to handle.
I've used a lot D3DPERF_BeginEvents and D3DPERF_EndEvents in my code to make PIXs frame log more readable but the time values it gives me are clearly wrong. It say something about the whole frame taking less than 1 second to render but a certain part of that frame taking over 30 seconds!

Quote:Original post by Evil Steve
So, if you update a vertex buffer with vertices, then draw using those vertices, the GPU can't render until the vertices are uploaded, and Present() can't complete until all primitives are drawn (Slightly not true, the GPU is allowed to buffer up to 3 frames).
What's even worse is if you do something like locking a static vertex buffer while D3D is using it. That forces serialisation, and means the GPU has to flush it's buffer of frames before it can give you the lock.

Hmm, not sure whether this is bad manners for a forum, but might I refer to this other post I have open regarding dynamic vertex buffers?
http://www.gamedev.net/community/forums/topic.asp?topic_id=463223

Bastian
Oh, I didn't realise that other post was by you too, I've already replied to it [smile]

I have to admit that I've not used PIX before (I'm actually looking at it as we speak), so I can't really comment. Are you using the debug runtimes? It's possible that it would get confused by the retail runtimes.
I actually wrote a detailed bug report regarding these weird time values to the DirectX SDK team. Naturally, I never received a reply. ;)
Quote:Original post by Evil Steve
Oh, I didn't realise that other post was by you too, I've already replied to it [smile]

I have to admit that I've not used PIX before (I'm actually looking at it as we speak), so I can't really comment. Are you using the debug runtimes? It's possible that it would get confused by the retail runtimes.


Unfortunately, from what I remember from the last time I used PIX, it only provided timing for when a method was called. While this is sometimes useful, it's nothing a regular profiler can't do.

The truely best tool for profiling your rendering is, by far, NVPerfHud. It does, however, only work on NV hardware. It utilizes an instrumented driver, which when turned on, allows PerfHud to know how long each DP call took, as well as a breakdown of how long it took each of the stages in the GPU (VS, PS, IA, etc.). It should already be clear by now that this tool is absolutely awesome for finding bottlenecks in your rendering.

Looking over this post, it looks like you're asking a lot of very general questions, that don't have a simple, always-correct answer. It all depends, and your job is to use the tools available to find out where things can be improved.

Hope this helps.
Sirob Yes.» - status: Work-O-Rama.
Hi sirob,

Quote:Original post by sirob
Unfortunately, from what I remember from the last time I used PIX, it only provided timing for when a method was called. While this is sometimes useful, it's nothing a regular profiler can't do.

So I suppose I can keep using QueryPerformanceCounter for this and use PIX only for shader debugging and stuff like that.

Quote:Original post by sirob
The truely best tool for profiling your rendering is, by far, NVPerfHud. It does, however, only work on NV hardware. It utilizes an instrumented driver, which when turned on, allows PerfHud to know how long each DP call took, as well as a breakdown of how long it took each of the stages in the GPU (VS, PS, IA, etc.). It should already be clear by now that this tool is absolutely awesome for finding bottlenecks in your rendering.

I regret buying my Radeon X1950XT now. :(

Quote:Original post by sirob
Looking over this post, it looks like you're asking a lot of very general questions, that don't have a simple, always-correct answer. It all depends, and your job is to use the tools available to find out where things can be improved.

What other tools for performance profiling could you recommend in general? It seems there's not all too much I can do from within my own code.
Quote:Original post by transwarp
I regret buying my Radeon X1950XT now. :(

Not the end of the world. It's a good tool, but you can do without it.
Quote:Original post by transwarp
What other tools for performance profiling could you recommend in general? It seems there's not all too much I can do from within my own code.

I use VS's built in profiler for profiling my local C++ code. As for HLSL, I havn't much needed a profiler in a while, so I'm not too sure what I'd use if I need one.
Sirob Yes.» - status: Work-O-Rama.
Is there any good PIX tutorial out there actually? I can view how many DIP/DP/DUP calls, state changes, texture changes etc I'm doing per frame, but I can't really read from these values where the real bottlenecks are.
I've just had a (potentially stupid) idea.
If I decide that I want to log how long each draw call takes to handle completly, could I not do the following?

1. Store the current QueryPerformanceCounter result.
2. Do the draw call.
3. Lock the vertex buffer this draw call used (thus forcing a synchronisation).
4. Now log the QueryPerformanceCounter result again.

Of course I'd only do this when I specificly turn on the logging feature for a single frame and not during normal operation.

This topic is closed to new replies.

Advertisement