Multithreaded use of D3D

Started by
35 comments, last by Brian Klamik 19 years ago
Have you read Accurately Profiling Direct3D API Calls?

Advertisement
Quote:Original post by Coder
Have you read Accurately Profiling Direct3D API Calls?


If people are having trouble wrapping your head around the idea of command buffering and the CPU aspect of the graphics pipeline, might I also suggest some content we put forth for Meltdown 2004 titled, The CPU Aspect of the D3D Pipeline? I've also dropped the speech script here for reference also. I've also seen good presentations from ATI and NVidia related to this topic and linked to in other threads.

As for efficiently adding CPU threads to the graphics pipeline, that's a notoriously hard problem. As highlighted by other people here, just setting the MULTITHREAD flag typically "just adds overhead" (of about 100 clocks per API call to acquire and release a crit-sec). The app has to add it's own crit-sec, since there's no way to hold D3D's crit-sec for multiple API calls: Set, Set, Draw.

Interesting techniques for adding multiple CPU threads tend to gravitate toward adding more frames of latency and extending the graphics pipeline across more than one CPU. (Have CPU 1 work on visibility culling frame n, while CPU 2 works on rendering (culled) frame n-1 by calling the API). Or give the other CPU thread another Device Context to work with and share Resources cross-Device Context. Unfortunately, the second method can't be done with D3D9, easily.

Realistically, multiple CPU threads are good at managing Resource IO load, as others have mentioned. That's probably the more proven area that CPU threads are useful.
Hi Brian,

Thank you very much for your comments and pointers to your resources.
==========================================================
I have read that "Accurately Profiling…" many many times… I guess I have not totally understood it.

-----To be viable, it seems (the technique of flushing the command buffer) to be trying to make the GPU work time negligible.

-----And then in the PPT file (CPU and …) there is some study about comparing the graphs of GPU A and GPU B. So I guess the timin is not that negligible after all. And I assume you are using the technique in the paper to force the work to become more or less synchronous (receive immediate attention and processing).

I wonder how these 2 tie in together… I must be missing something..

=======================================================================

But then --- Am I right that in general thigns are pretty much asynchronous (except if we do this query-event-flush technique) and the only grounds for blocking on an API would be Lock* (potentially), Present (more frames of work queued up than allowed)? (Just to get it from the D3D people's mouth). So the D3DCREATE_MULTITHREADED flag: all it incurs is the global critical section and nothing else??

thanks
chia
Folks,

ANother thing, from my various readings I am under the impression that D3D runtime's work and the driver's work are rather thin. Mainly validation and preparation and the most work would be done on the GPU anyway.

So I guess I wansn't very convicted about all these profiling D3D runtime stuff??
How much each API cost etc.

PLease... somebody enlighten me...

thanks
chia
It seems like you're still struggling with the asynchronous/ parallel processing aspect of the GPU. If you have experience with multi-threaded programming, you should use that as an analogy. Imagine the GPU is another CPU thread, except the GPU is more of a slave than a full-fledged sibling.

Now, how do multiple threads communicate?

Sticking with the analogy, imagine that CPU thread 1 builds up a command buffer. When the buffer is full, CPU thread 1 can hand off the whole buffer to CPU thread 2 (GPU) for consumption. Handling off the whole buffer is analogous to an API command buffer flush. If that flush does not happen, the command buffer will never get acted upon ('cuz they are not in the GPU's IN box yet).

A BAD thing to do related to this concept would be to call Draw(), then Sleep( 1000 ). An app should call Draw(), then Flush, before Sleep( 1000 ). Otherwise, the GPU will never make progress on the Draw.

DO NOT confuse this and assume that you must litter your application with lots of flush'ing (just in case the GPU is idle). If you look at the big picture, the GPU will be working on frame N, while the CPU is working on frame N + 1. Flushing will happen automatically for the app when needed (like during invoking Present).

When dealing with CPU/ GPU synchronization, the most popular reasons for the CPU "busy-waiting" for the GPU to finish are in response to the Lock() call (GPU not done with Resource yet), when too many command buffers are submitted to the GPU (ie. the GPU is way too far behind the CPU), and the imposed restriction on Present where the driver must ensure the GPU is within 3 frames. All these places are implemented with a busy-wait, so you will not see the CPU thread yield... Naturally, there is also GetData (which explicitly exposes this type of busy-wait to the app). An application can retrieve the time lost by busy-waiting, itself, either through using GetData and EVENT Queries or the DONOTWAIT flag for Lock.

BTW, I should make it clear that the MULTITHREAD flags takes a crit-sec per-Device object. So, each Device object owns it own crit-sec when passed the MULTITHREAD flag. There is another crit-sec in the kernel to prevent multiple CPU threads from entering the kernel mode driver. That crit-sec can never be avoided anyway, because it's system-wide.

[Edited by - Brian Klamik on April 5, 2005 7:14:10 PM]
thanks.
In the "Accurately Profiling..." paper. Some numbers seem to just come out of the blue ("Assuming ...")
e.g. driver costs for SetTextures .... etc are (approx) priced at 2964, 3600 etc.
Can somebody explain how those numbers come from?
thanks!
The Meltdown presentation shows actual results using the method detailed in the SDK documentation.

What is happening at the beginning and end (for small batches) is:

Start: IF = (implicit flush)
CPU: |---calling 1000 Draw-----|IF, calling 1000 Draw--------|IF, calling ....
GPU: ----------Idle------------|--Drawing--|----Idle---------|--Drawing--|---Idle-

End: PGD = (Poll GetData with explicit flush, returned S_FALSE)
CPU: ----|PGD, PGD, ...-|GetData returned S_OK...
GPU: Idle|--Drawing ----|---Idle-----

"Keeping GPU work negligable" means cpu-limited rendering. Therefore, the length of time the GPU takes to execute is less than the length of time it takes the CPU to even make the calls. That's why the GPU is idle so often and finishes the work before the cpu can submit another buffer.

The numbers for driver costs come from actual costs discovered with the method detailed in the SDK documentation. They are roughly representative of the cost to expect, but naturally, drivers change and there are different vendors, so only you can know how much your current driver will cost.

This topic is closed to new replies.

Advertisement