Map/unmap, CopyStructureCount and slow down

Started by
8 comments, last by Paul__ 11 years, 11 months ago
Hey all,

Profiling has shown that there's a massive slow down at a point in my game app.

In each frame, I use the compute shader to create vertices which are written to a default usage append buffer. Then the code reads the amount of vertices written by the compute shader with CopyStructureCount(). The target buffer for CopyStructureCount() is a D3D11_USAGE_STAGING buffer which is four bytes long, created with D3D11_CPU_ACCESS_READ. Then my app calls map() -> memcpy() -> unmap(). This last process causes the cpu to stop for 4 ms and the gpu to stop for 1 ms.

Without the call to the staging buffer's map/unmap, other dx calls and the app generally seem to take the right amount of time.

It's possible for me to calculate from the game data how many verts should be written, and therefore not call CopyStructureCount(). But it's a huge headache, involving tracking lots of data that I otherwise wouldn't need to.

The amount of pause is directly related to the length of the compute shader call. More vertices to create, longer pause. Seems likely the cpu is waiting for it to finish.

Now, I know that with some dx calls the cpu is forced to wait for the gpu, because the gpu is already using that resource. But why does the GPU pause too? And surely double buffering won't help? Because the *same* frame needs to know how many primitives to write in the soon-to-follow draw() call.

Any other suggestions? I'm sort of guessing here, but could I swap the order of each frame? Maybe:
- <Frame starts>
- Get the struct count from last frame
- Draw the verts
- Generate the next frame's verts
- Present

It's very hard to get *general* info about dx11 and the temporal relationship between the gpu and cpu, so any experienced help would be great!
Advertisement
Normally the CPU and GPU work asynchronously, with the CPU submitting commands way ahead of when the GPU actually executes them. When you read back a value on the CPU (which is what you're doing with the staging buffer), you force a sync point where the CPU flushes the command buffer and then sits around waiting for the GPU to execute all pending commands. The amount of time it has to wait depends on the number of pending commands and how long they take to execute, which means it could potentially get much worse as your frames get more complex. I'm not sure how you're determining that the GPU is "pausing", but I would doubt that is the actual case.

Swapping the order can potentially help, if you can keep the CPU busy enough to absorb some of the GPU latency .
Thanks for your answer. I'm not sure if I can really reorganise the way a frame is structured. Which means I might have to go the hard way and maintaining counts of all the geometry, rather than read the count from the append buffer. Damn!

So just to clarify, when an app *reads* a GPU buffer using map/unmap() will *always* cause the CPU to wait for the GPU? Compared to when an app *writes* to a dynamic buffer, which doesn't always cause the cpu to wait (I guess because under the hood dx seems to maintain multiple buffers for dynamic writes).

Also, when you say that the CPU "sits around waiting for the GPU to execute all pending commands", does that truly mean that all dx commands queued up for that frame have to be executed before a buffer can be read, or does it mean that only commands involving the particular append buffer to be read have to be waited for?

I'm using dx queries to time the gpu. I could well have made a mistake though!

Thanks again.
Paul
Until the GPU has executed the instruction queue, the data that you are going to read doesn't exist.

However, you don't have to wait for read access to resources that are not used as targets for the currently running operations.

Niko Suni

Okay, thanks Nik02, I think I understand the GPU/CPU relationship a bit better now.
It is best to think about the GPU as a remote machine to which you send requests, and from which you can then download the responses (if you need them). It actually is a remote machine, even though the physical distance from the CPU isn't usually very long.

Niko Suni


It is best to think about the GPU as a remote machine to which you send requests, and from which you can then download the responses (if you need them). It actually is a remote machine, even though the physical distance from the CPU isn't usually very long.

That's quite a good analogy.

Another that may work is that it's like sending radio signals to the moon. Assuming the speed of light, a signal will arrive in about 2 seconds. If all you're doing is sending signals you can just send them as fast as you possibly can - one signal every millisecond if you so wish. However, if at any point you need to wait on a response before you can send the next signal you've a 2 second wait for the signal to reach the moon, an unknown amount of time while it's being processed and acted on there, and another 2 seconds before it can get back to you. During this time you're sitting there doing nothing; you can't send the next signal until you get the response.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


So just to clarify, when an app *reads* a GPU buffer using map/unmap() will *always* cause the CPU to wait for the GPU?


Yup. The data you need doesn't exist until the GPU actually writes it, which means that the command that writes the data (and all previous dependent commands) have to be executed before the data is available for readback.


Compared to when an app *writes* to a dynamic buffer, which doesn't always cause the cpu to wait (I guess because under the hood dx seems to maintain multiple buffers for dynamic writes).


Indeed, the driver can transparently swap through multiple buffers using a technique known as buffer renaming. This allows the CPU to write to one buffer while the GPU is currently reading from a different buffer.


Also, when you say that the CPU "sits around waiting for the GPU to execute all pending commands", does that truly mean that all dx commands queued up for that frame have to be executed before a buffer can be read, or does it mean that only commands involving the particular append buffer to be read have to be waited for?


That would depend on the driver I suppose. I couldn't answer that for sure.
Do you actually need to have the number of vertices available on the CPU? If you could use a DrawIndirect call instead, you wouldn't need to read back the buffer count on the CPU and you would avoid the sync.
Thanks for all your replies -- a great help.

MJP: thanks for clarifying about the app reading gpu resources and the effect it has. I guess this means that programmers avoid reading the gpu in an app if possible, because such an app can't have the cpu working many many frames ahead of the gpu. I suppose it does kind of make the cpu and gpu stuck to each other for every single frame, and means they can't operate independently.

Also, with the driver using buffer renaming, I guess also there's no point in an app multi-buffering its dynamic buffers, because it's already done for it?

About the primitive count and why it's important. In my app, the compute shader generates a variable amount of primitives. Variable, because it's creating water tiles and each chunk of terrain has a variable amount of water tiles. On top of that, the amount of water tiles for each chunk changes throughout the game, based on water physics and other factors. So regardless of whether I use DrawIndirect or Draw, I think I still need to know the amount of water tiles in order to render them, either through reading how many tiles the compute shader made, or by the app keeping track of each chunk's amount of water tiles and updating those counts when the water behaviour changes. Keeping track is difficult, because terrain data is duplicated in video ram, and is updated from the main ram version only when there's a change. But I can and probably will maintain such a tile count, even though it'll be a bit of a pain.

Anyway, thought I'd say why reading the gpu would simplify the code so much. But I'm now persuaded it's probably not a good idea!

This topic is closed to new replies.

Advertisement