Sign in to follow this  
Paul__

DX11 Map/unmap, CopyStructureCount and slow down

Recommended Posts

Hey all,

Profiling has shown that there's a massive slow down at a point in my game app.

In each frame, I use the compute shader to create vertices which are written to a default usage append buffer. Then the code reads the amount of vertices written by the compute shader with CopyStructureCount(). The target buffer for CopyStructureCount() is a D3D11_USAGE_STAGING buffer which is four bytes long, created with D3D11_CPU_ACCESS_READ. Then my app calls map() -> memcpy() -> unmap(). This last process causes the cpu to stop for 4 ms and the gpu to stop for 1 ms.

Without the call to the staging buffer's map/unmap, other dx calls and the app generally seem to take the right amount of time.

It's possible for me to calculate from the game data how many verts should be written, and therefore not call CopyStructureCount(). But it's a huge headache, involving tracking lots of data that I otherwise wouldn't need to.

The amount of pause is directly related to the length of the compute shader call. More vertices to create, longer pause. Seems likely the cpu is waiting for it to finish.

Now, I know that with some dx calls the cpu is forced to wait for the gpu, because the gpu is already using that resource. But why does the GPU pause too? And surely double buffering won't help? Because the *same* frame needs to know how many primitives to write in the soon-to-follow draw() call.

Any other suggestions? I'm sort of guessing here, but could I swap the order of each frame? Maybe:
- <Frame starts>
- Get the struct count from last frame
- Draw the verts
- Generate the next frame's verts
- Present

It's very hard to get *general* info about dx11 and the temporal relationship between the gpu and cpu, so any experienced help would be great!

Share this post


Link to post
Share on other sites
Normally the CPU and GPU work asynchronously, with the CPU submitting commands way ahead of when the GPU actually executes them. When you read back a value on the CPU (which is what you're doing with the staging buffer), you force a sync point where the CPU flushes the command buffer and then sits around waiting for the GPU to execute all pending commands. The amount of time it has to wait depends on the number of pending commands and how long they take to execute, which means it could potentially get much worse as your frames get more complex. I'm not sure how you're determining that the GPU is "pausing", but I would doubt that is the actual case.

Swapping the order can potentially help, if you can keep the CPU busy enough to absorb some of the GPU latency .

Share this post


Link to post
Share on other sites
Thanks for your answer. I'm not sure if I can really reorganise the way a frame is structured. Which means I might have to go the hard way and maintaining counts of all the geometry, rather than read the count from the append buffer. Damn!

So just to clarify, when an app *reads* a GPU buffer using map/unmap() will *always* cause the CPU to wait for the GPU? Compared to when an app *writes* to a dynamic buffer, which doesn't always cause the cpu to wait (I guess because under the hood dx seems to maintain multiple buffers for dynamic writes).

Also, when you say that the CPU "sits around waiting for the GPU to execute all pending commands", does that truly mean that all dx commands queued up for that frame have to be executed before a buffer can be read, or does it mean that only commands involving the particular append buffer to be read have to be waited for?

I'm using dx queries to time the gpu. I could well have made a mistake though!

Thanks again.
Paul

Share this post


Link to post
Share on other sites
Until the GPU has executed the instruction queue, the data that you are going to read doesn't exist.

However, you don't have to wait for read access to resources that are not used as targets for the currently running operations. Edited by Nik02

Share this post


Link to post
Share on other sites
It is best to think about the GPU as a remote machine to which you send requests, and from which you can then download the responses (if you need them). It actually is a remote machine, even though the physical distance from the CPU isn't usually very long.

Share this post


Link to post
Share on other sites
[quote name='Nik02' timestamp='1337259480' post='4940920']
It is best to think about the GPU as a remote machine to which you send requests, and from which you can then download the responses (if you need them). It actually is a remote machine, even though the physical distance from the CPU isn't usually very long.
[/quote]
That's quite a good analogy.

Another that may work is that it's like sending radio signals to the moon. Assuming the speed of light, a signal will arrive in about 2 seconds. If all you're doing is sending signals you can just send them as fast as you possibly can - one signal every millisecond if you so wish. However, if at any point you need to wait on a response before you can send the next signal you've a 2 second wait for the signal to reach the moon, an unknown amount of time while it's being processed and acted on there, and another 2 seconds before it can get back to you. During this time you're sitting there doing nothing; you can't send the next signal until you get the response.

Share this post


Link to post
Share on other sites
[quote name='Paul__' timestamp='1337241494' post='4940863']
So just to clarify, when an app *reads* a GPU buffer using map/unmap() will *always* cause the CPU to wait for the GPU?
[/quote]

Yup. The data you need doesn't exist until the GPU actually writes it, which means that the command that writes the data (and all previous dependent commands) have to be executed before the data is available for readback.

[quote name='Paul__' timestamp='1337241494' post='4940863']
Compared to when an app *writes* to a dynamic buffer, which doesn't always cause the cpu to wait (I guess because under the hood dx seems to maintain multiple buffers for dynamic writes).
[/quote]

Indeed, the driver can transparently swap through multiple buffers using a technique known as buffer renaming. This allows the CPU to write to one buffer while the GPU is currently reading from a different buffer.

[quote name='Paul__' timestamp='1337241494' post='4940863']
Also, when you say that the CPU "sits around waiting for the GPU to execute all pending commands", does that truly mean that all dx commands queued up for that frame have to be executed before a buffer can be read, or does it mean that only commands involving the particular append buffer to be read have to be waited for?
[/quote]

That would depend on the driver I suppose. I couldn't answer that for sure.

Share this post


Link to post
Share on other sites
Do you actually need to have the number of vertices available on the CPU? If you could use a DrawIndirect call instead, you wouldn't need to read back the buffer count on the CPU and you would avoid the sync.

Share this post


Link to post
Share on other sites
Thanks for all your replies -- a great help.

MJP: thanks for clarifying about the app reading gpu resources and the effect it has. I guess this means that programmers avoid reading the gpu in an app if possible, because such an app can't have the cpu working many many frames ahead of the gpu. I suppose it does kind of make the cpu and gpu stuck to each other for every single frame, and means they can't operate independently.

Also, with the driver using buffer renaming, I guess also there's no point in an app multi-buffering its dynamic buffers, because it's already done for it?

About the primitive count and why it's important. In my app, the compute shader generates a variable amount of primitives. Variable, because it's creating water tiles and each chunk of terrain has a variable amount of water tiles. On top of that, the amount of water tiles for each chunk changes throughout the game, based on water physics and other factors. So regardless of whether I use DrawIndirect or Draw, I think I still need to know the amount of water tiles in order to render them, either through reading how many tiles the compute shader made, or by the app keeping track of each chunk's amount of water tiles and updating those counts when the water behaviour changes. Keeping track is difficult, because terrain data is duplicated in video ram, and is updated from the main ram version only when there's a change. But I can and probably will maintain such a tile count, even though it'll be a bit of a pain.

Anyway, thought I'd say why reading the gpu would simplify the code so much. But I'm now persuaded it's probably not a good idea!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      627744
    • Total Posts
      2978902
  • Similar Content

    • By schneckerstein
      Hello,
      I manged so far to implement NVIDIA's NDF-Filtering at a basic level (the paper can be found here). Here is my code so far:
      //... // project the half vector on the normal (?) float3 hppWS = halfVector / dot(halfVector, geometricNormal) float2 hpp = float2(dot(hppWS, wTangent), dot(hppWS, wBitangent)); // compute the pixel footprint float2x2 dhduv = float2x2(ddx(hpp), ddy(hpp)); // compute the rectangular area of the pixel footprint float2 rectFp = min((abs(dhduv[0]) + abs(dhduv[1])) * 0.5, 0.3); // map the area to ggx roughness float2 covMx = rectFp * rectFp * 2; roughness = sqrt(roughness * roughness + covMx); //... Now I want combine this with LEAN mapping as state in Chapter 5.5 of the NDF paper.
      But I struggle to understand what theses sections actually means in Code: 
      I suppose the first-order moments are the B coefficent of the LEAN map, however things like
      float3 hppWS = halfVector / dot(halfVector, float3(lean_B, 0)); doesn't bring up anything usefull.
      Next theres:
      This simply means:
      // M and B are the coefficents from the LEAN map float2x2 sigma_mat = float2x2( M.x - B.x * B.x, M.z - B.x * B.y, M.z - B.x * B.y, M.y - B.y * B.y); does it?
      Finally:
      This is the part confuses me the most: how am I suppose to convolute two matrices? I know the concept of convolution in terms of functions, not matrices. Should I multiple them? That didn't make any usefully output.
      I hope someone can help with this maybe too specific question, I'm really despaired to make this work and i've spend too many hours of trial & error...
      Cheers,
      Julian
    • By Baemz
      Hello,
      I've been working on some culling-techniques for a project. We've built our own engine so pretty much everything is built from scratch. I've set up a frustum with the following code, assuming that the FOV is 90 degrees.
      float angle = CU::ToRadians(45.f); Plane<float> nearPlane(Vector3<float>(0, 0, aNear), Vector3<float>(0, 0, -1)); Plane<float> farPlane(Vector3<float>(0, 0, aFar), Vector3<float>(0, 0, 1)); Plane<float> right(Vector3<float>(0, 0, 0), Vector3<float>(angle, 0, -angle)); Plane<float> left(Vector3<float>(0, 0, 0), Vector3<float>(-angle, 0, -angle)); Plane<float> up(Vector3<float>(0, 0, 0), Vector3<float>(0, angle, -angle)); Plane<float> down(Vector3<float>(0, 0, 0), Vector3<float>(0, -angle, -angle)); myVolume.AddPlane(nearPlane); myVolume.AddPlane(farPlane); myVolume.AddPlane(right); myVolume.AddPlane(left); myVolume.AddPlane(up); myVolume.AddPlane(down); When checking the intersections I am using a BoundingSphere of my models, which is calculated by taking the average position of all vertices and then choosing the furthest distance to a vertex for radius. The actual intersection test looks like this, where the "myFrustum90" is the actual frustum described above.
      The orientationInverse is the viewMatrix in this case.
      bool CFrustum::Intersects(const SFrustumCollider& aCollider) { CU::Vector4<float> position = CU::Vector4<float>(aCollider.myCenter.x, aCollider.myCenter.y, aCollider.myCenter.z, 1.f) * myOrientationInverse; return myFrustum90.Inside({ position.x, position.y, position.z }, aCollider.myRadius); } The Inside() function looks like this.
      template <typename T> bool PlaneVolume<T>::Inside(Vector3<T> aPosition, T aRadius) const { for (unsigned short i = 0; i < myPlaneList.size(); ++i) { if (myPlaneList[i].ClassifySpherePlane(aPosition, aRadius) > 0) { return false; } } return true; } And this is the ClassifySpherePlane() function. (The plane is defined as a Vector4 called myABCD, where ABC is the normal)
      template <typename T> inline int Plane<T>::ClassifySpherePlane(Vector3<T> aSpherePosition, float aSphereRadius) const { float distance = (aSpherePosition.Dot(myNormal)) - myABCD.w; // completely on the front side if (distance >= aSphereRadius) { return 1; } // completely on the backside (aka "inside") if (distance <= -aSphereRadius) { return -1; } //sphere intersects the plane return 0; }  
      Please bare in mind that this code is not optimized nor well-written by any means. I am just looking to get it working.
      The result of this culling is that the models seem to be culled a bit "too early", so that the culling is visible and the models pops away.
      How do I get the culling to work properly?
      I have tried different techniques but haven't gotten any of them to work.
      If you need more code or explanations feel free to ask for it.

      Thanks.
       
    • By evelyn4you
      hi,
      i have read very much about the binding of a constantbuffer to a shader but something is still unclear to me.
      e.g. when performing :   vertexshader.setConstantbuffer ( buffer,  slot )
       is the buffer bound
      a.  to the VertexShaderStage
      or
      b. to the VertexShader that is currently set as the active VertexShader
      Is it possible to bind a constantBuffer to a VertexShader e.g. VS_A and keep this binding even after the active VertexShader has changed ?
      I mean i want to bind constantbuffer_A  to VS_A, an Constantbuffer_B to VS_B  and  only use updateSubresource without using setConstantBuffer command every time.

      Look at this example:
      SetVertexShader ( VS_A )
      updateSubresource(buffer_A)
      vertexshader.setConstantbuffer ( buffer_A,  slot_A )
      perform drawcall       ( buffer_A is used )

      SetVertexShader ( VS_B )
      updateSubresource(buffer_B)
      vertexshader.setConstantbuffer ( buffer_B,  slot_A )
      perform drawcall   ( buffer_B is used )
      SetVertexShader ( VS_A )
      perform drawcall   (now which buffer is used ??? )
       
      I ask this question because i have made a custom render engine an want to optimize to
      the minimum  updateSubresource, and setConstantbuffer  calls
       
       
       
       
       
    • By noodleBowl
      I got a quick question about buffers when it comes to DirectX 11. If I bind a buffer using a command like:
      IASetVertexBuffers IASetIndexBuffer VSSetConstantBuffers PSSetConstantBuffers  and then later on I update that bound buffer's data using commands like Map/Unmap or any of the other update commands.
      Do I need to rebind the buffer again in order for my update to take effect? If I dont rebind is that really bad as in I get a performance hit? My thought process behind this is that if the buffer is already bound why do I need to rebind it? I'm using that same buffer it is just different data
       
    • By Rockmover
      I am really stuck with something that should be very simple in DirectX 11. 
      1. I can draw lines using a PC (position, colored) vertices and a simple shader just fine.
      2. I can draw 3D triangles using PCN (position, colored, normal) vertices just fine (even transparency and SpecularBlinnPhong shaders).
       
      However, if I'm using my 3D shader, and I want to draw my PC lines in the same scene how can I do that?
       
      If I change my lines to PCN and pass them to the 3D shader with my triangles, then the lighting screws them all up.  I only want the lighting for the 3D triangles, but no SpecularBlinnPhong/Lighting for the lines (just PC). 
      I am sure this is because if I change the lines to PNC there is not really a correct "normal" for the lines.  
      I assume I somehow need to draw the 3D triangles using one shader, and then "switch" to another shader and draw the lines?  But I have no clue how to use two different shaders in the same scene.  And then are the lines just drawn on top of the triangles, or vice versa (maybe draw order dependent)?  
      I must be missing something really basic, so if anyone can just point me in the right direction (or link to an example showing the implementation of multiple shaders) that would be REALLY appreciated.
       
      I'm also more than happy to post my simple test code if that helps as well!
       
      THANKS SO MUCH IN ADVANCE!!!
  • Popular Now