Sign in to follow this  
hiya83

DX11 DX11 Update textures every frame

Recommended Posts

Hi,

 

I am trying to upload multiple (say n) textures (BC7) to the gpu each frame (there's data every frame read from CPU; there is no way around this), and I am trying to minimize this time as much as possible, was wondering if anyone has any insights other than what I've done:

 

- each texture is dynamic, have 2 copies (total 2n textures) and interchange between a cpu mapped (D3D11_MAP_WRITE_DISCARD) version to copy data into and gpu unmapped to use for render

- each texture has 2 corresponding resources, a default & a staging version (2n staging, 2n default), map with D3D11_MAP_WRITE and CopyResource (n times each) to default from staging

- have a staging & default texture2darray (array size = n, 2 staging, 2 default), call map D3D11_MAP_WRITE once per frame on staging, CopyResource once to copy and unmap once.

- I also want to try 3d textures, but the limitation of 2048x2048x2048 means i can't use it.

 

All of these are approximately the same times. Does anyone have thoughts on how I can hide/reduce this time?

I am aware GPU has compute/copy/3d engines (exposed in D12), but is there anyway to parallelize whatever unmap/copyresource is doing to a separate engine from the 3d engine on D11? If not any suggestions/thoughts?

 

Thanks

Share this post


Link to post
Share on other sites

Copying textures every frame from CPU to GPU memory will be bottlenecked by the bus-bandwidth, so, check out your target platform (e.g. PCI-E) bandwidth and do some theo-crafting about how many times you would be theoretically able to transfer your textures from CPU memory to GPU memory. If this would be an issue, try to re-think your approach.

 

Data transfer will use DMA most of the time, so you can hide this transfer costs (aka avoid stalling your pipeline) if you can get along with one or two frames delay. If this is the case, look into double/triple buffering.

 

Eventually try to reduce the transfered data, either update only parts, use some compression or do even packing/unpacking.

 

Why are 2048x2048x2048 limiting ? Do you need larger textures ? I mean, 2k^3 ~ 32GB for an RGBA texture without mipmaps.

Edited by Ashaman73

Share this post


Link to post
Share on other sites

I am not sure if PCIe is the problem, I have maybe 40MB per frame (with PCIe3.0 x16 for 32GB/s), and I am already double buffering (with a frame delay) to hide the memcpy operation. However, I was thinking earlier, it seems using staging/default approach, the time is not in the unmap, but copyresource. Does D11's CopyResource automatically use the Copy Engine and not stall the 3D Engine (if there is no dependency)? Or would I have to use D12 for that? I'll have to test that out with triple buffering and 2 frames delay I guess. :D

 

2048^3 is limiting cause my widths are > 2048 (height and depth are fine). 

Edited by hiya83

Share this post


Link to post
Share on other sites

D3D11 has no concept of a "copy engine", and so the driver is free to implement CopyResource however it wants as long as it has the correct behavior. It might implement it with an asynchronous DMA. it might not. It might even be doing the same thing for all 3 of your approaches.

 

When you say that all of your approaches are "approximately the same times", what do you mean by that? Are you measuring CPU timing? GPU timing?

Share this post


Link to post
Share on other sites

I am measuring GPU times. 

 

- in dynamic case, I put gpu ticks around unmap

- in default & staging case, unmap doesn't take time, but CopyResource is where the time is

- in default/staging with texture2darray, same as 2nd case. 

Share this post


Link to post
Share on other sites

- in default & staging case, unmap doesn't take time, but CopyResource is where the time is

Unmap will only trigger the upload, which, when done with DMA, will not involve the GPU. But CopyResource, when you try to access the memory block, will spent time in 

1. waiting until the data has been uploaded (=>stalling your pipeline)

2. actually copying your data

 

To measure the first delay try to use some fence and try to measure the time spend in waiting for the fence:

unmap buffer A -> fence A ->... -> start GPU timer -> wait for fence B -> end GPU timer -> CopyResource ->... ->  unmap buffer B -> fence B ->...

Share this post


Link to post
Share on other sites


I also want to try 3d textures, but the limitation of 2048x2048x2048 means i can't use it

Perhaps this is not a relevant suggestion, but is it possible to use a texture array instead of one big volume texture?

Edited by vanka78bg

Share this post


Link to post
Share on other sites

How much time exactly is your 40MiB copy operation currently taking with any of your methods?

2048^3 is limiting cause my widths are > 2048 (height and depth are fine).

Well, another limit is that you'd need a video card with over 8GiB of RAM, which pretty much limits your min-spec hardware to the US$999 GeForce GTX Titan X :wink: :lol:

Share this post


Link to post
Share on other sites

Unmap will only trigger the upload, which, when done with DMA, will not involve the GPU. But CopyResource, when you try to access the memory block, will spent time in 

 

To measure the first delay try to use some fence and try to measure the time spend in waiting for the fence:

 

Does that mean unmap for Dynamic Textures triggers some sort of copy from cpu-accessible gpu memory to default gpu memory internally since that takes about same time as unmap/copyresource for staging/default textures. 

Also possibly dumb question, how would you setup a memory fence on DX from the CPU?? There is no query for that, and everything seems to be implicit... 

 

 

Perhaps this is not a relevant suggestion, but is it possible to use a texture array instead of one big volume texture?

 

I did try texture arrays already, that was the 3rd thing I tried in my original post.. sorry if it was misleading.

 

 

How much time exactly is your 40MiB copy operation currently taking with any of your methods?

 

Well, another limit is that you'd need a video card with over 8GiB of RAM, which pretty much limits your min-spec hardware to the US$999 GeForce GTX Titan X :wink: :lol:

 

Hey sorry but not sure what you meant by how long 40MB copy operation is taking? If you mean the methods I've tried above, they are all in the upper 3 ms ballpark (3.6 - 3.9). 

Yea I am aware of the large memory video card, I am working on other forms of compression as well, but just want to get this down with BC7 for now :D

Edited by hiya83

Share this post


Link to post
Share on other sites

So tried the triple buffering approach hoping CopyResource is async dma, but it still stalls the gpu command. :(

 

Also since d11 device is free threaded, I tried to do something real "dumb" of creating another thread and just keep deleting old/creating new textures (with new content) on this other thread, hoping the texture creation/deletion is async from the gpu graphics engine, and that plan fell flat as well. Even though device is free threaded from the context, apparently creating/deleting resources still runs in same pipeline as the context commands. :( 

 

Any other thoughts/ideas would be appreciated.

Share this post


Link to post
Share on other sites
Have you tried UpdateSubresource from a CPU memory pointer? In certain very specific circumstances I've found this efficient, despite the dire warnings about it in the documentation & elsewhere, because it will manage resource contention automatically for you, which is where I suspect your primary bottleneck is.

Share this post


Link to post
Share on other sites

Have you tried uploading less data? Depending on what your data looks like, you could compute dirty regions on the CPU and only upload that data (potentially via UpdateSubresource as called out above). Is your data really changing all over the place, non-uniformly, every frame?

Share this post


Link to post
Share on other sites

I've also been looking into this for days.  My use case is slightly different:  I'm writing a video application and an external source is decoding the video, leaving me with a 4K RGBA texture.  I need to display this texture in my 3D App (it's Unity, but I'm writing a native plug-in which means I'm using DX11).

 

I'm always getting hitches, no matter what I do.  The worst case is an Intel HD 4600 which can take up to 25ms just to upload a 1080p texture.  As Ashaman73 has mentioned, bus bandwidth is probably playing a large role in this.

 

I'm using the normally advocated method of using a DYNAMIC texture, writing to that, then CopyResource over into the real texture.  Here's an article where someone has gone through all of the scenarios and benchmarked them:  https://eatplayhate.me/2013/09/29/d3d11-texture-update-costs/. 

 

My problem is that even the memcpy() of a 1080p RGBA texture into Map()'d memory takes a really long time (5+ms), so when I get up to 4K it's substantial.  What I could really use, I think, is a way to begin this copy process asynchronously.  Right now the copy blocks the GPU thread (since you must Map()/Unmap() on GPU thread, I'm also generally doing my memcpy there).

 

I've read this may be possible in OpenGL with some kind of PixelBufferObject?  Is there anything like this in DirectX?  I haven't tried reverting my code to UpdateSubResource for this case, but are there any other suggestions?

Share this post


Link to post
Share on other sites


My problem is that even the memcpy() of a 1080p RGBA texture into Map()'d memory takes a really long time (5+ms), so when I get up to 4K it's substantial.  What I could really use, I think, is a way to begin this copy process asynchronously.  Right now the copy blocks the GPU thread (since you must Map()/Unmap() on GPU thread, I'm also generally doing my memcpy there).

To be honest, I am more familiar with OGL, so some DX11 expert should have better tips.

 

For one, once the memory is mapped, you can access it from any other thread, just avoid calling API functions from multiple threads. The basic setup for memory to buffer copy could be:

  1. GPU thread: map buffer A
  2. Worker thread: decode video frame into buffer A
  3. GPU thread: when decoded, unmap buffer A

This will most likely trigger an asynchronously upload from CPU to GPU memory, or might do nothing if the DX11 decides to keep the texture in CPU memory for now (shared mem on HD4600 ?).

 

The next issue will be, when accessing the buffer. If you access it too early, e.g. by copying the buffer content to the target texture, then the asynchronously upload will be suddently result in synchronosouly stalling your rendering pipeline. So I would test out to use multple buffers, 3 at least. This kind of delay should be not critical for displaying a video.

 

An other option would be to look for a codex which can be decoded on the GPU. I'm not familiar with video codex, but there might be a codex which allows you to use the GPU to decode it. In this case I could work like this:

  1. map buffer X
  2. copy delta frame (whatever) to buffer (much smaller than full frame)
  3. unmap buffer X
  4. fence X
  5. ..
  6. if(fence X has been reached) start decode shader (buffer->target texture)
  7. swap target texture with rendered texture

Share this post


Link to post
Share on other sites

I've read this may be possible in OpenGL with some kind of PixelBufferObject?  Is there anything like this in DirectX?  I haven't tried reverting my code to UpdateSubResource for this case, but are there any other suggestions?


An OpenGL PBO is the equivalent of using two textures in D3D, either via CopyResource or CopySubresourceRegion.

 

To summarise, in OpenGL the workflow with a PBO is (1) map the PBO, (2) write data to it, (3) unmap the PBO and (4) update the texture via glTexImage2D/glTexSubImage2D.

 

The D3D equivalent is (1) map a staging resource, (2) write data to it, (3) unmap the staging resource, and (4) update the texture via CopyResource/CopySubresourceRegion.

Share this post


Link to post
Share on other sites

Just a final update:  I got it working using the Ashaman73 approach:  Map / MemCopy / Unmap / CopyResource.  For a bit better performance I've added multi-threading for the Memcopy and fences at the Unmap and CopyResource stages to ensure I never touch the texture until it's ready (avoiding all stalls).  Performance went through the roof after enforcing no writes to the texture until the fence is finished.

 

I've talked with a few people who are much more familiar with the issue than I am, and they let me know that OpenGL does have a performance benefit because you don't have to unmap the texture when you perform the upload (you can leave it mapped, reducing some of the complexity and contention).  Another issue is that for 4K textures it's better to upload in a compressed format (for video like I'm doing, that's a YUV format as opposed to RGBA because it's about 1/2 the data depending on your encoding scheme).  You can then perform the final conversion via shaders (this saves the memory bandwidth and trades it for computation).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628385
    • Total Posts
      2982391
  • Similar Content

    • By KarimIO
      Hey guys,
      I'm trying to work on adding transparent objects to my deferred-rendered scene. The only issue is the z-buffer. As far as I know, the standard way to handle this is copying the buffer. In OpenGL, I can just blit it. What's the alternative for DirectX? And are there any alternatives to copying the buffer?
      Thanks in advance!
    • By joeblack
      Hi,
      im reading about specular aliasing because of mip maps, as far as i understood it, you need to compute fetched normal lenght and detect now its changed from unit length. I’m currently using BC5 normal maps, so i reconstruct z in shader and therefore my normals are normalized. Can i still somehow use antialiasing or its not needed? Thanks.
    • By 51mon
      I want to change the sampling behaviour to SampleLevel(coord, ddx(coord.y).xx, ddy(coord.y).xx). I was just wondering if it's possible without explicit shader code, e.g. with some flags or so?
    • By GalacticCrew
      Hello,
      I want to improve the performance of my game (engine) and some of your helped me to make a GPU Profiler. After creating the GPU Profiler, I started to measure the time my GPU needs per frame. I refined my GPU time measurements to find my bottleneck.
      Searching the bottleneck
      Rendering a small scene in an Idle state takes around 15.38 ms per frame. 13.54 ms (88.04%) are spent while rendering the scene, 1.57 ms (10.22%) are spent during the SwapChain.Present call (no VSync!) and the rest is spent on other tasks like rendering the UI. I further investigated the scene rendering, since it takes über 88% of my GPU frame rendering time.
      When rendering my scene, most of the time (80.97%) is spent rendering my models. The rest is spent to render the background/skybox, updating animation data, updating pixel shader constant buffer, etc. It wasn't really suprising that most of the time is spent for my models, so I further refined my measurements to find the actual bottleneck.
      In my example scene, I have five animated NPCs. When rendering these NPCs, most actions are almost for free. Setting the proper shaders in the input layout (0.11%), updating vertex shader constant buffers (0.32%), setting textures (0.24%) and setting vertex and index buffers (0.28%). However, the rest of the GPU time (99.05% !!) is spent in two function calls: DrawIndexed and DrawIndexedInstance.
      I searched this forum and the web for other articles and threads about these functions, but I haven't found a lot of useful information. I use SharpDX and .NET Framework 4.5 to develop my game (engine). The developer of SharpDX said, that "The method DrawIndexed in SharpDX is a direct call to DirectX" (Source). DirectX 11 is widely used and SharpDX is "only" a wrapper for DirectX functions, I assume the problem is in my code.
      How I render my scene
      When rendering my scene, I render one model after another. Each model has one or more parts and one or more positions. For example, a human model has parts like head, hands, legs, torso, etc. and may be placed in different locations (on the couch, on a street, ...). For static elements like furniture, houses, etc. I use instancing, because the positions never change at run-time. Dynamic models like humans and monster don't use instancing, because positions change over time.
      When rendering a model, I use this work-flow:
      Set vertex and pixel shaders, if they need to be updated (e.g. PBR shaders, simple shader, depth info shaders, ...) Set animation data as constant buffer in the vertex shader, if the model is animated Set generic vertex shader constant buffer (world matrix, etc.) Render all parts of the model. For each part: Set diffuse, normal, specular and emissive texture shader views Set vertex buffer Set index buffer Call DrawIndexedInstanced for instanced models and DrawIndexed models What's the problem
      After my GPU profiling, I know that over 99% of the rendering time for a single model is spent in the DrawIndexedInstanced and DrawIndexed function calls. But why do they take so long? Do I have to try to optimize my vertex or pixel shaders? I do not use other types of shaders at the moment. "Le Comte du Merde-fou" suggested in this post to merge regions of vertices to larger vertex buffers to reduce the number of Draw calls. While this makes sense to me, it does not explain why rendering my five (!) animated models takes that much GPU time. To make sure I don't analyse something I wrong, I made sure to not use the D3D11_CREATE_DEVICE_DEBUG flag and to run as Release version in Visual Studio as suggested by Hodgman in this forum thread.
      My engine does its job. Multi-texturing, animation, soft shadowing, instancing, etc. are all implemented, but I need to reduce the GPU load for performance reasons. Each frame takes less than 3ms CPU time by the way. So the problem is on the GPU side, I believe.
    • By noodleBowl
      I was wondering if someone could explain this to me
      I'm working on using the windows WIC apis to load in textures for DirectX 11. I see that sometimes the WIC Pixel Formats do not directly match a DXGI Format that is used in DirectX. I see that in cases like this the original WIC Pixel Format is converted into a WIC Pixel Format that does directly match a DXGI Format. And doing this conversion is easy, but I do not understand the reason behind 2 of the WIC Pixel Formats that are converted based on Microsoft's guide
      I was wondering if someone could tell me why Microsoft's guide on this topic says that GUID_WICPixelFormat40bppCMYKAlpha should be converted into GUID_WICPixelFormat64bppRGBA and why GUID_WICPixelFormat80bppCMYKAlpha should be converted into GUID_WICPixelFormat64bppRGBA
      In one case I would think that: 
      GUID_WICPixelFormat40bppCMYKAlpha would convert to GUID_WICPixelFormat32bppRGBA and that GUID_WICPixelFormat80bppCMYKAlpha would convert to GUID_WICPixelFormat64bppRGBA, because the black channel (k) values would get readded / "swallowed" into into the CMY channels
      In the second case I would think that:
      GUID_WICPixelFormat40bppCMYKAlpha would convert to GUID_WICPixelFormat64bppRGBA and that GUID_WICPixelFormat80bppCMYKAlpha would convert to GUID_WICPixelFormat128bppRGBA, because the black channel (k) bits would get redistributed amongst the remaining 4 channels (CYMA) and those "new bits" added to those channels would fit in the GUID_WICPixelFormat64bppRGBA and GUID_WICPixelFormat128bppRGBA formats. But also seeing as there is no GUID_WICPixelFormat128bppRGBA format this case is kind of null and void
      I basically do not understand why Microsoft says GUID_WICPixelFormat40bppCMYKAlpha and GUID_WICPixelFormat80bppCMYKAlpha should convert to GUID_WICPixelFormat64bppRGBA in the end
       
  • Popular Now