My problem is that even the memcpy() of a 1080p RGBA texture into Map()'d memory takes a really long time (5+ms), so when I get up to 4K it's substantial. What I could really use, I think, is a way to begin this copy process asynchronously. Right now the copy blocks the GPU thread (since you must Map()/Unmap() on GPU thread, I'm also generally doing my memcpy there).
To be honest, I am more familiar with OGL, so some DX11 expert should have better tips.
For one, once the memory is mapped, you can access it from any other thread, just avoid calling API functions from multiple threads. The basic setup for memory to buffer copy could be:
- GPU thread: map buffer A
- Worker thread: decode video frame into buffer A
- GPU thread: when decoded, unmap buffer A
This will most likely trigger an asynchronously upload from CPU to GPU memory, or might do nothing if the DX11 decides to keep the texture in CPU memory for now (shared mem on HD4600 ?).
The next issue will be, when accessing the buffer. If you access it too early, e.g. by copying the buffer content to the target texture, then the asynchronously upload will be suddently result in synchronosouly stalling your rendering pipeline. So I would test out to use multple buffers, 3 at least. This kind of delay should be not critical for displaying a video.
An other option would be to look for a codex which can be decoded on the GPU. I'm not familiar with video codex, but there might be a codex which allows you to use the GPU to decode it. In this case I could work like this:
- map buffer X
- copy delta frame (whatever) to buffer (much smaller than full frame)
- unmap buffer X
- fence X
- if(fence X has been reached) start decode shader (buffer->target texture)
- swap target texture with rendered texture