When you use UpdateSubresource, you rely on DX and the driver to schedule an asynchronous transfer. If the data to upload is too big (or you have already exhausted the internal schedule queue), DX will stall. This is very bad for performance.
Because the pointer you provide to UpdateSubresource maybe freed at an undisclosed moment, DX runtime/driver can't assume it won't be freed before the async transfer will happen, and thus needs to copy your data to a temporary internal buffer.
Therefore:
- Best case scenario: DX memcpy's your data CPU->GPU. At a later moment the DX will perform a GPU->GPU transfer. That's two memcpys.
- Common case scenario: DX memcpy's your data CPU->CPU. At a later moment the DX will perform a CPU->GPU and immediately afterwards a GPU->GPU transfer. That's three memcpys.
- Worst case scenario: DX will stall, and memcpy CPU->GPU then GPU->GPU. That's two memcpys + stall.
For data that is modified sporadically or small data that gets uploaded often, this works well. For large amounts of data that needs to be uploaded every frame; the two/three extra memcpys (+ potential stall) can hurt you badly. In those cases you use a USAGE_DYNAMIC buffer, in which the GPU will read directly from your CPU-visible buffer.
Note that mapping with DISCARD (often associated with USAGE_DYNAMIC) has an internal memory limit before it stalls too (e.g. don't map more than 4 MBs per frame using discard on AMD drivers); which is why you should use NO_OVERWRITE as much as possible and issue a DISCARD every now and then.
Of course hardware is complex and nothing is as straightforward as it seems: In GCN hardware, writing your data to a StagingBuffer (or a USAGE_DYNAMIC Buffer mapped with NO_OVERWRITE) and then CopySubResources to a USAGE_DEFAULT can end up being faster than using a USAGE_DYNAMIC directly because GCN may end up transferring CPU->GPU in the background using its DMA Engines; and by the time it's time to render, the data is already in GPU memory (which has higher memory bandwidth than the PCIe bus).
But this trick works like crap on Intel hardware because you're just adding extra memcpys on system memory (which has much lower bandwidth than a discrete GPU, and shares it with the CPU).
The ideal is to abstract the buffer mapping interface so that you can switch strategies based on what's faster on each hardware.
Fun times.