Hi, I'm writing a shader that requires writing a large amount of data to a buffer. As I increase the size of memory allocated for the buffer, the frame rate drops significantly, even though the number of writes do not increase and the geometry rendered and shader code remain exactly the same. I'm using a D3D11_QUERY_EVENT to measure the frame rate by calling ID3D11Device::CreateQuery before the call to ID3D11DeviceContext::DrawIndexed and then immediately calling ID3D11DeviceContext::End. I then have the cpu run in a while loop until the query is no longer S_FALSE. Is this the correct way to measure how much time a draw call requires to finish? Or is it somehow capturing the time it takes to allocate the memory as well? If the timer isn't wrong, then what is causing the reduction in performance? I'm using structured buffers created with D3D11_USAGE_DEFAULT and no CPU access. I have a UAV to these buffers.
After some more testing, I noticed that the first few frames after the program startup are much slower than the frames afterwards, so I suspected that the memory was still being allocated or loading up during this time. I tried putting the CPU to sleep for a couple seconds before invoking the draw command to allow the GPU to catch up, but that didn't change anything. I also tried making the CPU loop for a billion times, but that was also to no avail. With a hunch that the GPU needed to render a few frames before it got up to speed, I wrote a simple shader that just outputs a single color. After about 10 frames, I switched to my actual shader. This also didn't do anything.
Finally, I wondered if the GPU didn't load the buffers until they were actually used inside a shader, so I wrote a simple shader that writes values of 0 into the buffer. After about 10 frames, I switched to my actual shader, and voila! The frame rate was much much higher (600fps compared to the 15fps I was measuring for the first few frames). Changing the buffer size also did not signicantly change the frame rate. From what I see, it seems like the GPU will not load the buffers into memory until a shader accesses those buffers, and by the time that occurs, buffers that are large in size will incur a large overhead for the first 10 frames or so. In my opinion, this is quite strange behavior, since I'd like to think that the GPU would be able to load all the resources when given enough time (i.e. stalling the CPU to allow the GPU to catch up). Maybe something else is at work here, so please chime in if you have any ideas.
If you're wondering why it was necessary for me to measure the first frame (and not just the average fps), it was because my shader builds a sparse octree. The shader is much more demanding when the octree has to be subdivided many times, so the first frame (or frames with dynamic objects) requires much more time, and I needed a way of checking if the changes in my shaders were improving how quickly the tree could be built. So if you're looking for a way to profile the first few frames of a shader, make sure you have a "warmup" shader that writes values of 0 (or something that would have no effect) into the buffers you're using. After a few frames, switch to your actual shader and the frame rate should be more indicative of what you would get in a true continuous run.
Directx manages CPU and GPU memory on it's own. It could move blocks to the GPU, reallocate on CPU, etc... these decisions are mostly determined by how you create your vertex buffers ( it's description - i.e. dynamic, write only etc...) but I'm sure there are other factors too.
The best way to determine a bottleneck is to use PIX and profile your engine/code This will give you a better idea of what directx is doing with the CPU/GPU memory.
The driver attempts to virtualize GPU memory to some degree, and so you won't actually allocate memory when you create the resource. Instead it will try to move things in and out of memory depending how it gets used. This can be nice since your app isn't strictly limited to allocating within the size of GPU memory, and other apps can also have GPU resources allocated concurrently. But it can also be nasty since you can run into some very strange performance problems related to memory usage. In my experience if you start to use more memory than the driver can allocate from GPU memory, then it will try to page things in and out mid-frame which results in erratic performance. This would seem to explain the behavior you've encountered...perhaps after you start using your very large buffer it takes the driver a few frames to evict some other stale resources out of memory which is why the performances levels out after a little bit. But of course it's hard to tell just by looking at the raw performance numbers.
Also with regards to profiling with queries...you have to be careful with the results from that. It really only gives you the latency from which the GPU starts the query to when it reaches the end query in the command buffer, which doesn't necessarily give you the total time that the GPU spends on all of the commands within that query since it might be executing unrelated tasks concurrently. Things can also get really tricky when an expensive decompression or synchronization step is involved, since that won't happen until you do something later that causes it to happen. For instance I've seen this when trying to profile how long it takes my AMD GPU to fill an MSAA G-Buffer. I thought it was only taking a short amount of time, but then if I forced a sync/decompression by running a short compute shader that samples from the G-Buffer textures with one thread it showed up as taking much longer. I don't think you'll run into anything like that with buffers, since as far as I'm aware neither AMD or Nvidia do anything fancy for buffers in terms of memory layout or compression. But there may be some synchronization costs that could miss with a query.