Powerful GPU - Send more do less, send less do more?
Members - Reputation: 4037
Posted 23 September 2012 - 09:32 AM
Ideally you want all data that your GPU is going to work with to be contained in static buffers which are initialized at load time and from then on never touched by the CPU. This lets your GPU operate completely isolated from your CPU and run as fast as it can without ever having to wait for the CPU or for data to be ready. Like all other idealized scenarios this isn't really obtainable in the real world, so instead you look at keeping communication between the two processors as low as possible. So your GPU stores the data in static buffers, your CPU tells it what to do with that data, but wherever possible that's the limit of communication - the CPU sends commands, not data, the GPU stores data, the GPU works on data according to the commands it's been given.
It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.
Members - Reputation: 1210
Posted 23 September 2012 - 04:50 PM
Of course, if you do more than copying (ie. recalculating some data) then you are using the resources of your weaker processing unit which may reduce the performance.
Moderators - Reputation: 5656
Posted 23 September 2012 - 04:52 PM
One way to look at CPU/GPU interaction is the same way you'd look at two CPU cores working concurrently. In the case of two CPU's you achieve peak performance when both processors are working concurrently without any communication or synchronization required between the two. For the most apart this applies to CPU/GPU as well, since they're also parallel processors. So in general, reducing the amount of communication/synchronization between the two is a good thing. However in reality a GPU is incapable of operating completely independently from the GPU, which is unfortunate. The GPU always requires the CPU to, at minimum, submit a buffer (or buffers) containing a stream of commands for the GPU to execute. These commands include draw calls, state changes, and other things you'd normally perform using a graphics API.
The good news is that the hardware and driver are somewhat optimized for the case of CPU to GPU data flow, and thus can handle it in most cases with requiring stalling/locking for synchronization. The hardware enables this by being able to access CPU memory across the PCI-e bus, and/or by allowing the CPU write access to a small section of dedicated memory on the video card itself. However in general read or write speeds for either the CPU or GPU will be diminished when reading or writing to these areas, since data will have to be transferred across the PCI-e bus. For the command buffer itself the hardware will typically use some sort of FIFO setup where the driver can be writing commands to one area of memory, while the GPU trails behind executing commands from a different are of memory. This allows the GPU and CPU to work independently of each other, as long as the CPU is running fast enough to be working ahead of the GPU.
As for the drivers, they will also use a technique known as buffer renaming to to enable the CPU to send data to the GPU without explicit synchronization. It's primarily used when you have some sort "dynamic" buffer where the CPU has write access and the GPU has read access, for instance when you create a buffer with D3D11_USAGE_DYNAMIC in D3D11. What happens with these buffer is that the driver doesn't explicitly allocate them memory when you create them, it defers the allocation until the point when you lock/map it. At this point it allocates some memory that GPU isn't currently using, and allows the CPU to write its data there. Then the GPU later reads the data when it executes a command that uses the buffer, which is typically some time later on (perhaps even as much as a frame or two). Then if the CPU locks/maps the buffer again the driver will allocate a different area of memory than the last time it was locked/mapped, so the CPU is again writing to an area of memory that's not currently in use by the GPU. This is why such buffers require the DISCARD flag in D3D: the buffer is using a new piece of memory, therefore it won't have the same data that you previously filled it with. By using such buffers you can typically avoid stalls, but you still may pay some penalties in terms of access speeds or in the form of driver overhead. It's also possible that the driver may run out of memory to allocate, in which case it will be forced to stall. Another technique employed by drivers as an alternative to buffer renaming is to store data in the command buffer itself. This is how the old "DrawPrimitiveUP" stuff was implemented in D3D9. This can be slower than dynamic buffers depending on how the command buffers are set up. In some cases the driver will let you update a non-renamed buffer without an explicit sync as long as you "promise" not to write any data that the GPU is currently using. This is exposed by the WRITE_NO_OVERWRITE pattern in D3D.
For going the other way and having the GPU provide data to the CPU, you don't have the benefit of these optimizations. In all such cases (reading back render targets, getting query data, etc.) the CPU will be forced to sync with the GPU and flush all pending commmands, and then wait for them to execute. The only way to avoid the stall is to wait long enough for the GPU to finish before requesting access to the data.
So getting back to your original question, whether or not its better to pre-calculate on the CPU depends on a few things. For instance, how much data does the CPU need to send? Will doing so require the GPU to access an area of memory that's slower than its primary memory pool? How much time will the CPU spend computing the data, and writing the data to an area that's GPU-accessible? Is it faster for the GPU to compute the result on the fly, or to access the result from memory? Like I said before, these things can all vary depending on the exact architecture and your algorithms.
Edited by MJP, 30 September 2012 - 10:19 AM.