Jump to content

  • Log In with Google      Sign In   
  • Create Account


Powerful GPU - Send more do less, send less do more?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
5 replies to this topic

#1 iYossi   Members   -  Reputation: 128

Like
0Likes
Like

Posted 23 September 2012 - 08:35 AM

What is better( performance-wise ) for a powerful GPU - send more data to the GPU and do more work on it, or do more calculations on the CPU and send the final result to the shaders?

Sponsor:

#2 SimonForsman   Crossbones+   -  Reputation: 5719

Like
2Likes
Like

Posted 23 September 2012 - 09:14 AM

It depends on the workload, keeping things balanced is usually a good idea.
I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!

#3 mhagain   Crossbones+   -  Reputation: 7338

Like
3Likes
Like

Posted 23 September 2012 - 09:32 AM

I'd say neither. Your first option is part-way there (the "do more work on the GPU" part) but you want to be able to do this without sending data (or at worst with having to send as little data as possible). Sending data to the GPU is a performance killer; this isn't so much on account of bandwidth (although that too is a finite resource) but more on account of pipeline stalls, synchronization and resource contention. The more you send the more likely it is that one piece of data is going to trip you up on this, but note that this also applies to your second option (which also involves bottlenecking on your slower processor - not good) as you still need to send the results to the GPU. Note also that even as little as 1 byte can be enough to incur a pipeline stall unless you're careful about what you send and how you send it - resource contention kills.

Ideally you want all data that your GPU is going to work with to be contained in static buffers which are initialized at load time and from then on never touched by the CPU. This lets your GPU operate completely isolated from your CPU and run as fast as it can without ever having to wait for the CPU or for data to be ready. Like all other idealized scenarios this isn't really obtainable in the real world, so instead you look at keeping communication between the two processors as low as possible. So your GPU stores the data in static buffers, your CPU tells it what to do with that data, but wherever possible that's the limit of communication - the CPU sends commands, not data, the GPU stores data, the GPU works on data according to the commands it's been given.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#4 kauna   Crossbones+   -  Reputation: 2142

Like
0Likes
Like

Posted 23 September 2012 - 04:50 PM

Of course, when ever somebody tells that "sending data to GPU is bad" one has to keep in mind that a you can copy every frame few megabytes worth of data to dynamic buffers for the GPU to use in rendering and maintain excellent interactive frame rates, without stalling your pipeline. As told before, it depends how you do it.

Of course, if you do more than copying (ie. recalculating some data) then you are using the resources of your weaker processing unit which may reduce the performance.

Best regards!

#5 MJP   Moderators   -  Reputation: 10098

Like
12Likes
Like

Posted 23 September 2012 - 04:52 PM

There's no simple answer to this question because in reality the situation is very complicated, and thus depends on the specifics of the hardware and what you're doing.

One way to look at CPU/GPU interaction is the same way you'd look at two CPU cores working concurrently. In the case of two CPU's you achieve peak performance when both processors are working concurrently without any communication or synchronization required between the two. For the most apart this applies to CPU/GPU as well, since they're also parallel processors. So in general, reducing the amount of communication/synchronization between the two is a good thing. However in reality a GPU is incapable of operating completely independently from the GPU, which is unfortunate. The GPU always requires the CPU to, at minimum, submit a buffer (or buffers) containing a stream of commands for the GPU to execute. These commands include draw calls, state changes, and other things you'd normally perform using a graphics API.

The good news is that the hardware and driver are somewhat optimized for the case of CPU to GPU data flow, and thus can handle it in most cases with requiring stalling/locking for synchronization. The hardware enables this by being able to access CPU memory across the PCI-e bus, and/or by allowing the CPU write access to a small section of dedicated memory on the video card itself. However in general read or write speeds for either the CPU or GPU will be diminished when reading or writing to these areas, since data will have to be transferred across the PCI-e bus. For the command buffer itself the hardware will typically use some sort of FIFO setup where the driver can be writing commands to one area of memory, while the GPU trails behind executing commands from a different are of memory. This allows the GPU and CPU to work independently of each other, as long as the CPU is running fast enough to be working ahead of the GPU.

As for the drivers, they will also use a technique known as buffer renaming to to enable the CPU to send data to the GPU without explicit synchronization. It's primarily used when you have some sort "dynamic" buffer where the CPU has write access and the GPU has read access, for instance when you create a buffer with D3D11_USAGE_DYNAMIC in D3D11. What happens with these buffer is that the driver doesn't explicitly allocate them memory when you create them, it defers the allocation until the point when you lock/map it. At this point it allocates some memory that GPU isn't currently using, and allows the CPU to write its data there. Then the GPU later reads the data when it executes a command that uses the buffer, which is typically some time later on (perhaps even as much as a frame or two). Then if the CPU locks/maps the buffer again the driver will allocate a different area of memory than the last time it was locked/mapped, so the CPU is again writing to an area of memory that's not currently in use by the GPU. This is why such buffers require the DISCARD flag in D3D: the buffer is using a new piece of memory, therefore it won't have the same data that you previously filled it with. By using such buffers you can typically avoid stalls, but you still may pay some penalties in terms of access speeds or in the form of driver overhead. It's also possible that the driver may run out of memory to allocate, in which case it will be forced to stall. Another technique employed by drivers as an alternative to buffer renaming is to store data in the command buffer itself. This is how the old "DrawPrimitiveUP" stuff was implemented in D3D9. This can be slower than dynamic buffers depending on how the command buffers are set up. In some cases the driver will let you update a non-renamed buffer without an explicit sync as long as you "promise" not to write any data that the GPU is currently using. This is exposed by the WRITE_NO_OVERWRITE pattern in D3D.

For going the other way and having the GPU provide data to the CPU, you don't have the benefit of these optimizations. In all such cases (reading back render targets, getting query data, etc.) the CPU will be forced to sync with the GPU and flush all pending commmands, and then wait for them to execute. The only way to avoid the stall is to wait long enough for the GPU to finish before requesting access to the data.

So getting back to your original question, whether or not its better to pre-calculate on the CPU depends on a few things. For instance, how much data does the CPU need to send? Will doing so require the GPU to access an area of memory that's slower than its primary memory pool? How much time will the CPU spend computing the data, and writing the data to an area that's GPU-accessible? Is it faster for the GPU to compute the result on the fly, or to access the result from memory? Like I said before, these things can all vary depending on the exact architecture and your algorithms.

Edited by MJP, 30 September 2012 - 10:19 AM.


#6 RobMaddison   Members   -  Reputation: 636

Like
0Likes
Like

Posted 30 September 2012 - 03:24 AM

Great writeup, MJP




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS