Is it possible that a compute shader is slower than a CPU code if the CPU is better than the GPU or the GPU must be faster ?
Jump to content
Posted 27 January 2014 - 12:21 PM
CPUs are good at executing relatively complex programs on a few threads. GPUs are good at executing relatively simple programs on many threads.
If you can find a way to efficiently parallellize the problem you're trying to solve, a compute shader is probably going to be faster. On the other hand, if your algorithm is inherently sequential, the CPU probably has the upper hand.
current project: Roa
Posted 27 January 2014 - 07:03 PM
The way to really understand when one is better than the other is to understand the kinds of problems that each was developed to solve.
CPUs have been developed to provide a single answer at a time with as low a latency as possible -- that is, do one single thing at a time with one set of input as quickly as possible, per available thread of execution.
GPUs have been developed to provide multiple answers at a time while giving away low-latency operation in favor of higher aggregate throughput -- that is, do one single thing to many sets of input at lower frequency, but with performance multiplied having a high number of threads (each set of input) running in parallel.
You also need to understand that the threads of execution on a GPU have been simplified in various ways so that they occupy less silicon real-estate compared to threads of execution on a CPU.
A thread in a GPU doesn't really have its own program counter, registers, or cache, it shares them in lock-step with the rest of its sibling threads. For example, if you run code on a GPU that has a single "if" statement, and even a single thread chooses a different branch than all the rest of its siblings, then the code inside both branches has to be executed by the whole batch of siblings (then the right results are masked off, recombined to gather the results, and execution can continue). If, again, there's another if inside each branch and one of the siblings goes its own way, both new branches have to be solved and recombined again -- if statements on a GPU cause the execution time to grow exponentially as long as the threads "diverge" (take different paths). Sometimes you can come up with a clever solution to make sure all the threads go the same way, and then a GPU will do great, but its not always possible or predictable to do so.
A thread on a CPU has its own program counter, registers, and caches -- hyper-threading competes for execution resources on a CPU, but generally only hops in when the other possible thread is stalled waiting for memory, so its a win -- it doesn't really share with anyone, so it can go wherever it wants, whenever it wants, for the most part. When a CPU takes a branch, it completely ignores the untaken path and wastes no time executing it. Because of this, there is no exponential growth in execution time for "branchy code" on a CPU vs. on a GPU.
If I can indulge in a car analogy -- a 4-core CPU is like a pack of motorcycles, everyone goes exactly to their destination as quickly as possible. A GPU execution core (which groups 16-32 "passengers" inside each "vehicle") is like a schoolbus, everyone goes to the same destination, and more slowly. When you have a few people going to different places, the motorcycles will get everyone there more quickly, but when lots of people are going to the same place, a bus delivers more people sooner (remember, your CPU only has 4-8 threads, so you can only have 4-8 motorcycles on your roadway at once -- but a modern, high-end GPU is like a fleet of 32-44 buses, each carrying 16 people).
You also have vector instruction sets on the CPU core, which are kind of a middle-ground between CPUs and GPUs -- like CPUs they operate at higher speeds, but like GPUs their threads share the program counter, registers, and caches with their siblings; however, because they combine fewer threads at once (4 for SSE, Altivec, or NEON, and 8 for AVX) its a bit more managable to load up the vehicle so that everyone is going to the same place. Vector instructions are like SUVs in my car analogy.
The other factor that can make GPUs slower is that to get data onto a discrete GPU, you have to copy it across the relatively-slow PCIe bus, perform the calculation, and then you have to copy the result back again. The movement of data itself has a bandwidth thats about 1/4th that of the bandwidth between CPU and main-memory, which already puts you at a disadvantage. But there's also a fixed-cost per transfer that results from the driver telling the GPU's hardware to prepare and execute the transfer. As a result, the smaller the amount of data you move, the higher the amortized cost is per unit of work accomplished. In practice this means that you need to have a lot of data to operate on, have a lot of operations to perform, or both, in order to overcome this penalty and achieve a performance win. Lots of interesting problems fall into that category, and lots don't.
Just last week AMD introduced their new APUs that integrate a smallish GPU into their CPUs, and which shares the same memory space for the first time (Intel's latest Haswell i-series processors do too), which can reduce or eliminate the transfer overhead of PCIe. They're only about 1/6th the size of a high-end GPU, and they don't have dedicated GDDR5 to lend them super-high bandwidth, so some really big problems still win on the GPU, and they still don't do branches and divergent code well, so those problems are still best left to the CPU, but there are some problems -- like for instance, facial recognition -- that do a moderate amount of computation on a moderately-sized data set, and those kinds of problems do better on these kinds of APUs than they do on either CPUs or discreet GPUs. My car analogy begins to break down here, because these APU execution units are still buses, they just have much more efficient protocols for loading and unloading their passengers.
Edited by Ravyne, 27 January 2014 - 07:09 PM.