ALL code is executed, including ALL branches, all function calls, etc. (...) This is how all graphics cards work, AMD, NVIDIA, etc.
I'm not quite sure where you base your information on, but almost all graphics cards from the last 3 or 4 years work this way. Here's a quote from NVidia:
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is, to
follow different execution paths. If this happens, the different execution paths must be
serialized, since all of the threads of a warp share a program counter; this increases the
total number of instructions executed for this warp. When all the different execution
paths have completed, the threads converge back to the same execution path.
To obtain best performance in cases where the control flow depends on the thread ID,
the controlling condition should be written so as to minimize the number of divergent
This is possible because the distribution of the warps across the block is deterministic as
mentioned in SIMT Architecture of the CUDA C Programming Guide. A trivial example is
when the controlling condition depends only on (threadIdx / WSIZE) where WSIZE is
the warp size.
In this case, no warp diverges because the controlling condition is perfectly aligned with
Only when serialization is needed, which is when threads inside a warp diverge into different branches, the different execution paths get serialized.