[quote name='maxgpgpu' timestamp='1339445552' post='4948286']
Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.
Measuring a single transformation in isolation is not going to give you a meaningful comparison. As you noted, GPUs are much more parallel than CPUs (that's not going to change for the foreseeable), and they also have the advantage that this parallelism comes for free - there's no thread sync overhead or other nastiness in your program. At the most simplistic level, an 8-core CPU would need to transform vertexes 64 times faster (the actual number will be different but in a similar ballpark) than a 512-core GPU for this to be advantageous.
Another crucial factor is that performing these transforms on the GPU allows you to use static vertex buffers. Do them on the CPU and you're needing to stream data to the GPU every frame, which can fast become a bottleneck. Yes, bandwidth is plentiful and fast, but it certainly doesn't come for free - keeping as much data as possible 100% static for the lifetime of your program is still a win. Not to mention a gain in terms of reduction of code complexity in your program, which I think we can all agree is a good thing.
[/quote]
From articles by nvidia, I infer a single core of a fast CPU is about 8x faster than a GPU core, not 64x. That's why I believe we all agree that our 3D engines should perform all operations on the GPU that do not involve some unfortunate (and possibly inefficient) consequences. A perfect example of an unfortunate/inefficient consequence of doing everything the conventional way is the problem of collision-detection and collision-response. This is not an issue in trivial games, or games that just don't need to detect or respond to collision in any significant or detailed way. But when any aspect of collision-detection OR collision-response needs to be handled on the CPU, the "simple way" or "conventional way" of keeping object local-coordinates in VBOs in the GPU memory, and having the GPU transform those objects directly to screen-coordinates with one matrix has serious problems.
First, collision-detection must be performed with all objects in the same coordinate-system... which for many practical reasons is world-coordinates. Also, collision-response has the same requirements... all objects must be available in the same [non-rotating, non-accelerating] coordinate-system, and world-coordinates is again most convenient. So the "solution" proposed by advocates of "conventional ways" is to make the GPU perform two matrix transformations --- first transform each vertex to world-coordinates, then write the vertex (or at least the world-coordinates and possibly transformed surface vectors (normal, tangent, bitangent)) back into CPU memory, then transform from world coordinates to screen coordinates in a separate transformation.
So now we do have a transfer of all vertices between CPU and GPU... only this time, the transfer is from GPU to CPU. The CPU must wait for this information to be returned before it starts processing it, and of course this must be synchronized (not terribly difficult).
I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates --- only those vertices that changed (usually by rotate or translate) on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static". However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects, this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches" (except for huge objects with boatloads of vertices, where one object in-and-of itself is a large batch). Furthermore, in the "conventional way", since a new glDrawElements() call is inherently required for every object, we might as well have boatloads of shaders, one for every conceivable type of rendering, and [potentially] change shaders for every object... or at least "often". From what I can tell, changing shaders has substantial overhead in the GPU driver [and maybe in the GPU itself], so this is an often ignored gotcha of the "conventional way". Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").
I very much agree that keeping code complexity low is important. However, from my experience, my approach is simpler than the "conventional way". Every object looks the same, has the same elements, is transformed to world-coordinates and sent to GPU when modified (but only when modified), exists in memory in world-coordinates for the convenience of collision-detection and collision-response computations, and whole batches of objects are rendered with a single glDrawElements(). Are there exceptions? Yes, though not many, and I resist them. Examples of exceptions I cannot avoid are rendering points and lines (each gets put into its own batch). My vertices contain a few fields that tell the pixel shader how to render each vertex (which texture and normalmap, and bits to specify rendering type (color-only, texture-only, color-modifies-texture, enable-bumpmapping, etc)). So if anything, I think my proposed way (implemented in my engine) is simpler. And all things considered, for a general-purpose engine, I suspect it is faster too.