Even older hardware was capable of executing many shaders at the same time. It could have been working on the vertex shader stage for one batch of input at the same time as it was shading pixels for an earlier batch and blending for an even earlier one - that's a classic pipelined architecture, and the Wiki page has a decent diagram that can help illustrate it: http://en.wikipedia....uction_pipeline.
It could also have multiple pipes for each shader stage, e.g. a GeForce 6600GT had 8 pixel pipes and 3 vertex pipes, meaning it was capable of running the vertex shader for each vertex of an entire triangle simultaneously (GeForce 6 series specs comparison: http://en.wikipedia....ries_comparison).
Modern hardware just generalises/unifies the shader stages and massively increases the number of cores, but otherwise the pipelined/simultaneous architecture has been with us for a long long time.
I knew shaders have been [dynamically] assigned to act as vertex shaders or pixel shaders on the same object for several years. What I have not realized is that the vertices and pixels of many objects with many different transformation matrices [and even shaders] could be processed simultaneously. I'd very much like to know the practical limits of this, so any modifications I make to my engine are well chosen.
I know this is not a simple, one-dimensional tradeoff. But there must be certain specific limits that an engine designer could be told that would inform his strategies. Like how many registers are available (and at what granularity can they be allocated to shaders). How many different vertex shaders can be run simultaneously. How many different pixel shaders can be run simultaneously. And so forth.
Do you guys load a transformation matrix for an object then render the object, then repeat? Or do you load a pile of transformation matrices into one [or more] uniform blocks first, then somehow specify to the shaders which to reference for each vertex? Or do you load a pile of transformation matrices into a texture instead of a uniform block? What makes the association between vertex and a specific transformation matrix? I've got a 4-bit field in my vertices to specify "which transformation matrix", but that's not to specify the primary transformation matrix (which in my engine is the same for all objects (except "large moving objects")). I could expand that field, I suppose, though I'm running low on free bits in my 64-byte vertices.
I can see a parallel between what I do now, and one possible approach along the lines you guys are talking about. As it is now, on each frame I compile a list of objects that have been "modified" (rotated, translated or 1+ vertices otherwise changed). Once all modifications have been performed, I process only the modified objects (transform vertices, upload to VBO, draw). If I leave local-coordinates in the VBOs, then I don't need to upload the vertices of modified objects, but I do need to upload the transformation matrices of modified objects. If those transformation matrices are in uniform blocks, or in a texture, I can upload the matrices all at once at the beginning, then draw in whatever [sort] order I wish. This should [potentially] make it possible to render many sequential objects in a VBO with one draw call. This might not work all that great in average cases if I test every object against the frustum, because a large percentage of objects will be culled, and the stream of consecutive objects to draw usually rather limited by that fact. However, if I keep my current scheme of mapping a volume of space to each VBO, then most often all objects or no objects will be drawn. Furthermore, I could test the whole 3D volume against the frustum first, then reject entire VBOs full of objects with this one fast test in many cases.
Yeah, the ability to upload all modified transformation matrices at the start, then not upload them between objects seems promising. Where is the more efficient place to access them (uniform blocks, textures, or somewhere else)?