Why not just use the API properly, and then take measures to merge objects into single batches later when profiling says you have to?
All of these actions will interrupt the pipeline and - while they won't directly cause pipeline stalls - they will cause a break in the flow of commands and data from the CPU to the GPU
That's a bit of an exaggeration. All any GL function that talks to the GPU does, is write command packets into a queue (which is read by the GPU many dozen milliseconds later
). Writing new command packets can't cause a break in the flow of already-written packets, nor will it somehow stall later packets.
On the GPU end, it has hardware reading/decoding this queue in parallel to actually doing work. As long as you're submitting large enough groups of work (and GPU groups/batches don't correspond to 'batches' on the CPU, which are usually regarded as individual glDraw* calls -- the GPU can merge multiple CPU draws into a single group depending on conditions
), then the executing the groups will take longer than decoding them, so decoding (which includes applying/preparing state changes
) is pretty much free.
i.e. when you're giving the GPU enough work per batch, the pipeline looks like:
Decode #1|Decode #2| |Decode #3|
| Run #1 | Run #2 | Run #3 |
and if you're not giving it enough work, it might look like:
Decode #1| Decode #2 | Decode #3 |
|Run #1|stall|Run #2|stall|Run #3|
And yes, if you run into the second case, then to fix it you may want to increase the number of pixels/vertices processed per draw call, and one way to do that may be to merge shaders, which may in turn require the merging of vertex formats... But all that is an optimisation topic, which means it should be done under the supervision of a profiling tool.
N.B. the first pipeline diagram above actually has a 'break' between Decode#2 and Decode#3 (i.e. the flow of commands from the CPU->GPU
), but isn't a bad thing ;)
As for "saving precious space", this isn't about saving RAM. Yep, RAM is cheap and ever growing. The reason you want to save space is bandwidth
Below are the specs on a high-end and low-end model GPU from 3 different generations of nVidia cards:
Model Bandwidth@60Hz Memory
------------------ -------------- ------
GeForce 8400 GS 109 MiB/frame 512MiB
GeForce 8800 Ultra 1.73 GiB/frame 768MiB
GeForce 205 137 MiB/frame 512MiB
GeForce GTX 285 2.65 GiB/frame 2GiB
GeForce GT 620 246 MiB/frame 1GiB
GeForce GTX 680 3.2 GiB/frame 4GiB
As you can see, the high-end cards can pretty much read or write every byte of their memory around once per frame, but, the low-end cards can only touch a quarter of their RAM in any given frame.
Moreover, large parts of your RAM have to be read/written more than once
in a frame -- render targets with blending will require multiple reads/writes per frame, texels will likely be read many times, VBOs are shared between different models and thus reused, and even within the drawing of a single mesh verts are shared between triangles (and will be redundantly reshaded upon cache miss, about half the time
When you get to profiling, it's just as likely that some of the fixes you'll have to apply will be bandwidth-saving measures, which could be the opposite of the above -- e.g. splitting a single shader into multiple ones that take different vertex inputs, and sample different amounts of textures.
Edited by Hodgman, 18 November 2012 - 08:13 PM.
To add a counter-viewpoint --- using a single fat vertex format for all meshes, to avoid some possible future performance problem, is definitely a premature optimisation in my eyes.