See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.
No, it doesn't take that much effort to evaluate other approaches -- there is no "conventional" apprach; we're always looking for alternatives. I've used methods like this on the Wii, where optimised CPU transform routines were better than it's GPU transforms (
and also on the PS3, where it's got way more CPU power than GPU power), but without knowing the details of the rest of those specific projects, we can't judge here the cost/value of those decisions.
The problem is that your alternative isn't as great or original as you think it is.
However, if we consider two extremes -- complete CPU vertex processing
vs entirely GPU driven animation/instancing of vertices -- then you just have to consider that there is a continuous and interesting middle ground between the two extremes at all times. Depending on your target platforms, you find different kinds of scenes/objects are faster one way or the other. Different games also have different needs for the CPU/GPU of their own -- some need all the CPU time they can get, and some leave the CPU idle as the GPU bottle-necks. Our job should be to always consider this entire spectrum, depending on the current situation/requirements.
I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates.[/quote]No, this is utter bullshit. You haven't even understood "conventional" methods yet, and you're justifying your clever avoidance of them for these straw-man reasons? Decent multi-key sorting plus the simple state-cache on page#1 removes the majority of state changes, and is easy to implement. I use a different approach, but it still just boils down to sorting and filtering, which can be optimised greatly if you care.
Moreover, there's no reason that you have to have "the transform matrix" -- it's not the fixed function pipeline any more. Many static objects don't have one, and many dynamic objects have 100 of them. Objects that use a shared storage tehcnique between them can be batched together. There's no reason I couldn't take a simmilar approach to your all-verts-together technique, but instead store all-transforms-together.
However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.
However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!
I don't agree with how you've characterised that. I meant to imply that state-change costs are unintuitive, but that they are still something that can be measured. If it can be measured, it can be optimised. If it cant be measured, then your optimisations are just voodoo.
Also, just because you've got 0 state changes, that in no way grants automatically better performance. Perhaps the 1000 state-changes that you avoided wouldn't have been a bottle-neck anyway; they might have been free in the GPU pipeline and only taken 0.1ms of CPU time to submit? Maybe the optimisation you've made to avoid them is 1000x slower? Have you measured both techniques with a set of different scenes?
I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a [[font=courier new,courier,monospace]Soldier0[/font]] object, then you have several instances of [[font=courier new,courier,monospace]Soldier0[/font]]... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.
Maybe you don't want to waste VBO memory on storing a unique mesh for every "[font=courier new,courier,monospace]Soldier[/font]" in the game, so you only store a single (local space) "base mesh" that's kind of an average soldier. Your art team still models 8 unique soldiers that the game requires, as usual, but they are remapped as deltas from the base mesh. You then store your 1 base-mesh and 8 delta/morph-meshes together, and then instance 100 characters using the 1 base-mesh, each of which will be drawn as one of the 8 variations as required/indexed.
In your system, we not only need the original 9 data sets, but also enough memory for the 100 transformed data-sets. We then need enough GPU memory for another ~200 (n-hundred) transformed meshes, because it's dynamic data and the driver is going to want to ~double buffer (n-buffer) it.
I really don't know why I bothered with this, as you're being very dismissive of the real criticism that you solicited, but again, look at your memory trade-offs... The "conventional" requirements look like:
[attachment=9490:1.png]
whereas yours looks like:
[attachment=9491:2.png]
i.e. conventional memory usage
= Models.Verts * sizeof(Vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms
your memory usage
= Models.Verts * sizeof(vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms
+ Instances * Models.Verts * sizeof(vert) //world data
+ Instances * Models.Verts * sizeof(vert) * N // Driver dynamic VBO buffering.
If you're going next-gen, you're looking at huge object counts and huge vertex counts. In the first equation above, there is no relation between the two -- increasing instances does not impact vertex storage, and increasing vertices does not increase the per-instance storage cost. However, in the latter, they are connected -- increasing the amount of detail in a model also increases your per-instance memory costs.
Seeing as modern games are getting more and more dynamic (where more things are moving every frame) and more and more detailed (where both vertex and instance counts are increasing greatly), I'd give your users the option to keep their costs independent of each other.
e.g. the example I gave earlier -- of 100 animated characters, with 100 bones each, using a 1M vertex model -- clearly is much more efficient in the "conventional" method than in yours. So if you wanted to support games based around animated characters, you might want to support the option of a different rendering mode for them.