Advertisement

most efficient general rendering strategies for new GPUs

Started by June 09, 2012 12:12 AM
92 comments, last by maxgpgpu 12 years, 3 months ago

Please clarify. You are talking about these variables incrementing once per object on utterly diverse objects. Not incrementing from tree #1 to tree #1000 as per normal instancing practice, but instance #1 is a tree, instance #2 is a car, instance #3 is your mother-in-law, etc? If the later is true, exactly what causes this build-in gl_InstanceID to increment? And when it does increment, where is the array of transformation matrices if not in a uniform block.

I know about instancing, but if this applies in some convenient way to diverse objects, then you are correct --- I did not notice this, and it is very cool.

I'm not sure just where you're getting that idea from - what I am talking about here is standard instancing.

You seem to be missing the point that even with standard instancing you will have multiple instances of the same object. You won't just have 10,000 completely unique objects in your scene - you'll have many cars, many trees, many people. So you sort them by object type and draw them using standard instancing.

You also seem to have this odd idea that re-upping vertexes is going to be much faster than re-upping matrixes. Maybe you have a hypothetical best-case where it is (as you mention, only when objects change) but your worst-case (when everything needs to change) is going to be at least an order of magnitude slower. You're missing the point that this is important. Maintaining a stable and consistent framerate gives a much better player experience than having a framerate that's see-sawing all over the place.

That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects), and must call some kind of glDrawElements() function 10,000 times (once per object) instead of just once (or more commonly 10 to 100 times).[/quote]

Again, no, no, no, no, no. The example of BattleField 3 was given to you - it instances everything. It does not make one draw-call per object per frame. What part of that is not going into your brain?

You have a theoretical setup - BF3 is out there saying "Eppur si muove". This works. It's a solved problem.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


[quote name='maxgpgpu' timestamp='1339722384' post='4949384']
Please clarify. You are talking about these variables incrementing once per object on utterly diverse objects. Not incrementing from tree #1 to tree #1000 as per normal instancing practice, but instance #1 is a tree, instance #2 is a car, instance #3 is your mother-in-law, etc? If the later is true, exactly what causes this build-in gl_InstanceID to increment? And when it does increment, where is the array of transformation matrices if not in a uniform block.

I know about instancing, but if this applies in some convenient way to diverse objects, then you are correct --- I did not notice this, and it is very cool.

I'm not sure just where you're getting that idea from - what I am talking about here is standard instancing.

You seem to be missing the point that even with standard instancing you will have multiple instances of the same object. You won't just have 10,000 completely unique objects in your scene - you'll have many cars, many trees, many people. So you sort them by object type and draw them using standard instancing.

You also seem to have this odd idea that re-upping vertexes is going to be much faster than re-upping matrixes. Maybe you have a hypothetical best-case where it is (as you mention, only when objects change) but your worst-case (when everything needs to change) is going to be at least an order of magnitude slower. You're missing the point that this is important. Maintaining a stable and consistent framerate gives a much better player experience than having a framerate that's see-sawing all over the place.

That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects), and must call some kind of glDrawElements() function 10,000 times (once per object) instead of just once (or more commonly 10 to 100 times).[/quote]

Again, no, no, no, no, no. The example of BattleField 3 was given to you - it instances everything. It does not make one draw-call per object per frame. What part of that is not going into your brain?

You have a theoretical setup - BF3 is out there saying "Eppur si muove". This works. It's a solved problem.
[/quote]

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a GeneralPatton object, then you have several instances of GeneralPatton... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.
Advertisement

[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.


It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.
[/quote]
Yes, that works if you accept larger AABB that are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

[quote name='mhagain' timestamp='1339697733' post='4949225']
[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.


It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.
[/quote]
Yes, that works if you accept slightly larger AABB than are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

And yes, the framerate must be constant.
[/quote]

[quote name='maxgpgpu' timestamp='1339727278' post='4949399']
[quote name='mhagain' timestamp='1339697733' post='4949225']
[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.


It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.
[/quote]
Yes, that works if you accept larger AABB that are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

And yes, the framerate must remain constant --- always.
[/quote]
[/quote]
So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),[/quote]
Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

So, if no objects rotate or move, no objects are uploaded to the GPU.[/quote]
Further more if you are the one without culling and 1 vbo for the whole scene, then your telling me you agree with wasting the GPU's time drawing say a 10K poly statue model or any model for that matter, even when you can't see it to save a tiny draw call. Not to mention that for indoor scenes you see 2% of the world and outdoors you only see 1/2 of the world.

Clarify if you are or aren't the guy I'm thinking of, and if so I want to see pics. Because the trees in my game are like 20K polys each, I could get away with your method if they were like 100 polys each, but I want my stuff to look good, which means tons more polys, and that means you will never achieve 1 VBO with 20K poly trees x say 100 trees.

NBA2K, Madden, Maneater, Killing Floor, Sims

Advertisement

See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.
No, it doesn't take that much effort to evaluate other approaches -- there is no "conventional" apprach; we're always looking for alternatives. I've used methods like this on the Wii, where optimised CPU transform routines were better than it's GPU transforms (and also on the PS3, where it's got way more CPU power than GPU power), but without knowing the details of the rest of those specific projects, we can't judge here the cost/value of those decisions.

The problem is that your alternative isn't as great or original as you think it is. However, if we consider two extremes -- complete CPU vertex processing vs entirely GPU driven animation/instancing of vertices -- then you just have to consider that there is a continuous and interesting middle ground between the two extremes at all times. Depending on your target platforms, you find different kinds of scenes/objects are faster one way or the other. Different games also have different needs for the CPU/GPU of their own -- some need all the CPU time they can get, and some leave the CPU idle as the GPU bottle-necks. Our job should be to always consider this entire spectrum, depending on the current situation/requirements.

I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates.[/quote]No, this is utter bullshit. You haven't even understood "conventional" methods yet, and you're justifying your clever avoidance of them for these straw-man reasons? Decent multi-key sorting plus the simple state-cache on page#1 removes the majority of state changes, and is easy to implement. I use a different approach, but it still just boils down to sorting and filtering, which can be optimised greatly if you care.

Moreover, there's no reason that you have to have "the transform matrix" -- it's not the fixed function pipeline any more. Many static objects don't have one, and many dynamic objects have 100 of them. Objects that use a shared storage tehcnique between them can be batched together. There's no reason I couldn't take a simmilar approach to your all-verts-together technique, but instead store all-transforms-together.

However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.

However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!
I don't agree with how you've characterised that. I meant to imply that state-change costs are unintuitive, but that they are still something that can be measured. If it can be measured, it can be optimised. If it cant be measured, then your optimisations are just voodoo.

Also, just because you've got 0 state changes, that in no way grants automatically better performance. Perhaps the 1000 state-changes that you avoided wouldn't have been a bottle-neck anyway; they might have been free in the GPU pipeline and only taken 0.1ms of CPU time to submit? Maybe the optimisation you've made to avoid them is 1000x slower? Have you measured both techniques with a set of different scenes?

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a [[font=courier new,courier,monospace]Soldier0[/font]] object, then you have several instances of [[font=courier new,courier,monospace]Soldier0[/font]]... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.
Maybe you don't want to waste VBO memory on storing a unique mesh for every "[font=courier new,courier,monospace]Soldier[/font]" in the game, so you only store a single (local space) "base mesh" that's kind of an average soldier. Your art team still models 8 unique soldiers that the game requires, as usual, but they are remapped as deltas from the base mesh. You then store your 1 base-mesh and 8 delta/morph-meshes together, and then instance 100 characters using the 1 base-mesh, each of which will be drawn as one of the 8 variations as required/indexed.

In your system, we not only need the original 9 data sets, but also enough memory for the 100 transformed data-sets. We then need enough GPU memory for another ~200 (n-hundred) transformed meshes, because it's dynamic data and the driver is going to want to ~double buffer (n-buffer) it.

I really don't know why I bothered with this, as you're being very dismissive of the real criticism that you solicited, but again, look at your memory trade-offs... The "conventional" requirements look like:
[attachment=9490:1.png]
whereas yours looks like:
[attachment=9491:2.png]

i.e. conventional memory usage
= Models.Verts * sizeof(Vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms

your memory usage
= Models.Verts * sizeof(vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms
+ Instances * Models.Verts * sizeof(vert) //world data
+ Instances * Models.Verts * sizeof(vert) * N // Driver dynamic VBO buffering.

If you're going next-gen, you're looking at huge object counts and huge vertex counts. In the first equation above, there is no relation between the two -- increasing instances does not impact vertex storage, and increasing vertices does not increase the per-instance storage cost. However, in the latter, they are connected -- increasing the amount of detail in a model also increases your per-instance memory costs.
Seeing as modern games are getting more and more dynamic (where more things are moving every frame) and more and more detailed (where both vertex and instance counts are increasing greatly), I'd give your users the option to keep their costs independent of each other.

e.g. the example I gave earlier -- of 100 animated characters, with 100 bones each, using a 1M vertex model -- clearly is much more efficient in the "conventional" method than in yours. So if you wanted to support games based around animated characters, you might want to support the option of a different rendering mode for them.

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

[/quote]
How is the resolution you're rendering at affected AT ALL by the number of vertices you're uploading/saving? You're drawing the same amount either way, the only speed difference is in upload time and call time
Your GPU is running pixel shaders on more pixels. Do you want it to spend more time shading those or taking a piss while you upload a ton of vertices before it can even start to draw them. By leaving the data on the gpu, it is immediately able to start transform and shaders. I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

NBA2K, Madden, Maneater, Killing Floor, Sims

How is the resolution you're rendering at affected AT ALL by the number of vertices you're uploading/saving? You're drawing the same amount either way, the only speed difference is in upload time and call time
This is completely off-topic, but... as you lower your resolution, you should also lower the LOD level of your detailed models (lower number of triangles, larger size each) -- the GPU performs pixel-shading in 2x2 pixel quads, triangles smaller than 2x2 pixels still effectively run the pixel shader 4 times for the whole 2x2 quad. So for example, if you are using single-pixel-sized triangles, then your pixel-shaders all suddenly become 4x more expensive. I've seen a huge speed-up in pixel-shading times before by simply adding LOD support to a game.
Your GPU is running pixel shaders on more pixels. Do you want it to spend more time shading those or taking a piss while you upload a ton of vertices before it can even start to draw them. By leaving the data on the gpu, it is immediately able to start transform and shaders. I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.
A high-spec PC won't block on uploads like this usually, it will instead just add latency by increasing the number of frames it buffers commands before execution (which results in input-lag, higher memory requirements).

This topic is closed to new replies.

Advertisement