most efficient general rendering strategies for new GPUs

Boot Strap · 2012-07-04T19:05:01

The moral of the story is that draw calls are still not free, but you don't need to pathologically avoid them as much as you did before. 7.5k calls in a shipping AAA title would certainly have given everyone the horrors not so long ago. "Going crazy" can work in both directions... Indexing into a texture array is a decent way of avoiding changes and keeping calls down, but it adds the constraint that all of your textures must be the same size. You're not going to use the same texture size for a small pebble or for a particle as you use for a brick wall, I hope. Aiming for the entire scene in a single call also constrains you you to using the same material properties for all of your objects. If you're happy with that tradeoff, then sure, go for it, but it really reduces your setup's general-purpose utility. You can't even do something as simple as enable alpha blending for a window but keep it disabled for everything else. That makes the objective something more of theoretical interest than practical utility. Well, we still have more than one texture units to work with. I assume 4 texture units, and hopefully GPUs don't ever drop below that number. What I do is put 4 to hundreds of textures (or normalmaps, heightmaps, etc) onto each texture (more or less "texture atlas" style). So my tcoords for a given object don't range from 0.000 to 1.000 on each axis, they range from some tiny fraction of that range. Of course my approach means I can't create repeating textures (for tiled floors and such) by letting the tcoords extend far < 0.000 and far > 1.000. Clearly I need to rethink my balancing act. You guys are probably correct that only putting local-coordinates into the VBOs for "large moving objects" is not the optimal tradeoff. But some comments seem to indicate that going whole-hog the opposite direction isn't very good either. In many games and simulations, the [vast/large/substantial] majority of objects are fixed. These objects probably render at the same speed with either local or world coordinates in the VBOs, everything else being done the same. Probably the simplest test I can perform is break my draw calls into as many objects are in each VBO and perform frustum tests on each. I can certainly measure how much CPU time that adds. Unfortunately, I'm not very proficient in figuring out the impact on GPU execution time.

21st Century Moose

13,459

June 15, 2012 02:17 AM

Please clarify. You are talking about these variables incrementing once per object on utterly diverse objects. Not incrementing from tree #1 to tree #1000 as per normal instancing practice, but instance #1 is a tree, instance #2 is a car, instance #3 is your mother-in-law, etc? If the later is true, exactly what causes this build-in gl_InstanceID to increment? And when it does increment, where is the array of transformation matrices if not in a uniform block.

I know about instancing, but if this applies in some convenient way to diverse objects, then you are correct --- I did not notice this, and it is very cool.

I'm not sure just where you're getting that idea from - what I am talking about here is standard instancing.

You seem to be missing the point that even with standard instancing you will have multiple instances of the same object. You won't just have 10,000 completely unique objects in your scene - you'll have many cars, many trees, many people. So you sort them by object type and draw them using standard instancing.

You also seem to have this odd idea that re-upping vertexes is going to be much faster than re-upping matrixes. Maybe you have a hypothetical best-case where it is (as you mention, only when objects change) but your worst-case (when everything needs to change) is going to be at least an order of magnitude slower. You're missing the point that this is important. Maintaining a stable and consistent framerate gives a much better player experience than having a framerate that's see-sawing all over the place.

That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects), and must call some kind of glDrawElements() function 10,000 times (once per object) instead of just once (or more commonly 10 to 100 times).[/quote]

Again, no, no, no, no, no. The example of BattleField 3 was given to you - it instances everything. It does not make one draw-call per object per frame. What part of that is not going into your brain?

You have a theoretical setup - BF3 is out there saying "Eppur si muove". This works. It's a solved problem.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

maxgpgpu

Author

207

June 15, 2012 02:23 AM

[quote name='maxgpgpu' timestamp='1339722384' post='4949384']
Please clarify. You are talking about these variables incrementing once per object on utterly diverse objects. Not incrementing from tree #1 to tree #1000 as per normal instancing practice, but instance #1 is a tree, instance #2 is a car, instance #3 is your mother-in-law, etc? If the later is true, exactly what causes this build-in gl_InstanceID to increment? And when it does increment, where is the array of transformation matrices if not in a uniform block.

I know about instancing, but if this applies in some convenient way to diverse objects, then you are correct --- I did not notice this, and it is very cool.

I'm not sure just where you're getting that idea from - what I am talking about here is standard instancing.

You seem to be missing the point that even with standard instancing you will have multiple instances of the same object. You won't just have 10,000 completely unique objects in your scene - you'll have many cars, many trees, many people. So you sort them by object type and draw them using standard instancing.

You also seem to have this odd idea that re-upping vertexes is going to be much faster than re-upping matrixes. Maybe you have a hypothetical best-case where it is (as you mention, only when objects change) but your worst-case (when everything needs to change) is going to be at least an order of magnitude slower. You're missing the point that this is important. Maintaining a stable and consistent framerate gives a much better player experience than having a framerate that's see-sawing all over the place.

That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects), and must call some kind of glDrawElements() function 10,000 times (once per object) instead of just once (or more commonly 10 to 100 times).[/quote]

Again, no, no, no, no, no. The example of BattleField 3 was given to you - it instances everything. It does not make one draw-call per object per frame. What part of that is not going into your brain?

You have a theoretical setup - BF3 is out there saying "Eppur si muove". This works. It's a solved problem.
[/quote]

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a GeneralPatton object, then you have several instances of GeneralPatton... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.

maxgpgpu

Author

207

June 15, 2012 02:27 AM

[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.
[/quote]
Yes, that works if you accept larger AABB that are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

maxgpgpu

Author

207

June 15, 2012 02:30 AM

[quote name='mhagain' timestamp='1339697733' post='4949225']
[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.
[/quote]
Yes, that works if you accept slightly larger AABB than are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

And yes, the framerate must be constant.
[/quote]

maxgpgpu

Author

207

June 15, 2012 02:33 AM

[quote name='maxgpgpu' timestamp='1339727278' post='4949399']
[quote name='mhagain' timestamp='1339697733' post='4949225']
[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.
[/quote]
Yes, that works if you accept larger AABB that are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

And yes, the framerate must remain constant --- always.
[/quote]
[/quote]

dpadam450

2,408

June 15, 2012 03:47 AM

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),[/quote]
Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

So, if no objects rotate or move, no objects are uploaded to the GPU.[/quote]
Further more if you are the one without culling and 1 vbo for the whole scene, then your telling me you agree with wasting the GPU's time drawing say a 10K poly statue model or any model for that matter, even when you can't see it to save a tiny draw call. Not to mention that for indoor scenes you see 2% of the world and outdoors you only see 1/2 of the world.

Clarify if you are or aren't the guy I'm thinking of, and if so I want to see pics. Because the trees in my game are like 20K polys each, I could get away with your method if they were like 100 polys each, but I want my stuff to look good, which means tons more polys, and that means you will never achieve 1 VBO with 20K poly trees x say 100 trees.

NBA2K, Madden, Maneater, Killing Floor, Sims

Hodgman

52,718

June 15, 2012 03:50 AM

See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.

No, it doesn't take that much effort to evaluate other approaches -- there is no "conventional" apprach; we're always looking for alternatives. I've used methods like this on the Wii, where optimised CPU transform routines were better than it's GPU transforms (and also on the PS3, where it's got way more CPU power than GPU power), but without knowing the details of the rest of those specific projects, we can't judge here the cost/value of those decisions.

The problem is that your alternative isn't as great or original as you think it is. However, if we consider two extremes -- complete CPU vertex processing vs entirely GPU driven animation/instancing of vertices -- then you just have to consider that there is a continuous and interesting middle ground between the two extremes at all times. Depending on your target platforms, you find different kinds of scenes/objects are faster one way or the other. Different games also have different needs for the CPU/GPU of their own -- some need all the CPU time they can get, and some leave the CPU idle as the GPU bottle-necks. Our job should be to always consider this entire spectrum, depending on the current situation/requirements.

I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates.[/quote]No, this is utter bullshit. You haven't even understood "conventional" methods yet, and you're justifying your clever avoidance of them for these straw-man reasons? Decent multi-key sorting plus the simple state-cache on page#1 removes the majority of state changes, and is easy to implement. I use a different approach, but it still just boils down to sorting and filtering, which can be optimised greatly if you care.

Moreover, there's no reason that you have to have "the transform matrix" -- it's not the fixed function pipeline any more. Many static objects don't have one, and many dynamic objects have 100 of them. Objects that use a shared storage tehcnique between them can be batched together. There's no reason I couldn't take a simmilar approach to your all-verts-together technique, but instead store all-transforms-together.

However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.

However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!
I don't agree with how you've characterised that. I meant to imply that state-change costs are unintuitive, but that they are still something that can be measured. If it can be measured, it can be optimised. If it cant be measured, then your optimisations are just voodoo.

Also, just because you've got 0 state changes, that in no way grants automatically better performance. Perhaps the 1000 state-changes that you avoided wouldn't have been a bottle-neck anyway; they might have been free in the GPU pipeline and only taken 0.1ms of CPU time to submit? Maybe the optimisation you've made to avoid them is 1000x slower? Have you measured both techniques with a set of different scenes?

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a [[font=courier new,courier,monospace]Soldier0[/font]] object, then you have several instances of [[font=courier new,courier,monospace]Soldier0[/font]]... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.

Maybe you don't want to waste VBO memory on storing a unique mesh for every "[font=courier new,courier,monospace]Soldier[/font]" in the game, so you only store a single (local space) "base mesh" that's kind of an average soldier. Your art team still models 8 unique soldiers that the game requires, as usual, but they are remapped as deltas from the base mesh. You then store your 1 base-mesh and 8 delta/morph-meshes together, and then instance 100 characters using the 1 base-mesh, each of which will be drawn as one of the 8 variations as required/indexed.

In your system, we not only need the original 9 data sets, but also enough memory for the 100 transformed data-sets. We then need enough GPU memory for another ~200 (n-hundred) transformed meshes, because it's dynamic data and the driver is going to want to ~double buffer (n-buffer) it.

I really don't know why I bothered with this, as you're being very dismissive of the real criticism that you solicited, but again, look at your memory trade-offs... The "conventional" requirements look like:
[attachment=9490:1.png]
whereas yours looks like:
[attachment=9491:2.png]

i.e. conventional memory usage
= Models.Verts * sizeof(Vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms

your memory usage
= Models.Verts * sizeof(vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms
+ Instances * Models.Verts * sizeof(vert) //world data
+ Instances * Models.Verts * sizeof(vert) * N // Driver dynamic VBO buffering.

If you're going next-gen, you're looking at huge object counts and huge vertex counts. In the first equation above, there is no relation between the two -- increasing instances does not impact vertex storage, and increasing vertices does not increase the per-instance storage cost. However, in the latter, they are connected -- increasing the amount of detail in a model also increases your per-instance memory costs.
Seeing as modern games are getting more and more dynamic (where more things are moving every frame) and more and more detailed (where both vertex and instance counts are increasing greatly), I'd give your users the option to keep their costs independent of each other.

e.g. the example I gave earlier -- of 100 animated characters, with 100 bones each, using a 1M vertex model -- clearly is much more efficient in the "conventional" method than in yours. So if you wanted to support games based around animated characters, you might want to support the option of a different rendering mode for them.

. 22 Racing Series .

zacaj

667

June 15, 2012 04:23 AM

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

[/quote]
How is the resolution you're rendering at affected AT ALL by the number of vertices you're uploading/saving? You're drawing the same amount either way, the only speed difference is in upload time and call time

My Site

@zacaj_

dpadam450

2,408

June 15, 2012 04:33 AM

Your GPU is running pixel shaders on more pixels. Do you want it to spend more time shading those or taking a piss while you upload a ton of vertices before it can even start to draw them. By leaving the data on the gpu, it is immediately able to start transform and shaders. I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

NBA2K, Madden, Maneater, Killing Floor, Sims

Hodgman

52,718

June 15, 2012 04:35 AM

How is the resolution you're rendering at affected AT ALL by the number of vertices you're uploading/saving? You're drawing the same amount either way, the only speed difference is in upload time and call time

This is completely off-topic, but... as you lower your resolution, you should also lower the LOD level of your detailed models (lower number of triangles, larger size each) -- the GPU performs pixel-shading in 2x2 pixel quads, triangles smaller than 2x2 pixels still effectively run the pixel shader 4 times for the whole 2x2 quad. So for example, if you are using single-pixel-sized triangles, then your pixel-shaders all suddenly become 4x more expensive. I've seen a huge speed-up in pixel-shading times before by simply adding LOD support to a game.

Your GPU is running pixel shaders on more pixels. Do you want it to spend more time shading those or taking a piss while you upload a ton of vertices before it can even start to draw them. By leaving the data on the gpu, it is immediately able to start transform and shaders. I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

A high-spec PC won't block on uploads like this usually, it will instead just add latency by increasing the number of frames it buffers commands before execution (which results in input-lag, higher memory requirements).

. 22 Racing Series .

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines