most efficient general rendering strategies for new GPUs

Boot Strap · 2012-07-04T19:05:01

[font=tahoma, geneva, sans-serif]Rendering performance of 3D game/graphics/simulation engines can be improved by quite a few techniques. Examples include culling (backface, obscured, frustum, etc), simple/fast shaders for deferred processing, uber-shaders to support large batches, etc.[/font] [font=tahoma, geneva, sans-serif]In this thread, I'd like experienced 3D programmers to brainstorm to attempt to identify the set of techniques that will most speed rendering of high-quality general scenes on current and next-generation high-end CPUs, GPUs and OpenGL/GLSL APIs (let's assume a 5 year timeframe). Complexity of implementation should also be considered important.[/font] [font=tahoma, geneva, sans-serif]The goal is to come up with a set of techniques, and their order of execution (including parallel execution) that best suits high-quality, general purpose scenes with large numbers of objects. In other words, imagine you're writing a 3D engine that needs to execute a variety of common game and simulation scenarios efficiently (not just one specific game). The nominal scenarios should range between:[/font] [font=tahoma, geneva, sans-serif]#1: A very large outdoor [or indoor] environment in which most objects do not move on a typical frame, but dozens of objects are moving each frame.[/font] [font=tahoma, geneva, sans-serif]#2: A game in outer space in which most or all objects move every frame.[/font] [font=tahoma, geneva, sans-serif]Let's assume the engine supports 1 ambient, and several point light-sources and some form of soft shadows are required.[/font] [font=tahoma, geneva, sans-serif]The following lower efficiency and should be considered:[/font] [font=tahoma, geneva, sans-serif]- small batches[/font] [font=tahoma, geneva, sans-serif]- switching shaders[/font] [font=tahoma, geneva, sans-serif]- rendering objects outside frustum[/font] [font=tahoma, geneva, sans-serif]- rendering objects entirely obscured by closer opaque objects[/font] [font=tahoma, geneva, sans-serif]- rendering objects behind semi-transparent objects[/font] [font=tahoma, geneva, sans-serif]- some form of parallax mapping vs detailed geometry[/font] [font=tahoma, geneva, sans-serif]- add more here[/font] [font=tahoma, geneva, sans-serif]There are many possible "dynamics" that people consider.[/font] [font=tahoma, geneva, sans-serif]For example, if we write one or more "uber-shaders" that tests bit-fields and/or texture-IDs and/or matrix-IDs in each vertex structure to control how the pixel shader renders each triangle, it is possible to render huge quantities of objects with a single call to glDrawElements() or equivalent. On the other hand, every triangle takes a little bit longer to execute, due to the multiple paths in the pixel shader.[/font] [font=tahoma, geneva, sans-serif]Another dynamic is the complexity of culling objects outside the frustum when they do or might cast shadows, and when the environment contains mirrors or [semi]-reflective surfaces, and when the environment contains virtual cameras that point in random directions and their view is rendered on video displays at various places [possibly] within the scene. Furthermore, should not collision detection and response be computed for all objects, even those outside the frustum?[/font] [font=tahoma, geneva, sans-serif]At one end of the spectrum of possibilities is an approach in which every possible efficiency is tested-for and potentially executed every frame. Considering how various possible efficiencies and possible aspects of a scene interact, this approach could be extremely complex, tricky and prone to discovering cases that are not handled correctly due to that complexity.[/font] [font=tahoma, geneva, sans-serif]At the other end of the spectrum of possibilities is an approach in which every object that has moved is transformed every frame, without testing for being visible in the frustum, casting a shadow onto any object in the frustum, etc. Instead, this approach would attempt to find a way to most efficiently perform every applicable computation on every object, and possibly even render every object. Perhaps this approach could support one type of culling without risking unwanted interactions - by grouping objects near to each other into individual batches, then not rendering into the backbuffer those batches that fall entirely outside the frustum. But this culling would only be on the final rendering phase, not the collision-phase or shadow computing phase, etc.[/font] [font=tahoma, geneva, sans-serif]I consider this a difficult problem! I've brainstormed this issue with myself for years, and have never felt confident I have the best answer... or even close to the best answer. I won't bias this brainstorming session by stating my nominal working opinion before others have voice their observations and opinions.[/font] [font=tahoma, geneva, sans-serif]Please omit discussions that apply to CPUs older than current high-end CPUs, GPUs older than GTX-680 class, and OpenGL/GLSL older than v4.20, because the entire point of this thread is to design something that's efficient 2~4 years from now, and likely for years beyond that. Also omit discussions that apply to non-general environments or non-general rendering.[/font] [font=tahoma, geneva, sans-serif]OTOH, if you know of new features of next-generation CPUs/GPUs/OpenGL/GLSL that are important to this topic, please DO discuss these.[/font] [font=tahoma, geneva, sans-serif]Assume the computer contains:[/font] [font=tahoma, geneva, sans-serif]- one 4GHz 8-core AMD/Intel CPU[/font] [font=tahoma, geneva, sans-serif]- 8GB to 32GB fastish system RAM[/font] [font=tahoma, geneva, sans-serif]- one GTX-680 class GPU with 2GB~4GB RAM[/font] [font=tahoma, geneva, sans-serif]- one 1920x1200 LCD display (or higher-resolution)[/font] [font=tahoma, geneva, sans-serif]- computer is not running other applications simultaneously[/font] [font=tahoma, geneva, sans-serif][font=tahoma, geneva, sans-serif]- up to 8~16 simultaneously active engine threads on CPU[/font][/font] [font=tahoma, geneva, sans-serif]Assume the 3D engine supports the following conventional features:[/font] [font=tahoma, geneva, sans-serif]- some kind of background (mountains, ocean, sky)[/font] [font=tahoma, geneva, sans-serif]- many thousand objects[/font] [font=tahoma, geneva, sans-serif]- millions of triangles[/font] [font=tahoma, geneva, sans-serif]- several point lights (or many point lights but only closest*brightest applied to each triangle)[/font] [font=tahoma, geneva, sans-serif]- ambient lighting[/font] [font=tahoma, geneva, sans-serif]- diffuse shading[/font] [font=tahoma, geneva, sans-serif]- reflective shading[/font] [font=tahoma, geneva, sans-serif]- soft shadows (variance shadow mapping or alternative)[/font] [font=tahoma, geneva, sans-serif]- texture mapping[/font] [font=tahoma, geneva, sans-serif]- bump mapping[/font] [font=tahoma, geneva, sans-serif]- parallax mapping (maybe, vs real geometry)[/font] [font=tahoma, geneva, sans-serif]- collision detection (broad & narrow phase, convex/concave, fairly accurate)[/font] [font=tahoma, geneva, sans-serif]- collision response (basic physics)[/font] [font=tahoma, geneva, sans-serif]- objects support hierarchies (rotate/translate against each other)[/font] [font=tahoma, geneva, sans-serif]- semi-transparent objects == optional[/font]

Graphics and GPU Programming Programming OpenGL

Started by maxgpgpu June 09, 2012 12:12 AM

92 comments, last by maxgpgpu 11 years, 9 months ago

_the_phantom_

11,263

June 19, 2012 07:52 PM

Re: Draw calls.

This is, and remains, an important issue.
Draw calls, while cheaper than D3D9, can still suck up CPU power depending on what you are doing. If you are updating buffers and using them in a draw call the driver has to shuffle memory under the hood, copy things around, update other things. While avoiding them to the extreme of the OP is a bit crazy even so you should be careful as they can suck away CPU time pretty quickly.

With D3D11, using multi-threaded deferred contexts (something GL doesn't have) you are, on something like an i7, top out at ~15,000 draw calls per frame is you want to maintain 60fps. BF3 tops out at around 7,500 per frame as memory serves.

In short; don't go crazy, keeping your draw calls down remains a good thing due to driver overhead.

(For reference using a very CPU heavy test loop and performing 50,000 draw calls per frame; a 2.6Ghz i7 with a NV GTX470 GPU can't even clear 30fps using 6 cores to render. An X360, using the same code base and same test, will happily do 60fps. This is purely CPU overhead causing the problem and most of the time is the driver doing work to shuffle data around. Clearly this wouldn't play out in a real game situation but there is still reason to be concerned about CPU cost.)

Re: varying verts per instance

You CAN do this... although you probably shouldn't.
However you don't do it via traditional instancing instead you use the Geo. shader to create the extra vertex data - but this comes at a cost as the output of a GS has to be serialised correctly which can introduce significant bottlenecks in the GPU.

Generally, unless you have very little work on the GPU and are totally rammed on the CPU, you won't want to do this instead just take the hit of an extra draw call per model type. Chances are you aren't going to have that many that require this anyway so its not going to be a huge CPU cost and you avoid a large GPU cost.

maxgpgpu

207

Author

June 19, 2012 08:26 PM

[quote name='maxgpgpu' timestamp='1340132594' post='4950675']
[font=tahoma,geneva,sans-serif]I should probably do my homework rather than ask this silly question, but... here goes anyway. Feel free to ignore it.[/font]

[font=tahoma,geneva,sans-serif]Can instancing support in the latest generation of GPUs make it possible to pervert instancing in the following way. Assume we do instancing the regular way, so this question only refers to a large set of unique objects [that exist only once in the game/simulation]. Can number of vertices be a per-instance variable? What I'm wondering here is whether it might be possible to consider all these diverse objects as separate instances of some general amorphous object?[/font]

[font=tahoma,geneva,sans-serif]In the instancing I'm familiar with, every instance has the same number of vertices. This is for jobs like rendering a crapload of leaves on trees, and the per-instance data tells for each leaf: position, orientation, color (for example). However, if per-instance can include # of vertices and maybe a couple more items, perhaps every object with every number of vertices could be rendered with instancing. That sounds wacko off hand, but then effectively instanceID means objectID, so instanceID can double as the index into an array of general local-to-view transformation matrices.[/font]

[font=tahoma,geneva,sans-serif]This probably exceeds the flexibility of the instancing mechanism, but then again, maybe it doesn't. Any comments?[/font]

I doubt it - if the number of verts needs to change then it seems a reasonably good bet that the texture also needs to change (otherwise your texcoords would be out of whack) so you're looking at a separate draw call anyway.
[/quote]
My vertex structures contain a textureID filed that indexes into the texture array. Therefore, that's not a killer even now. Obviously this is more-or-less necessary in my current scheme that renders unlimited objects in a single draw call --- each object can have different textures, different normalmaps, different specularmaps, etc. My 64-byte vertex structure is running low on free bits at this point, so my alternative is to eliminate those textureID fields [and matrixID field] and replace that with an objectID field that indexes into a texture to get all the information that could possibly be needed (at the expense of an extra texture-fetch per vertex). Unfortunately, that eliminates one nice feature of the current scheme --- the ability to specify texture, normalmap, specularmap on a per-vertex basis, not just per-object.

21st Century Moose

13,459

June 20, 2012 12:47 AM

The moral of the story is that draw calls are still not free, but you don't need to pathologically avoid them as much as you did before. 7.5k calls in a shipping AAA title would certainly have given everyone the horrors not so long ago. "Going crazy" can work in both directions...

Indexing into a texture array is a decent way of avoiding changes and keeping calls down, but it adds the constraint that all of your textures must be the same size. You're not going to use the same texture size for a small pebble or for a particle as you use for a brick wall, I hope. Aiming for the entire scene in a single call also constrains you you to using the same material properties for all of your objects. If you're happy with that tradeoff, then sure, go for it, but it really reduces your setup's general-purpose utility. You can't even do something as simple as enable alpha blending for a window but keep it disabled for everything else. That makes the objective something more of theoretical interest than practical utility.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

maxgpgpu

207

Author

July 04, 2012 07:05 PM

The moral of the story is that draw calls are still not free, but you don't need to pathologically avoid them as much as you did before. 7.5k calls in a shipping AAA title would certainly have given everyone the horrors not so long ago. "Going crazy" can work in both directions...

Indexing into a texture array is a decent way of avoiding changes and keeping calls down, but it adds the constraint that all of your textures must be the same size. You're not going to use the same texture size for a small pebble or for a particle as you use for a brick wall, I hope. Aiming for the entire scene in a single call also constrains you you to using the same material properties for all of your objects. If you're happy with that tradeoff, then sure, go for it, but it really reduces your setup's general-purpose utility. You can't even do something as simple as enable alpha blending for a window but keep it disabled for everything else. That makes the objective something more of theoretical interest than practical utility.

Well, we still have more than one texture units to work with. I assume 4 texture units, and hopefully GPUs don't ever drop below that number.

What I do is put 4 to hundreds of textures (or normalmaps, heightmaps, etc) onto each texture (more or less "texture atlas" style). So my tcoords for a given object don't range from 0.000 to 1.000 on each axis, they range from some tiny fraction of that range. Of course my approach means I can't create repeating textures (for tiled floors and such) by letting the tcoords extend far < 0.000 and far > 1.000.

Clearly I need to rethink my balancing act. You guys are probably correct that only putting local-coordinates into the VBOs for "large moving objects" is not the optimal tradeoff. But some comments seem to indicate that going whole-hog the opposite direction isn't very good either.

In many games and simulations, the [vast/large/substantial] majority of objects are fixed. These objects probably render at the same speed with either local or world coordinates in the VBOs, everything else being done the same. Probably the simplest test I can perform is break my draw calls into as many objects are in each VBO and perform frustum tests on each. I can certainly measure how much CPU time that adds. Unfortunately, I'm not very proficient in figuring out the impact on GPU execution time.

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines