most efficient general rendering strategies for new GPUs

Boot Strap · 2012-07-04T19:05:01

[font=tahoma, geneva, sans-serif]Rendering performance of 3D game/graphics/simulation engines can be improved by quite a few techniques. Examples include culling (backface, obscured, frustum, etc), simple/fast shaders for deferred processing, uber-shaders to support large batches, etc.[/font] [font=tahoma, geneva, sans-serif]In this thread, I'd like experienced 3D programmers to brainstorm to attempt to identify the set of techniques that will most speed rendering of high-quality general scenes on current and next-generation high-end CPUs, GPUs and OpenGL/GLSL APIs (let's assume a 5 year timeframe). Complexity of implementation should also be considered important.[/font] [font=tahoma, geneva, sans-serif]The goal is to come up with a set of techniques, and their order of execution (including parallel execution) that best suits high-quality, general purpose scenes with large numbers of objects. In other words, imagine you're writing a 3D engine that needs to execute a variety of common game and simulation scenarios efficiently (not just one specific game). The nominal scenarios should range between:[/font] [font=tahoma, geneva, sans-serif]#1: A very large outdoor [or indoor] environment in which most objects do not move on a typical frame, but dozens of objects are moving each frame.[/font] [font=tahoma, geneva, sans-serif]#2: A game in outer space in which most or all objects move every frame.[/font] [font=tahoma, geneva, sans-serif]Let's assume the engine supports 1 ambient, and several point light-sources and some form of soft shadows are required.[/font] [font=tahoma, geneva, sans-serif]The following lower efficiency and should be considered:[/font] [font=tahoma, geneva, sans-serif]- small batches[/font] [font=tahoma, geneva, sans-serif]- switching shaders[/font] [font=tahoma, geneva, sans-serif]- rendering objects outside frustum[/font] [font=tahoma, geneva, sans-serif]- rendering objects entirely obscured by closer opaque objects[/font] [font=tahoma, geneva, sans-serif]- rendering objects behind semi-transparent objects[/font] [font=tahoma, geneva, sans-serif]- some form of parallax mapping vs detailed geometry[/font] [font=tahoma, geneva, sans-serif]- add more here[/font] [font=tahoma, geneva, sans-serif]There are many possible "dynamics" that people consider.[/font] [font=tahoma, geneva, sans-serif]For example, if we write one or more "uber-shaders" that tests bit-fields and/or texture-IDs and/or matrix-IDs in each vertex structure to control how the pixel shader renders each triangle, it is possible to render huge quantities of objects with a single call to glDrawElements() or equivalent. On the other hand, every triangle takes a little bit longer to execute, due to the multiple paths in the pixel shader.[/font] [font=tahoma, geneva, sans-serif]Another dynamic is the complexity of culling objects outside the frustum when they do or might cast shadows, and when the environment contains mirrors or [semi]-reflective surfaces, and when the environment contains virtual cameras that point in random directions and their view is rendered on video displays at various places [possibly] within the scene. Furthermore, should not collision detection and response be computed for all objects, even those outside the frustum?[/font] [font=tahoma, geneva, sans-serif]At one end of the spectrum of possibilities is an approach in which every possible efficiency is tested-for and potentially executed every frame. Considering how various possible efficiencies and possible aspects of a scene interact, this approach could be extremely complex, tricky and prone to discovering cases that are not handled correctly due to that complexity.[/font] [font=tahoma, geneva, sans-serif]At the other end of the spectrum of possibilities is an approach in which every object that has moved is transformed every frame, without testing for being visible in the frustum, casting a shadow onto any object in the frustum, etc. Instead, this approach would attempt to find a way to most efficiently perform every applicable computation on every object, and possibly even render every object. Perhaps this approach could support one type of culling without risking unwanted interactions - by grouping objects near to each other into individual batches, then not rendering into the backbuffer those batches that fall entirely outside the frustum. But this culling would only be on the final rendering phase, not the collision-phase or shadow computing phase, etc.[/font] [font=tahoma, geneva, sans-serif]I consider this a difficult problem! I've brainstormed this issue with myself for years, and have never felt confident I have the best answer... or even close to the best answer. I won't bias this brainstorming session by stating my nominal working opinion before others have voice their observations and opinions.[/font] [font=tahoma, geneva, sans-serif]Please omit discussions that apply to CPUs older than current high-end CPUs, GPUs older than GTX-680 class, and OpenGL/GLSL older than v4.20, because the entire point of this thread is to design something that's efficient 2~4 years from now, and likely for years beyond that. Also omit discussions that apply to non-general environments or non-general rendering.[/font] [font=tahoma, geneva, sans-serif]OTOH, if you know of new features of next-generation CPUs/GPUs/OpenGL/GLSL that are important to this topic, please DO discuss these.[/font] [font=tahoma, geneva, sans-serif]Assume the computer contains:[/font] [font=tahoma, geneva, sans-serif]- one 4GHz 8-core AMD/Intel CPU[/font] [font=tahoma, geneva, sans-serif]- 8GB to 32GB fastish system RAM[/font] [font=tahoma, geneva, sans-serif]- one GTX-680 class GPU with 2GB~4GB RAM[/font] [font=tahoma, geneva, sans-serif]- one 1920x1200 LCD display (or higher-resolution)[/font] [font=tahoma, geneva, sans-serif]- computer is not running other applications simultaneously[/font] [font=tahoma, geneva, sans-serif][font=tahoma, geneva, sans-serif]- up to 8~16 simultaneously active engine threads on CPU[/font][/font] [font=tahoma, geneva, sans-serif]Assume the 3D engine supports the following conventional features:[/font] [font=tahoma, geneva, sans-serif]- some kind of background (mountains, ocean, sky)[/font] [font=tahoma, geneva, sans-serif]- many thousand objects[/font] [font=tahoma, geneva, sans-serif]- millions of triangles[/font] [font=tahoma, geneva, sans-serif]- several point lights (or many point lights but only closest*brightest applied to each triangle)[/font] [font=tahoma, geneva, sans-serif]- ambient lighting[/font] [font=tahoma, geneva, sans-serif]- diffuse shading[/font] [font=tahoma, geneva, sans-serif]- reflective shading[/font] [font=tahoma, geneva, sans-serif]- soft shadows (variance shadow mapping or alternative)[/font] [font=tahoma, geneva, sans-serif]- texture mapping[/font] [font=tahoma, geneva, sans-serif]- bump mapping[/font] [font=tahoma, geneva, sans-serif]- parallax mapping (maybe, vs real geometry)[/font] [font=tahoma, geneva, sans-serif]- collision detection (broad & narrow phase, convex/concave, fairly accurate)[/font] [font=tahoma, geneva, sans-serif]- collision response (basic physics)[/font] [font=tahoma, geneva, sans-serif]- objects support hierarchies (rotate/translate against each other)[/font] [font=tahoma, geneva, sans-serif]- semi-transparent objects == optional[/font]

Graphics and GPU Programming Programming OpenGL

Started by maxgpgpu June 09, 2012 12:12 AM

92 comments, last by maxgpgpu 11 years, 9 months ago

dpadam450

2,403

June 15, 2012 04:53 AM

Are you suggesting that if I have a single VBO and say triple buffered rendering, that if I update that VBO on all 3 of those frames that it would buffer that VBO into temporary arrays? I guess it would have to store the new frame-based VBO into temporary buffers. Or is it just going to overwrite the single VBO memory with the most recent buffered command. I would assume texture/vbo uploads are not put into the command buffer and take place immediately.

As for blocking, if I'm only using double buffering, then all my commands are being drawn to the current back buffer, so if I want to send 50K verts at the same time the GPU has finished its current commands (going idle) I'd rather just say "Draw 50K model" than send 50K verts, and then after all verts have made it, tell it to draw them. In that time again the GPU just has a ton to do, so with resolution I was getting at well boost up the resolution to HD effectively giving it even more hard work to do, while at the same time stalling it, and your framerate is going down quick. You upgrade to next-gen quality models at HD, then you need to help your GPU do its job faster, not waste time sending thousands (maybe even 100K) vertices. It is a complete waste, in that time I could have processed full scene SSAO or something.

I'm just pretty sure this guy is the one that says "throw the entire scene (static objects only) into 1 giant vbo and draw it all the time, you don't need to cull because its faster to use 1 draw call than 100" Realistically when does anyone ever draw more than 1,000 objects on screen at once anyway? If you cull, then 1 draw call vs 100 or even 1000 is so negligible. 1000 draw calls is almost nothing. Collapsing that to 1 is not very much faster performance, in fact I bet in most cases unmeasurable.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Hodgman

52,717

June 15, 2012 05:23 AM

Are you suggesting that if I have a single VBO and say triple buffered rendering, that if I update that VBO on all 3 of those frames that it would buffer that VBO into temporary arrays? I guess it would have to store the new frame-based VBO into temporary buffers. Or is it just going to overwrite the single VBO memory with the most recent buffered command. I would assume texture/vbo uploads are not put into the command buffer and take place immediately.

If you're updating and rendering from a VBO each frame, the driver will probably n-buffer it depending on how much latency is in the command stream.
In the general case, when using some bound data (like a VBO, via a command to the GPU) resource fences are inserted by the driver after the draw-call, which are commands in the command stream that quickly write-back to the CPU driver, letting it know that the command using that VBO has been completed (so now the VBO's VRAM can be re-used). If you map/lock a resource, the driver will check it's fences to see if that resource is still waiting to be consumed by the GPU, and if so, will have to allocate another copy to map/return to you.
For example, if you're locking a VBO every frame, and there's 2.5 frames of latency between the two processors, then data that's written on Frame#1 might not be consumed until the CPU is up to Frame#4! Uploading more data per frame will likely increase the latency, which increases the amount of buffering RAM required... In a good case, there's only 1 frame of latency so only double buffering is required for dynamic buffers. However, for all we know, the driver might be very optimised for dynamically giving out buffers from a pool like this...
[attachment=9493:3.png]

As for blocking, if I'm only using double buffering, then all my commands are being drawn to the current back buffer...[/quote]The internal buffering of GPU commands is different to back/front buffering -- the mechanisms for single/double/tripple-buffered flipping are implemented as commands in the command-stream just like everything else. The driver will determine the amount of latency to buffer commands for based on the conditions of your app. If you've got a lot of upload traffic, it's probably going to have to cover for that with latency automatically.
Realistically when does anyone ever draw more than 1,000 objects on screen at once anyway? If you cull, then 1 draw call vs 100 or even 1000 is so negligible. 1000 draw calls is almost nothing. Collapsing that to 1 is not very much faster performance, in fact I bet in most cases unmeasurable.[/quote]On my last game, we had about 2000 D3D9 draw-calls per frame (plus required state-changes per draw-call), which cost less than 1ms of CPU time in our optimised renderer.

. 22 Racing Series .

dpadam450

2,403

June 15, 2012 05:53 AM

The diagram was basically what I was mentioning. But if the VBO is set to STATIC_DRAW and never updated, I would have to assume that there wouldn't be those copies laying around. A simple test would be too fill a command buffer with glBindBuffer/glBufferData calls over and over again and see if vbo memory goes up.

What also I was getting at is what if the command buffer hits the end? glSwapBuffers obviously takes care of some syncing because the command buffer isn't going to just fill up with 800 frames to render while still working at frame 500. So the command buffer has to be limited by frame and once its too far ahead, wait for a previous frame to finish. But also in double buffering though, the commands it is receiving are the ones that its getting ready to put up after glSwapBuffers, so at some point the GPU can catch up and stall with no commands left.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Hodgman

52,717

June 15, 2012 06:23 AM

Yes the driver will choose a different allocation strategy based on the flags/hints you give it. Static ones aren't likely to suffer this memory-bloat penalty, but might do something worse if used in a dynamic manner, like block if used by both processors in nearby/alternate frames.

The command buffer is probably implemented kind of like a ring buffer, so either the GPU can catch up to the CPU-flush marker, or the CPU can catch up to the GPU-read marker. N.B. the flush marker is somewhere between the read/write markers as below. The driver will periodically move the flush marker up to the write marker.

[][][][][][][]foobargarbage

^       ^     ^

|       |     CPU Write cursor

|       CPU Flush marker

GPU Read cursor

The CPU can fill up the command buffer during any API call, which will either block, allocate more memory for the ring, or send the command to a background thread for processing. During glSwapBuffers/etc, to keep the CPU from getting too far ahead (to the point where it will potentially use up all command memory), the driver can sync it to a particular frame/vblank by waiting for a fence object in the command stream again, which it placed after a particular flip command. After the GPU flips, it processes the fence, which lets the driver know that this particular frame has been flipped. By waiting on different fences, the driver can keep the CPU from getting further than 0,1,2,etc... frames ahead. If we then allocate enough ring memory to cover 1,2,3,etc... frames worth of commands, then the CPU running out of buffer memory becomes an exceptional event, which you can debug.
The opposite shouldn't happen, where the GPU catches up to the CPU. If it does, then either (1) you need to be more efficient - 3ms of CPU GL calls can easily produce 33ms of GPU work to keep it busy, or (2) some other part of your game is hogging the CPU too much and you need to make sure the renderer runs reliably once every 33ms/etc...

. 22 Racing Series .

zacaj

667

June 15, 2012 02:50 PM

I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

"If you lower the amount of processing in the bottleneck, it will work better!"

My Site

@zacaj_

21st Century Moose

13,459

June 16, 2012 12:33 AM

So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

kunos

2,256

June 16, 2012 04:47 AM

Am I missing anything here?

what's missing is what people that are serious about their claims usually produce: proofs.
I have exactly the same feeling that maxgpu here is completely missing the point of having a GPU in the first place and I am surprised that, in 3 pages, nobody has actually called him to show what he actually accomplished with his "peculiar" approach.

I don't understand how is it possible to be serious and propose to upload an entire CPU transformed VB to avoid setting a cbuffer with some values. The guy is missing the entire point why vertex shaders, geometry shaders and instancing exists in the first place. Looks like plain old trolling to me... at least until we see a moving demo of this ahemmm weird approach in action. But every dev with some real experience knows this isn't going to happen.

Stefano Casillo
TWITTER: [twitter]KunosStefano[/twitter]
AssettoCorsa - netKar PRO - Kunos Simulazioni

maxgpgpu

207

Author

June 16, 2012 05:43 AM

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.[/quote]
No, I have many VBOs. Each contains the objects in a specific volume of 3D space. For the "conventional way" (excluding instancing for the moment), you set the transformation matrix and call glDrawElementsRange() or equivalent once per object. So it is completely convenient to test the object AABB (or OOBB) against the frustum, and draw only if they intersect. In my scheme, I usually draw the entire VBO, so I test the entire 3D volume assigned to the VBO against the frustum and draw --- or not draw --- ALL objects in the VBO.

So yeah, I cull. I just don't cull as fine grain as the conventional way. That does mean I render some objects outside the frustum... just not very many.

So, if no objects rotate or move, no objects are uploaded to the GPU.[/quote]
Further more if you are the one without culling and 1 vbo for the whole scene, then your telling me you agree with wasting the GPU's time drawing say a 10K poly statue model or any model for that matter, even when you can't see it to save a tiny draw call. Not to mention that for indoor scenes you see 2% of the world and outdoors you only see 1/2 of the world.

Clarify if you are or aren't the guy I'm thinking of, and if so I want to see pics. Because the trees in my game are like 20K polys each, I could get away with your method if they were like 100 polys each, but I want my stuff to look good, which means tons more polys, and that means you will never achieve 1 VBO with 20K poly trees x say 100 trees.
[/quote]

No, I do cull, just not as fine-grain as others. So yeah, I lose a little there, but not that much.

maxgpgpu

207

Author

June 16, 2012 05:55 AM

So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?

Yes. The most important thing you're missing is this. I'm here for brainstorming. If that means I learn 100x more from others than they learn from me... that's great for me! And I'm happy. What I'm not here for is to prove anything to anyone. I couldn't care less about that. But apparently you and others think my purpose here is to convince you that I'm a genius and I've figured out the greatest idea since sliced bread. Hell no! My whole purpose here is to try to make sure I'm headed down the right path, learn other perspectives, hear new ideas (where "new" means "new to me"). Maybe I'll have to run my own benchmarks and find out for myself. Or maybe someone will say something that rings a bell, and I can tell what's better (for situations x or y or z) without implementing and benchmarking everything myself.

What else you seem to be missing... maybe... is that both of us are not [fully] understanding the points of each other. In my case, I am aware of some of these cases. And often the points people make assume some variation of the conventional way, when that doesn't apply given the alternative I propose. But often it seems like others are aware of NONE of my points, and are purposely ignoring them. I'll write that off to not wanting to read all my messages. I understand that. But also, sometimes I say "x is good" for something or in some cases, and people go apeshit and post in reply that "he says we should do everything the x way". I never said that, but that is, for some reason, what people take away. If I try to over-qualify everything, people will hassle me about "writing too much". Believe me, I've been beaten to death for that too. My favorite is when someone falsely claims I said something I never said, then others pile on without ever finding where I said that --- which I never did. That's a lot of fun, and a waste of everyone's time.

maxgpgpu

207

Author

June 16, 2012 06:08 AM

[quote name='dpadam450' timestamp='1339734822' post='4949426']
I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

"If you lower the amount of processing in the bottleneck, it will work better!"
[/quote]
That is a universal truth. In my approach, it is a bit more complex than the conventional approach.

In the normal case, where few objects rotate or move each frame, the bottleneck is inherently the GPU, because the CPU has little to do.

In one abnormal case, where most objects rotate or move each frame, the bottleneck can become the CPU, for the reasons people here are screaming bloody murder. That is, the CPU must transform the vertices of all rotated/moved objects in each batch and transfer them to VBOs in the GPU before it draws the batch. That's why I went to the trouble of writing the fastest-possible 64-bit (and 32-bit) vertex transformation routines possible in SIMD/AVX/FMA4 assembly-language.

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines