most efficient general rendering strategies for new GPUs

Started by
92 comments, last by maxgpgpu 11 years, 9 months ago
Are you suggesting that if I have a single VBO and say triple buffered rendering, that if I update that VBO on all 3 of those frames that it would buffer that VBO into temporary arrays? I guess it would have to store the new frame-based VBO into temporary buffers. Or is it just going to overwrite the single VBO memory with the most recent buffered command. I would assume texture/vbo uploads are not put into the command buffer and take place immediately.

As for blocking, if I'm only using double buffering, then all my commands are being drawn to the current back buffer, so if I want to send 50K verts at the same time the GPU has finished its current commands (going idle) I'd rather just say "Draw 50K model" than send 50K verts, and then after all verts have made it, tell it to draw them. In that time again the GPU just has a ton to do, so with resolution I was getting at well boost up the resolution to HD effectively giving it even more hard work to do, while at the same time stalling it, and your framerate is going down quick. You upgrade to next-gen quality models at HD, then you need to help your GPU do its job faster, not waste time sending thousands (maybe even 100K) vertices. It is a complete waste, in that time I could have processed full scene SSAO or something.

I'm just pretty sure this guy is the one that says "throw the entire scene (static objects only) into 1 giant vbo and draw it all the time, you don't need to cull because its faster to use 1 draw call than 100" Realistically when does anyone ever draw more than 1,000 objects on screen at once anyway? If you cull, then 1 draw call vs 100 or even 1000 is so negligible. 1000 draw calls is almost nothing. Collapsing that to 1 is not very much faster performance, in fact I bet in most cases unmeasurable.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Advertisement

Are you suggesting that if I have a single VBO and say triple buffered rendering, that if I update that VBO on all 3 of those frames that it would buffer that VBO into temporary arrays? I guess it would have to store the new frame-based VBO into temporary buffers. Or is it just going to overwrite the single VBO memory with the most recent buffered command. I would assume texture/vbo uploads are not put into the command buffer and take place immediately.
If you're updating and rendering from a VBO each frame, the driver will probably n-buffer it depending on how much latency is in the command stream.
In the general case, when using some bound data (like a VBO, via a command to the GPU) resource fences are inserted by the driver after the draw-call, which are commands in the command stream that quickly write-back to the CPU driver, letting it know that the command using that VBO has been completed (so now the VBO's VRAM can be re-used). If you map/lock a resource, the driver will check it's fences to see if that resource is still waiting to be consumed by the GPU, and if so, will have to allocate another copy to map/return to you.
For example, if you're locking a VBO every frame, and there's 2.5 frames of latency between the two processors, then data that's written on Frame#1 might not be consumed until the CPU is up to Frame#4! Uploading more data per frame will likely increase the latency, which increases the amount of buffering RAM required... In a good case, there's only 1 frame of latency so only double buffering is required for dynamic buffers. However, for all we know, the driver might be very optimised for dynamically giving out buffers from a pool like this...
[attachment=9493:3.png]
As for blocking, if I'm only using double buffering, then all my commands are being drawn to the current back buffer...[/quote]The internal buffering of GPU commands is different to back/front buffering -- the mechanisms for single/double/tripple-buffered flipping are implemented as commands in the command-stream just like everything else. The driver will determine the amount of latency to buffer commands for based on the conditions of your app. If you've got a lot of upload traffic, it's probably going to have to cover for that with latency automatically.
Realistically when does anyone ever draw more than 1,000 objects on screen at once anyway? If you cull, then 1 draw call vs 100 or even 1000 is so negligible. 1000 draw calls is almost nothing. Collapsing that to 1 is not very much faster performance, in fact I bet in most cases unmeasurable.[/quote]On my last game, we had about 2000 D3D9 draw-calls per frame (plus required state-changes per draw-call), which cost less than 1ms of CPU time in our optimised renderer.
The diagram was basically what I was mentioning. But if the VBO is set to STATIC_DRAW and never updated, I would have to assume that there wouldn't be those copies laying around. A simple test would be too fill a command buffer with glBindBuffer/glBufferData calls over and over again and see if vbo memory goes up.

What also I was getting at is what if the command buffer hits the end? glSwapBuffers obviously takes care of some syncing because the command buffer isn't going to just fill up with 800 frames to render while still working at frame 500. So the command buffer has to be limited by frame and once its too far ahead, wait for a previous frame to finish. But also in double buffering though, the commands it is receiving are the ones that its getting ready to put up after glSwapBuffers, so at some point the GPU can catch up and stall with no commands left.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Yes the driver will choose a different allocation strategy based on the flags/hints you give it. Static ones aren't likely to suffer this memory-bloat penalty, but might do something worse if used in a dynamic manner, like block if used by both processors in nearby/alternate frames.

The command buffer is probably implemented kind of like a ring buffer, so either the GPU can catch up to the CPU-flush marker, or the CPU can catch up to the GPU-read marker. N.B. the flush marker is somewhere between the read/write markers as below. The driver will periodically move the flush marker up to the write marker.

[][][][][][][]foobargarbage
^ ^ ^
| | CPU Write cursor
| CPU Flush marker
GPU Read cursor
The CPU can fill up the command buffer during any API call, which will either block, allocate more memory for the ring, or send the command to a background thread for processing. During glSwapBuffers/etc, to keep the CPU from getting too far ahead (to the point where it will potentially use up all command memory), the driver can sync it to a particular frame/vblank by waiting for a fence object in the command stream again, which it placed after a particular flip command. After the GPU flips, it processes the fence, which lets the driver know that this particular frame has been flipped. By waiting on different fences, the driver can keep the CPU from getting further than 0,1,2,etc... frames ahead. If we then allocate enough ring memory to cover 1,2,3,etc... frames worth of commands, then the CPU running out of buffer memory becomes an exceptional event, which you can debug.
The opposite shouldn't happen, where the GPU catches up to the CPU. If it does, then either (1) you need to be more efficient - 3ms of CPU GL calls can easily produce 33ms of GPU work to keep it busy, or (2) some other part of your game is hogging the CPU too much and you need to make sure the renderer runs reliably once every 33ms/etc...

I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

"If you lower the amount of processing in the bottleneck, it will work better!"
So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


Am I missing anything here?


what's missing is what people that are serious about their claims usually produce: proofs.
I have exactly the same feeling that maxgpu here is completely missing the point of having a GPU in the first place and I am surprised that, in 3 pages, nobody has actually called him to show what he actually accomplished with his "peculiar" approach.

I don't understand how is it possible to be serious and propose to upload an entire CPU transformed VB to avoid setting a cbuffer with some values. The guy is missing the entire point why vertex shaders, geometry shaders and instancing exists in the first place. Looks like plain old trolling to me... at least until we see a moving demo of this ahemmm weird approach in action. But every dev with some real experience knows this isn't going to happen.

Stefano Casillo
TWITTER: [twitter]KunosStefano[/twitter]
AssettoCorsa - netKar PRO - Kunos Simulazioni


So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.[/quote]
No, I have many VBOs. Each contains the objects in a specific volume of 3D space. For the "conventional way" (excluding instancing for the moment), you set the transformation matrix and call glDrawElementsRange() or equivalent once per object. So it is completely convenient to test the object AABB (or OOBB) against the frustum, and draw only if they intersect. In my scheme, I usually draw the entire VBO, so I test the entire 3D volume assigned to the VBO against the frustum and draw --- or not draw --- ALL objects in the VBO.

So yeah, I cull. I just don't cull as fine grain as the conventional way. That does mean I render some objects outside the frustum... just not very many.

So, if no objects rotate or move, no objects are uploaded to the GPU.[/quote]
Further more if you are the one without culling and 1 vbo for the whole scene, then your telling me you agree with wasting the GPU's time drawing say a 10K poly statue model or any model for that matter, even when you can't see it to save a tiny draw call. Not to mention that for indoor scenes you see 2% of the world and outdoors you only see 1/2 of the world.

Clarify if you are or aren't the guy I'm thinking of, and if so I want to see pics. Because the trees in my game are like 20K polys each, I could get away with your method if they were like 100 polys each, but I want my stuff to look good, which means tons more polys, and that means you will never achieve 1 VBO with 20K poly trees x say 100 trees.
[/quote]

No, I do cull, just not as fine-grain as others. So yeah, I lose a little there, but not that much.

So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?


Yes. The most important thing you're missing is this. I'm here for brainstorming. If that means I learn 100x more from others than they learn from me... that's great for me! And I'm happy. What I'm not here for is to prove anything to anyone. I couldn't care less about that. But apparently you and others think my purpose here is to convince you that I'm a genius and I've figured out the greatest idea since sliced bread. Hell no! My whole purpose here is to try to make sure I'm headed down the right path, learn other perspectives, hear new ideas (where "new" means "new to me"). Maybe I'll have to run my own benchmarks and find out for myself. Or maybe someone will say something that rings a bell, and I can tell what's better (for situations x or y or z) without implementing and benchmarking everything myself.

What else you seem to be missing... maybe... is that both of us are not [fully] understanding the points of each other. In my case, I am aware of some of these cases. And often the points people make assume some variation of the conventional way, when that doesn't apply given the alternative I propose. But often it seems like others are aware of NONE of my points, and are purposely ignoring them. I'll write that off to not wanting to read all my messages. I understand that. But also, sometimes I say "x is good" for something or in some cases, and people go apeshit and post in reply that "he says we should do everything the x way". I never said that, but that is, for some reason, what people take away. If I try to over-qualify everything, people will hassle me about "writing too much". Believe me, I've been beaten to death for that too. My favorite is when someone falsely claims I said something I never said, then others pile on without ever finding where I said that --- which I never did. That's a lot of fun, and a waste of everyone's time.

[quote name='dpadam450' timestamp='1339734822' post='4949426']
I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

"If you lower the amount of processing in the bottleneck, it will work better!"
[/quote]
That is a universal truth. In my approach, it is a bit more complex than the conventional approach.

In the normal case, where few objects rotate or move each frame, the bottleneck is inherently the GPU, because the CPU has little to do.

In one abnormal case, where most objects rotate or move each frame, the bottleneck can become the CPU, for the reasons people here are screaming bloody murder. That is, the CPU must transform the vertices of all rotated/moved objects in each batch and transfer them to VBOs in the GPU before it draws the batch. That's why I went to the trouble of writing the fastest-possible 64-bit (and 32-bit) vertex transformation routines possible in SIMD/AVX/FMA4 assembly-language.

This topic is closed to new replies.

Advertisement