most efficient general rendering strategies for new GPUs

Started by
92 comments, last by maxgpgpu 11 years, 10 months ago


You are talking about memory. I'm specifically talking about overdraw/ trying to draw stuff that is outside the frustum. By grouping everything into 1 vbo, you are wasting time drawing/shading triangles that are outside the frustum or occluded.


A draw call does not have to touch all vertices in a VBO. Read Hodgman's post.
Advertisement
Correct. Read maxgpu's posts and mine. We are talking about drawing all the objects in a given vbo that are enclosed in a volume.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

I suspect part of the problem is our mental images of "a house". In some games, "a house" may only be 50 triangles and the game has no way to get you inside the house at all. In other games, "a house" or building might be the entire universe of the game, and you can't even get outside to look at its outer surfaces - and it might contains millions of triangles. Therefore, I think we're talking past each other's vision of the situation. Still, it is true that in the worst case scenario only 1/4 of the contents of a VBO would be visible, and the rest of the objects would be early rejected by the GPU for being non-visible, but they would still consume GPU time.

If it turns out modern GPUs can be executing many batches simultaneously, potentially with different shaders, my approach is going to be "more efficient" for quite a bit smaller percentage of the total universe... and I'll have to do some re-thinking and re-designing [and benchmarking].

What I really, really, really need is a coherent and detailed discussion of what modern GPUs can do. Where do I find that?
Even older hardware was capable of executing many shaders at the same time. It could have been working on the vertex shader stage for one batch of input at the same time as it was shading pixels for an earlier batch and blending for an even earlier one - that's a classic pipelined architecture, and the Wiki page has a decent diagram that can help illustrate it: http://en.wikipedia.org/wiki/Instruction_pipeline.

It could also have multiple pipes for each shader stage, e.g. a GeForce 6600GT had 8 pixel pipes and 3 vertex pipes, meaning it was capable of running the vertex shader for each vertex of an entire triangle simultaneously (GeForce 6 series specs comparison: http://en.wikipedia.org/wiki/Geforce_6#Geforce_6_Series_comparison).

Modern hardware just generalises/unifies the shader stages and massively increases the number of cores, but otherwise the pipelined/simultaneous architecture has been with us for a long long time.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


Even older hardware was capable of executing many shaders at the same time. It could have been working on the vertex shader stage for one batch of input at the same time as it was shading pixels for an earlier batch and blending for an even earlier one - that's a classic pipelined architecture, and the Wiki page has a decent diagram that can help illustrate it: http://en.wikipedia....uction_pipeline.

It could also have multiple pipes for each shader stage, e.g. a GeForce 6600GT had 8 pixel pipes and 3 vertex pipes, meaning it was capable of running the vertex shader for each vertex of an entire triangle simultaneously (GeForce 6 series specs comparison: http://en.wikipedia....ries_comparison).

Modern hardware just generalises/unifies the shader stages and massively increases the number of cores, but otherwise the pipelined/simultaneous architecture has been with us for a long long time.

I knew shaders have been [dynamically] assigned to act as vertex shaders or pixel shaders on the same object for several years. What I have not realized is that the vertices and pixels of many objects with many different transformation matrices [and even shaders] could be processed simultaneously. I'd very much like to know the practical limits of this, so any modifications I make to my engine are well chosen.

I know this is not a simple, one-dimensional tradeoff. But there must be certain specific limits that an engine designer could be told that would inform his strategies. Like how many registers are available (and at what granularity can they be allocated to shaders). How many different vertex shaders can be run simultaneously. How many different pixel shaders can be run simultaneously. And so forth.

Do you guys load a transformation matrix for an object then render the object, then repeat? Or do you load a pile of transformation matrices into one [or more] uniform blocks first, then somehow specify to the shaders which to reference for each vertex? Or do you load a pile of transformation matrices into a texture instead of a uniform block? What makes the association between vertex and a specific transformation matrix? I've got a 4-bit field in my vertices to specify "which transformation matrix", but that's not to specify the primary transformation matrix (which in my engine is the same for all objects (except "large moving objects")). I could expand that field, I suppose, though I'm running low on free bits in my 64-byte vertices.

I can see a parallel between what I do now, and one possible approach along the lines you guys are talking about. As it is now, on each frame I compile a list of objects that have been "modified" (rotated, translated or 1+ vertices otherwise changed). Once all modifications have been performed, I process only the modified objects (transform vertices, upload to VBO, draw). If I leave local-coordinates in the VBOs, then I don't need to upload the vertices of modified objects, but I do need to upload the transformation matrices of modified objects. If those transformation matrices are in uniform blocks, or in a texture, I can upload the matrices all at once at the beginning, then draw in whatever [sort] order I wish. This should [potentially] make it possible to render many sequential objects in a VBO with one draw call. This might not work all that great in average cases if I test every object against the frustum, because a large percentage of objects will be culled, and the stream of consecutive objects to draw usually rather limited by that fact. However, if I keep my current scheme of mapping a volume of space to each VBO, then most often all objects or no objects will be drawn. Furthermore, I could test the whole 3D volume against the frustum first, then reject entire VBOs full of objects with this one fast test in many cases.

Yeah, the ability to upload all modified transformation matrices at the start, then not upload them between objects seems promising. Where is the more efficient place to access them (uniform blocks, textures, or somewhere else)?
There's no one single solution that is going to work best in all cases, so you need to adapt according to what type of object you're drawing and what effect you want to achieve.

Uploading a new transformation matrix to the GPU is not the huge performance overhead you seem to think. Any modifications to the matrix are always done locally on the CPU, and it's just the final matrix that gets uploaded - the driver will typically set a dirty bit, then detect that and upload if needed before a draw call. In the old days there were a lot of other calculations done as well, but drivers are really really good at detecting your usage patterns and optimizing accordingly - you can bet that if the driver can detect that you never use the inverse-transpose-modelview-projection-matrix then it's not even going to bother calculating and uploading it.

Nowadays that doesn't even apply. The old fixed-pipeline matrix functions are gone from modern GL (and were never even in D3D) so you use an external matrix library instead and only do the calculations you need (D3D has included such a library for a long time, but you don't have to use it; with GL you need to find one yourself - you can even use D3D's library with it if you're so inclined; I've used it before and it works quite nicely).

So a transformation matrix upload is just 16 floats - the equivalent of 4 glUniform4f calls. And it will typically only happen once per-object, compared to vertex transforms which may happen thousands of times per object. If you do some profiling on a typical program you'll see that transformation matrix uploads are right down in the noise - they won't even register.

Options for loading matrixes include using regular uniforms, putting them into UBOs, encoding them as textures or packing them into your vertex streams as per-instance data. You just use whichever option works well with whatever you're currently doing.

For actual transforms you typically have a number of options. Let's take an example - say you're lighting a model. Option 1 is to move each vertex of the model into world space, then calculate your lighting, then move the final vertexes to view space. Not going to work well. So you do option 2 instead - which is to move the light into the same space as the model (transform it's position - and it will just be a single position - by the inverse of the model's world-space transform) then calculate your lighting entirely in the model's local space based on it's untransformed positions. You've just saved potentially several thousand matrix * position transforms.

So - sometimes you need to turn your thinking on it's head to arrive at the best solution for something.

The focus on draw calls is regrettable but perhaps understandable owing to the heavy emphasis on this that came from the vendors some time back. Being able to take everything in a single draw call is not the huge advantage it might seem on the surface. It's all a balancing act - in order to reduce draw calls you need to do a certain amount of extra work, and that extra work can very easily overwhelm any saving you get.

All that a draw call really does is add a few bytes to the command buffer - nothing much else. It will also do some rummaging around for dirty states and send any such changed states to the GPU, but if you don't have any state changes between draw calls then they become really really cheap operations. The advice to reduce draw call counts must be viewed in the context of old-school code, where you'd typically have thousands of glBegin/glEnd pairs per frame. That's what needed to die.

That old advice also applied to nasty old versions of D3D which also did a lot of state validation for each draw call, and which made calls a lot more expensive than they otherwise would have been. That's also gone now.

So, as long as you don't do anything stupid, you're fine. You can take each model in it's own draw call, giving you a few hundred per frame, and everything will still work out OK. It's not a big overhead at all.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


That old advice also applied to nasty old versions of D3D which also did a lot of state validation for each draw call, and which made calls a lot more expensive than they otherwise would have been. That's also gone now.

So, as long as you don't do anything stupid, you're fine. You can take each model in it's own draw call, giving you a few hundred per frame, and everything will still work out OK. It's not a big overhead at all.


this is so very true. In my old D3D9 project I often found myself gaining a lot performances by avoid as many interaction I could with the driver... state changes, VB/IB changes (these in D3D9 were pretty slow I found) and so on and I mean BIG gains. I always finish up disappointed in DX11 because no matter what I try to change I pretty much end up with the same performances... it seems to be a very well oiled system.

Stefano Casillo
TWITTER: [twitter]KunosStefano[/twitter]
AssettoCorsa - netKar PRO - Kunos Simulazioni

[font=tahoma,geneva,sans-serif]I should probably do my homework rather than ask this silly question, but... here goes anyway. Feel free to ignore it.[/font]

[font=tahoma,geneva,sans-serif]Can instancing support in the latest generation of GPUs make it possible to pervert instancing in the following way. Assume we do instancing the regular way, so this question only refers to a large set of unique objects [that exist only once in the game/simulation]. Can number of vertices be a per-instance variable? What I'm wondering here is whether it might be possible to consider all these diverse objects as separate instances of some general amorphous object?[/font]

[font=tahoma,geneva,sans-serif]In the instancing I'm familiar with, every instance has the same number of vertices. This is for jobs like rendering a crapload of leaves on trees, and the per-instance data tells for each leaf: position, orientation, color (for example). However, if per-instance can include # of vertices and maybe a couple more items, perhaps every object with every number of vertices could be rendered with instancing. That sounds wacko off hand, but then effectively instanceID means objectID, so instanceID can double as the index into an array of general local-to-view transformation matrices.[/font]

[font=tahoma,geneva,sans-serif]This probably exceeds the flexibility of the instancing mechanism, but then again, maybe it doesn't. Any comments?[/font]

this is so very true. In my old D3D9 project I often found myself gaining a lot performances by avoid as many interaction I could with the driver... state changes, VB/IB changes (these in D3D9 were pretty slow I found) and so on and I mean BIG gains. I always finish up disappointed in DX11 because no matter what I try to change I pretty much end up with the same performances... it seems to be a very well oiled system.


I find that the biggest performance gains with D3D11 come when you use the API the way it's obviously been designed to be used. As a rule, if you're fighting the API and trying to wrestle it into conforming with your code structure, then your code structure probably needs to be changed.

It seems reasonable to suppose that the same also applies to modern OpenGL.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


[font=tahoma,geneva,sans-serif]I should probably do my homework rather than ask this silly question, but... here goes anyway. Feel free to ignore it.[/font]

[font=tahoma,geneva,sans-serif]Can instancing support in the latest generation of GPUs make it possible to pervert instancing in the following way. Assume we do instancing the regular way, so this question only refers to a large set of unique objects [that exist only once in the game/simulation]. Can number of vertices be a per-instance variable? What I'm wondering here is whether it might be possible to consider all these diverse objects as separate instances of some general amorphous object?[/font]

[font=tahoma,geneva,sans-serif]In the instancing I'm familiar with, every instance has the same number of vertices. This is for jobs like rendering a crapload of leaves on trees, and the per-instance data tells for each leaf: position, orientation, color (for example). However, if per-instance can include # of vertices and maybe a couple more items, perhaps every object with every number of vertices could be rendered with instancing. That sounds wacko off hand, but then effectively instanceID means objectID, so instanceID can double as the index into an array of general local-to-view transformation matrices.[/font]

[font=tahoma,geneva,sans-serif]This probably exceeds the flexibility of the instancing mechanism, but then again, maybe it doesn't. Any comments?[/font]


I doubt it - if the number of verts needs to change then it seems a reasonably good bet that the texture also needs to change (otherwise your texcoords would be out of whack) so you're looking at a separate draw call anyway.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

This topic is closed to new replies.

Advertisement