Jump to content

  • Log In with Google      Sign In   
  • Create Account

most efficient general rendering strategies for new GPUs


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
93 replies to this topic

#81 teutonicus   Members   -  Reputation: 518

Like
1Likes
Like

Posted 18 June 2012 - 05:04 PM

You are talking about memory. I'm specifically talking about overdraw/ trying to draw stuff that is outside the frustum. By grouping everything into 1 vbo, you are wasting time drawing/shading triangles that are outside the frustum or occluded.


A draw call does not have to touch all vertices in a VBO. Read Hodgman's post.

Sponsor:

#82 dpadam450   Members   -  Reputation: 945

Like
0Likes
Like

Posted 18 June 2012 - 06:22 PM

Correct. Read maxgpu's posts and mine. We are talking about drawing all the objects in a given vbo that are enclosed in a volume.

Edited by dpadam450, 18 June 2012 - 06:23 PM.


#83 maxgpgpu   Crossbones+   -  Reputation: 281

Like
0Likes
Like

Posted 18 June 2012 - 10:17 PM

I suspect part of the problem is our mental images of "a house". In some games, "a house" may only be 50 triangles and the game has no way to get you inside the house at all. In other games, "a house" or building might be the entire universe of the game, and you can't even get outside to look at its outer surfaces - and it might contains millions of triangles. Therefore, I think we're talking past each other's vision of the situation. Still, it is true that in the worst case scenario only 1/4 of the contents of a VBO would be visible, and the rest of the objects would be early rejected by the GPU for being non-visible, but they would still consume GPU time.

If it turns out modern GPUs can be executing many batches simultaneously, potentially with different shaders, my approach is going to be "more efficient" for quite a bit smaller percentage of the total universe... and I'll have to do some re-thinking and re-designing [and benchmarking].

What I really, really, really need is a coherent and detailed discussion of what modern GPUs can do. Where do I find that?

#84 mhagain   Crossbones+   -  Reputation: 8275

Like
0Likes
Like

Posted 19 June 2012 - 07:24 AM

Even older hardware was capable of executing many shaders at the same time. It could have been working on the vertex shader stage for one batch of input at the same time as it was shading pixels for an earlier batch and blending for an even earlier one - that's a classic pipelined architecture, and the Wiki page has a decent diagram that can help illustrate it: http://en.wikipedia.org/wiki/Instruction_pipeline.

It could also have multiple pipes for each shader stage, e.g. a GeForce 6600GT had 8 pixel pipes and 3 vertex pipes, meaning it was capable of running the vertex shader for each vertex of an entire triangle simultaneously (GeForce 6 series specs comparison: http://en.wikipedia.org/wiki/Geforce_6#Geforce_6_Series_comparison).

Modern hardware just generalises/unifies the shader stages and massively increases the number of cores, but otherwise the pipelined/simultaneous architecture has been with us for a long long time.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#85 maxgpgpu   Crossbones+   -  Reputation: 281

Like
0Likes
Like

Posted 19 June 2012 - 10:00 AM

Even older hardware was capable of executing many shaders at the same time. It could have been working on the vertex shader stage for one batch of input at the same time as it was shading pixels for an earlier batch and blending for an even earlier one - that's a classic pipelined architecture, and the Wiki page has a decent diagram that can help illustrate it: http://en.wikipedia....uction_pipeline.

It could also have multiple pipes for each shader stage, e.g. a GeForce 6600GT had 8 pixel pipes and 3 vertex pipes, meaning it was capable of running the vertex shader for each vertex of an entire triangle simultaneously (GeForce 6 series specs comparison: http://en.wikipedia....ries_comparison).

Modern hardware just generalises/unifies the shader stages and massively increases the number of cores, but otherwise the pipelined/simultaneous architecture has been with us for a long long time.

I knew shaders have been [dynamically] assigned to act as vertex shaders or pixel shaders on the same object for several years. What I have not realized is that the vertices and pixels of many objects with many different transformation matrices [and even shaders] could be processed simultaneously. I'd very much like to know the practical limits of this, so any modifications I make to my engine are well chosen.

I know this is not a simple, one-dimensional tradeoff. But there must be certain specific limits that an engine designer could be told that would inform his strategies. Like how many registers are available (and at what granularity can they be allocated to shaders). How many different vertex shaders can be run simultaneously. How many different pixel shaders can be run simultaneously. And so forth.

Do you guys load a transformation matrix for an object then render the object, then repeat? Or do you load a pile of transformation matrices into one [or more] uniform blocks first, then somehow specify to the shaders which to reference for each vertex? Or do you load a pile of transformation matrices into a texture instead of a uniform block? What makes the association between vertex and a specific transformation matrix? I've got a 4-bit field in my vertices to specify "which transformation matrix", but that's not to specify the primary transformation matrix (which in my engine is the same for all objects (except "large moving objects")). I could expand that field, I suppose, though I'm running low on free bits in my 64-byte vertices.

I can see a parallel between what I do now, and one possible approach along the lines you guys are talking about. As it is now, on each frame I compile a list of objects that have been "modified" (rotated, translated or 1+ vertices otherwise changed). Once all modifications have been performed, I process only the modified objects (transform vertices, upload to VBO, draw). If I leave local-coordinates in the VBOs, then I don't need to upload the vertices of modified objects, but I do need to upload the transformation matrices of modified objects. If those transformation matrices are in uniform blocks, or in a texture, I can upload the matrices all at once at the beginning, then draw in whatever [sort] order I wish. This should [potentially] make it possible to render many sequential objects in a VBO with one draw call. This might not work all that great in average cases if I test every object against the frustum, because a large percentage of objects will be culled, and the stream of consecutive objects to draw usually rather limited by that fact. However, if I keep my current scheme of mapping a volume of space to each VBO, then most often all objects or no objects will be drawn. Furthermore, I could test the whole 3D volume against the frustum first, then reject entire VBOs full of objects with this one fast test in many cases.

Yeah, the ability to upload all modified transformation matrices at the start, then not upload them between objects seems promising. Where is the more efficient place to access them (uniform blocks, textures, or somewhere else)?

#86 mhagain   Crossbones+   -  Reputation: 8275

Like
0Likes
Like

Posted 19 June 2012 - 12:13 PM

There's no one single solution that is going to work best in all cases, so you need to adapt according to what type of object you're drawing and what effect you want to achieve.

Uploading a new transformation matrix to the GPU is not the huge performance overhead you seem to think. Any modifications to the matrix are always done locally on the CPU, and it's just the final matrix that gets uploaded - the driver will typically set a dirty bit, then detect that and upload if needed before a draw call. In the old days there were a lot of other calculations done as well, but drivers are really really good at detecting your usage patterns and optimizing accordingly - you can bet that if the driver can detect that you never use the inverse-transpose-modelview-projection-matrix then it's not even going to bother calculating and uploading it.

Nowadays that doesn't even apply. The old fixed-pipeline matrix functions are gone from modern GL (and were never even in D3D) so you use an external matrix library instead and only do the calculations you need (D3D has included such a library for a long time, but you don't have to use it; with GL you need to find one yourself - you can even use D3D's library with it if you're so inclined; I've used it before and it works quite nicely).

So a transformation matrix upload is just 16 floats - the equivalent of 4 glUniform4f calls. And it will typically only happen once per-object, compared to vertex transforms which may happen thousands of times per object. If you do some profiling on a typical program you'll see that transformation matrix uploads are right down in the noise - they won't even register.

Options for loading matrixes include using regular uniforms, putting them into UBOs, encoding them as textures or packing them into your vertex streams as per-instance data. You just use whichever option works well with whatever you're currently doing.

For actual transforms you typically have a number of options. Let's take an example - say you're lighting a model. Option 1 is to move each vertex of the model into world space, then calculate your lighting, then move the final vertexes to view space. Not going to work well. So you do option 2 instead - which is to move the light into the same space as the model (transform it's position - and it will just be a single position - by the inverse of the model's world-space transform) then calculate your lighting entirely in the model's local space based on it's untransformed positions. You've just saved potentially several thousand matrix * position transforms.

So - sometimes you need to turn your thinking on it's head to arrive at the best solution for something.

The focus on draw calls is regrettable but perhaps understandable owing to the heavy emphasis on this that came from the vendors some time back. Being able to take everything in a single draw call is not the huge advantage it might seem on the surface. It's all a balancing act - in order to reduce draw calls you need to do a certain amount of extra work, and that extra work can very easily overwhelm any saving you get.

All that a draw call really does is add a few bytes to the command buffer - nothing much else. It will also do some rummaging around for dirty states and send any such changed states to the GPU, but if you don't have any state changes between draw calls then they become really really cheap operations. The advice to reduce draw call counts must be viewed in the context of old-school code, where you'd typically have thousands of glBegin/glEnd pairs per frame. That's what needed to die.

That old advice also applied to nasty old versions of D3D which also did a lot of state validation for each draw call, and which made calls a lot more expensive than they otherwise would have been. That's also gone now.

So, as long as you don't do anything stupid, you're fine. You can take each model in it's own draw call, giving you a few hundred per frame, and everything will still work out OK. It's not a big overhead at all.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#87 kunos   Crossbones+   -  Reputation: 2207

Like
0Likes
Like

Posted 19 June 2012 - 12:38 PM

That old advice also applied to nasty old versions of D3D which also did a lot of state validation for each draw call, and which made calls a lot more expensive than they otherwise would have been. That's also gone now.

So, as long as you don't do anything stupid, you're fine. You can take each model in it's own draw call, giving you a few hundred per frame, and everything will still work out OK. It's not a big overhead at all.


this is so very true. In my old D3D9 project I often found myself gaining a lot performances by avoid as many interaction I could with the driver... state changes, VB/IB changes (these in D3D9 were pretty slow I found) and so on and I mean BIG gains. I always finish up disappointed in DX11 because no matter what I try to change I pretty much end up with the same performances... it seems to be a very well oiled system.
Stefano Casillo
Lead Programmer
TWITTER: @KunosStefano
AssettoCorsa - netKar PRO - Kunos Simulazioni

#88 maxgpgpu   Crossbones+   -  Reputation: 281

Like
0Likes
Like

Posted 19 June 2012 - 01:03 PM

I should probably do my homework rather than ask this silly question, but... here goes anyway. Feel free to ignore it.

Can instancing support in the latest generation of GPUs make it possible to pervert instancing in the following way. Assume we do instancing the regular way, so this question only refers to a large set of unique objects [that exist only once in the game/simulation]. Can number of vertices be a per-instance variable? What I'm wondering here is whether it might be possible to consider all these diverse objects as separate instances of some general amorphous object?

In the instancing I'm familiar with, every instance has the same number of vertices. This is for jobs like rendering a crapload of leaves on trees, and the per-instance data tells for each leaf: position, orientation, color (for example). However, if per-instance can include # of vertices and maybe a couple more items, perhaps every object with every number of vertices could be rendered with instancing. That sounds wacko off hand, but then effectively instanceID means objectID, so instanceID can double as the index into an array of general local-to-view transformation matrices.

This probably exceeds the flexibility of the instancing mechanism, but then again, maybe it doesn't. Any comments?

#89 mhagain   Crossbones+   -  Reputation: 8275

Like
0Likes
Like

Posted 19 June 2012 - 01:13 PM

this is so very true. In my old D3D9 project I often found myself gaining a lot performances by avoid as many interaction I could with the driver... state changes, VB/IB changes (these in D3D9 were pretty slow I found) and so on and I mean BIG gains. I always finish up disappointed in DX11 because no matter what I try to change I pretty much end up with the same performances... it seems to be a very well oiled system.


I find that the biggest performance gains with D3D11 come when you use the API the way it's obviously been designed to be used. As a rule, if you're fighting the API and trying to wrestle it into conforming with your code structure, then your code structure probably needs to be changed.

It seems reasonable to suppose that the same also applies to modern OpenGL.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#90 mhagain   Crossbones+   -  Reputation: 8275

Like
0Likes
Like

Posted 19 June 2012 - 01:16 PM

I should probably do my homework rather than ask this silly question, but... here goes anyway. Feel free to ignore it.

Can instancing support in the latest generation of GPUs make it possible to pervert instancing in the following way. Assume we do instancing the regular way, so this question only refers to a large set of unique objects [that exist only once in the game/simulation]. Can number of vertices be a per-instance variable? What I'm wondering here is whether it might be possible to consider all these diverse objects as separate instances of some general amorphous object?

In the instancing I'm familiar with, every instance has the same number of vertices. This is for jobs like rendering a crapload of leaves on trees, and the per-instance data tells for each leaf: position, orientation, color (for example). However, if per-instance can include # of vertices and maybe a couple more items, perhaps every object with every number of vertices could be rendered with instancing. That sounds wacko off hand, but then effectively instanceID means objectID, so instanceID can double as the index into an array of general local-to-view transformation matrices.

This probably exceeds the flexibility of the instancing mechanism, but then again, maybe it doesn't. Any comments?


I doubt it - if the number of verts needs to change then it seems a reasonably good bet that the texture also needs to change (otherwise your texcoords would be out of whack) so you're looking at a separate draw call anyway.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#91 phantom   Moderators   -  Reputation: 7558

Like
0Likes
Like

Posted 19 June 2012 - 01:52 PM

Re: Draw calls.

This is, and remains, an important issue.
Draw calls, while cheaper than D3D9, can still suck up CPU power depending on what you are doing. If you are updating buffers and using them in a draw call the driver has to shuffle memory under the hood, copy things around, update other things. While avoiding them to the extreme of the OP is a bit crazy even so you should be careful as they can suck away CPU time pretty quickly.

With D3D11, using multi-threaded deferred contexts (something GL doesn't have) you are, on something like an i7, top out at ~15,000 draw calls per frame is you want to maintain 60fps. BF3 tops out at around 7,500 per frame as memory serves.

In short; don't go crazy, keeping your draw calls down remains a good thing due to driver overhead.

(For reference using a very CPU heavy test loop and performing 50,000 draw calls per frame; a 2.6Ghz i7 with a NV GTX470 GPU can't even clear 30fps using 6 cores to render. An X360, using the same code base and same test, will happily do 60fps. This is purely CPU overhead causing the problem and most of the time is the driver doing work to shuffle data around. Clearly this wouldn't play out in a real game situation but there is still reason to be concerned about CPU cost.)

Re: varying verts per instance

You CAN do this... although you probably shouldn't.
However you don't do it via traditional instancing instead you use the Geo. shader to create the extra vertex data - but this comes at a cost as the output of a GS has to be serialised correctly which can introduce significant bottlenecks in the GPU.

Generally, unless you have very little work on the GPU and are totally rammed on the CPU, you won't want to do this instead just take the hit of an extra draw call per model type. Chances are you aren't going to have that many that require this anyway so its not going to be a huge CPU cost and you avoid a large GPU cost.

#92 maxgpgpu   Crossbones+   -  Reputation: 281

Like
0Likes
Like

Posted 19 June 2012 - 02:26 PM


I should probably do my homework rather than ask this silly question, but... here goes anyway. Feel free to ignore it.

Can instancing support in the latest generation of GPUs make it possible to pervert instancing in the following way. Assume we do instancing the regular way, so this question only refers to a large set of unique objects [that exist only once in the game/simulation]. Can number of vertices be a per-instance variable? What I'm wondering here is whether it might be possible to consider all these diverse objects as separate instances of some general amorphous object?

In the instancing I'm familiar with, every instance has the same number of vertices. This is for jobs like rendering a crapload of leaves on trees, and the per-instance data tells for each leaf: position, orientation, color (for example). However, if per-instance can include # of vertices and maybe a couple more items, perhaps every object with every number of vertices could be rendered with instancing. That sounds wacko off hand, but then effectively instanceID means objectID, so instanceID can double as the index into an array of general local-to-view transformation matrices.

This probably exceeds the flexibility of the instancing mechanism, but then again, maybe it doesn't. Any comments?


I doubt it - if the number of verts needs to change then it seems a reasonably good bet that the texture also needs to change (otherwise your texcoords would be out of whack) so you're looking at a separate draw call anyway.

My vertex structures contain a textureID filed that indexes into the texture array. Therefore, that's not a killer even now. Obviously this is more-or-less necessary in my current scheme that renders unlimited objects in a single draw call --- each object can have different textures, different normalmaps, different specularmaps, etc. My 64-byte vertex structure is running low on free bits at this point, so my alternative is to eliminate those textureID fields [and matrixID field] and replace that with an objectID field that indexes into a texture to get all the information that could possibly be needed (at the expense of an extra texture-fetch per vertex). Unfortunately, that eliminates one nice feature of the current scheme --- the ability to specify texture, normalmap, specularmap on a per-vertex basis, not just per-object.

Edited by maxgpgpu, 19 June 2012 - 02:31 PM.


#93 mhagain   Crossbones+   -  Reputation: 8275

Like
0Likes
Like

Posted 19 June 2012 - 06:47 PM

The moral of the story is that draw calls are still not free, but you don't need to pathologically avoid them as much as you did before. 7.5k calls in a shipping AAA title would certainly have given everyone the horrors not so long ago. "Going crazy" can work in both directions...

Indexing into a texture array is a decent way of avoiding changes and keeping calls down, but it adds the constraint that all of your textures must be the same size. You're not going to use the same texture size for a small pebble or for a particle as you use for a brick wall, I hope. Aiming for the entire scene in a single call also constrains you you to using the same material properties for all of your objects. If you're happy with that tradeoff, then sure, go for it, but it really reduces your setup's general-purpose utility. You can't even do something as simple as enable alpha blending for a window but keep it disabled for everything else. That makes the objective something more of theoretical interest than practical utility.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#94 maxgpgpu   Crossbones+   -  Reputation: 281

Like
0Likes
Like

Posted 04 July 2012 - 01:05 PM

The moral of the story is that draw calls are still not free, but you don't need to pathologically avoid them as much as you did before. 7.5k calls in a shipping AAA title would certainly have given everyone the horrors not so long ago. "Going crazy" can work in both directions...

Indexing into a texture array is a decent way of avoiding changes and keeping calls down, but it adds the constraint that all of your textures must be the same size. You're not going to use the same texture size for a small pebble or for a particle as you use for a brick wall, I hope. Aiming for the entire scene in a single call also constrains you you to using the same material properties for all of your objects. If you're happy with that tradeoff, then sure, go for it, but it really reduces your setup's general-purpose utility. You can't even do something as simple as enable alpha blending for a window but keep it disabled for everything else. That makes the objective something more of theoretical interest than practical utility.

Well, we still have more than one texture units to work with. I assume 4 texture units, and hopefully GPUs don't ever drop below that number.

What I do is put 4 to hundreds of textures (or normalmaps, heightmaps, etc) onto each texture (more or less "texture atlas" style). So my tcoords for a given object don't range from 0.000 to 1.000 on each axis, they range from some tiny fraction of that range. Of course my approach means I can't create repeating textures (for tiled floors and such) by letting the tcoords extend far < 0.000 and far > 1.000.

Clearly I need to rethink my balancing act. You guys are probably correct that only putting local-coordinates into the VBOs for "large moving objects" is not the optimal tradeoff. But some comments seem to indicate that going whole-hog the opposite direction isn't very good either.

In many games and simulations, the [vast/large/substantial] majority of objects are fixed. These objects probably render at the same speed with either local or world coordinates in the VBOs, everything else being done the same. Probably the simplest test I can perform is break my draw calls into as many objects are in each VBO and perform frustum tests on each. I can certainly measure how much CPU time that adds. Unfortunately, I'm not very proficient in figuring out the impact on GPU execution time.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS