most efficient general rendering strategies for new GPUs

Started by
92 comments, last by maxgpgpu 11 years, 10 months ago

[quote name='maxgpgpu' timestamp='1339835382' post='4949733']
Certain facts simply exist. A modern GPU has about 512~1024 cores. If the engine is rendering a bunch of small objects, or far-away objects (meaning "few vertices" rendered per object), and a new transformation matrix gets uploaded between each object, then each object might not even keep all 512 cores busy for even an instant. That's gotta put a crimp in bandwidth. The GPU finishes the object in "no time", then all 512 cores are stalled while the CPU uploads a new transformation matrix [and maybe other state]. That's precisely what doesn't happen in the alternate approach, where dozens to hundreds of objects get rendered by a single glDrawElements(). However, if you have alternate ways to assure the GPU can keep on trucking without [many] breaks, that's great. Maybe I haven't given enough attention to figuring out those alternative ways, but I don't see what that should stimulate so much vitriol.


The problem is your 'facts' are wrong.
Utterly.

As we are talking about next generation GPUs lets take one I know about; the AMD Southern Islands aka Radeon HD7000 Series.
A 7970, the card I have, has 2048 'cores' however when you issue a draw call the GPU doesn't go 'all 2048 threads, do this!' instead it operates on wavefronts (NV calls them 'warps', the general term is 'work groups').

AMD breaks the SI core up into 'compute units'.
Each compute unit has for SIMD units in it.
Each SIMD unit can deal with upto 10 wave fronts
Wave fronts are made up of 64 work items.

So each compute unit can have upto 2560 work items in flight. (And if I'm reading my clinfo output correctly the HD7970 has 32 such units)

But, each of these don't have to come from the same draw call nor do they have to be the same shader as each wave can have its own instruction buffer and data.

Now add in what Hodgman said about the async nature of the GPU and CPU (by default I believe the drivers can buffer up to 3 frames ahead) and you can see that your outline above simply isn't true - you don't issue a call which takes up the whole device and then waits, you build a command stream which can take up sections of the device with threads being swapped in and out as resources demand it.
[/quote]

Okay, if I understand you guys correctly, I have a big misconception. Are you saying that current generation GPUs are set up to execute several (or many) different shaders (with different vertex-layouts and different uniforms and so forth) simultaneously? As long as the GPU has enough memory to hold several shaders and all the blocks of uniforms needed for the several shaders, I guess I don't see anything to prevent a GPU architecture from doing entirely different things in parallel like that.

If new GPUs can indeed handle this, then I didn't know that, and that certainly can change things. I'll have to think a while how that impacts different situations, but clearly it helps as long as the CPU or driver isn't thrashing to death to keep things rolling in the GPU.

Does this also mean I can be running one or two OpenCL processes on the GPU at the same time too, or are they not able to run in parallel with simultaneous graphics shaders?

At what generation of nvidia GPUs did they start to handle this (I'm not familiar with the AMD series)?
Advertisement
At what generation of nvidia GPUs did they start to handle this?
Even back on "old" (current-gen console) cards like a GeForce7, it's based on having lots of threads executing passes of shader programs, which can be interrupted/stored/resumed (actually, one big performance-factor when writing shaders is the number of temporary registers used -- lower temp reg count == more threads can be serialised into the same sized buffer == more chances to hide texture-fetch stalls, etc...).

It's just that these old cards also have lots of opportunities for stalling the pipeline -- e.g. change a pixel shader constant without drawing x-thousand pixels in-between, and you'd leave the pixel-units sitting idle (N.B. this is still before the unified shading model too, where pixel and vertex shaders used different hardware)
Today that's probably still true for some operations, so make sure each batch covers a decent amount of screen pixels to be safe (unknown amount... hundreds, thousands?) and also make sure that your triangles stay larger than 1px in size, using LOD for detailed objects where necessary.
No, I do cull, just not as fine-grain as others. So yeah, I lose a little there, but not that much.[/quote]
I understand your approach that you are grouping portions of the scene into vbos. That makes sense for say in BF/MW you have a bunch of houses/huts and you want to just put all the chairs/tables/cups/lamps into 1 vbo and just call it "Room1 VBO" but I wouldn't dare stick 2 full houses into a VBO when I can only go into 1 house at a time, and view 1 room in that house at a time. So you need to clarify that. Grouping static objects is fine, but it sounds like you are grouping WAY too much.

Also, you can generate culling heirarchies, for instance if those chairs/tables/cups/lamps belong to "Room1" then assume if you can see at least a portion of "Room 1" then yes just draw it all, you don't need to cull a 20 poly coffee cup. But again you sound like your have a whole mini-city in a VBO.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal


No, I do cull, just not as fine-grain as others. So yeah, I lose a little there, but not that much.

I understand your approach that you are grouping portions of the scene into vbos. That makes sense for say in BF/MW you have a bunch of houses/huts and you want to just put all the chairs/tables/cups/lamps into 1 vbo and just call it "Room1 VBO" but I wouldn't dare stick 2 full houses into a VBO when I can only go into 1 house at a time, and view 1 room in that house at a time. So you need to clarify that. Grouping static objects is fine, but it sounds like you are grouping WAY too much.

Also, you can generate culling heirarchies, for instance if those chairs/tables/cups/lamps belong to "Room1" then assume if you can see at least a portion of "Room 1" then yes just draw it all, you don't need to cull a 20 poly coffee cup. But again you sound like your have a whole mini-city in a VBO.
[/quote]
Depends. If the envirionment of the game or simulation is a whole island (say, 10km square), then yes, every house could be in a VBO... unless it had lots of details inside (vertices), in which case it would be distributed over multiple VBOs. If the environment of the game was a single building, then individual rooms would be in VBOs... though here too a room will be subdivided if all the vertices don't fit in a 65536 vertex VBO (16-bit indices). The exception is single objects that have more than 65536 vertices (or approaching 65536 vertices), which might get assigned to a larger VBO (32-bit indices). However, if the object isn't supposed to move (very often or ever), it might get split across multiple 65536 vertex VBOs. I started this engine quite a while ago, and have put it aside now and then to do contract work to generate spending money, so putting so much emphasis on 16-bit indices is probably an [almost] obsolete consideration.

That makes sense for say in BF/MW you have a bunch of houses/huts and you want to just put all the chairs/tables/cups/lamps into 1 vbo and just call it "Room1 VBO" but I wouldn't dare stick 2 full houses into a VBO when I can only go into 1 house at a time, and view 1 room in that house at a time.
yes, every house could be in a VBO... unless it had lots of details inside (vertices), in which case it would be distributed over multiple VBOs
You're both a bit off track with VBO allocation. Creating a VBO is equivalent to calling [font=courier new,courier,monospace]malloc[/font] in CPU land - it just gives you an array of bytes. Depending on your flags/hints, it will either [font=courier new,courier,monospace]malloc[/font] some VRAM, some main-RAM, or both (and as before, may have to use a lot of RAM in the dynamic case). The GPU may be able to read from both VRAM and main-RAM (remember, in order to cover the latency caused by slow memory-read times, you need more GPU registers to hold more threads), but the driver hides the details.

I can malloc enough memory to store two "Rooms" worth of vertices, and copy both those data sets into different parts of this same allocation, and then either render the two rooms in one go, or draw them as two separate rooms. It's exactly the same as if I called malloc twice - once for each Room. The only difference is that with the single-malloc version, I've got the option of making draw-calls that use verts from both rooms at once.

When you load a graphics asset, the file usually contains a blob of bytes that need to end up in main-RAM, and a blob of bytes that need to end up in VRAM. The simple solution is to create a new VRAM buffer for each asset (e.g. new model = new VBO), but this is basically just a detail of your engine's VRAM management system. You could instead make a few big VRAM allocations, and dish out different regions/offsets to different assets. There's a lot of ways to build this part of an engine.


If you're following the "typical" every-object-is-a-draw-call technique, then it can be even more important to use a smaller number of (shared, large) buffers, because then you won't have to issue buffer-binding state-change commands in-between each draw-call! Remember that draw-calls specify an offset into the VBO (or IBO), so many different bits of data can be grouped.

Also, your vertex attribute bindings (aka vertex declaration - the thing that plugs your VBOs into your vertex shader) are also largely based on offsets into your VBO's, allowing a lot of flexibility (exact details depend on the API).
E.g. sometimes you want to separate out your positions and normals into 2 separate (non-interleaved) streams -- they can still actually be stored within the same VBO if you like (not saying you should), with all positions first, followed by all normals after. Just use the right buffer-offset when binding the normal attribute.
Or, you could also have a VBO that contains interleaved data, but bind your VBO twice in such a way where the GPU is still reading two separate streams as if you would with 2 VBO's (the 2nd stream is just offset sizeof(position) from the 1st stream).


To go off on a bit of a tangent -- when different models share vertex data, such as a low-LOD model and it's original high-poly model, then they can share buffered vertices too. In this case, the low-LOD model could be created using only the high-poly model's vertices as input, and generating a new set of draw-calls and indices (which could be stored in any particular IBO object) that reference the original high-poly vertices.
So assuming I've already got the local-space high-poly model in VRAM, then drawing a crowd of LODed versions has very minimal memory impact. I've only got to load the LOD index data and the crowd transform matrix array. Each model in the crowd can be a unique variation on the model, in quite an extreme way with the specs in the OP. A model of a sphere can be tessellated and displaced into almost anything, so a T-pose of a human could be morphed into almost all your humanoid characters (by modelling them from the human base mesh).


Or back to the House example -- you could have a set of indices which draws the exterior of each house and you could also have a set of indices which draws in interior of each If you put the two 'exterior' index lists next to each other in the IBO, then you can draw the exteriors of either house individually, or draw both houses together in a single draw-call.
You can actually pre-generated multiple index lists for different viewpoints, or for likely pairs of objects (e.g. one list for when A,B,C are visible, and one for when D,B,E are visible). You can get much better results on individual models with many layers of transparency (i.e. need back-to-front order), such as foliage or glass-heavy designs, if you pre-generate a few different index lists for different angles. At runtime, you just need to measure the viewing angle and look-up the right IB offset.

I actually store 'draw-call' objects inside my model files, which may cover the same area as other draw-calls (i.e. [font=courier new,courier,monospace]foreach( model.draw as draw ) draw();[/font] would make a mess). Using a bit of meta-data or game-specific logic though, you can do things like separate the draw-calls out into LOD layers, etc...
You are talking about memory. I'm specifically talking about overdraw/ trying to draw stuff that is outside the frustum. By grouping everything into 1 vbo, you are wasting time drawing/shading triangles that are outside the frustum or occluded.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal


[quote name='dpadam450' timestamp='1339870022' post='4949847']
That makes sense for say in BF/MW you have a bunch of houses/huts and you want to just put all the chairs/tables/cups/lamps into 1 vbo and just call it "Room1 VBO" but I wouldn't dare stick 2 full houses into a VBO when I can only go into 1 house at a time, and view 1 room in that house at a time.
yes, every house could be in a VBO... unless it had lots of details inside (vertices), in which case it would be distributed over multiple VBOs
You're both a bit off track with VBO allocation. Creating a VBO is equivalent to calling [font=courier new,courier,monospace]malloc[/font] in CPU land - it just gives you an array of bytes. Depending on your flags/hints, it will either [font=courier new,courier,monospace]malloc[/font] some VRAM, some main-RAM, or both (and as before, may have to use a lot of RAM in the dynamic case). The GPU may be able to read from both VRAM and main-RAM (remember, in order to cover the latency caused by slow memory-read times, you need more GPU registers to hold more threads), but the driver hides the details.

I can malloc enough memory to store two "Rooms" worth of vertices, and copy both those data sets into different parts of this same allocation, and then either render the two rooms in one go, or draw them as two separate rooms. It's exactly the same as if I called malloc twice - once for each Room. The only difference is that with the single-malloc version, I've got the option of making draw-calls that use verts from both rooms at once.

When you load a graphics asset, the file usually contains a blob of bytes that need to end up in main-RAM, and a blob of bytes that need to end up in VRAM. The simple solution is to create a new VRAM buffer for each asset (e.g. new model = new VBO), but this is basically just a detail of your engine's VRAM management system. You could instead make a few big VRAM allocations, and dish out different regions/offsets to different assets. There's a lot of ways to build this part of an engine.


If you're following the "typical" every-object-is-a-draw-call technique, then it can be even more important to use a smaller number of (shared, large) buffers, because then you won't have to issue buffer-binding state-change commands in-between each draw-call! Remember that draw-calls specify an offset into the VBO (or IBO), so many different bits of data can be grouped.

Also, your vertex attribute bindings (aka vertex declaration - the thing that plugs your VBOs into your vertex shader) are also largely based on offsets into your VBO's, allowing a lot of flexibility (exact details depend on the API).
E.g. sometimes you want to separate out your positions and normals into 2 separate (non-interleaved) streams -- they can still actually be stored within the same VBO if you like (not saying you should), with all positions first, followed by all normals after. Just use the right buffer-offset when binding the normal attribute.
Or, you could also have a VBO that contains interleaved data, but bind your VBO twice in such a way where the GPU is still reading two separate streams as if you would with 2 VBO's (the 2nd stream is just offset sizeof(position) from the 1st stream).


To go off on a bit of a tangent -- when different models share vertex data, such as a low-LOD model and it's original high-poly model, then they can share buffered vertices too. In this case, the low-LOD model could be created using only the high-poly model's vertices as input, and generating a new set of draw-calls and indices (which could be stored in any particular IBO object) that reference the original high-poly vertices.
So assuming I've already got the local-space high-poly model in VRAM, then drawing a crowd of LODed versions has very minimal memory impact. I've only got to load the LOD index data and the crowd transform matrix array. Each model in the crowd can be a unique variation on the model, in quite an extreme way with the specs in the OP. A model of a sphere can be tessellated and displaced into almost anything, so a T-pose of a human could be morphed into almost all your humanoid characters (by modelling them from the human base mesh).


Or back to the House example -- you could have a set of indices which draws the exterior of each house and you could also have a set of indices which draws in interior of each If you put the two 'exterior' index lists next to each other in the IBO, then you can draw the exteriors of either house individually, or draw both houses together in a single draw-call.
You can actually pre-generated multiple index lists for different viewpoints, or for likely pairs of objects (e.g. one list for when A,B,C are visible, and one for when D,B,E are visible). You can get much better results on individual models with many layers of transparency (i.e. need back-to-front order), such as foliage or glass-heavy designs, if you pre-generate a few different index lists for different angles. At runtime, you just need to measure the viewing angle and look-up the right IB offset.

I actually store 'draw-call' objects inside my model files, which may cover the same area as other draw-calls (i.e. [font=courier new,courier,monospace]foreach( model.draw as draw ) draw();[/font] would make a mess). Using a bit of meta-data or game-specific logic though, you can do things like separate the draw-calls out into LOD layers, etc...
[/quote]
I very much appreciate your helpful attitude and thoughtful posts.

I understand most of your comments about the VBO. Maybe you can critique what I do. I don't do graphics full time, and I started planning this engine circa GTX6800 era, so I think some of my design decisions were based upon outdated assumptions. First of all, would you say my emphasis on 65536 vertex VBOs is more-or-less obsolete at this point in GPU history (to keep the indices at 16-bits in the IBO)? The vertices are typically 64 bytes.

What I do is allocate VBOs in the GPU, then manage what is in them myself. Thus, once a game or simulation has more-or-less gotten going, my engine doesn't often create or destroy any VBO. What it does do is move objects from VBO to VBO once in a while. This is not a big overhead, since most objects don't move and thus stay in the save VBO forever. Since my VBOs hold objects in the same 3D volume of space (nominally a cube), when that 3D volume becomes very far away from any active camera, the VBO is not deleted, but instead will eventually be filled with objects in some other 3D volume that is being created or coming into view of a camera.

My current approach "accidentally" has one other convenient feature. The IBO for each VBO contains multiple sets of indices, one set per LOD level. So as the 3D volume moves further from a given camera, the engine simply draws via a different set of indices to render all the objects at the appropriate LOD level. The other advantage of my approach is this: GPU memory does not fragment, at least not GPU memory allocated to VBOs and IBOs. If what I'm learning here convinces me to switch to some other scheme in which the GPU holds local-coordinates for all objects, my practice of assigning VBOs to 3D volumes probably isn't a wise choice, and I may also need to implement my LOD strategy somewhat differently.

This "clean" way of allocating VBOs might be difficult to keep if I switch over too. I'll have to think about that. I don't want to cause a slow but never-ending loss of GPU memory due to fragmentation. The idea of having 2 or more VAOs for each VBO is interesting. I'll have the think about that idea further too. Currently the contents of all my VBOs are interleaved.

Can you point me at any reference that explicitly says nvidia GPUs can be executing multiple batches with different shaders and uniforms and vertex specifications simultaneously? I am surprised to learn that I could miss such an important piece of information - even though I'm not doing 3D all the time. I was just reading "game engine architecture" last night and that book implied what I thought - that all cores of the GPU were executing the same shader on the same object. Or better yet, can you point me at some book or articles or whitepapers or whatever that explains the capabilities of GTX680-generation GPUs, and next-generation GPUs too if possible, with an emphasis on nvidia GPUs if possible. I've always been annoyed that I was not able to find any coherent presentation of this type for GPUs. The closest I ever found was a 20 or 30 odd page PDF file about the GT8800 series. But even that was very scant on details or explicit statements, and I'm worried the architecture has changed significantly since then, especially in these sorts of ways (becoming more flexible).

You are talking about memory. I'm specifically talking about overdraw/ trying to draw stuff that is outside the frustum. By grouping everything into 1 vbo, you are wasting time drawing/shading triangles that are outside the frustum or occluded.

Let me see if I understand. Because I put all objects in a given 3D volume of space into the same VBO, then draw the entire VBO by calling the glDrawElements() function once, I sometimes draw objects in that VBO that are not visible to the GPU (because part of that 3D volume falls outside the frustum). Is that what you're saying? If so, you are correct, because I don't test the AABB or OOBB of every object against the frustum. However, do you really think that costs me more bandwidth than it saves? I suppose I could alter the code to draw every object separately (without even a frustum test at first) and see how much slower that is --- then add the frustum test and not call glDrawElementsRange() when objects are outside the frustum and see how much faster (or slower) that is.

That is, assuming I am understanding your comment correctly.
No again, culling a cup or chair is fairly pointless. Grouping a room of chairs and cups into a vbo, and deciding to draw all or none of the objects in that room is fine.
given 3D volume of space[/quote]
Well I believe months ago you were saying to put everything into a single vbo and draw everything. But you just said you would consider putting 2 houses (including interior rooms), into a single vbo. By default you might have say 20K verts of objects inside of each house.

So you are outside, and only 1 house is in view. You have to draw 2 houses + there is no vertex order, so you may end up drawing the whole interior which has 20K verts, those all get shaded, and then the outside of the house draws last and it overwrites everything you just did. That is not negligible work to me. You drew 20K verts for the first house outside the frustum and 20K verts that were overdrawn for the house thats in view. So if you don't do that, you can get 40K verts somewhere else, maybe you can increase poly's in some of your car models or tree models.

To me it sounds like you might throw in some other objects into this vbo as well which again have no order in a vbo. So if you have are on the side of one house and the other house is on the other side, and there is a tree even further back. Then you should really be sorting that to determine, this is the object closest to me, draw it first. I don't get why you wouldn't still sort to draw it first.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal


Let me see if I understand. Because I put all objects in a given 3D volume of space into the same VBO, then draw the entire VBO by calling the glDrawElements() function once, I sometimes draw objects in that VBO that are not visible to the GPU (because part of that 3D volume falls outside the frustum). Is that what you're saying? If so, you are correct, because I don't test the AABB or OOBB of every object against the frustum. However, do you really think that costs me more bandwidth than it saves? I suppose I could alter the code to draw every object separately (without even a frustum test at first) and see how much slower that is --- then add the frustum test and not call glDrawElementsRange() when objects are outside the frustum and see how much faster (or slower) that is.

That is, assuming I am understanding your comment correctly.

Few flaws with this.

VBOs don't necessarily promise to put your data into GPU memory; the driver will select an optimal location based on your usage hints, current conditions and memory usage, phase of the moon, which side of bed you got out of this morning, did you remember to sacrifice a goat last night, and so on. So when you select e.g. GL_STATIC_DRAW, you're really saying to the driver "give me some memory and this is how I'd like to use it" but you really have no control over which memory the driver chooses, and it may even vary between two successive glBufferData calls. This is explicit in the documentation for glBufferData: http://www.opengl.or...lBufferData.xml.

Based on that there will always be a risk of some CPU-GPU bandwidth cost because drivers are perfectly free to put even a GL_STATIC_DRAW VBO into system memory if they consider that to be the most suitable location at the time you call glBufferData. There's also an internal GPU-bandwidth cost, even if the VBO is allocated in the best-case memory pool, which comes from moving data from the VBO to your vertex shader registers. This cost is obviously going to be very low, but it still doesn't come for free.

That brings me to the last point, which is that - yes, the GPU can cull on it's own, and fragments outside the frustum won't get shaded, but it must still run every vertex through your vertex shaders before it can do that (or the equivalent fixed pipeline stage if not using vertex shaders - although that's going to be emulated by vertex shaders on all modern hardware). If a high proportion of your scene is in fact outside the frustum, then you're incurring a higher than necessary per-vertex cost which will be tying up GPU resources that are best used elsewhere.

That's why coarse frustum culling on the CPU is still used, and that's why accepting a sllightly higher draw call count is considered a fair tradeoff. If all that you measure is draw call costs then it might look bad, but you have to weigh those extra costs against the savings gained, and the savings gained will always more than counterbalance them (unless you have a really odd, possibly contrived situation).

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

This topic is closed to new replies.

Advertisement