most efficient general rendering strategies for new GPUs

Started by
92 comments, last by maxgpgpu 11 years, 9 months ago

[quote name='maxgpgpu' timestamp='1339445552' post='4948286']
Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.

Measuring a single transformation in isolation is not going to give you a meaningful comparison. As you noted, GPUs are much more parallel than CPUs (that's not going to change for the foreseeable), and they also have the advantage that this parallelism comes for free - there's no thread sync overhead or other nastiness in your program. At the most simplistic level, an 8-core CPU would need to transform vertexes 64 times faster (the actual number will be different but in a similar ballpark) than a 512-core GPU for this to be advantageous.

Another crucial factor is that performing these transforms on the GPU allows you to use static vertex buffers. Do them on the CPU and you're needing to stream data to the GPU every frame, which can fast become a bottleneck. Yes, bandwidth is plentiful and fast, but it certainly doesn't come for free - keeping as much data as possible 100% static for the lifetime of your program is still a win. Not to mention a gain in terms of reduction of code complexity in your program, which I think we can all agree is a good thing.
[/quote]

From articles by nvidia, I infer a single core of a fast CPU is about 8x faster than a GPU core, not 64x. That's why I believe we all agree that our 3D engines should perform all operations on the GPU that do not involve some unfortunate (and possibly inefficient) consequences. A perfect example of an unfortunate/inefficient consequence of doing everything the conventional way is the problem of collision-detection and collision-response. This is not an issue in trivial games, or games that just don't need to detect or respond to collision in any significant or detailed way. But when any aspect of collision-detection OR collision-response needs to be handled on the CPU, the "simple way" or "conventional way" of keeping object local-coordinates in VBOs in the GPU memory, and having the GPU transform those objects directly to screen-coordinates with one matrix has serious problems.

First, collision-detection must be performed with all objects in the same coordinate-system... which for many practical reasons is world-coordinates. Also, collision-response has the same requirements... all objects must be available in the same [non-rotating, non-accelerating] coordinate-system, and world-coordinates is again most convenient. So the "solution" proposed by advocates of "conventional ways" is to make the GPU perform two matrix transformations --- first transform each vertex to world-coordinates, then write the vertex (or at least the world-coordinates and possibly transformed surface vectors (normal, tangent, bitangent)) back into CPU memory, then transform from world coordinates to screen coordinates in a separate transformation.

So now we do have a transfer of all vertices between CPU and GPU... only this time, the transfer is from GPU to CPU. The CPU must wait for this information to be returned before it starts processing it, and of course this must be synchronized (not terribly difficult).

I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates --- only those vertices that changed (usually by rotate or translate) on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static". However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects, this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches" (except for huge objects with boatloads of vertices, where one object in-and-of itself is a large batch). Furthermore, in the "conventional way", since a new glDrawElements() call is inherently required for every object, we might as well have boatloads of shaders, one for every conceivable type of rendering, and [potentially] change shaders for every object... or at least "often". From what I can tell, changing shaders has substantial overhead in the GPU driver [and maybe in the GPU itself], so this is an often ignored gotcha of the "conventional way". Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").

I very much agree that keeping code complexity low is important. However, from my experience, my approach is simpler than the "conventional way". Every object looks the same, has the same elements, is transformed to world-coordinates and sent to GPU when modified (but only when modified), exists in memory in world-coordinates for the convenience of collision-detection and collision-response computations, and whole batches of objects are rendered with a single glDrawElements(). Are there exceptions? Yes, though not many, and I resist them. Examples of exceptions I cannot avoid are rendering points and lines (each gets put into its own batch). My vertices contain a few fields that tell the pixel shader how to render each vertex (which texture and normalmap, and bits to specify rendering type (color-only, texture-only, color-modifies-texture, enable-bumpmapping, etc)). So if anything, I think my proposed way (implemented in my engine) is simpler. And all things considered, for a general-purpose engine, I suspect it is faster too.
Advertisement

[quote name='AgentC' timestamp='1339537035' post='4948637']
- Sort rendering operations. Speed this up using multiple cores. For opaque geometry, sort both by state and distance for front-to-back ordering. The method I use is to check the closest geometry rendered with each unique renderstate, and use that info to determine the order of renderstates. This needs experimental testing to determine whether for example minimizing shader changes is more important than more strict front-to-back order.

Thanks for a good summary of optimization technologies, it looks like you really know are talking about.

I am now trying to improve my application optimization for state changes. Is there some easy rule of thumb what states changes are important to optimize for? In worst case, I suppose a state change will delay the pipeline.
[/quote]
For what it's worth, I took the ultimiate "brain-dead" approach to address this issue.

I have a "glstate" structure [for each rendering context] that contains every OpenGL state variable. Then, when my code needs to set any OpenGL state, it first checks whether the value it needs is already set, and if so, doesn't bother. Obviously the code for this approach is "braindead simple" (which means, no errors due to complexity), is the same for every situtation and every OpenGL state variable, and quite efficient and readable (test and conditional assign on a single line of C code).

And yes, in a large percentage of cases the "new state" == "old state", and GPU-driver and GPU overhead is avoided.

I have a "glstate" structure [for each rendering context] that contains every OpenGL state variable. Then, when my code needs to set any OpenGL state, it first checks whether the value it needs is already set, and if so, doesn't bother. Obviously the code for this approach is "braindead simple" (which means, no errors due to complexity), is the same for every situtation and every OpenGL state variable, and quite efficient and readable (test and conditional assign on a single line of C code).


It will work, and be efficient CPU wise, but I don't think this is good enough. You need to sort the drawing operations for some state changes, or you can get pipeline delays.

I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this. Maybe someone can confirm this.


First, collision-detection must be performed with all objects in the same coordinate-system...


I don't see what collision detection has to do with graphics and drawing. It is a physical modelling. Maybe it is possible to use the GPU to compute this, but I think it is more common to use a CPU local quadtree.
[size=2]Current project: Ephenation.
[size=2]Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/
I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates ... vertices that [move] on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static".
I think you answered your own question there.
However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects....[/quote]Transforms that haven't changed don't need to be re-uploaded to the GPU, so except in the simplest scenes, the overhead of computing and uploading the changed transforms is sure to be far less than that of computing and uploading the changed vertices.
For example, if we scale this up to a modern scene with 100 characters, each with 1 million vertices, animated by 100 bones, we could implement it several ways:
Case A) CPU vertex transforms
CPU load = 10000 bone transforms, 100 million vertex transforms and uploads.
GPU load = 100 million vertex transforms.
Case B) GPU skinning
CPU load = 10000 bone transforms and uploads.
GPU load = 100 million vertex transforms.
Case C) GPU animation and skinning
CPU load = 100 character transform uploads.
GPU load = 10000 bone transforms, 100 million vertex transforms.
There's no general best solution. If you wanted to get the theoretical best performance, you'd probably end up using all of these approaches in different places.

However, a big difference in the above is the memory overhead -- in the GPU skinned versions, there's only a single (static) copy of the character vertex data required, in GPU memory -- but in the CPU case, the CPU has 200 copies of the data (untransformed and transformed, for each player), and the GPU/driver require another n-copies of the total data where n is number of frames that your driver buffers commands for.

Also, if you're sending world-space vertices to the GPU, then you'll lose too much precision in 16-bit vertex formats, so your vertex buffers will all have to double in size to using full 32-bit floats.
this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches".[/quote]Batches are a CPU overhead (the GPU forms it's own idea of 'batches' from the work/stall points in it's pipeline independently from draw calls), so if you put the same optimisation effort into your systems that call [font=courier new,courier,monospace]glDraw[/font]*, you wont be troubled by them until you've got similar numbers of objects that are also going to choke your CPU-vertex-transform routines.
Also, there is always a middle ground -- e.g. my last game always drew 4 shadow-casters per batch, with 4 different light transforms.
Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").[/quote]Focus on some algorithmic optimisation in your CPU GL code, and the details behind uniform-buffers won't impact you greatly - it still boils down to that only data that the CPU needs to change each frame needs to be re-uploaded each frame.
#3: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate collision-detection [and collision-response if desired] on every object, every frame.[/quote]Do you really use triangle vs triangle collision detection routines for the majority of your objects? There's usually a good reason that dynamic (especially animated) objects are usually represented by something other than triangle-soups in collision engines. I guess that's useful if you need accuracy though, just be aware of the costs.


I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this.
You don't want to waste time with redundant state-changes and there are very efficient ways to avoid them, just make sure the optimisation isn't more expensive than the original redundancy ;). Usually sorting as well as this kind of filtering would be used.
To quote TomF:
[color=#000000][font=arial, helvetica]

1. Typically, a graphics-card driver will try to take the entire state of the rendering pipeline and optimise it like crazy in a sort of "compilation" step. In the same way that changing a single line of C can produce radically different code, you might think you're "just" changing the [/font]AlphaTestEnable[color=#000000][font=arial, helvetica]

flag, but actually that changes a huge chunk of the pipeline. [/font]Oh but sir, it is only a [background=rgb(255, 238, 136)]wafer-thin[/background] renderstate...[color=#000000][font=arial, helvetica]

In practice, it's extremely hard to predict anything about the relative costs of various changes beyond extremely broad generalities - and even those change fairly substantially from generation to generation.[/font]

[color=#000000][font=arial, helvetica]

2. Because of this, the number of state changes you make between rendering calls is not all that relevant any more. This used to be true in the [/font]DX7[color=#000000][font=arial, helvetica]

and [/font]DX8[color=#000000][font=arial, helvetica]

eras, but it's far less so in these days of [/font]DX9[color=#000000][font=arial, helvetica]

, and it will be basically irrelevant on [/font]DX10[color=#000000][font=arial, helvetica]

. The card treats each unique set of states as an indivisible unit, and will often upload the entire pipeline state. There are very few [/font]incremental[color=#000000][font=arial, helvetica]

state changes any more - the main exceptions are rendertarget and some odd non-obvious ones like Z-compare modes.[/font]

[color=#000000][font=arial, helvetica]

3. On a platform like the PC, you often have no idea what sort of card the user is running on. Even if you ID'd the card, there's ten or twenty possible graphics card architectures, and each has a sucession of different drivers. Which one do you optimise for? Do you try to make the edge-traversal function change according to the card installed? That sounds expensive. Remembering that most games are limited by the CPU, not the GPU, and you've just added to the asymmetry of that load.[/font]

[/quote]


[quote name='maxgpgpu' timestamp='1339654045' post='4949057']
I have a "glstate" structure [for each rendering context] that contains every OpenGL state variable. Then, when my code needs to set any OpenGL state, it first checks whether the value it needs is already set, and if so, doesn't bother. Obviously the code for this approach is "braindead simple" (which means, no errors due to complexity), is the same for every situtation and every OpenGL state variable, and quite efficient and readable (test and conditional assign on a single line of C code).


It will work, and be efficient CPU wise, but I don't think this is good enough. You need to sort the drawing operations for some state changes, or you can get pipeline delays.

I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this. Maybe someone can confirm this.[/quote]
Oh, absolutely no question we also need to perform certain kinds of sorts, but that is a different issue. Of course in my extreme case, there is rarely any need to change OpenGL state between batches... and my batches are huge. Of course there is a need to change state between a shadow-generating pass and later rendering passes, but those are very coarse-grain and therefore not very significant.

When the general approach is to perform a separate draw call for every object, the need for many state-changes arises because there's a temptation to have many different rendering styles and even multiple shaders.


First, collision-detection must be performed with all objects in the same coordinate-system...


I don't see what collision detection has to do with graphics and drawing. It is a physical modelling. Maybe it is possible to use the GPU to compute this, but I think it is more common to use a CPU local quadtree.
[/quote]
Really? Hahahaha!

Why do you think the "conventional way" is to put the local-coordinates of all vertices of all objects into the GPU, and then have the GPU transform from local-coordinates to screen-coordinates? Two answers: So the program only needs to transfer object vertices to GPU memory once (and certainly not every frame), and so the CPU doesn't need to transform object vertices from local-coordinates to world-coordinates (or some other coordinate-system), because the GPU does that. Of course the CPU does compute the transformation matrix and transfer that to the GPU for every object before it renders the object.

And THAT is what collision-detection and collision-response has to do with this "conventional way". Why? Because the CPU cannot perform collision-detector OR collision-response unless it has access to all object vertices in world-coordinates (or some other single, consistent, non-rotating, non-accelerating coordinate system).

But wait! The whole freaking POINT of the "conventional way" is to eliminate the need for the CPU to transform vertices OR transfer vertices between CPU and GPU memory. But collision-detection and collision-response REQUIRE all those vertices in world-coordinates in CPU memory... every single frame. Which means:

#1: So much for the CPU not needing to transform vertices.
#2: So much for the CPU not needing to transfer vertices between CPU and GPU.

Now, to be fair, you can arrange things so only one of the above needs to be done. You can keep the original scheme, where the GPU holds only local coordinates, and the CPU computes TWO (instead of one) transformation matrices for each object, and transfers that to the GPU before it renders each object. However, this requires the GPU compute world-coordinates of all vertices (the first transformation matrix supplied by the CPU), and then transfer those world-coordinate vertices back into CPU memory --- so the CPU has world-coordinates of all objects to perform collision-detection and collision-response with. Note that this scheme still requires a separate batch for every object. Furthermore, unless you switch shaders [potentially] every object, the GPU transfers the world-coordinates of every single vertex back to CPU memory, because it has no freaking idea which objects rotated or moved. In contrast, in my scheme the CPU only transfers vertices of objects that rotated or moved, because it doesn't even bother to process objects that didn't move. As a consequence, there is no need for the CPU application to make any distinction between objects --- no separate shaders, no extra transfers between CPU and GPU, etc.

This is [some of] what "collision-detection" has to do with "graphics and drawing".

[quote name='maxgpgpu' timestamp='1339653554' post='4949056']I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates ... vertices that [move] on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static".

I think you answered your own question there.[/quote]

Good one! :-) But as we see when we continue, that's only one consideration....

However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects....[/quote]

Transforms that haven't changed don't need to be re-uploaded to the GPU, so except in the simplest scenes, the overhead of computing and uploading the changed transforms is sure to be far less than that of computing and uploading the changed vertices.[/quote]

You may be right, but your premise is wrong, at least in any scenario my application ever faces. In my application, and I would guess most other applications, the camera moves every frame. Therefore, every transformation matrix for every object must change in the "conventional way". This is also true in my scheme too, except there is only one transformation for all objects, since GPU memory contains world-coordinates for all objects (so to the GPU, literally all objects are one object in this respect). Furthermore, you totally ignore the very real, practical issue that I keep raising! The CPU needs world-coordinates to perform collision-detection and collision-response, and advocates of the "conventional way" always, always, always, always, always ignore this, pretend the problem doesn't exist, ignore people like me when we point it out, etc. I'm sorry, but a real general-purpose engine must perform collision-detection and collision-response. So at least you must include overhead for setting two matrices to the GPU and you must include overhead for the GPU writing out world-coordinate vertices to CPU memory... for every object every frame! Now we return control to your normal channel! :-)

For example, if we scale this up to a modern scene with 100 characters, each with 1 million vertices, animated by 100 bones, we could implement it several ways:[/quote]

Okay, I must interrupt with an objection here, but then we'll return to your thoughts. In most games and simulations, most objects contain far fewer vertices, and most objects do not move (a wall, a chair, a table, a dish, a fork, a baseball, a bat, a blade of grass, etc). Then a small minority of objects are large, or in your scenario - huge!

I must also admit that I tend to think in terms of small details and their self-shadows implemented with a parallax-mapping technique like "relaxed cone-step mapping" rather than 27-zillion tiny little triangles. However, I admit this is probably a matter of choice with both approaches having advantages and disadvantages in different situations. One thing I'm sure we can both agree upon is that there's no need to perform collision-detection or collision-response on fine details like the wrinkles in the face of an old person. But a naive approach that implements these features with zillions of micro-triangles and then tries to perform collision-detection and collision-reponse upon them is certainly doomed --- whether vertices are transformed by CPU or GPU.


Case A) CPU vertex transforms
CPU load = 10000 bone transforms, 100 million vertex transforms and uploads.
GPU load = 100 million vertex transforms.
Case B) GPU skinning
CPU load = 10000 bone transforms and uploads.
GPU load = 100 million vertex transforms.
Case C) GPU animation and skinning
CPU load = 100 character transform uploads.
GPU load = 10000 bone transforms, 100 million vertex transforms.
There's no general best solution. If you wanted to get the theoretical best performance, you'd probably end up using all of these approaches in different places.

However, a big difference in the above is the memory overhead -- in the GPU skinned versions, there's only a single (static) copy of the character vertex data required, in GPU memory -- but in the CPU case, the CPU has 200 copies of the data (untransformed and transformed, for each player), and the GPU/driver require another n-copies of the total data where n is number of frames that your driver buffers commands for.[/quote]
That's one place you're wrong. If we're gonna perform collision-detection, then you need ALL vertices in world-coordinates, and probably you need them in CPU memory. So you not only have all that storage, you also need to have the GPU transfer all those world-coordinates back to CPU memory so the CPU can perform the collision-detection. However, let's be fair to you, even though it was you who created these extreme cases (100 super-detailed bending, twisting, moving, flexible, organic characters). In any such scenario, whether processed on CPU or GPU or cooperatively, we would certainly [manually or automatically] select a subset of character vertices to represent each character for purposes of collision-detection and collision-response.

Also, if you're sending world-space vertices to the GPU, then you'll lose too much precision in 16-bit vertex formats, so your vertex buffers will all have to double in size to using full 32-bit floats.[/quote]
Yeah, I've never even considered 16-bit floats for coordinates! Hell, I dislike 32-bit coordinates for coordinates! Hell, even 64-bit coordinates make me nervous! Seriously! In one of my applications, I need to have thousands of entirely separate "world coordinate systems", one for each solar system, because the range of 64-bit floating-point is no where near sufficient to contain galactic scale as well as tiny details. Fortunately we can't see details within other solar systems (just the point star, and limited detail with high-power telescopes), so I am able to avoid the inherent mess that multiple world coordinate systems implies in a rigorous application of this scheme.

this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches".[/quote]

Batches are a CPU overhead (the GPU forms it's own idea of 'batches' from the work/stall points in it's pipeline independently from draw calls), so if you put the same optimisation effort into your systems that call glDraw*, you wont be troubled by them until you've got similar numbers of objects that are also going to choke your CPU-vertex-transform routines.
Also, there is always a middle ground -- e.g. my last game always drew 4 shadow-casters per batch, with 4 different light transforms.[/quote]

Well, first of all, I must admit that as much as I try, I am not confident I understand what is most efficient for OpenGL on nvidia GPUs (much less other GPUs on OpenGL or D3D). However, I suspect you oversimplified. If I understand this correctly, if you make no state changes between batches, then batch overhead in the newest GPUs can be quite modest (depending on batch size, etc). In fact, maybe the driver is smart enough to merge those batches into one batch. I also have the impression that a limited number of simple state changes between batches won't cause terrible overhead. Having said this, however, any change that makes any difference to the results the GPU cores produce, must inherently require the GPU be completely finished all processing of the previous batch (both vertex and pixel shaders). This necessarily involves at least a modest loss of efficiency over continuous processing. And when something like a shader needs to be changed, even more performance loss is unavoidable. True, top-end GPUs today aren't nearly as bad as previous generations, since (I infer) they can hold multiple shaders in the GPU, so the process of switching between them isn't horrific any more, even including changes to the uniform buffer objects required to support the new shader. But it is still significant compared to continuous processing.

I also note the following. My engine makes a distinction between "huge moving objects" and all others (non-moving objects and small-to-modest-size moving objects). For practical purposes (and in fact in my engine), a huge moving object has so many vertices that it is a batch, all by itself. So even my engine takes a middle road in this case. It leaves local-coordinates in the GPU and sets the appropriate transformation matrix to the GPU before it renders that object. However, at this point anyway, the CPU also transforms local-coordinates to world-coordinates itself, rather than have the GPU perform two transformations and feedback the intermediate world-coordinates of these objects back to CPU memory for further [collision] processing. Of course, if the object has craploads of tiny details implemented as microtriangles, then the CPU only transforms that subset that reasonably describes the boundary of the object.


Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").[/quote]
Focus on some algorithmic optimisation in your CPU GL code, and the details behind uniform-buffers won't impact you greatly - it still boils down to that only data that the CPU needs to change each frame needs to be re-uploaded each frame.


#3: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate collision-detection [and collision-response if desired] on every object, every frame.[/quote]
Do you really use triangle vs triangle collision detection routines for the majority of your objects? There's usually a good reason that dynamic (especially animated) objects are usually represented by something other than triangle-soups in collision engines. I guess that's useful if you need accuracy though, just be aware of the costs.[/quote]

The usual approach of my engine is this (on most objects):

#1: broad phase with insertion-sorted overlap testing of object AABBs in world-coordinates.

#2: narrow phase with highly optimized GJK "convex-hull" technique (whether object is convex or not).

#3: if #2 detects collision OR object is marked as "arbitrary irregular/concave shape" then triangle-triangle intersection.

The application running on the engine can globally or per-object specify that step #1 alone is adequate, or step #1 + #2 is adequate. Otherwise all 3 steps are executed. I put a huge effort into making these routines super-efficient, but as you clearly understand, even the most brilliant implementation of triangle-triangle-soup intersection is relatively slow. Depending on the objects, mine is 2 to 20 times slower than the GJK implementation of #2, but of course has the huge advantage of working on completely arbitrary, irregular "concave" objects. However, I have not yet implemented efficiencies possible by #3 inspecting and optimizing itself on the basis of results saved by #2 (deepest penetration locations). So #3 should speed up somewhat~to~considerably.... someday. Like I don't have enough other work to do, right?

PS: Just for fun (or to make myself look stupid) I'll mention the following example of going to [absurd] extremes to make certain parts of my engine efficient. The transformation of each vertex from local-coordinates to world-coordinates by the CPU requires only 3 SIMD/AVX machine-language instructions per vertex to update the object AABB. That's less than 1 nanosecond per vertex. :-) Yeah, yeah, I know... some modern scenes have a crapload of vertices.

But just to say it again... if we're talking about collisions of million-vertex objects like in your example, clearly we have no reasonable choice but to specify or automatically compute a vertex-subset of every "huge" object to "reasonably" mimic the outer surface of the object. Otherwise, we need to speed up CPUs 100x or so! :-o


I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this.


For that kind of state-change, I suspect you're correct. To cover my ass from "what I don't know, and what they don't tell us" cases, I decided to perform the same test and action on every OpenGL state, then "not worry about it".

You don't want to waste time with redundant state-changes and there are very efficient ways to avoid them, just make sure the optimisation isn't more expensive than the original redundancy ;). Usually sorting as well as this kind of filtering would be used.[/quote]

The one very nasty fact that people rarely mention is... we can only perform this global sort in one way (on one basis). So, for example, in my engine I prefer to global sort on the basis of location in 3D space (or 2D in a game/scenario that takes place mostly/exclusively on a surface). In other words, each "globally sorted batch" contains objects in the same 3D volume of space. This is just about the simplest possible basis to sort upon, I suppose. Each 3D volume of space (a world-coordinates AABB for practical purposes) is very efficiently tested against the frustum by the CPU. If the AABB is definitely outside the frustum, my engine ignores the entire batch (does not: send vertices to GPU, set up uniforms, change state, or render). It does not test each object against the frustum like the "conventional way", which means the GPU renders a few more unnecessary objects that a conventional approach, but the CPU test is super simple and super fast.

To be sure, it is true that we can perform "sub-sorts" within our "global-sort", and in some engines this will help, sometimes substantially. Especially in engines where the shader-writers go insane and create craploads of super-optimized special-purpose shaders, it would be worth sorting all objects within each unit/batch/section of the global-sort by which shader is required. Since I have so few shaders (and only one shader that checks conditional-rendering bits in each vertex for normal objects, this is irrelevant for me).


To quote TomF:
1. Typically, a graphics-card driver will try to take the entire state of the rendering pipeline and optimise it like crazy in a sort of "compilation" step. In the same way that changing a single line of C can produce radically different code, you might think you're "just" changing the AlphaTestEnable flag, but actually that changes a huge chunk of the pipeline. Oh but sir, it is only a wafer-thin renderstate... In practice, it's extremely hard to predict anything about the relative costs of various changes beyond extremely broad generalities - and even those change fairly substantially from generation to generation.

2. Because of this, the number of state changes you make between rendering calls is not all that relevant any more. This used to be true in the DX7 and DX8 eras, but it's far less so in these days of DX9, and it will be basically irrelevant on DX10. The card treats each unique set of states as an indivisible unit, and will often upload the entire pipeline state. There are very few incremental state changes any more - the main exceptions are rendertarget and some odd non-obvious ones like Z-compare modes.

3. On a platform like the PC, you often have no idea what sort of card the user is running on. Even if you ID'd the card, there's ten or twenty possible graphics card architectures, and each has a sucession of different drivers. Which one do you optimise for? Do you try to make the edge-traversal function change according to the card installed? That sounds expensive. Remembering that most games are limited by the CPU, not the GPU, and you've just added to the asymmetry of that load.
[/quote]

I mostly agree with the above. However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.

However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!
Three things;

Firstly your whole premise around collision detection/response is utterly flawed. Tri-tri intersections are the very lowest and last level of intersection done, most games won't even go to this level for collision detection. If you are performing collisions in the world then you'll start off with a broad AABB sweep, then OBB sweep, then sub-OBBs as required and finally, maybe, a tri-tri intersection if you really require that. So for an object of a few 1000 triangles you end up performing a few vertex transforms before worrying about triangles. So you DONT need the transformed vertex data back from the GPU because you simply don't need that data.

Secondly you keep throwing around the term 'conventional' without any real consideration as to what 'conventional' even means. The last game I worked on, for example, did draw rejection with a few schemes - distance based, frustum based per object & frustum based bounding box depending on the group of objects being rendered. There is no 'conventional' - there is 'basic' which is the per-object frustum based without consideration for the world structure but most games are going to grow beyond that pretty quickly. Hell, the current game I'm working on is using a static PVS solution to reject volumes of the world and associated objects from getting anywhere near being drawn.

Finally, while I'm too busy to think deeply about your sorting it sounds... wrong.
While your visiblility detection based on volumes seem sane, beyond that you want to sort batches by state. If you are rendering by volume (as my first read implies) then you are causing both CPU and GPU overhead as you'll be swapping states too often and repeating binding operations etc too.
Example; two volumes have two models each in them, model A and B.
If you render per volume then you'll do

  1. Volume

    • Render A
    • Render B
  • Volume

    1. Render A
    2. Render B


  • Each of those render calls would require setting the various per-instance states each time.
    Where as with a simple sort on the object you can detect that they are both using the same state and reduce it to

    1. Render two instances of A
    2. Render two instances of B

    Scale this up to a few more volumes and you'll quickly see the problem.

    As reference EVERYTHING in Battlefield 3 is instance rendered, which for a modern (read DX11 class hardware) is remarkable easy to do. In fact depending on bottlenecks you might well be well served with instancing models with difference textures on them and using various techniques to bundle the different data together (texture arrays, UAVs of source data etc).

    The point is spending a bit of time sorting for batches across the whole visable scene (something which you can bucket and do via multiple threads too) can save you more time on the CPU and GPU later by reducing submission workload and GPU state changes.
    You may be right, but your premise is wrong, at least in any scenario my application ever faces. In my application, and I would guess most other applications, the camera moves every frame. Therefore, every transformation matrix for every object must change in the "conventional way".
    No, not at all -- in the "conventional way", the many object's local-to-world matrices are untouched, and only the single world-to-projection matrix is changed. So, the minimum change is 64bytes...

    Furthermore, you totally ignore the very real, practical issue that I keep raising! The CPU needs world-coordinates to perform collision-detection and collision-response, and advocates of the "conventional way" always, always, always, always, always ignore this, pretend the problem doesn't exist, ignore people like me when we point it out, etc. I'm sorry, but a real general-purpose engine must perform collision-detection and collision-response. So at least you must include overhead for setting two matrices to the GPU and you must include overhead for the GPU writing out world-coordinate vertices to CPU memory... for every object every frame! Now we return control to your normal channel! :-)[/quote]I didn't completely ignore it - I asked in jest if you really need triangle-soup based collision detection. In my experience, it's standard practice to use an entirely different data representation for physics/collision than for rendering -- the two systems are entirely decoupled.
    Trying to use your visual-representation for most game physics tasks seems entirely overcomplicated, especially as graphics assets continue increase in detail. You seem to be designing huge portions of your framework around the assumption that this is important, and creating dangerous couplings in the process.

    Okay, I must interrupt with an objection here, but then we'll return to your thoughts. In most games and simulations, most objects contain far fewer vertices, and most objects do not move (a wall, a chair, a table, a dish, a fork, a baseball, a bat, a blade of grass, etc). Then a small minority of objects are large, or in your scenario - huge![/quote]For the target hardware in this thread, those numbers are sane enough, and they're only going to get worse. If you want something for 6 years from now, make sure that it scales - in memory usage and processing time - as the art density increases. Your current system suffers a memory explosion in complex, dynamic scenes, so I'd at least recommend a hybrid approach, where certain objects can opt-out of your batch-optimisation process and be rendered more "conventionally".

    That's one place you're wrong. If we're gonna perform collision-detection, then you need ALL vertices in world-coordinates, and probably you need them in CPU memory. So you not only have all that storage, you also need to have the GPU transfer all those world-coordinates back to CPU memory so the CPU can perform the collision-detection.[/quote]This is crazy, I've never seen this in a game. The status quo is to use an entierly different "Physics Mesh", which is like one of your visual LODs. As your visual LODs scale up to millions of vertices, your physics mesh usually remains very simple.
    Most physics meshes aren't dynamic and there shouldn't be a time where you'd need to be transforming their vertices. Physics meshes that are dynamic are usually constructed out of articulated rigid body shapes, rather than arbitrarily deforming polygon soups.
    <snip>


    This collision setup breaks under a client/server model so it's most definitely not suitable as any kind of general-purpose approach. With client/server you have 2 fundamental core assumptions: (1) the client is not to the trusted, and (2) the server may be somewhere on the other side of the world. It follows from those that the server does physics (at least the important gameplay-affecting stuff) and the client does rendering (unless you really want to send screen updates over the internet), so each needs a different data set.

    It also starts out by assuming the very worst case - that you're going to need a per-triangle level of detail for collision detection with everything. There are actually at least two coarser tests that you run to trivially reject objects before going to this level of detail (even if you ever do go that far) - "can these objects collide?" (they may be in different areas of the map or moving in opposite directions) and a coarser test (sphere/bbox) for "do these objects collide?" Only if both of those are passed (and over 90% of your game objects are going to fail) do you need a more detailed test. The fastest code is the code that never runs - avoiding transforms altogether for over 90% of objects is going to trump any multithreaded SIMD setup.

    Regarding collision response, if you need to re-animate as a result of that you don't re-animate the vertexes, you re-animate the bones.

    Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


    Three things;

    Firstly your whole premise around collision detection/response is utterly flawed. Tri-tri intersections are the very lowest and last level of intersection done, most games won't even go to this level for collision detection. If you are performing collisions in the world then you'll start off with a broad AABB sweep, then OBB sweep, then sub-OBBs as required and finally, maybe, a tri-tri intersection if you really require that. So for an object of a few 1000 triangles you end up performing a few vertex transforms before worrying about triangles. So you DONT need the transformed vertex data back from the GPU because you simply don't need that data.

    My whole premise is utterly flawed? Interesting, because my post explicitly stated (with #1, #2, #3) the stages of my collision-detection. It said phase #1 is the broad AABB overlap sweep, just as you say above. It also says, "that's all that's required for some types of applications". However, I went to great pains at the beginning of this thread to emphasize that we're talking about next-generation general-purpose 3D game/simulation engines. They must be able to handle all kinds of situations. Having said that, I invite alternatives. For example, it is certainly arguable that sub-dividing irregular/concave objects into multiple convex objects so they can be processed by faster convex algorithms (like GJK) is superior to writing a totally general algorithm that handles completely arbitrary, irregular, concave objects (as I did).

    I might not exactly understand what you mean by OBB and sub-OBB, so maybe you can elaborate. I assume OBB means OOBB (object oriented bounding box), the alternative to AABB (axis aligned bounding box). But I'm not sure what you mean by sub-OBB. I am not certain whether inserting an OOBB test on object-pairs that the AABB test classifies as "overlapping" speeds up collision-detection or not. It might, though I doubt by very much. However, I could be wrong about that --- you might have a blazing fast ways to compute OOBBs and test a pair against each other, in which case the speedup might be significant and worthwhile (worth the extra complexity). However, after this level you lose me --- I just don't follow what you're talking about, unless maybe you have a nested set of more-and-more, smaller-and-smaller OOBBs inside the outer OOBB. If so, my off-the-cuff impression is... that must be quite a bit slower and more complex than a good, fast, optimized GJK algorithm like the one I have.

    You are correct, of course, that an engine that works the "conventional way" (which I clearly stated means keeping local-coordinates of vertices in the GPU), can have the CPU perform local to world coordinate transformations of various subsets of object vertices "on demand". But guess what? If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB. Of course, you can perform an analysis on non-deforming objects to find the few vertices that can never be an outlier on ANY of the three axes, and exclude those from transformation because they will never set any of the 6 limits on the AABB. You could also exclude more vertices from this test, but artificially increase the size of the AABB to counteract the largest possible error this scheme might introduce (though I'm not exactly clear how to write such a routine, off hand).

    Secondly you keep throwing around the term 'conventional' without any real consideration as to what 'conventional' even means.[/quote]
    I did explain what I mean by "conventional". It only means keeping the local-coordinates of vertices in GPU memory, and always making the GPU transform them from local to screen coordinates every frame. I believe that is conventional in the sense that 90% or more of existing engines do this.

    The last game I worked on, for example, did draw rejection with a few schemes - distance based, frustum based per object & frustum based bounding box depending on the group of objects being rendered. There is no 'conventional' - there is 'basic' which is the per-object frustum based without consideration for the world structure but most games are going to grow beyond that pretty quickly. Hell, the current game I'm working on is using a static PVS solution to reject volumes of the world and associated objects from getting anywhere near being drawn.[/quote]
    About those objects that you "draw-rejected". Do you also exclude them from casting shadows onto objects that are in the frustum? Do you also exclude them from colliding with each other? Sorry, I mean do you exclude your engine from recognizing when they collide with each other, just because they were "draw rejected" [at an early stage]?

    Finally, while I'm too busy to think deeply about your sorting it sounds... wrong.
    While your visiblility detection based on volumes seem sane, beyond that you want to sort batches by state. If you are rendering by volume (as my first read implies) then you are causing both CPU and GPU overhead as you'll be swapping states too often and repeating binding operations etc too.
    Example; two volumes have two models each in them, model A and B.
    If you render per volume then you'll do

    1. Volume

      • Render A
      • Render B



    Each of those render calls would require setting the various per-instance states each time.
    Where as with a simple sort on the object you can detect that they are both using the same state and reduce it to

    • Volume

      1. Render A
      2. Render B


    1. Render two instances of A
    2. Render two instances of B

    Scale this up to a few more volumes and you'll quickly see the problem.[/quote]
    Yikes! See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.

    In this case, what you are failing to see, is that once you adopt the proposed unconventional approach, there is rarely if ever any need to change ANY state (including shaders) between batches (where each batch contains objects in one volume of space).

    To help clarify this, the following is what my frame looks like:

    set up state
    for (batch = 0; batch < batch_count; batch++) { // where each batch contains objects in a volume of space.
    if (batch.volume overlaps frustum.volume) {
    glDrawElements (the entire batch, including every object within the batch);
    }
    }

    That's it! So, where are these "state changes" you're talking about? That's right, there aren't any! Every object in every volume is rendered with the exact same state. Yes, the exact same state. Yes, that's correct, this means every object is rendered with the exact same transformation matrix.

    How can this be? It "be", because the CPU transformed the vertices of every object that rotated or moved to world coordinates, and transferred them to GPU memory. Every vertex of every object that did not rotate or move since last frame still has the same world coordinates, and therefore that information does not need to be transferred to the GPU --- the vertices in GPU memory are already correct. And since ALL vertices in the GPU are in world-coordinates (not hundreds of different local-coordinates), ALL vertices of ALL objects get rendered by the exact same transformation matrix. Get it? No state changes. Not between individual objects, and not even between batches!

    Now, my engine does have a couple special cases, so in a complex scenario that has every possible kind of object (plus points and lines too), there might be some state changes. At the very least, glDrawElements() gets called with an argument that indicates "points" or "lines" are being drawn. Even then, however, the same shader draws points and lines and triangles, so even in these cases there are no state changes. Only in a couple situations (that don't often exist) is any state change needed. Each frame, since the camera probably rotated or moved, a new transformation matrix must be specified, and possibly a few other items. But that's it! They are specified once, before anything is rendered for that frame, and then [typically] NO state is changed the entire frame.

    I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates. And because you cannot avoid it, and it is an utterly unavoidable aspect of the conventional approach, you do not even imagine a system that requires no state changes for entire batches containing many objects, much less every object in the entire frame.

    This is why I always find it difficult to discuss this topic with anyone who "knows too much"... or "has too much experience". You have a well founded set of assumptions that apply quite well to the conventional approach. And this makes it difficult to imagine a whole new set of interactions and tradeoffs and dynamics of a different approach. On the surface, the two approaches might not seem that different --- one simply stores local-coordinates in the GPU, while the other stores world-coordinates in the GPU. But look at how much the interactions and tradeoffs and dynamics change as a consequence.

    As reference EVERYTHING in Battlefield 3 is instance rendered, which for a modern (read DX11 class hardware) is remarkable easy to do. In fact depending on bottlenecks you might well be well served with instancing models with difference textures on them and using various techniques to bundle the different data together (texture arrays, UAVs of source data etc).

    The point is spending a bit of time sorting for batches across the whole visable scene (something which you can bucket and do via multiple threads too) can save you more time on the CPU and GPU later by reducing submission workload and GPU state changes.
    [/quote]
    Yes, I understand. As you might imagine, my engine depends on features like texture arrays. After all, in my approach, a single glDrawElements() might draw 100 or even 1000 different objects, with dozens if not hundreds of textures, bumpmaps, parallax-maps, specular-maps and other goober on them. How on earth could an engine possibly do this without texture-arrays? I suppose one mammoth texture could be subdivided into hundreds of smaller texture, ala "texture atlas", but texture arrays are a much better, more convenient solution, and my scheme wouldn't even work if I had to switch texture state and call glDrawElements() again for every object. Perish the thought! Good riddens to the problems of the "old days"! Of course, this means every vertex of every object needs a texture-ID field to identify where to find the textures for that object (plus "rules" that specify where other textures are found based on the specified texture-ID and tcoords). In all, my vertex has 32-bits of small ID fields, plus a bunch of bits that specify how to render (like "enable texture", "enable color", "emission", etc). So if the color-bit and texture-bit are both set, the each pixel of the applied texture is "colored" by the object color, and so forth (so a great many combinations are possible for each triangle given just a few control bits).

    This topic is closed to new replies.

    Advertisement