# OpenGL most efficient general rendering strategies for new GPUs

This topic is 2029 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

[font=tahoma, geneva, sans-serif]Rendering performance of 3D game/graphics/simulation engines can be improved by quite a few techniques. Examples include culling (backface, obscured, frustum, etc), simple/fast shaders for deferred processing, uber-shaders to support large batches, etc.[/font]

[font=tahoma, geneva, sans-serif]In this thread, I'd like experienced 3D programmers to brainstorm to attempt to identify the set of techniques that will most speed rendering of high-quality general scenes on current and next-generation high-end CPUs, GPUs and OpenGL/GLSL APIs (let's assume a 5 year timeframe). Complexity of implementation should also be considered important.[/font]

[font=tahoma, geneva, sans-serif]The goal is to come up with a set of techniques, and their order of execution (including parallel execution) that best suits high-quality, general purpose scenes with large numbers of objects. In other words, imagine you're writing a 3D engine that needs to execute a variety of common game and simulation scenarios efficiently (not just one specific game). The nominal scenarios should range between:[/font]

[font=tahoma, geneva, sans-serif]#1: A very large outdoor [or indoor] environment in which most objects do not move on a typical frame, but dozens of objects are moving each frame.[/font]

[font=tahoma, geneva, sans-serif]#2: A game in outer space in which most or all objects move every frame.[/font]

[font=tahoma, geneva, sans-serif]Let's assume the engine supports 1 ambient, and several point light-sources and some form of soft shadows are required.[/font]

[font=tahoma, geneva, sans-serif]The following lower efficiency and should be considered:[/font]
[font=tahoma, geneva, sans-serif]- small batches[/font]
[font=tahoma, geneva, sans-serif]- rendering objects outside frustum[/font]
[font=tahoma, geneva, sans-serif]- rendering objects entirely obscured by closer opaque objects[/font]
[font=tahoma, geneva, sans-serif]- rendering objects behind semi-transparent objects[/font]
[font=tahoma, geneva, sans-serif]- some form of parallax mapping vs detailed geometry[/font]
[font=tahoma, geneva, sans-serif]- add more here[/font]

[font=tahoma, geneva, sans-serif]There are many possible "dynamics" that people consider.[/font]

[font=tahoma, geneva, sans-serif]For example, if we write one or more "uber-shaders" that tests bit-fields and/or texture-IDs and/or matrix-IDs in each vertex structure to control how the pixel shader renders each triangle, it is possible to render huge quantities of objects with a single call to glDrawElements() or equivalent. On the other hand, every triangle takes a little bit longer to execute, due to the multiple paths in the pixel shader.[/font]

[font=tahoma, geneva, sans-serif]Another dynamic is the complexity of culling objects outside the frustum when they do or might cast shadows, and when the environment contains mirrors or [semi]-reflective surfaces, and when the environment contains virtual cameras that point in random directions and their view is rendered on video displays at various places [possibly] within the scene. Furthermore, should not collision detection and response be computed for all objects, even those outside the frustum?[/font]

[font=tahoma, geneva, sans-serif]At one end of the spectrum of possibilities is an approach in which every possible efficiency is tested-for and potentially executed every frame. Considering how various possible efficiencies and possible aspects of a scene interact, this approach could be extremely complex, tricky and prone to discovering cases that are not handled correctly due to that complexity.[/font]

[font=tahoma, geneva, sans-serif]At the other end of the spectrum of possibilities is an approach in which every object that has moved is transformed every frame, without testing for being visible in the frustum, casting a shadow onto any object in the frustum, etc. Instead, this approach would attempt to find a way to most efficiently perform every applicable computation on every object, and possibly even render every object. Perhaps this approach could support one type of culling without risking unwanted interactions - by grouping objects near to each other into individual batches, then not rendering into the backbuffer those batches that fall entirely outside the frustum. But this culling would only be on the final rendering phase, not the collision-phase or shadow computing phase, etc.[/font]

[font=tahoma, geneva, sans-serif]I consider this a difficult problem! I've brainstormed this issue with myself for years, and have never felt confident I have the best answer... or even close to the best answer. I won't bias this brainstorming session by stating my nominal working opinion before others have voice their observations and opinions.[/font]

[font=tahoma, geneva, sans-serif]Please omit discussions that apply to CPUs older than current high-end CPUs, GPUs older than GTX-680 class, and OpenGL/GLSL older than v4.20, because the entire point of this thread is to design something that's efficient 2~4 years from now, and likely for years beyond that. Also omit discussions that apply to non-general environments or non-general rendering.[/font]

[font=tahoma, geneva, sans-serif]OTOH, if you know of new features of next-generation CPUs/GPUs/OpenGL/GLSL that are important to this topic, please DO discuss these.[/font]

[font=tahoma, geneva, sans-serif]Assume the computer contains:[/font]
[font=tahoma, geneva, sans-serif]- one 4GHz 8-core AMD/Intel CPU[/font]
[font=tahoma, geneva, sans-serif]- 8GB to 32GB fastish system RAM[/font]
[font=tahoma, geneva, sans-serif]- one GTX-680 class GPU with 2GB~4GB RAM[/font]
[font=tahoma, geneva, sans-serif]- one 1920x1200 LCD display (or higher-resolution)[/font]
[font=tahoma, geneva, sans-serif]- computer is not running other applications simultaneously[/font]
[font=tahoma, geneva, sans-serif][font=tahoma, geneva, sans-serif]- up to 8~16 simultaneously active engine threads on CPU[/font][/font]

[font=tahoma, geneva, sans-serif]Assume the 3D engine supports the following conventional features:[/font]
[font=tahoma, geneva, sans-serif]- some kind of background (mountains, ocean, sky)[/font]
[font=tahoma, geneva, sans-serif]- many thousand objects[/font]
[font=tahoma, geneva, sans-serif]- millions of triangles[/font]
[font=tahoma, geneva, sans-serif]- several point lights (or many point lights but only closest*brightest applied to each triangle)[/font]
[font=tahoma, geneva, sans-serif]- ambient lighting[/font]
[font=tahoma, geneva, sans-serif]- texture mapping[/font]
[font=tahoma, geneva, sans-serif]- bump mapping[/font]
[font=tahoma, geneva, sans-serif]- parallax mapping (maybe, vs real geometry)[/font]
[font=tahoma, geneva, sans-serif]- collision detection (broad & narrow phase, convex/concave, fairly accurate)[/font]
[font=tahoma, geneva, sans-serif]- collision response (basic physics)[/font]
[font=tahoma, geneva, sans-serif]- objects support hierarchies (rotate/translate against each other)[/font]
[font=tahoma, geneva, sans-serif]- semi-transparent objects == optional[/font]

##### Share on other sites
Using queries to find if objects are occluded is interesting, but difficult. The problem is that a query may give that an object was indeed occluded, but using that as a decision for the next frame may be invalid when things move around. Another problem with queries is that you have to be careful for when to ask for the result. If done too quickly, you can force the pipeline to slow down.

But there are some ways which I have used queries to great success. Suppose the world contains objects of many vertices, and the objects are sorted from near to far. I do the drawing of a frame like this:

1. Draw the first 10% of the objects.
2. Draw invisible bounding boxes representing the next 10%, with queries enabled.
3. For each object in 2, check the query result (which should now be available), and draw the full object if the box was visible.
4. Repeat step 2 for next 10% of objects.

There is an overhead of drawing boxes, but it is small compared to the full object. A problem with this algorithm is to find a good sub set of objects to test for each iteration. If the percentage is too small, there is the risk that the query may slow down the pipeline. If the percentage is too big, the number of occluded objects will go down.

##### Share on other sites
I find it suspicious that so many articles are published about one way or another about culling, or improving rendering efficiency, but in my experience they are all narrow focus - pointing at one possibility or another, but not the whole picture. Unfortunately, many approaches interact with other approaches, or are not mutually compatible. For example, you can only choose or organize the content of each VBO [or batch] on one primary basis. Also, if you work hard to avoid drawing objects that are not visible from the current camera, you might end up calling glDrawElements() zillions of times (so much for "large batches"), and you might spend a lot of CPU figuring out which objects are indeed "occluded" or "outside frustum" or otherwise excluded.

In practice, most decisions create substantial complexity in ways that become a combinitorial explosion of complexity - and almost always upon the CPU. Here is are a couple examples of practical problems and interactions that seem to be ignored in analyses of this topic. #1: When you cull objects because they are outside the frustum (or "occlusion", or other "efficiency" reason), those objects might cast shadows into the visible frame. Oops! In the silence of a realistic space game/simulation especially, these shadows can be key. #2: When you cull objects for any reason, how will your game/simulation know whether they've crashed into each other?

And these are just two "practical problems". My engine supports in-game/simulation cameras and displays. In other words, the scene may contain "security cameras" (and other kinds of cameras) that are displaying on fixed or moving LCD screens. Exactly how complex is the frustum and/or occlusion scheme with lots of cameras pointing in all directions, and their images being shown on lots of displays in real time in each view of each camera. Blank out! But I suppose that complexity is a bit esoteric compared to the problem of not computing collisions or showing shadows (of potentially huge or numerous objects) because they (and the sun) are behind the camera.

I guess now is time to stir the pot by floating my extremely unconventional non-solution solution. Before I state it, let me admit that special-purpose engines (engines for one game, or one narrow type of game) may be able to adopt more efficiency measures. For example, in some situations "players won't usually notice" that shadows are not cast by objects outside the frustum, and "players won't usually notice" when objects never collide outside the primary frustum.

I was driven to remove efficiency-measure after efficiency-measure for the kinds of reasons I just mentioned. My eventual "non-solution solution" was to remove or limit most efficiency measures, and focus on generalizing and streamlining the existing comutation and rendering processes (where "computation processes" includes "collision detection-and-response" and other [mostly-non-rendering] processes).

Anyway, I ended up with something like the following. Well, except, hopefully not literally "ended up", for perhaps some of the geniuses here will identify and explain better strategies and approaches.

Rather than try to rip out all sorts of techniques that I made work together to the extent possible, I "started over" and created the most streamlined general processes I could.

#1: On every frame, when the position or rotation of an object changes, mark the object "geometry modified".

#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.

#3: Make each batch contain up to 65536 vertices so the GPU only need access a 16-bit index to render each vertex. For objects with more than 65536 vertices can be rendered in a single batch with 32-bit indices, or divided into multiple batches with up to 65536 indices in each. Making batches 8192 elements or larger has substantial rendering speed advantages, while batches larger than 65536 vertices have minor speed benefit with current-generation GPUs (which contain roughly 512 ~ 1024 cores divided between vertex shaders and pixel shaders [and sometimes also geometry shaders, gpgpu/compute shaders, and others]).

#4: Write super-efficient 64-bit SIMD/SSE/AVX/FMA3/FMA4/etc functions in assembly-language multiply 4x4 matrices, and also functions to transform arrays of vertices [from local to world coordinates]. These routines are amazingly fast - only dozen nanoseconds per vertex --- per CPU. And I can make 4 to 8 cores transform vertices in parallel for truly amazing performance. This includes memory-access overhead to load the local-coordinates vertex into CPU/SIMD registers, and memory-access overhead to store world-coordinate vertices back into memory (two of them in fact, one with f64 elements and one with f32 elements for the GPU). And my vertices contain four f64x4 vectors to transform (position, zenith vector (normal), north vector (tangent), east vector (bi-tangent)).

#5: In general, each batch contains objects in a specific region of 3D space (or 2D space in primarily "flat" environments like the surface of a planet). All batches can be processed by the CPU for collision-detection, shadow-casting, and for other purposes --- OR --- any batches can be excluded if appropriate to the game/simulation/environment. However, the CPU need not render any batch that outside the frustum of a given camera, so the one primary sort of visibility culling performed is excluding regions that are entirely outside the relevant frustum.

##### some benefits of this approach #####
##### others please state drawbacks #####
#1: Since all objects are in VBOs in GPU memory, and they are always in world coordinates, there is only one transformation matrix for all objects. This means that the same transformation matrix can be applied to every vertex in every object, and so any mix of objects can be rendered in a single "batch" (a single call of glDrawElements() or equivalent). In some scenes a few type of objects benefit from "instancing". To support up to 16 instancing objects in any rendered frame, each vertex has a 4-bit field to specify an alternate or additional matrix for that object-type.

#2: In most games and simulations, the vast majority of objects are stationary on a random frame (if not permanently stationary). Therefore, the CPU needs to transform and move to VBO in GPU memory only those objects that were modified during the last frame, which is usually a tiny minority of objects.

#3: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate collision-detection [and collision-response if desired] on every object, every frame.

#4: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate shadow casting by any-or-every object, every frame, whether those objects are rendered to the backbuffer or not.

#5: Since cameras and lights are also objects, we also have accurate world-coordinates in CPU memory for every camera (viewpoint) and every light source. We therefore can, if and when convenient, perform frustum and other visibility/shadowing computations on the CPU without the difficulty and complexity of having the GPU spew world-coordinates of all objects back to CPU memory.

===============

Okay, that ought to stir up a hornets nest, since my approach is almost opposite "conventional wisdom" in most ways. I invite specific alternatives that should work better for a general purpose game/simulation engine designed to render general environments in as realistic a manner as possible (not cartoons or other special-purpose scenarios, for example).

##### Share on other sites

I find it suspicious that so many articles are published about one way or another about culling, or improving rendering efficiency, but in my experience they are all narrow focus - pointing at one possibility or another, but not the whole picture.

True. But games are different, usually with different requirements and different ways that the game engine is constructed. So there is no "one size fits all". Rather, there is a palette of algorithms to choose from.
... and you might spend a lot of CPU figuring out which objects are indeed "occluded" or "outside frustum" or otherwise excluded.[/quote]
Certainly a problem, and a compromise is required. A problem is to find a good compromise that stays the same on various combinations of graphics cards and CPU architecture. I think multi threaded solutions will gradually push this in the direction where it is possible to do more and more pre-computation.
#1: When you cull objects because they are outside the frustum (or "occlusion", or other "efficiency" reason), those objects might cast shadows into the visible frame. Oops! In the silence of a realistic space game/simulation especially, these shadows can be key.[/quote]
Are you sure that is a problem? You cull the object, not the shadow of the object. That means that the object will not be shown on your screen, but the shadow will. The computation of shadows depends on light sources, and these have to do their own, independent, culling and rendering.
#2: When you cull objects for any reason, how will your game/simulation know whether they've crashed into each other?[/quote]
Culling is used for decided what shall be drawn. It is not used for collision detection, which is an independent algorithm (for example, using octree). I think collision detection is usually done on the main CPU, not the GPU.
My engine supports in-game/simulation cameras and displays.[/quote]
Pictures in cameras are created using a separate render pass, that have its own culling algorithm.
...Before I state it, let me admit that special-purpose engines (engines for one game, or one narrow type of game) may be able to adopt more efficiency measures. For example, in some situations "players won't usually notice" that shadows are not cast by objects outside the frustum, and "players won't usually notice" when objects never collide outside the primary frustum.[/quote]
Graphics optimization is a lot about cheating. You can, and have to, do the most "funny" things, as long as the player doesn't notice.
#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.[/quote]
A very common technique is to define an object as a mesh. You can then draw the same mesh in many places in the world, simply using different transformation matrices. This would not fit into the use of world coordinates for every vertex. When an object moves, you only have to update the model matrix instead of updating all vertices.
#3: Make each batch contain up to 65536 vertices so the GPU only need access a 16-bit index to render each vertex. For objects with more than 65536 vertices can be rendered in a single batch with 32-bit indices, or divided into multiple batches with up to 65536 indices in each. Making batches 8192 elements or larger has substantial rendering speed advantages, while batches larger than 65536 vertices have minor speed benefit with current-generation GPUs (which contain roughly 512 ~ 1024 cores divided between vertex shaders and pixel shaders [and sometimes also geometry shaders, gpgpu/compute shaders, and others]).[/quote]
Sure about that? One advantage of indexed drawing is the automatic use of cached results from the vertex shader. This will work just as fine with 32-bit indices. I don't know for sure.
#4: Write super-efficient 64-bit SIMD/SSE/AVX/FMA3/FMA4/etc functions in assembly-language multiply 4x4 matrices, and also functions to transform arrays of vertices [from local to world coordinates]. These routines are amazingly fast - only dozen nanoseconds per vertex --- per CPU. And I can make 4 to 8 cores transform vertices in parallel for truly amazing performance. This includes memory-access overhead to load the local-coordinates vertex into CPU/SIMD registers, and memory-access overhead to store world-coordinate vertices back into memory (two of them in fact, one with f64 elements and one with f32 elements for the GPU). And my vertices contain four f64x4 vectors to transform (position, zenith vector (normal), north vector (tangent), east vector (bi-tangent)).[/quote]
Isn't it better to let the GPU do this instead? I think a 4x4 matrix multiplication takes one cycle only? Hard to beat.

##### Share on other sites

Isn't it better to let the GPU do this instead? I think a 4x4 matrix multiplication takes one cycle only? Hard to beat.

Depends on the use case. If, for example, you're transforming a point light into the same space as a model, then sending the light's transformed position to the GPU, it's going to more efficient to do this transformation once per-model on the CPU than once per-vertex on the GPU. Even so, you're going to need a hell of a lot of point lights for that transformation to become in any way measurable on your performance graphs (and even then it's still going to be a miniscule fraction of your overall cost).

But yeah, for general position/matrix transforms you do them on the GPU - any kind of CPU-side SIMD optimization of that is way back in the realms of software T&L.

##### Share on other sites

[quote name='maxgpgpu' timestamp='1339409123' post='4948104']
I find it suspicious that so many articles are published about one way or another about culling, or improving rendering efficiency, but in my experience they are all narrow focus - pointing at one possibility or another, but not the whole picture.

True. But games are different, usually with different requirements and different ways that the game engine is constructed. So there is no "one size fits all". Rather, there is a palette of algorithms to choose from.
[/quote]
True enough for special-purpose engines. However, the topic of general-purpose engines is not irrelevant --- after all, there are lots of general-purpose engines!

... and you might spend a lot of CPU figuring out which objects are indeed "occluded" or "outside frustum" or otherwise excluded.[/quote]
Certainly a problem, and a compromise is required. A problem is to find a good compromise that stays the same on various combinations of graphics cards and CPU architecture. I think multi threaded solutions will gradually push this in the direction where it is possible to do more and more pre-computation.[/quote]
I agree, that that's one of the conclusions I came to. That's why I do a lot more work on the CPU than anyone I've heard give advice. My development system has an 8-core CPU, and I figure in 2~3 years 8-core will be mainstream and I'll have a 16-core CPU in my box.

#1: When you cull objects because they are outside the frustum (or "occlusion", or other "efficiency" reason), those objects might cast shadows into the visible frame. Oops! In the silence of a realistic space game/simulation especially, these shadows can be key.[/quote]
Are you sure that is a problem? You cull the object, not the shadow of the object. That means that the object will not be shown on your screen, but the shadow will. The computation of shadows depends on light sources, and these have to do their own, independent, culling and rendering.[/quote]
Yes, I suppose separate culling schemes per pass/phase can work. On the other hand, to perform culling at all assumes the CPU is figuring out which objects can be culled... for shadow-casting too. And if the CPU needs to know where things are every cycle to perform this process, then where is the savings of having the GPU transform everything on pass #0 in order to figure out what casts shadows into the frame, and what doesn't? This is one fo the considerations that drove me to just transform everything to world-coordinates as efficiently as possible in the CPU.

#2: When you cull objects for any reason, how will your game/simulation know whether they've crashed into each other?[/quote]
Culling is used for decided what shall be drawn. It is not used for collision detection, which is an independent algorithm (for example, using octree). I think collision detection is usually done on the main CPU, not the GPU.[/quote]
Yes, but you are just confirming my thesis now, because collision detection (and many other computations) must be performed in world-coordinates. So if culling means "having the GPU render every object", then the CPU needs to know which objects can be culled every frame, which means the CPU needs to transform every object that has moved every frame. Thus culling saves GPU time, but the CPU must at least transform the AABB or bounding circle to world-coordinates to perform any kind of culling at all.

My engine supports in-game/simulation cameras and displays.[/quote]
Pictures in cameras are created using a separate render pass, that have its own culling algorithm.[/quote]
Yes, my engine must perform a rendering loop for each camera. However, since my engine transforms every object that moved since the last frame to world-coordinates on the first iteration of that loop, the CPU need not transform any objects for the 2nd, 3rd, 4th... to final camera. The transformation process also automatically updates the AABB for every transformed object, so that information is also available for other forms of culling. Of course each loop the CPU must call glDrawElements() as many times as necessary to render the camera view into a texture attached to a FBO. No way to avoid that, though objects outside the frustum need not be drawn (a fairly quick and simple AABB versus frustum check).

...Before I state it, let me admit that special-purpose engines (engines for one game, or one narrow type of game) may be able to adopt more efficiency measures. For example, in some situations "players won't usually notice" that shadows are not cast by objects outside the frustum, and "players won't usually notice" when objects never collide outside the primary frustum.[/quote]
Graphics optimization is a lot about cheating. You can, and have to, do the most "funny" things, as long as the player doesn't notice.[/quote]
My opinion is... those days are numbered. Of course, in a few years they'll say not performing full-bore ray-tracing to every pixel is cheating, so in that sense you'll be correct for a long time.

#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.[/quote]
A very common technique is to define an object as a mesh. You can then draw the same mesh in many places in the world, simply using different transformation matrices. This would not fit into the use of world coordinates for every vertex. When an object moves, you only have to update the model matrix instead of updating all vertices.[/quote]
Yes, that is the case I addressed with my comments on instancing. There's more than one way to accomplish instancing, but newer GPUs and versions of OpenGL API support instancing natively. The rotation, translation and other twist-spindle-mutilate that gets done to objects is usually a function of code in the shader plus a few global boundary variables (to specify where all the blades of grass do not go beyond).

#3: Make each batch contain up to 65536 vertices so the GPU only need access a 16-bit index to render each vertex. For objects with more than 65536 vertices can be rendered in a single batch with 32-bit indices, or divided into multiple batches with up to 65536 indices in each. Making batches 8192 elements or larger has substantial rendering speed advantages, while batches larger than 65536 vertices have minor speed benefit with current-generation GPUs (which contain roughly 512 ~ 1024 cores divided between vertex shaders and pixel shaders [and sometimes also geometry shaders, gpgpu/compute shaders, and others]).[/quote]
Sure about that? One advantage of indexed drawing is the automatic use of cached results from the vertex shader. This will work just as fine with 32-bit indices. I don't know for sure.[/quote]
32-bit indices work fine. They're just a tad slower than 16-bit indices.

#4: Write super-efficient 64-bit SIMD/SSE/AVX/FMA3/FMA4/etc functions in assembly-language multiply 4x4 matrices, and also functions to transform arrays of vertices [from local to world coordinates]. These routines are amazingly fast - only dozen nanoseconds per vertex --- per CPU. And I can make 4 to 8 cores transform vertices in parallel for truly amazing performance. This includes memory-access overhead to load the local-coordinates vertex into CPU/SIMD registers, and memory-access overhead to store world-coordinate vertices back into memory (two of them in fact, one with f64 elements and one with f32 elements for the GPU). And my vertices contain four f64x4 vectors to transform (position, zenith vector (normal), north vector (tangent), east vector (bi-tangent)).[/quote]
Isn't it better to let the GPU do this instead? I think a 4x4 matrix multiplication takes one cycle only? Hard to beat.
[/quote]
Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.

##### Share on other sites

Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.

Measuring a single transformation in isolation is not going to give you a meaningful comparison. As you noted, GPUs are much more parallel than CPUs (that's not going to change for the foreseeable), and they also have the advantage that this parallelism comes for free - there's no thread sync overhead or other nastiness in your program. At the most simplistic level, an 8-core CPU would need to transform vertexes 64 times faster (the actual number will be different but in a similar ballpark) than a 512-core GPU for this to be advantageous.

Another crucial factor is that performing these transforms on the GPU allows you to use static vertex buffers. Do them on the CPU and you're needing to stream data to the GPU every frame, which can fast become a bottleneck. Yes, bandwidth is plentiful and fast, but it certainly doesn't come for free - keeping as much data as possible 100% static for the lifetime of your program is still a win. Not to mention a gain in terms of reduction of code complexity in your program, which I think we can all agree is a good thing. Edited by mhagain

##### Share on other sites
Here's my take on a fairly high level of a generic rendering mechanism. It may however be more like last-gen than next-gen Some of it is implemented in Urho3D. It only considers separate objects, static world geometry does not have special handling. No automatic data reorganization for improved batching is done.

- Update animation for scene objects such as animated meshes and particle emitters. Use multiple cores for this. For each animated object, check previous frame's visibility result and skip or reduce update frequency if invisible. This may yet cause objects' bounds to change, meaning their position in the spatial graph has to be re-evaluated.

- Make sure the spatial graph (I use an octree) is up to date. Reinsert objects that have moved or have changed their bounds since last frame. This I don't know how to multi-thread safely.

- Query the spatial graph for objects and lights in the main view frustum. If occlusion is to be utilized, query occluders first (for me, there's a low-resolution CPU depth-only rasterizer in use). Detect possible sub-views, such as shadow map frustums or camera monitor views. Process sub-views and occlusion testing on multiple cores. Aggressively optimize unnecessary shadowcasters away by checking if their extruded bounds (extruded in light perspective) actually cross the main view frustum.

- Construct rendering operations from the visible objects. Automatically group operations for instancing when possible (same geometry/same shader/same material/same light). NB: I use either deferred rendering, or primitive multipass forward rendering, so the grouping is relatively easy. For a multi-light single-pass forward renderer I imagine it is harder. Instancing objects with too many vertices can actually be detrimental so a threshold value can be used.

- Sort rendering operations. Speed this up using multiple cores. For opaque geometry, sort both by state and distance for front-to-back ordering. The method I use is to check the closest geometry rendered with each unique renderstate, and use that info to determine the order of renderstates. This needs experimental testing to determine whether for example minimizing shader changes is more important than more strict front-to-back order.

- Render. Optimize where possible, such as using the stencil buffer & scissor to optimize light influences, render shadow maps first to avoid a lot of render target changes, and skip unnecessary buffer clears.

Btw. back when I did the 10000 cube experiment on various rendering engines I found that Unity rewrites a dynamic vertex buffer each frame for small batched objects (apparently it pre-transforms them on the CPU), so it seems they found that optimal at least for some cases. However I found hardware instancing of small meshes to perform much better, so I believe they do that to optimize for old API versions *without* hardware instancing support, and simply chose not to implement a separate hardware instancing path at least for the time being. Edited by AgentC

##### Share on other sites

- Sort rendering operations. Speed this up using multiple cores. For opaque geometry, sort both by state and distance for front-to-back ordering. The method I use is to check the closest geometry rendered with each unique renderstate, and use that info to determine the order of renderstates. This needs experimental testing to determine whether for example minimizing shader changes is more important than more strict front-to-back order.

Thanks for a good summary of optimization technologies, it looks like you really know are talking about.

I am now trying to improve my application optimization for state changes. Is there some easy rule of thumb what states changes are important to optimize for? In worst case, I suppose a state change will delay the pipeline.

##### Share on other sites
I am now trying to improve my application optimization for state changes. Is there some easy rule of thumb what states changes are important to optimize for? In worst case, I suppose a state change will delay the pipeline.

In general, the CPU/GPU communicate by sharing a ring-buffer in RAM. The CPU writes command packets into this buffer, and the GPU reads these command packets out of the buffer (much like sending packets over TCP, etc...).
 || CPU write cursor \/ [...][set texture][set stream][draw primitives][...] /\ || GPU read cursorGraphics API's like D3D/GL hide these details from the user, but as an approximation, you can think of every D3D call which starts with "Set" or "Draw" as being a fairly simple function that writes command packets into a queue... However, it can be a bit more tricky, because D3D/GL might be lazy when it comes to sending out packets -- it's possible, for example, that when I call [font=courier new,courier,monospace]SetStreamSource[/font] that the 'SetStream' packet isn't immediately written, instead, they might set a "stream dirty" flag to be true, and then later, when I call [font=courier new,courier,monospace]DrawIndexedPrimitive[/font], it checks that the flag is true, and sends out both the 'SetStreamSource' command, and the 'DrawIndexedPrimitive' command at the same time.

Whenever you're reducing the number of D3D/GL commands or re-sorting the order of these commands, you're changing the optimisation of both the CPU and the GPU -- You're affecting the way in which D3D/GL actually writes out it's command buffer, and you're changing the command stream that the GPU will execute.

Some optimisation strategies will make only one of these processors run faster, and will actually slow down the other processor, and this can be valid -- e.g. if you're CPU-side code is very fast, then you might choose to perform some GPU optimisations which have the effect of adding 1ms to your CPU execution times. And in other situations, you might do the opposite.

That said, the CPU-side cost of a state-change is usually fairly predictable, and fairly constant. You can easily measure these by measuring the amount of time spent inside each D3D/GL call. Just be sure to note that if seem very fast, make sure that you're correlating them with the draw-call which follows them, and to be extra safe, issue a command-buffer flush after each draw-call so you know that the command packets have actually been computed and sent.

On the GPU-side, the costs vary by extreme amounts depending on (1) what you are rendering and (2) the actual GPU hardware and drivers.
Advice which makes Game A run twice as fast may make Game B run twice as slow.

For example, let's say that when a 700MHz GPU switches shader programs, it requires a 100,000 cycle delay (100k@700MHz = 0.14ms). This means that every time you switch the shader program state, you're adding 0.14ms to your frame-time.
However, now let's also say that this GPU is pipelined in a way where it can be shading pixels at the same time that it performs this 100000-cycle state-change. Now, as long as the GPU has over 0.14ms of pixel-shading work queued up (which depends on the complexity of the pixel shader, the amount/size of triangles drawn, etc...) when it reaches the change-shaders command, then there will be no delay at all! If it's only got 0.04ms of pixel-shading work queued up when it reaches this command, then there will be a 0.1ms delay (pipeline bubble).
So you can see, that even if you know the exact cost of a piece of GPU work, the actual time along your critical path depends greatly on all of the surrounding pieces of GPU work as well (in this case, the total waste from switching shaders depends on the pixel operations already being worked on, which depend on the vertex operations, etc..).

You could have 10 items of work, which all take 1ms each when run in isolation... but when you run them all together, they only take 3ms, thanks to pipelining... So simply attaching a 'cost' to these items doesn't mean anything. In this case, each 'item' takes 1ms OR 0.3ms (or maybe your item cost 0ms and another item cost 3ms) -- and you can't tell which, without knowing what else is being processed at the same time as your item, or by using a good profiling tool.

If you want to optimise your GPU usage, you've really got to get a good GPU profiler which can tell you about what the hardware is doing internally, where any stalls are in it's pipeline, which features are used at capacity, which operations are forming the bottleneck, etc...

Is there some easy rule of thumb what states changes are important to optimize for?
http://developer.dow...g_Guide_G80.pdf
http://origin-develo...hBatchBatch.pdf
http://developer.amd...mming_guide.pdf Edited by Hodgman

##### Share on other sites

[quote name='maxgpgpu' timestamp='1339445552' post='4948286']
Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.

Measuring a single transformation in isolation is not going to give you a meaningful comparison. As you noted, GPUs are much more parallel than CPUs (that's not going to change for the foreseeable), and they also have the advantage that this parallelism comes for free - there's no thread sync overhead or other nastiness in your program. At the most simplistic level, an 8-core CPU would need to transform vertexes 64 times faster (the actual number will be different but in a similar ballpark) than a 512-core GPU for this to be advantageous.

Another crucial factor is that performing these transforms on the GPU allows you to use static vertex buffers. Do them on the CPU and you're needing to stream data to the GPU every frame, which can fast become a bottleneck. Yes, bandwidth is plentiful and fast, but it certainly doesn't come for free - keeping as much data as possible 100% static for the lifetime of your program is still a win. Not to mention a gain in terms of reduction of code complexity in your program, which I think we can all agree is a good thing.
[/quote]

From articles by nvidia, I infer a single core of a fast CPU is about 8x faster than a GPU core, not 64x. That's why I believe we all agree that our 3D engines should perform all operations on the GPU that do not involve some unfortunate (and possibly inefficient) consequences. A perfect example of an unfortunate/inefficient consequence of doing everything the conventional way is the problem of collision-detection and collision-response. This is not an issue in trivial games, or games that just don't need to detect or respond to collision in any significant or detailed way. But when any aspect of collision-detection OR collision-response needs to be handled on the CPU, the "simple way" or "conventional way" of keeping object local-coordinates in VBOs in the GPU memory, and having the GPU transform those objects directly to screen-coordinates with one matrix has serious problems.

First, collision-detection must be performed with all objects in the same coordinate-system... which for many practical reasons is world-coordinates. Also, collision-response has the same requirements... all objects must be available in the same [non-rotating, non-accelerating] coordinate-system, and world-coordinates is again most convenient. So the "solution" proposed by advocates of "conventional ways" is to make the GPU perform two matrix transformations --- first transform each vertex to world-coordinates, then write the vertex (or at least the world-coordinates and possibly transformed surface vectors (normal, tangent, bitangent)) back into CPU memory, then transform from world coordinates to screen coordinates in a separate transformation.

So now we do have a transfer of all vertices between CPU and GPU... only this time, the transfer is from GPU to CPU. The CPU must wait for this information to be returned before it starts processing it, and of course this must be synchronized (not terribly difficult).

I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates --- only those vertices that changed (usually by rotate or translate) on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static". However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects, this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches" (except for huge objects with boatloads of vertices, where one object in-and-of itself is a large batch). Furthermore, in the "conventional way", since a new glDrawElements() call is inherently required for every object, we might as well have boatloads of shaders, one for every conceivable type of rendering, and [potentially] change shaders for every object... or at least "often". From what I can tell, changing shaders has substantial overhead in the GPU driver [and maybe in the GPU itself], so this is an often ignored gotcha of the "conventional way". Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").

I very much agree that keeping code complexity low is important. However, from my experience, my approach is simpler than the "conventional way". Every object looks the same, has the same elements, is transformed to world-coordinates and sent to GPU when modified (but only when modified), exists in memory in world-coordinates for the convenience of collision-detection and collision-response computations, and whole batches of objects are rendered with a single glDrawElements(). Are there exceptions? Yes, though not many, and I resist them. Examples of exceptions I cannot avoid are rendering points and lines (each gets put into its own batch). My vertices contain a few fields that tell the pixel shader how to render each vertex (which texture and normalmap, and bits to specify rendering type (color-only, texture-only, color-modifies-texture, enable-bumpmapping, etc)). So if anything, I think my proposed way (implemented in my engine) is simpler. And all things considered, for a general-purpose engine, I suspect it is faster too.

##### Share on other sites

[quote name='AgentC' timestamp='1339537035' post='4948637']
- Sort rendering operations. Speed this up using multiple cores. For opaque geometry, sort both by state and distance for front-to-back ordering. The method I use is to check the closest geometry rendered with each unique renderstate, and use that info to determine the order of renderstates. This needs experimental testing to determine whether for example minimizing shader changes is more important than more strict front-to-back order.

Thanks for a good summary of optimization technologies, it looks like you really know are talking about.

I am now trying to improve my application optimization for state changes. Is there some easy rule of thumb what states changes are important to optimize for? In worst case, I suppose a state change will delay the pipeline.
[/quote]
For what it's worth, I took the ultimiate "brain-dead" approach to address this issue.

I have a "glstate" structure [for each rendering context] that contains every OpenGL state variable. Then, when my code needs to set any OpenGL state, it first checks whether the value it needs is already set, and if so, doesn't bother. Obviously the code for this approach is "braindead simple" (which means, no errors due to complexity), is the same for every situtation and every OpenGL state variable, and quite efficient and readable (test and conditional assign on a single line of C code).

And yes, in a large percentage of cases the "new state" == "old state", and GPU-driver and GPU overhead is avoided. Edited by maxgpgpu

##### Share on other sites

I have a "glstate" structure [for each rendering context] that contains every OpenGL state variable. Then, when my code needs to set any OpenGL state, it first checks whether the value it needs is already set, and if so, doesn't bother. Obviously the code for this approach is "braindead simple" (which means, no errors due to complexity), is the same for every situtation and every OpenGL state variable, and quite efficient and readable (test and conditional assign on a single line of C code).

It will work, and be efficient CPU wise, but I don't think this is good enough. You need to sort the drawing operations for some state changes, or you can get pipeline delays.

I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this. Maybe someone can confirm this.

First, collision-detection must be performed with all objects in the same coordinate-system...

I don't see what collision detection has to do with graphics and drawing. It is a physical modelling. Maybe it is possible to use the GPU to compute this, but I think it is more common to use a CPU local quadtree. Edited by larspensjo

##### Share on other sites
I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates ... vertices that [move] on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static".
However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects....[/quote]Transforms that haven't changed don't need to be re-uploaded to the GPU, so except in the simplest scenes, the overhead of computing and uploading the changed transforms is sure to be far less than that of computing and uploading the changed vertices.
For example, if we scale this up to a modern scene with 100 characters, each with 1 million vertices, animated by 100 bones, we could implement it several ways:
Case A) CPU vertex transforms
CPU load = 10000 bone transforms, 100 million vertex transforms and uploads.
GPU load = 100 million vertex transforms.
Case B) GPU skinning
GPU load = 100 million vertex transforms.
Case C) GPU animation and skinning
GPU load = 10000 bone transforms, 100 million vertex transforms.
There's no general best solution. If you wanted to get the theoretical best performance, you'd probably end up using all of these approaches in different places.

However, a big difference in the above is the memory overhead -- in the GPU skinned versions, there's only a single (static) copy of the character vertex data required, in GPU memory -- but in the CPU case, the CPU has 200 copies of the data (untransformed and transformed, for each player), and the GPU/driver require another n-copies of the total data where n is number of frames that your driver buffers commands for.

Also, if you're sending world-space vertices to the GPU, then you'll lose too much precision in 16-bit vertex formats, so your vertex buffers will all have to double in size to using full 32-bit floats.
this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches".[/quote]Batches are a CPU overhead (the GPU forms it's own idea of 'batches' from the work/stall points in it's pipeline independently from draw calls), so if you put the same optimisation effort into your systems that call [font=courier new,courier,monospace]glDraw[/font]*, you wont be troubled by them until you've got similar numbers of objects that are also going to choke your CPU-vertex-transform routines.
Also, there is always a middle ground -- e.g. my last game always drew 4 shadow-casters per batch, with 4 different light transforms.
Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").[/quote]Focus on some algorithmic optimisation in your CPU GL code, and the details behind uniform-buffers won't impact you greatly - it still boils down to that only data that the CPU needs to change each frame needs to be re-uploaded each frame.
#3: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate collision-detection [and collision-response if desired] on every object, every frame.[/quote]Do you really use triangle vs triangle collision detection routines for the majority of your objects? There's usually a good reason that dynamic (especially animated) objects are usually represented by something other than triangle-soups in collision engines. I guess that's useful if you need accuracy though, just be aware of the costs.

I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this.
You don't want to waste time with redundant state-changes and there are very efficient ways to avoid them, just make sure the optimisation isn't more expensive than the original redundancy ;). Usually sorting as well as this kind of filtering would be used.
To quote TomF:
[color=#000000][font=arial, helvetica]

### 3. On a platform like the PC, you often have no idea what sort of card the user is running on. Even if you ID'd the card, there's ten or twenty possible graphics card architectures, and each has a sucession of different drivers. Which one do you optimise for? Do you try to make the edge-traversal function change according to the card installed? That sounds expensive. Remembering that most games are limited by the CPU, not the GPU, and you've just added to the asymmetry of that load.[/font] [/quote] Edited June 14, 2012 by Hodgman

##### Share on other sites

[quote name='maxgpgpu' timestamp='1339654045' post='4949057']
I have a "glstate" structure [for each rendering context] that contains every OpenGL state variable. Then, when my code needs to set any OpenGL state, it first checks whether the value it needs is already set, and if so, doesn't bother. Obviously the code for this approach is "braindead simple" (which means, no errors due to complexity), is the same for every situtation and every OpenGL state variable, and quite efficient and readable (test and conditional assign on a single line of C code).

It will work, and be efficient CPU wise, but I don't think this is good enough. You need to sort the drawing operations for some state changes, or you can get pipeline delays.

I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this. Maybe someone can confirm this.[/quote]
Oh, absolutely no question we also need to perform certain kinds of sorts, but that is a different issue. Of course in my extreme case, there is rarely any need to change OpenGL state between batches... and my batches are huge. Of course there is a need to change state between a shadow-generating pass and later rendering passes, but those are very coarse-grain and therefore not very significant.

When the general approach is to perform a separate draw call for every object, the need for many state-changes arises because there's a temptation to have many different rendering styles and even multiple shaders.

First, collision-detection must be performed with all objects in the same coordinate-system...

I don't see what collision detection has to do with graphics and drawing. It is a physical modelling. Maybe it is possible to use the GPU to compute this, but I think it is more common to use a CPU local quadtree.
[/quote]
Really? Hahahaha!

Why do you think the "conventional way" is to put the local-coordinates of all vertices of all objects into the GPU, and then have the GPU transform from local-coordinates to screen-coordinates? Two answers: So the program only needs to transfer object vertices to GPU memory once (and certainly not every frame), and so the CPU doesn't need to transform object vertices from local-coordinates to world-coordinates (or some other coordinate-system), because the GPU does that. Of course the CPU does compute the transformation matrix and transfer that to the GPU for every object before it renders the object.

And THAT is what collision-detection and collision-response has to do with this "conventional way". Why? Because the CPU cannot perform collision-detector OR collision-response unless it has access to all object vertices in world-coordinates (or some other single, consistent, non-rotating, non-accelerating coordinate system).

But wait! The whole freaking POINT of the "conventional way" is to eliminate the need for the CPU to transform vertices OR transfer vertices between CPU and GPU memory. But collision-detection and collision-response REQUIRE all those vertices in world-coordinates in CPU memory... every single frame. Which means:

#1: So much for the CPU not needing to transform vertices.
#2: So much for the CPU not needing to transfer vertices between CPU and GPU.

Now, to be fair, you can arrange things so only one of the above needs to be done. You can keep the original scheme, where the GPU holds only local coordinates, and the CPU computes TWO (instead of one) transformation matrices for each object, and transfers that to the GPU before it renders each object. However, this requires the GPU compute world-coordinates of all vertices (the first transformation matrix supplied by the CPU), and then transfer those world-coordinate vertices back into CPU memory --- so the CPU has world-coordinates of all objects to perform collision-detection and collision-response with. Note that this scheme still requires a separate batch for every object. Furthermore, unless you switch shaders [potentially] every object, the GPU transfers the world-coordinates of every single vertex back to CPU memory, because it has no freaking idea which objects rotated or moved. In contrast, in my scheme the CPU only transfers vertices of objects that rotated or moved, because it doesn't even bother to process objects that didn't move. As a consequence, there is no need for the CPU application to make any distinction between objects --- no separate shaders, no extra transfers between CPU and GPU, etc.

This is [some of] what "collision-detection" has to do with "graphics and drawing".

##### Share on other sites

[quote name='maxgpgpu' timestamp='1339653554' post='4949056']I'm not sure what you mean by "keeping the data as static as possible". In my scheme, where the CPU always holds every vertex in both local-coordinates and world-coordinates, and the GPU holds every vertex in world-coordinates ... vertices that [move] on a given frame are transferred to the GPU. Isn't that also "keeping data as static as possible"? Of course, it is true that in the "conventional way" the vertices in GPU memory are almost never changed --- which is the ultimate in "static".

Good one! :-) But as we see when we continue, that's only one consideration....

However, that approach inherently requires that the transformation matrix of every object be updated every frame by the CPU and send to the GPU. While transformation matrices are smaller than the vertices for most objects....[/quote]

Transforms that haven't changed don't need to be re-uploaded to the GPU, so except in the simplest scenes, the overhead of computing and uploading the changed transforms is sure to be far less than that of computing and uploading the changed vertices.[/quote]

You may be right, but your premise is wrong, at least in any scenario my application ever faces. In my application, and I would guess most other applications, the camera moves every frame. Therefore, every transformation matrix for every object must change in the "conventional way". This is also true in my scheme too, except there is only one transformation for all objects, since GPU memory contains world-coordinates for all objects (so to the GPU, literally all objects are one object in this respect). Furthermore, you totally ignore the very real, practical issue that I keep raising! The CPU needs world-coordinates to perform collision-detection and collision-response, and advocates of the "conventional way" always, always, always, always, always ignore this, pretend the problem doesn't exist, ignore people like me when we point it out, etc. I'm sorry, but a real general-purpose engine must perform collision-detection and collision-response. So at least you must include overhead for setting two matrices to the GPU and you must include overhead for the GPU writing out world-coordinate vertices to CPU memory... for every object every frame! Now we return control to your normal channel! :-)

For example, if we scale this up to a modern scene with 100 characters, each with 1 million vertices, animated by 100 bones, we could implement it several ways:[/quote]

Okay, I must interrupt with an objection here, but then we'll return to your thoughts. In most games and simulations, most objects contain far fewer vertices, and most objects do not move (a wall, a chair, a table, a dish, a fork, a baseball, a bat, a blade of grass, etc). Then a small minority of objects are large, or in your scenario - huge!

I must also admit that I tend to think in terms of small details and their self-shadows implemented with a parallax-mapping technique like "relaxed cone-step mapping" rather than 27-zillion tiny little triangles. However, I admit this is probably a matter of choice with both approaches having advantages and disadvantages in different situations. One thing I'm sure we can both agree upon is that there's no need to perform collision-detection or collision-response on fine details like the wrinkles in the face of an old person. But a naive approach that implements these features with zillions of micro-triangles and then tries to perform collision-detection and collision-reponse upon them is certainly doomed --- whether vertices are transformed by CPU or GPU.

Case A) CPU vertex transforms
CPU load = 10000 bone transforms, 100 million vertex transforms and uploads.
GPU load = 100 million vertex transforms.
Case B) GPU skinning
GPU load = 100 million vertex transforms.
Case C) GPU animation and skinning
GPU load = 10000 bone transforms, 100 million vertex transforms.
There's no general best solution. If you wanted to get the theoretical best performance, you'd probably end up using all of these approaches in different places.

However, a big difference in the above is the memory overhead -- in the GPU skinned versions, there's only a single (static) copy of the character vertex data required, in GPU memory -- but in the CPU case, the CPU has 200 copies of the data (untransformed and transformed, for each player), and the GPU/driver require another n-copies of the total data where n is number of frames that your driver buffers commands for.[/quote]
That's one place you're wrong. If we're gonna perform collision-detection, then you need ALL vertices in world-coordinates, and probably you need them in CPU memory. So you not only have all that storage, you also need to have the GPU transfer all those world-coordinates back to CPU memory so the CPU can perform the collision-detection. However, let's be fair to you, even though it was you who created these extreme cases (100 super-detailed bending, twisting, moving, flexible, organic characters). In any such scenario, whether processed on CPU or GPU or cooperatively, we would certainly [manually or automatically] select a subset of character vertices to represent each character for purposes of collision-detection and collision-response.

Also, if you're sending world-space vertices to the GPU, then you'll lose too much precision in 16-bit vertex formats, so your vertex buffers will all have to double in size to using full 32-bit floats.[/quote]
Yeah, I've never even considered 16-bit floats for coordinates! Hell, I dislike 32-bit coordinates for coordinates! Hell, even 64-bit coordinates make me nervous! Seriously! In one of my applications, I need to have thousands of entirely separate "world coordinate systems", one for each solar system, because the range of 64-bit floating-point is no where near sufficient to contain galactic scale as well as tiny details. Fortunately we can't see details within other solar systems (just the point star, and limited detail with high-power telescopes), so I am able to avoid the inherent mess that multiple world coordinate systems implies in a rigorous application of this scheme.

this aspect of the "conventional way" inherently forces another inefficiency - the necessity to call glDrawElements() or glDrawElementsRange() or similar function for every object. In other words, "small batches".[/quote]

Batches are a CPU overhead (the GPU forms it's own idea of 'batches' from the work/stall points in it's pipeline independently from draw calls), so if you put the same optimisation effort into your systems that call glDraw*, you wont be troubled by them until you've got similar numbers of objects that are also going to choke your CPU-vertex-transform routines.
Also, there is always a middle ground -- e.g. my last game always drew 4 shadow-casters per batch, with 4 different light transforms.[/quote]

Well, first of all, I must admit that as much as I try, I am not confident I understand what is most efficient for OpenGL on nvidia GPUs (much less other GPUs on OpenGL or D3D). However, I suspect you oversimplified. If I understand this correctly, if you make no state changes between batches, then batch overhead in the newest GPUs can be quite modest (depending on batch size, etc). In fact, maybe the driver is smart enough to merge those batches into one batch. I also have the impression that a limited number of simple state changes between batches won't cause terrible overhead. Having said this, however, any change that makes any difference to the results the GPU cores produce, must inherently require the GPU be completely finished all processing of the previous batch (both vertex and pixel shaders). This necessarily involves at least a modest loss of efficiency over continuous processing. And when something like a shader needs to be changed, even more performance loss is unavoidable. True, top-end GPUs today aren't nearly as bad as previous generations, since (I infer) they can hold multiple shaders in the GPU, so the process of switching between them isn't horrific any more, even including changes to the uniform buffer objects required to support the new shader. But it is still significant compared to continuous processing.

I also note the following. My engine makes a distinction between "huge moving objects" and all others (non-moving objects and small-to-modest-size moving objects). For practical purposes (and in fact in my engine), a huge moving object has so many vertices that it is a batch, all by itself. So even my engine takes a middle road in this case. It leaves local-coordinates in the GPU and sets the appropriate transformation matrix to the GPU before it renders that object. However, at this point anyway, the CPU also transforms local-coordinates to world-coordinates itself, rather than have the GPU perform two transformations and feedback the intermediate world-coordinates of these objects back to CPU memory for further [collision] processing. Of course, if the object has craploads of tiny details implemented as microtriangles, then the CPU only transforms that subset that reasonably describes the boundary of the object.

Also important is that different shaders often expect different uniform buffers with different content in different layouts. So this information and uniform buffer data also must be sent to GPU-driver and GPU too, potentially as often as every object (if rendering order is sorted by anything other than "which shader").[/quote]
Focus on some algorithmic optimisation in your CPU GL code, and the details behind uniform-buffers won't impact you greatly - it still boils down to that only data that the CPU needs to change each frame needs to be re-uploaded each frame.

#3: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate collision-detection [and collision-response if desired] on every object, every frame.[/quote]
Do you really use triangle vs triangle collision detection routines for the majority of your objects? There's usually a good reason that dynamic (especially animated) objects are usually represented by something other than triangle-soups in collision engines. I guess that's useful if you need accuracy though, just be aware of the costs.[/quote]

The usual approach of my engine is this (on most objects):

#1: broad phase with insertion-sorted overlap testing of object AABBs in world-coordinates.

#2: narrow phase with highly optimized GJK "convex-hull" technique (whether object is convex or not).

#3: if #2 detects collision OR object is marked as "arbitrary irregular/concave shape" then triangle-triangle intersection.

The application running on the engine can globally or per-object specify that step #1 alone is adequate, or step #1 + #2 is adequate. Otherwise all 3 steps are executed. I put a huge effort into making these routines super-efficient, but as you clearly understand, even the most brilliant implementation of triangle-triangle-soup intersection is relatively slow. Depending on the objects, mine is 2 to 20 times slower than the GJK implementation of #2, but of course has the huge advantage of working on completely arbitrary, irregular "concave" objects. However, I have not yet implemented efficiencies possible by #3 inspecting and optimizing itself on the basis of results saved by #2 (deepest penetration locations). So #3 should speed up somewhat~to~considerably.... someday. Like I don't have enough other work to do, right?

PS: Just for fun (or to make myself look stupid) I'll mention the following example of going to [absurd] extremes to make certain parts of my engine efficient. The transformation of each vertex from local-coordinates to world-coordinates by the CPU requires only 3 SIMD/AVX machine-language instructions per vertex to update the object AABB. That's less than 1 nanosecond per vertex. :-) Yeah, yeah, I know... some modern scenes have a crapload of vertices.

But just to say it again... if we're talking about collisions of million-vertex objects like in your example, clearly we have no reasonable choice but to specify or automatically compute a vertex-subset of every "huge" object to "reasonably" mimic the outer surface of the object. Otherwise, we need to speed up CPUs 100x or so! :-o

I think a state change from glEnable or glDisable that change to the same state will have a low cost on CPU, and zero cost on the GPU. If so, there is no need for a glState to optimize this.

For that kind of state-change, I suspect you're correct. To cover my ass from "what I don't know, and what they don't tell us" cases, I decided to perform the same test and action on every OpenGL state, then "not worry about it".

You don't want to waste time with redundant state-changes and there are very efficient ways to avoid them, just make sure the optimisation isn't more expensive than the original redundancy ;). Usually sorting as well as this kind of filtering would be used.[/quote]

The one very nasty fact that people rarely mention is... we can only perform this global sort in one way (on one basis). So, for example, in my engine I prefer to global sort on the basis of location in 3D space (or 2D in a game/scenario that takes place mostly/exclusively on a surface). In other words, each "globally sorted batch" contains objects in the same 3D volume of space. This is just about the simplest possible basis to sort upon, I suppose. Each 3D volume of space (a world-coordinates AABB for practical purposes) is very efficiently tested against the frustum by the CPU. If the AABB is definitely outside the frustum, my engine ignores the entire batch (does not: send vertices to GPU, set up uniforms, change state, or render). It does not test each object against the frustum like the "conventional way", which means the GPU renders a few more unnecessary objects that a conventional approach, but the CPU test is super simple and super fast.

To be sure, it is true that we can perform "sub-sorts" within our "global-sort", and in some engines this will help, sometimes substantially. Especially in engines where the shader-writers go insane and create craploads of super-optimized special-purpose shaders, it would be worth sorting all objects within each unit/batch/section of the global-sort by which shader is required. Since I have so few shaders (and only one shader that checks conditional-rendering bits in each vertex for normal objects, this is irrelevant for me).

To quote TomF:
1. Typically, a graphics-card driver will try to take the entire state of the rendering pipeline and optimise it like crazy in a sort of "compilation" step. In the same way that changing a single line of C can produce radically different code, you might think you're "just" changing the AlphaTestEnable flag, but actually that changes a huge chunk of the pipeline. Oh but sir, it is only a wafer-thin renderstate... In practice, it's extremely hard to predict anything about the relative costs of various changes beyond extremely broad generalities - and even those change fairly substantially from generation to generation.

2. Because of this, the number of state changes you make between rendering calls is not all that relevant any more. This used to be true in the DX7 and DX8 eras, but it's far less so in these days of DX9, and it will be basically irrelevant on DX10. The card treats each unique set of states as an indivisible unit, and will often upload the entire pipeline state. There are very few incremental state changes any more - the main exceptions are rendertarget and some odd non-obvious ones like Z-compare modes.

3. On a platform like the PC, you often have no idea what sort of card the user is running on. Even if you ID'd the card, there's ten or twenty possible graphics card architectures, and each has a sucession of different drivers. Which one do you optimise for? Do you try to make the edge-traversal function change according to the card installed? That sounds expensive. Remembering that most games are limited by the CPU, not the GPU, and you've just added to the asymmetry of that load.
[/quote]

I mostly agree with the above. However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.

However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!

##### Share on other sites
Three things;

Firstly your whole premise around collision detection/response is utterly flawed. Tri-tri intersections are the very lowest and last level of intersection done, most games won't even go to this level for collision detection. If you are performing collisions in the world then you'll start off with a broad AABB sweep, then OBB sweep, then sub-OBBs as required and finally, maybe, a tri-tri intersection if you really require that. So for an object of a few 1000 triangles you end up performing a few vertex transforms before worrying about triangles. So you DONT need the transformed vertex data back from the GPU because you simply don't need that data.

Secondly you keep throwing around the term 'conventional' without any real consideration as to what 'conventional' even means. The last game I worked on, for example, did draw rejection with a few schemes - distance based, frustum based per object & frustum based bounding box depending on the group of objects being rendered. There is no 'conventional' - there is 'basic' which is the per-object frustum based without consideration for the world structure but most games are going to grow beyond that pretty quickly. Hell, the current game I'm working on is using a static PVS solution to reject volumes of the world and associated objects from getting anywhere near being drawn.

Finally, while I'm too busy to think deeply about your sorting it sounds... wrong.
While your visiblility detection based on volumes seem sane, beyond that you want to sort batches by state. If you are rendering by volume (as my first read implies) then you are causing both CPU and GPU overhead as you'll be swapping states too often and repeating binding operations etc too.
Example; two volumes have two models each in them, model A and B.
If you render per volume then you'll do

1. Volume

• Render A
• Render B
• Volume

1. Render A
2. Render B

Each of those render calls would require setting the various per-instance states each time.
Where as with a simple sort on the object you can detect that they are both using the same state and reduce it to

1. Render two instances of A
2. Render two instances of B

Scale this up to a few more volumes and you'll quickly see the problem.

As reference EVERYTHING in Battlefield 3 is instance rendered, which for a modern (read DX11 class hardware) is remarkable easy to do. In fact depending on bottlenecks you might well be well served with instancing models with difference textures on them and using various techniques to bundle the different data together (texture arrays, UAVs of source data etc).

The point is spending a bit of time sorting for batches across the whole visable scene (something which you can bucket and do via multiple threads too) can save you more time on the CPU and GPU later by reducing submission workload and GPU state changes.

##### Share on other sites
You may be right, but your premise is wrong, at least in any scenario my application ever faces. In my application, and I would guess most other applications, the camera moves every frame. Therefore, every transformation matrix for every object must change in the "conventional way".
No, not at all -- in the "conventional way", the many object's local-to-world matrices are untouched, and only the single world-to-projection matrix is changed. So, the minimum change is 64bytes...

Furthermore, you totally ignore the very real, practical issue that I keep raising! The CPU needs world-coordinates to perform collision-detection and collision-response, and advocates of the "conventional way" always, always, always, always, always ignore this, pretend the problem doesn't exist, ignore people like me when we point it out, etc. I'm sorry, but a real general-purpose engine must perform collision-detection and collision-response. So at least you must include overhead for setting two matrices to the GPU and you must include overhead for the GPU writing out world-coordinate vertices to CPU memory... for every object every frame! Now we return control to your normal channel! :-)[/quote]I didn't completely ignore it - I asked in jest if you really need triangle-soup based collision detection. In my experience, it's standard practice to use an entirely different data representation for physics/collision than for rendering -- the two systems are entirely decoupled.
Trying to use your visual-representation for most game physics tasks seems entirely overcomplicated, especially as graphics assets continue increase in detail. You seem to be designing huge portions of your framework around the assumption that this is important, and creating dangerous couplings in the process.

Okay, I must interrupt with an objection here, but then we'll return to your thoughts. In most games and simulations, most objects contain far fewer vertices, and most objects do not move (a wall, a chair, a table, a dish, a fork, a baseball, a bat, a blade of grass, etc). Then a small minority of objects are large, or in your scenario - huge![/quote]For the target hardware in this thread, those numbers are sane enough, and they're only going to get worse. If you want something for 6 years from now, make sure that it scales - in memory usage and processing time - as the art density increases. Your current system suffers a memory explosion in complex, dynamic scenes, so I'd at least recommend a hybrid approach, where certain objects can opt-out of your batch-optimisation process and be rendered more "conventionally".

That's one place you're wrong. If we're gonna perform collision-detection, then you need ALL vertices in world-coordinates, and probably you need them in CPU memory. So you not only have all that storage, you also need to have the GPU transfer all those world-coordinates back to CPU memory so the CPU can perform the collision-detection.[/quote]This is crazy, I've never seen this in a game. The status quo is to use an entierly different "Physics Mesh", which is like one of your visual LODs. As your visual LODs scale up to millions of vertices, your physics mesh usually remains very simple.
Most physics meshes aren't dynamic and there shouldn't be a time where you'd need to be transforming their vertices. Physics meshes that are dynamic are usually constructed out of articulated rigid body shapes, rather than arbitrarily deforming polygon soups. Edited by Hodgman

##### Share on other sites
<snip>

This collision setup breaks under a client/server model so it's most definitely not suitable as any kind of general-purpose approach. With client/server you have 2 fundamental core assumptions: (1) the client is not to the trusted, and (2) the server may be somewhere on the other side of the world. It follows from those that the server does physics (at least the important gameplay-affecting stuff) and the client does rendering (unless you really want to send screen updates over the internet), so each needs a different data set.

It also starts out by assuming the very worst case - that you're going to need a per-triangle level of detail for collision detection with everything. There are actually at least two coarser tests that you run to trivially reject objects before going to this level of detail (even if you ever do go that far) - "can these objects collide?" (they may be in different areas of the map or moving in opposite directions) and a coarser test (sphere/bbox) for "do these objects collide?" Only if both of those are passed (and over 90% of your game objects are going to fail) do you need a more detailed test. The fastest code is the code that never runs - avoiding transforms altogether for over 90% of objects is going to trump any multithreaded SIMD setup.

Regarding collision response, if you need to re-animate as a result of that you don't re-animate the vertexes, you re-animate the bones.

##### Share on other sites

Three things;

Firstly your whole premise around collision detection/response is utterly flawed. Tri-tri intersections are the very lowest and last level of intersection done, most games won't even go to this level for collision detection. If you are performing collisions in the world then you'll start off with a broad AABB sweep, then OBB sweep, then sub-OBBs as required and finally, maybe, a tri-tri intersection if you really require that. So for an object of a few 1000 triangles you end up performing a few vertex transforms before worrying about triangles. So you DONT need the transformed vertex data back from the GPU because you simply don't need that data.

My whole premise is utterly flawed? Interesting, because my post explicitly stated (with #1, #2, #3) the stages of my collision-detection. It said phase #1 is the broad AABB overlap sweep, just as you say above. It also says, "that's all that's required for some types of applications". However, I went to great pains at the beginning of this thread to emphasize that we're talking about next-generation general-purpose 3D game/simulation engines. They must be able to handle all kinds of situations. Having said that, I invite alternatives. For example, it is certainly arguable that sub-dividing irregular/concave objects into multiple convex objects so they can be processed by faster convex algorithms (like GJK) is superior to writing a totally general algorithm that handles completely arbitrary, irregular, concave objects (as I did).

I might not exactly understand what you mean by OBB and sub-OBB, so maybe you can elaborate. I assume OBB means OOBB (object oriented bounding box), the alternative to AABB (axis aligned bounding box). But I'm not sure what you mean by sub-OBB. I am not certain whether inserting an OOBB test on object-pairs that the AABB test classifies as "overlapping" speeds up collision-detection or not. It might, though I doubt by very much. However, I could be wrong about that --- you might have a blazing fast ways to compute OOBBs and test a pair against each other, in which case the speedup might be significant and worthwhile (worth the extra complexity). However, after this level you lose me --- I just don't follow what you're talking about, unless maybe you have a nested set of more-and-more, smaller-and-smaller OOBBs inside the outer OOBB. If so, my off-the-cuff impression is... that must be quite a bit slower and more complex than a good, fast, optimized GJK algorithm like the one I have.

You are correct, of course, that an engine that works the "conventional way" (which I clearly stated means keeping local-coordinates of vertices in the GPU), can have the CPU perform local to world coordinate transformations of various subsets of object vertices "on demand". But guess what? If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB. Of course, you can perform an analysis on non-deforming objects to find the few vertices that can never be an outlier on ANY of the three axes, and exclude those from transformation because they will never set any of the 6 limits on the AABB. You could also exclude more vertices from this test, but artificially increase the size of the AABB to counteract the largest possible error this scheme might introduce (though I'm not exactly clear how to write such a routine, off hand).

Secondly you keep throwing around the term 'conventional' without any real consideration as to what 'conventional' even means.[/quote]
I did explain what I mean by "conventional". It only means keeping the local-coordinates of vertices in GPU memory, and always making the GPU transform them from local to screen coordinates every frame. I believe that is conventional in the sense that 90% or more of existing engines do this.

The last game I worked on, for example, did draw rejection with a few schemes - distance based, frustum based per object & frustum based bounding box depending on the group of objects being rendered. There is no 'conventional' - there is 'basic' which is the per-object frustum based without consideration for the world structure but most games are going to grow beyond that pretty quickly. Hell, the current game I'm working on is using a static PVS solution to reject volumes of the world and associated objects from getting anywhere near being drawn.[/quote]
About those objects that you "draw-rejected". Do you also exclude them from casting shadows onto objects that are in the frustum? Do you also exclude them from colliding with each other? Sorry, I mean do you exclude your engine from recognizing when they collide with each other, just because they were "draw rejected" [at an early stage]?

Finally, while I'm too busy to think deeply about your sorting it sounds... wrong.
While your visiblility detection based on volumes seem sane, beyond that you want to sort batches by state. If you are rendering by volume (as my first read implies) then you are causing both CPU and GPU overhead as you'll be swapping states too often and repeating binding operations etc too.
Example; two volumes have two models each in them, model A and B.
If you render per volume then you'll do

1. Volume

• Render A
• Render B

Each of those render calls would require setting the various per-instance states each time.
Where as with a simple sort on the object you can detect that they are both using the same state and reduce it to

• Volume

1. Render A
2. Render B

1. Render two instances of A
2. Render two instances of B

Scale this up to a few more volumes and you'll quickly see the problem.[/quote]
Yikes! See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.

In this case, what you are failing to see, is that once you adopt the proposed unconventional approach, there is rarely if ever any need to change ANY state (including shaders) between batches (where each batch contains objects in one volume of space).

To help clarify this, the following is what my frame looks like:

set up state
for (batch = 0; batch < batch_count; batch++) { // where each batch contains objects in a volume of space.
if (batch.volume overlaps frustum.volume) {
glDrawElements (the entire batch, including every object within the batch);
}
}

That's it! So, where are these "state changes" you're talking about? That's right, there aren't any! Every object in every volume is rendered with the exact same state. Yes, the exact same state. Yes, that's correct, this means every object is rendered with the exact same transformation matrix.

How can this be? It "be", because the CPU transformed the vertices of every object that rotated or moved to world coordinates, and transferred them to GPU memory. Every vertex of every object that did not rotate or move since last frame still has the same world coordinates, and therefore that information does not need to be transferred to the GPU --- the vertices in GPU memory are already correct. And since ALL vertices in the GPU are in world-coordinates (not hundreds of different local-coordinates), ALL vertices of ALL objects get rendered by the exact same transformation matrix. Get it? No state changes. Not between individual objects, and not even between batches!

Now, my engine does have a couple special cases, so in a complex scenario that has every possible kind of object (plus points and lines too), there might be some state changes. At the very least, glDrawElements() gets called with an argument that indicates "points" or "lines" are being drawn. Even then, however, the same shader draws points and lines and triangles, so even in these cases there are no state changes. Only in a couple situations (that don't often exist) is any state change needed. Each frame, since the camera probably rotated or moved, a new transformation matrix must be specified, and possibly a few other items. But that's it! They are specified once, before anything is rendered for that frame, and then [typically] NO state is changed the entire frame.

I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates. And because you cannot avoid it, and it is an utterly unavoidable aspect of the conventional approach, you do not even imagine a system that requires no state changes for entire batches containing many objects, much less every object in the entire frame.

This is why I always find it difficult to discuss this topic with anyone who "knows too much"... or "has too much experience". You have a well founded set of assumptions that apply quite well to the conventional approach. And this makes it difficult to imagine a whole new set of interactions and tradeoffs and dynamics of a different approach. On the surface, the two approaches might not seem that different --- one simply stores local-coordinates in the GPU, while the other stores world-coordinates in the GPU. But look at how much the interactions and tradeoffs and dynamics change as a consequence.

As reference EVERYTHING in Battlefield 3 is instance rendered, which for a modern (read DX11 class hardware) is remarkable easy to do. In fact depending on bottlenecks you might well be well served with instancing models with difference textures on them and using various techniques to bundle the different data together (texture arrays, UAVs of source data etc).

The point is spending a bit of time sorting for batches across the whole visable scene (something which you can bucket and do via multiple threads too) can save you more time on the CPU and GPU later by reducing submission workload and GPU state changes.
[/quote]
Yes, I understand. As you might imagine, my engine depends on features like texture arrays. After all, in my approach, a single glDrawElements() might draw 100 or even 1000 different objects, with dozens if not hundreds of textures, bumpmaps, parallax-maps, specular-maps and other goober on them. How on earth could an engine possibly do this without texture-arrays? I suppose one mammoth texture could be subdivided into hundreds of smaller texture, ala "texture atlas", but texture arrays are a much better, more convenient solution, and my scheme wouldn't even work if I had to switch texture state and call glDrawElements() again for every object. Perish the thought! Good riddens to the problems of the "old days"! Of course, this means every vertex of every object needs a texture-ID field to identify where to find the textures for that object (plus "rules" that specify where other textures are found based on the specified texture-ID and tcoords). In all, my vertex has 32-bits of small ID fields, plus a bunch of bits that specify how to render (like "enable texture", "enable color", "emission", etc). So if the color-bit and texture-bit are both set, the each pixel of the applied texture is "colored" by the object color, and so forth (so a great many combinations are possible for each triangle given just a few control bits).

##### Share on other sites
Firstly I'm not touching the physics things because, as pointed out, physics data shouldn't be using the same mesh as rendered data anyway and, more importantly, the two subsystems are hardly related.

'State' is far more than a matrix; any time you have a different shader, a different set of material constants, a different set of textures you require some state to be uploaded to the GPU. Your fixation on the transformation matrix is just that; a fixation.

As for the rest I really only have one question; how many objects are you testing your "engine" with and how many verts does each object hold?

##### Share on other sites

If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago. Edited by mhagain

##### Share on other sites

[quote name='maxgpgpu' timestamp='1339670012' post='4949105']<snip>

This collision setup breaks under a client/server model so it's most definitely not suitable as any kind of general-purpose approach. With client/server you have 2 fundamental core assumptions: (1) the client is not to the trusted, and (2) the server may be somewhere on the other side of the world. It follows from those that the server does physics (at least the important gameplay-affecting stuff) and the client does rendering (unless you really want to send screen updates over the internet), so each needs a different data set.

It also starts out by assuming the very worst case - that you're going to need a per-triangle level of detail for collision detection with everything. There are actually at least two coarser tests that you run to trivially reject objects before going to this level of detail (even if you ever do go that far) - "can these objects collide?" (they may be in different areas of the map or moving in opposite directions) and a coarser test (sphere/bbox) for "do these objects collide?" Only if both of those are passed (and over 90% of your game objects are going to fail) do you need a more detailed test. The fastest code is the code that never runs - avoiding transforms altogether for over 90% of objects is going to trump any multithreaded SIMD setup.

Regarding collision response, if you need to re-animate as a result of that you don't re-animate the vertexes, you re-animate the bones.
[/quote]
Interesting... I designed my engine in a fundamentally client-server manner, even though most often the client and server will be the same computer, and even in the same process. This does present a few situations and cases that are annoying or problematic, especially when the client and server are in fact thousands of miles apart on the internet.

However, I suspect we switched what we call the client and server. In my engine, the server is a 3D game/graphics/simulation server, and the client is a game or wants to run a simulation.

I am trying to design a highly generalize system, and this has greatly increased the work that the server is expected to perform. For example, when the client specifies and builds an object hierarchy, it specifies what material each piece is made of, what are the material characteristics (if a non-standard material), what is its mass, how thick are its surfaces (if not solid), and so forth. Therefore, when the server detects a collision between a stainless steel ball bearing and a drinking glass, it performs collision-response AKA physics appropriate for those materials. This makes it vastly easier to write realistic games and simulations, but makes writing the engine much more demanding.

However, as you clearly understand, there will be hell to pay if the server must detect the collision, then inform the client half-way around the world, then wait (how many frames) for the client to decide and specify what is the collision-response AKA physics.

In my scheme, of course the client does get informed that "the drinking glass AKA object 3853 just got shattered by object 7418", along with some information about where they were, what velocities they collided at, and anything non-obvious about the collision-response chosen by the server. But the client may get this information a few frames after the results show up on the display, assuming the client is thousands of miles away.

Oh yes, I totally agree with your comments about collision detection. As I described elsewhere, my approach is exactly as you suggest. First perform the quickest and easiest test that excludes the vast majority of objects. In my case, that's the insertion-sort driving sweep-and-prune test. Actually, that's not even strictly correct. The first and most effective test for most games and simulations is this: if an object didn't move, it didn't collide. Of course that's not strictly true because some moving object might have collided with it, but the test of the moving object will detect all these collisions, so we can start by simply excluding all objects that did not move. In practice that becomes part of the insertion-sort driven sweep-and-prune test that excludes almost all object-pairs without testing those that didn't move. This passes only object pairs that have overlapping AABB on all 3 axes (x, y, z).

For the few object-pairs that are overlapping on all 3 axes (and thus are not rejected), my engine runs a super-optimized implementation of the GJK algorithm. This is an amazingly efficient test for intersection of 2 "convex" objects. If it finds they are not colliding, it provides the closest features on each object (which could be a point, an edge, or a face). This provides a better starting point for that object-pair on subsequent frames (as the objects get closer), so the GJK gets even faster thereafter. GJK is an accurate algorithm. When GJK reports "no-collision" or "collision", it is always accurate. If you want to be super-precise and GJK reports the objects are "considerably overlapped" (meaning they got rather far inside the other since not being in collision last frame), then finding the moment and positions of "first contact" can be... ehhh... "fun" --- as in "not easy", especially if one or both objects are rotating. Until I get a bit smarter, or have some brilliant idea, I decided to resort to "brute force", which means a successive approximation (binary search) for [close-enough-to] the moment of first contact.

As for my triangle-versus-triangle intersection test, remember this only happens AT ALL in rare circumstances. First of all, if both objects are "convex", the GJK is a perfectly accurate test, and produces the closest features (point, edge, face/triangle) at the moment of first-contact. So the triangle-versus-triangle intersection test adds nothing, and is therefore not worth executing in these cases.

Finally, if one or both objects is substantially irregular/concave AND the first two tests pass them as "in collision", then my triangle-versus-triangle routine can be executed... IF... the client chooses to enable it. If an object is only "slightly concave", the usually the GJK is plenty good enough, and is still exact when collision happens on the convex portions of the concave object.

So yeah, you are totally correct. First and foremost collision-detection excludes non-colliding objects and object-pairs in the quickest and cheapest manner. Of course this still requires world-coordinates to perform every collision test... except the first step that says "if it didn't rotate or move, it didn't collide". That test doesn't require anything except a flag that gets set whenever the object is rotated or moved, and cleared sometime before next frame begins.

And yeah, probably 99.9% of objects don't collide on a typical frame, and most frames no objects collide.

##### Share on other sites

#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.

This is bonkers.
Transform vertices on the CPU? Unfortunately you spread this fallacy over so many posts I have a ton more quoting to do before I can make any point.

From articles by nvidia, I infer a single core of a fast CPU is about 8x faster than a GPU core, not 64x. That's why I believe we all agree that our 3D engines should perform all operations on the GPU that do not involve some unfortunate (and possibly inefficient) consequences.

We should avoid doing any operations anywhere on anything that cause unfortunate and inefficient consequences.

But when any aspect of collision-detection OR collision-response needs to be handled on the CPU, the "simple way" or "conventional way" of keeping object local-coordinates in VBOs in the GPU memory, and having the GPU transform those objects directly to screen-coordinates with one matrix has serious problems.

Physics has nothing to do with graphics. You would realize that except you are completely stuck in the world of vertices.
Everything you say is vertices this and vertices that. Have to transform them on the CPU because the CPU is faster (???) (in chess we mark “stupid moves” with question marks, with more question marks implying a stupider move).
Have to transform vertices on the CPU to get a correct AABB (???).
Get out of it.

You cannot transform vertices on the CPU as fast as you can on the GPU. Firstly, the CPU is shared. Stop reading every little piece of documentation completely literally. A CPU has a higher clock speed. Great. Now give half of that to the OS, some of it to everything running in the background, and a little slice is left over for your game. And then you are talking about re-updating a VBO every frame! More overhead!
With you at the helm, last generation’s games would run at about 15 FPS on next-generation hardware.

First, collision-detection must be performed with all objects in the same coordinate-system

No, actually, objects are checked with one being within the local coordinate system of the other.
But again we are talking about graphics, which has absolutely nothing,

# not even vertices, to do with physics, which uses

##### Share on other sites

Firstly I'm not touching the physics things because, as pointed out, physics data shouldn't be using the same mesh as rendered data anyway and, more importantly, the two subsystems are hardly related.

'State' is far more than a matrix; any time you have a different shader, a different set of material constants, a different set of textures you require some state to be uploaded to the GPU. Your fixation on the transformation matrix is just that; a fixation.

As for the rest I really only have one question; how many objects are you testing your "engine" with and how many verts does each object hold?

I can see that in some cases, maybe many cases, it is reasonable to have separate rendering and physics meshes. However, I'm not sure how to understand a statement that "they should not be using the same mesh". That seems to me a rather bold blanket statement to make. As I said elsewhere, in an engine and game that renders boatloads of very fine detail with zillions of microtriangles, there is no question that performing physics on every freaking microtriangle is overkill. If you show damage in a slightly wrong place, nobody will ever notice in practice. However, in an engine and game that renders boatloads of very fine detail with a self-shadowing parallax-mapping technique like "relaxed cone-step mapping" (or even just bumpmapping), then objects don't need zillions of vertices to display exceptionally fine details. In this case, I don't see why "physics data should not be in the same mesh as render data". In my view the two system are rather closely related --- they describe the exact same object. So even when you adopt microtriangles for fine detail and therefore DO have separate data for physics purposes, I still consider them "closely related", even if processed separately.

I have no fixation on the transformation matrix. However, I recognize this much. The one state that inherently must be changed for each and every object (in the conventional approach) is the transformation matrix. Once the engine is required to change this state, the efficiency of "continuous flow" is broken, and the additional overhead of changing other state isn't that bit a deal. So I don't have any fixation on the transformation matrix whatsoever. I simply recognize that the LACK of my need to change the transformation matrix for every object is the reason I can render 100s or 1000s of objects in one batch with one glDrawElements() call. That's just a fact, not a fixation. I guess you want to pretend "it doesn't matter", but I keep explaining why consequences flow from the requirement to change transformation matrices every object, or the lack of that requirement.

As for testing my engine, give me a number of objects and a number of vertices, and I'll run a test for you.

However consider this consequence of the unconventional approach in my engine. I'll describe a not very unusual environment and situation. Assume we have a large environment with thousands of visible objects, each of which has as many triangles as necessary to look good. Assume the camera is flying around through the environment, so the viewpoint changes every frame. But let's say no other objects besides the camera are moving, as in an airplane flying over and through mountains and valleys with buildings here and there.

In my engine, here is what must happen from frame to frame:
#1: a new transformation matrix must be computed to transform world-coordinates to screen-coordinates given the new camera rotation and position.
#2: this new transformation matrix must be set to the GPU.
#3: the CPU performs ZERO additional work.
#4: the CPU calls glDrawElements() once for each batch of 100s or 1000s of objects if that batch might intersect the frustum.
#5: DONE

That's all there is. I am not kidding. The CPU did SQUAT. The CPU transformed ZERO vertices. The CPU transferred ZERO vertices to the GPU. The CPU checked ONE object for collisions (only the camera, since no other object moved, and non-moving objects do not collide with other non-moving objects). The CPU computed and transferred ONE transformation matrix to the GPU (the world-to-screen matrix). Therefore the rendering speed is ENTIRELY a function of GPU-speed. Period. End of story.

So go ahead and give me a scenario like this, and my answer will simply be a measure of how fast the GPU can render my objects. Period.

Or go ahead and make a few objects move. This will make very little difference. It doesn't take very long for my fast transformation routines to transform one, two, five, twenty objects from local-coordinates to world-coordinates, and transfer their vertices to GPU with one glBufferSubData() call per object (assuming I remember that function name correctly). That's the only extra work the CPU needs to do to render this frame. And hey, in most games and simulations, 10 or 20 moving objects is a very dynamic scene. We'll cheat (as-is-usual) and assume cases like "subtle fluttering of leaves in the breeze" are implemented as usual, and thus do not invoke collision detection on a leaf-to-leaf basis. The little extra work my engine performs to move a dozen or two objects is insignificant.

HOWEVER, if you want to make me and my engine look stupid, you can easily throw a worst-case scenario at me in which every last object is moving around wildly in random directions. That will certainly raise the CPU temperature a few degrees --- no doubt about it.

I'll run the tests if you wish. I just hope you understand the above explains in terms you can understand WHY the CPU in my engine does very, very, very little work in common situations. That's all I want, is for you guys to understand what I'm talking about... to understand which cases cause what work to be required in both approaches. Notice this. Even in the case where nothing but the camera is moving (or the camera and a few objects), the conventional approach MUST change state between EVERY object (at the very least to change the transformation matrix, but possibly other state too). You must call glDrawElementsRange() or similar function for EVERY object. You must do all this EVEN WHEN NOTHING HAPPENS --- even when no object rotates or moves, even the camera!

If you can finally understand how different this case is, and how much LESS my engine needs to do in these cases, then we can agree that this whole conversation should be an analysis and discussion of the tradeoffs of the different approaches, or hybrid approaches, or entirely new approaches none of us even thought of yet. That's what I'm hoping for. I'm not convinced of anything yet. This is a complicated situation, with so many ways to skin a cat that sometimes it makes me dizzy (hence this thread). But nobody will sit back, reflect, and openly discuss tradeoffs and possibilities if they imagine there is one way.... or the highway. Clearly I took the highway on this one so far, and though I keep trying to get you guys to see the merits, I am not myself convinced I have the best overall solution. But nobody will get anywhere unless you understand that basic nature of the difference between these approaches, and some of the remarkable consequences.

• 10
• 11
• 10
• 11
• 9
• ### Similar Content

• By EddieK
Hello. I'm trying to make an android game and I have come across a problem. I want to draw different map layers at different Z depths so that some of the tiles are drawn above the player while others are drawn under him. But there's an issue where the pixels with alpha drawn above the player. This is the code i'm using:
int setup(){ GLES20.glEnable(GLES20.GL_DEPTH_TEST); GLES20.glEnable(GL10.GL_ALPHA_TEST); GLES20.glEnable(GLES20.GL_TEXTURE_2D); } int render(){ GLES20.glClearColor(0, 0, 0, 0); GLES20.glClear(GLES20.GL_ALPHA_BITS); GLES20.glClear(GLES20.GL_COLOR_BUFFER_BIT); GLES20.glClear(GLES20.GL_DEPTH_BUFFER_BIT); GLES20.glBlendFunc(GLES20.GL_ONE, GL10.GL_ONE_MINUS_SRC_ALPHA); // do the binding of textures and drawing vertices } My vertex shader:
uniform mat4 MVPMatrix; // model-view-projection matrix uniform mat4 projectionMatrix; attribute vec4 position; attribute vec2 textureCoords; attribute vec4 color; attribute vec3 normal; varying vec4 outColor; varying vec2 outTexCoords; varying vec3 outNormal; void main() { outNormal = normal; outTexCoords = textureCoords; outColor = color; gl_Position = MVPMatrix * position; } My fragment shader:
precision highp float; uniform sampler2D texture; varying vec4 outColor; varying vec2 outTexCoords; varying vec3 outNormal; void main() { vec4 color = texture2D(texture, outTexCoords) * outColor; gl_FragColor = vec4(color.r,color.g,color.b,color.a);//color.a); } I have attached a picture of how it looks. You can see the black squares near the tree. These squares should be transparent as they are in the png image:

Its strange that in this picture instead of alpha or just black color it displays the grass texture beneath the player and the tree:

Any ideas on how to fix this?

• This article uses material originally posted on Diligent Graphics web site.
Introduction
Graphics APIs have come a long way from small set of basic commands allowing limited control of configurable stages of early 3D accelerators to very low-level programming interfaces exposing almost every aspect of the underlying graphics hardware. Next-generation APIs, Direct3D12 by Microsoft and Vulkan by Khronos are relatively new and have only started getting widespread adoption and support from hardware vendors, while Direct3D11 and OpenGL are still considered industry standard. New APIs can provide substantial performance and functional improvements, but may not be supported by older hardware. An application targeting wide range of platforms needs to support Direct3D11 and OpenGL. New APIs will not give any advantage when used with old paradigms. It is totally possible to add Direct3D12 support to an existing renderer by implementing Direct3D11 interface through Direct3D12, but this will give zero benefits. Instead, new approaches and rendering architectures that leverage flexibility provided by the next-generation APIs are expected to be developed.
There are at least four APIs (Direct3D11, Direct3D12, OpenGL/GLES, Vulkan, plus Apple's Metal for iOS and osX platforms) that a cross-platform 3D application may need to support. Writing separate code paths for all APIs is clearly not an option for any real-world application and the need for a cross-platform graphics abstraction layer is evident. The following is the list of requirements that I believe such layer needs to satisfy:
Lightweight abstractions: the API should be as close to the underlying native APIs as possible to allow an application leverage all available low-level functionality. In many cases this requirement is difficult to achieve because specific features exposed by different APIs may vary considerably. Low performance overhead: the abstraction layer needs to be efficient from performance point of view. If it introduces considerable amount of overhead, there is no point in using it. Convenience: the API needs to be convenient to use. It needs to assist developers in achieving their goals not limiting their control of the graphics hardware. Multithreading: ability to efficiently parallelize work is in the core of Direct3D12 and Vulkan and one of the main selling points of the new APIs. Support for multithreading in a cross-platform layer is a must. Extensibility: no matter how well the API is designed, it still introduces some level of abstraction. In some cases the most efficient way to implement certain functionality is to directly use native API. The abstraction layer needs to provide seamless interoperability with the underlying native APIs to provide a way for the app to add features that may be missing. Diligent Engine is designed to solve these problems. Its main goal is to take advantages of the next-generation APIs such as Direct3D12 and Vulkan, but at the same time provide support for older platforms via Direct3D11, OpenGL and OpenGLES. Diligent Engine exposes common C++ front-end for all supported platforms and provides interoperability with underlying native APIs. It also supports integration with Unity and is designed to be used as graphics subsystem in a standalone game engine, Unity native plugin or any other 3D application. Full source code is available for download at GitHub and is free to use.
Overview
Diligent Engine API takes some features from Direct3D11 and Direct3D12 as well as introduces new concepts to hide certain platform-specific details and make the system easy to use. It contains the following main components:
Render device (IRenderDevice  interface) is responsible for creating all other objects (textures, buffers, shaders, pipeline states, etc.).
Device context (IDeviceContext interface) is the main interface for recording rendering commands. Similar to Direct3D11, there are immediate context and deferred contexts (which in Direct3D11 implementation map directly to the corresponding context types). Immediate context combines command queue and command list recording functionality. It records commands and submits the command list for execution when it contains sufficient number of commands. Deferred contexts are designed to only record command lists that can be submitted for execution through the immediate context.
An alternative way to design the API would be to expose command queue and command lists directly. This approach however does not map well to Direct3D11 and OpenGL. Besides, some functionality (such as dynamic descriptor allocation) can be much more efficiently implemented when it is known that a command list is recorded by a certain deferred context from some thread.
The approach taken in the engine does not limit scalability as the application is expected to create one deferred context per thread, and internally every deferred context records a command list in lock-free fashion. At the same time this approach maps well to older APIs.
In current implementation, only one immediate context that uses default graphics command queue is created. To support multiple GPUs or multiple command queue types (compute, copy, etc.), it is natural to have one immediate contexts per queue. Cross-context synchronization utilities will be necessary.
Swap Chain (ISwapChain interface). Swap chain interface represents a chain of back buffers and is responsible for showing the final rendered image on the screen.
Render device, device contexts and swap chain are created during the engine initialization.
Resources (ITexture and IBuffer interfaces). There are two types of resources - textures and buffers. There are many different texture types (2D textures, 3D textures, texture array, cubmepas, etc.) that can all be represented by ITexture interface.
Resources Views (ITextureView and IBufferView interfaces). While textures and buffers are mere data containers, texture views and buffer views describe how the data should be interpreted. For instance, a 2D texture can be used as a render target for rendering commands or as a shader resource.
Pipeline State (IPipelineState interface). GPU pipeline contains many configurable stages (depth-stencil, rasterizer and blend states, different shader stage, etc.). Direct3D11 uses coarse-grain objects to set all stage parameters at once (for instance, a rasterizer object encompasses all rasterizer attributes), while OpenGL contains myriad functions to fine-grain control every individual attribute of every stage. Both methods do not map very well to modern graphics hardware that combines all states into one monolithic state under the hood. Direct3D12 directly exposes pipeline state object in the API, and Diligent Engine uses the same approach.
Shader Resource Binding (IShaderResourceBinding interface). Shaders are programs that run on the GPU. Shaders may access various resources (textures and buffers), and setting correspondence between shader variables and actual resources is called resource binding. Resource binding implementation varies considerably between different API. Diligent Engine introduces a new object called shader resource binding that encompasses all resources needed by all shaders in a certain pipeline state.
API Basics
Creating Resources
Device resources are created by the render device. The two main resource types are buffers, which represent linear memory, and textures, which use memory layouts optimized for fast filtering. Graphics APIs usually have a native object that represents linear buffer. Diligent Engine uses IBuffer interface as an abstraction for a native buffer. To create a buffer, one needs to populate BufferDesc structure and call IRenderDevice::CreateBuffer() method as in the following example:
BufferDesc BuffDesc; BufferDesc.Name = "Uniform buffer"; BuffDesc.BindFlags = BIND_UNIFORM_BUFFER; BuffDesc.Usage = USAGE_DYNAMIC; BuffDesc.uiSizeInBytes = sizeof(ShaderConstants); BuffDesc.CPUAccessFlags = CPU_ACCESS_WRITE; m_pDevice->CreateBuffer( BuffDesc, BufferData(), &m_pConstantBuffer ); While there is usually just one buffer object, different APIs use very different approaches to represent textures. For instance, in Direct3D11, there are ID3D11Texture1D, ID3D11Texture2D, and ID3D11Texture3D objects. In OpenGL, there is individual object for every texture dimension (1D, 2D, 3D, Cube), which may be a texture array, which may also be multisampled (i.e. GL_TEXTURE_2D_MULTISAMPLE_ARRAY). As a result there are nine different GL texture types that Diligent Engine may create under the hood. In Direct3D12, there is only one resource interface. Diligent Engine hides all these details in ITexture interface. There is only one  IRenderDevice::CreateTexture() method that is capable of creating all texture types. Dimension, format, array size and all other parameters are specified by the members of the TextureDesc structure:
TextureDesc TexDesc; TexDesc.Name = "My texture 2D"; TexDesc.Type = TEXTURE_TYPE_2D; TexDesc.Width = 1024; TexDesc.Height = 1024; TexDesc.Format = TEX_FORMAT_RGBA8_UNORM; TexDesc.Usage = USAGE_DEFAULT; TexDesc.BindFlags = BIND_SHADER_RESOURCE | BIND_RENDER_TARGET | BIND_UNORDERED_ACCESS; TexDesc.Name = "Sample 2D Texture"; m_pRenderDevice->CreateTexture( TexDesc, TextureData(), &m_pTestTex ); If native API supports multithreaded resource creation, textures and buffers can be created by multiple threads simultaneously.
Interoperability with native API provides access to the native buffer/texture objects and also allows creating Diligent Engine objects from native handles. It allows applications seamlessly integrate native API-specific code with Diligent Engine.
Next-generation APIs allow fine level-control over how resources are allocated. Diligent Engine does not currently expose this functionality, but it can be added by implementing IResourceAllocator interface that encapsulates specifics of resource allocation and providing this interface to CreateBuffer() or CreateTexture() methods. If null is provided, default allocator should be used.
Initializing the Pipeline State
As it was mentioned earlier, Diligent Engine follows next-gen APIs to configure the graphics/compute pipeline. One big Pipelines State Object (PSO) encompasses all required states (all shader stages, input layout description, depth stencil, rasterizer and blend state descriptions etc.). This approach maps directly to Direct3D12/Vulkan, but is also beneficial for older APIs as it eliminates pipeline misconfiguration errors. With many individual calls tweaking various GPU pipeline settings it is very easy to forget to set one of the states or assume the stage is already properly configured when in fact it is not. Using pipeline state object helps avoid these problems as all stages are configured at once.
While in earlier APIs shaders were bound separately, in the next-generation APIs as well as in Diligent Engine shaders are part of the pipeline state object. The biggest challenge when authoring shaders is that Direct3D and OpenGL/Vulkan use different shader languages (while Apple uses yet another language in their Metal API). Maintaining two versions of every shader is not an option for real applications and Diligent Engine implements shader source code converter that allows shaders authored in HLSL to be translated to GLSL. To create a shader, one needs to populate ShaderCreationAttribs structure. SourceLanguage member of this structure tells the system which language the shader is authored in:
When sampling a texture in a shader, the texture sampler was traditionally specified as separate object that was bound to the pipeline at run time or set as part of the texture object itself. However, in most cases it is known beforehand what kind of sampler will be used in the shader. Next-generation APIs expose new type of sampler called static sampler that can be initialized directly in the pipeline state. Diligent Engine exposes this functionality: when creating a shader, textures can be assigned static samplers. If static sampler is assigned, it will always be used instead of the one initialized in the texture shader resource view. To initialize static samplers, prepare an array of StaticSamplerDesc structures and initialize StaticSamplers and NumStaticSamplers members. Static samplers are more efficient and it is highly recommended to use them whenever possible. On older APIs, static samplers are emulated via generic sampler objects.
The following is an example of shader initialization:
Creating the Pipeline State Object
After all required shaders are created, the rest of the fields of the PipelineStateDesc structure provide depth-stencil, rasterizer, and blend state descriptions, the number and format of render targets, input layout format, etc. For instance, rasterizer state can be described as follows:
PipelineStateDesc PSODesc; RasterizerStateDesc &RasterizerDesc = PSODesc.GraphicsPipeline.RasterizerDesc; RasterizerDesc.FillMode = FILL_MODE_SOLID; RasterizerDesc.CullMode = CULL_MODE_NONE; RasterizerDesc.FrontCounterClockwise = True; RasterizerDesc.ScissorEnable = True; RasterizerDesc.AntialiasedLineEnable = False; Depth-stencil and blend states are defined in a similar fashion.
Another important thing that pipeline state object encompasses is the input layout description that defines how inputs to the vertex shader, which is the very first shader stage, should be read from the memory. Input layout may define several vertex streams that contain values of different formats and sizes:
// Define input layout InputLayoutDesc &Layout = PSODesc.GraphicsPipeline.InputLayout; LayoutElement TextLayoutElems[] = {     LayoutElement( 0, 0, 3, VT_FLOAT32, False ),     LayoutElement( 1, 0, 4, VT_UINT8, True ),     LayoutElement( 2, 0, 2, VT_FLOAT32, False ), }; Layout.LayoutElements = TextLayoutElems; Layout.NumElements = _countof( TextLayoutElems ); Finally, pipeline state defines primitive topology type. When all required members are initialized, a pipeline state object can be created by IRenderDevice::CreatePipelineState() method:
// Define shader and primitive topology PSODesc.GraphicsPipeline.PrimitiveTopologyType = PRIMITIVE_TOPOLOGY_TYPE_TRIANGLE; PSODesc.GraphicsPipeline.pVS = pVertexShader; PSODesc.GraphicsPipeline.pPS = pPixelShader; PSODesc.Name = "My pipeline state"; m_pDev->CreatePipelineState(PSODesc, &m_pPSO); When PSO object is bound to the pipeline, the engine invokes all API-specific commands to set all states specified by the object. In case of Direct3D12 this maps directly to setting the D3D12 PSO object. In case of Direct3D11, this involves setting individual state objects (such as rasterizer and blend states), shaders, input layout etc. In case of OpenGL, this requires a number of fine-grain state tweaking calls. Diligent Engine keeps track of currently bound states and only calls functions to update these states that have actually changed.
Direct3D11 and OpenGL utilize fine-grain resource binding models, where an application binds individual buffers and textures to certain shader or program resource binding slots. Direct3D12 uses a very different approach, where resource descriptors are grouped into tables, and an application can bind all resources in the table at once by setting the table in the command list. Resource binding model in Diligent Engine is designed to leverage this new method. It introduces a new object called shader resource binding that encapsulates all resource bindings required for all shaders in a certain pipeline state. It also introduces the classification of shader variables based on the frequency of expected change that helps the engine group them into tables under the hood:
Static variables (SHADER_VARIABLE_TYPE_STATIC) are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. Mutable variables (SHADER_VARIABLE_TYPE_MUTABLE) define resources that are expected to change on a per-material frequency. Examples may include diffuse textures, normal maps etc. Dynamic variables (SHADER_VARIABLE_TYPE_DYNAMIC) are expected to change frequently and randomly. Shader variable type must be specified during shader creation by populating an array of ShaderVariableDesc structures and initializing ShaderCreationAttribs::Desc::VariableDesc and ShaderCreationAttribs::Desc::NumVariables members (see example of shader creation above).
Static variables cannot be changed once a resource is bound to the variable. They are bound directly to the shader object. For instance, a shadow map texture is not expected to change after it is created, so it can be bound directly to the shader:
m_pPSO->CreateShaderResourceBinding(&m_pSRB); Note that an SRB is only compatible with the pipeline state it was created from. SRB object inherits all static bindings from shaders in the pipeline, but is not allowed to change them.
Mutable resources can only be set once for every instance of a shader resource binding. Such resources are intended to define specific material properties. For instance, a diffuse texture for a specific material is not expected to change once the material is defined and can be set right after the SRB object has been created:
m_pSRB->GetVariable(SHADER_TYPE_PIXEL, "tex2DDiffuse")->Set(pDiffuseTexSRV); In some cases it is necessary to bind a new resource to a variable every time a draw command is invoked. Such variables should be labeled as dynamic, which will allow setting them multiple times through the same SRB object:
m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "cbRandomAttribs")->Set(pRandomAttrsCB); Under the hood, the engine pre-allocates descriptor tables for static and mutable resources when an SRB objcet is created. Space for dynamic resources is dynamically allocated at run time. Static and mutable resources are thus more efficient and should be used whenever possible.
As you can see, Diligent Engine does not expose low-level details of how resources are bound to shader variables. One reason for this is that these details are very different for various APIs. The other reason is that using low-level binding methods is extremely error-prone: it is very easy to forget to bind some resource, or bind incorrect resource such as bind a buffer to the variable that is in fact a texture, especially during shader development when everything changes fast. Diligent Engine instead relies on shader reflection system to automatically query the list of all shader variables. Grouping variables based on three types mentioned above allows the engine to create optimized layout and take heavy lifting of matching resources to API-specific resource location, register or descriptor in the table.
This post gives more details about the resource binding model in Diligent Engine.
Setting the Pipeline State and Committing Shader Resources
Before any draw or compute command can be invoked, the pipeline state needs to be bound to the context:
m_pContext->SetPipelineState(m_pPSO); Under the hood, the engine sets the internal PSO object in the command list or calls all the required native API functions to properly configure all pipeline stages.
The next step is to bind all required shader resources to the GPU pipeline, which is accomplished by IDeviceContext::CommitShaderResources() method:
m_pContext->CommitShaderResources(m_pSRB, COMMIT_SHADER_RESOURCES_FLAG_TRANSITION_RESOURCES); The method takes a pointer to the shader resource binding object and makes all resources the object holds available for the shaders. In the case of D3D12, this only requires setting appropriate descriptor tables in the command list. For older APIs, this typically requires setting all resources individually.
Next-generation APIs require the application to track the state of every resource and explicitly inform the system about all state transitions. For instance, if a texture was used as render target before, while the next draw command is going to use it as shader resource, a transition barrier needs to be executed. Diligent Engine does the heavy lifting of state tracking.  When CommitShaderResources() method is called with COMMIT_SHADER_RESOURCES_FLAG_TRANSITION_RESOURCES flag, the engine commits and transitions resources to correct states at the same time. Note that transitioning resources does introduce some overhead. The engine tracks state of every resource and it will not issue the barrier if the state is already correct. But checking resource state is an overhead that can sometimes be avoided. The engine provides IDeviceContext::TransitionShaderResources() method that only transitions resources:
m_pContext->TransitionShaderResources(m_pPSO, m_pSRB); In some scenarios it is more efficient to transition resources once and then only commit them.
Invoking Draw Command
The final step is to set states that are not part of the PSO, such as render targets, vertex and index buffers. Diligent Engine uses Direct3D11-syle API that is translated to other native API calls under the hood:
ITextureView *pRTVs[] = {m_pRTV}; m_pContext->SetRenderTargets(_countof( pRTVs ), pRTVs, m_pDSV); // Clear render target and depth buffer const float zero[4] = {0, 0, 0, 0}; m_pContext->ClearRenderTarget(nullptr, zero); m_pContext->ClearDepthStencil(nullptr, CLEAR_DEPTH_FLAG, 1.f); // Set vertex and index buffers IBuffer *buffer[] = {m_pVertexBuffer}; Uint32 offsets[] = {0}; Uint32 strides[] = {sizeof(MyVertex)}; m_pContext->SetVertexBuffers(0, 1, buffer, strides, offsets, SET_VERTEX_BUFFERS_FLAG_RESET); m_pContext->SetIndexBuffer(m_pIndexBuffer, 0); Different native APIs use various set of function to execute draw commands depending on command details (if the command is indexed, instanced or both, what offsets in the source buffers are used etc.). For instance, there are 5 draw commands in Direct3D11 and more than 9 commands in OpenGL with something like glDrawElementsInstancedBaseVertexBaseInstance not uncommon. Diligent Engine hides all details with single IDeviceContext::Draw() method that takes takes DrawAttribs structure as an argument. The structure members define all attributes required to perform the command (primitive topology, number of vertices or indices, if draw call is indexed or not, if draw call is instanced or not, if draw call is indirect or not, etc.). For example:
DrawAttribs attrs; attrs.IsIndexed = true; attrs.IndexType = VT_UINT16; attrs.NumIndices = 36; attrs.Topology = PRIMITIVE_TOPOLOGY_TRIANGLE_LIST; pContext->Draw(attrs); For compute commands, there is IDeviceContext::DispatchCompute() method that takes DispatchComputeAttribs structure that defines compute grid dimension.
Source Code
Full engine source code is available on GitHub and is free to use. The repository contains two samples, asteroids performance benchmark and example Unity project that uses Diligent Engine in native plugin.
AntTweakBar sample is Diligent Engine’s “Hello World” example.

Atmospheric scattering sample is a more advanced example. It demonstrates how Diligent Engine can be used to implement various rendering tasks: loading textures from files, using complex shaders, rendering to multiple render targets, using compute shaders and unordered access views, etc.

Asteroids performance benchmark is based on this demo developed by Intel. It renders 50,000 unique textured asteroids and allows comparing performance of Direct3D11 and Direct3D12 implementations. Every asteroid is a combination of one of 1000 unique meshes and one of 10 unique textures.

Finally, there is an example project that shows how Diligent Engine can be integrated with Unity.

Future Work
The engine is under active development. It currently supports Windows desktop, Universal Windows and Android platforms. Direct3D11, Direct3D12, OpenGL/GLES backends are now feature complete. Vulkan backend is coming next, and support for more platforms is planned.
• By reenigne
For those that don't know me. I am the individual who's two videos are listed here under setup for https://wiki.libsdl.org/Tutorials
I also run grhmedia.com where I host the projects and code for the tutorials I have online.
Recently, I received a notice from youtube they will be implementing their new policy in protecting video content as of which I won't be monetized till I meat there required number of viewers and views each month.

Frankly, I'm pretty sick of youtube. I put up a video and someone else learns from it and puts up another video and because of the way youtube does their placement they end up with more views.
Even guys that clearly post false information such as one individual who said GLEW 2.0 was broken because he didn't know how to compile it. He in short didn't know how to modify the script he used because he didn't understand make files and how the requirements of the compiler and library changes needed some different flags.

At the end of the month when they implement this I will take down the content and host on my own server purely and it will be a paid system and or patreon.

I get my videos may be a bit dry, I generally figure people are there to learn how to do something and I rather not waste their time.
I used to also help people for free even those coming from the other videos. That won't be the case any more. I used to just take anyone emails and work with them my email is posted on the site.

I don't expect to get the required number of subscribers in that time or increased views. Even if I did well it wouldn't take care of each reoccurring month.
I figure this is simpler and I don't plan on putting some sort of exorbitant fee for a monthly subscription or the like.
I was thinking on the lines of a few dollars 1,2, and 3 and the larger subscription gets you assistance with the content in the tutorials if needed that month.
Maybe another fee if it is related but not directly in the content.
The fees would serve to cut down on the number of people who ask for help and maybe encourage some of the people to actually pay attention to what is said rather than do their own thing. That actually turns out to be 90% of the issues. I spent 6 hours helping one individual last week I must have asked him 20 times did you do exactly like I said in the video even pointed directly to the section. When he finally sent me a copy of the what he entered I knew then and there he had not. I circled it and I pointed out that wasn't what I said to do in the video. I didn't tell him what was wrong and how I knew that way he would go back and actually follow what it said to do. He then reported it worked. Yea, no kidding following directions works. But hey isn't alone and well its part of the learning process.

So the point of this isn't to be a gripe session. I'm just looking for a bit of feed back. Do you think the fees are unreasonable?
Should I keep the youtube channel and do just the fees with patreon or do you think locking the content to my site and require a subscription is an idea.

I'm just looking at the fact it is unrealistic to think youtube/google will actually get stuff right or that youtube viewers will actually bother to start looking for more accurate videos.

• i got error 1282 in my code.