# OpenGL most efficient general rendering strategies for new GPUs

This topic is 2150 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

[font=tahoma, geneva, sans-serif]Rendering performance of 3D game/graphics/simulation engines can be improved by quite a few techniques. Examples include culling (backface, obscured, frustum, etc), simple/fast shaders for deferred processing, uber-shaders to support large batches, etc.[/font]

[font=tahoma, geneva, sans-serif]In this thread, I'd like experienced 3D programmers to brainstorm to attempt to identify the set of techniques that will most speed rendering of high-quality general scenes on current and next-generation high-end CPUs, GPUs and OpenGL/GLSL APIs (let's assume a 5 year timeframe). Complexity of implementation should also be considered important.[/font]

[font=tahoma, geneva, sans-serif]The goal is to come up with a set of techniques, and their order of execution (including parallel execution) that best suits high-quality, general purpose scenes with large numbers of objects. In other words, imagine you're writing a 3D engine that needs to execute a variety of common game and simulation scenarios efficiently (not just one specific game). The nominal scenarios should range between:[/font]

[font=tahoma, geneva, sans-serif]#1: A very large outdoor [or indoor] environment in which most objects do not move on a typical frame, but dozens of objects are moving each frame.[/font]

[font=tahoma, geneva, sans-serif]#2: A game in outer space in which most or all objects move every frame.[/font]

[font=tahoma, geneva, sans-serif]Let's assume the engine supports 1 ambient, and several point light-sources and some form of soft shadows are required.[/font]

[font=tahoma, geneva, sans-serif]The following lower efficiency and should be considered:[/font]
[font=tahoma, geneva, sans-serif]- small batches[/font]
[font=tahoma, geneva, sans-serif]- switching shaders[/font]
[font=tahoma, geneva, sans-serif]- rendering objects outside frustum[/font]
[font=tahoma, geneva, sans-serif]- rendering objects entirely obscured by closer opaque objects[/font]
[font=tahoma, geneva, sans-serif]- rendering objects behind semi-transparent objects[/font]
[font=tahoma, geneva, sans-serif]- some form of parallax mapping vs detailed geometry[/font]
[font=tahoma, geneva, sans-serif]- add more here[/font]

[font=tahoma, geneva, sans-serif]There are many possible "dynamics" that people consider.[/font]

[font=tahoma, geneva, sans-serif]For example, if we write one or more "uber-shaders" that tests bit-fields and/or texture-IDs and/or matrix-IDs in each vertex structure to control how the pixel shader renders each triangle, it is possible to render huge quantities of objects with a single call to glDrawElements() or equivalent. On the other hand, every triangle takes a little bit longer to execute, due to the multiple paths in the pixel shader.[/font]

[font=tahoma, geneva, sans-serif]Another dynamic is the complexity of culling objects outside the frustum when they do or might cast shadows, and when the environment contains mirrors or [semi]-reflective surfaces, and when the environment contains virtual cameras that point in random directions and their view is rendered on video displays at various places [possibly] within the scene. Furthermore, should not collision detection and response be computed for all objects, even those outside the frustum?[/font]

[font=tahoma, geneva, sans-serif]At one end of the spectrum of possibilities is an approach in which every possible efficiency is tested-for and potentially executed every frame. Considering how various possible efficiencies and possible aspects of a scene interact, this approach could be extremely complex, tricky and prone to discovering cases that are not handled correctly due to that complexity.[/font]

[font=tahoma, geneva, sans-serif]At the other end of the spectrum of possibilities is an approach in which every object that has moved is transformed every frame, without testing for being visible in the frustum, casting a shadow onto any object in the frustum, etc. Instead, this approach would attempt to find a way to most efficiently perform every applicable computation on every object, and possibly even render every object. Perhaps this approach could support one type of culling without risking unwanted interactions - by grouping objects near to each other into individual batches, then not rendering into the backbuffer those batches that fall entirely outside the frustum. But this culling would only be on the final rendering phase, not the collision-phase or shadow computing phase, etc.[/font]

[font=tahoma, geneva, sans-serif]I consider this a difficult problem! I've brainstormed this issue with myself for years, and have never felt confident I have the best answer... or even close to the best answer. I won't bias this brainstorming session by stating my nominal working opinion before others have voice their observations and opinions.[/font]

[font=tahoma, geneva, sans-serif]Please omit discussions that apply to CPUs older than current high-end CPUs, GPUs older than GTX-680 class, and OpenGL/GLSL older than v4.20, because the entire point of this thread is to design something that's efficient 2~4 years from now, and likely for years beyond that. Also omit discussions that apply to non-general environments or non-general rendering.[/font]

[font=tahoma, geneva, sans-serif]OTOH, if you know of new features of next-generation CPUs/GPUs/OpenGL/GLSL that are important to this topic, please DO discuss these.[/font]

[font=tahoma, geneva, sans-serif]Assume the computer contains:[/font]
[font=tahoma, geneva, sans-serif]- one 4GHz 8-core AMD/Intel CPU[/font]
[font=tahoma, geneva, sans-serif]- 8GB to 32GB fastish system RAM[/font]
[font=tahoma, geneva, sans-serif]- one GTX-680 class GPU with 2GB~4GB RAM[/font]
[font=tahoma, geneva, sans-serif]- one 1920x1200 LCD display (or higher-resolution)[/font]
[font=tahoma, geneva, sans-serif]- computer is not running other applications simultaneously[/font]
[font=tahoma, geneva, sans-serif][font=tahoma, geneva, sans-serif]- up to 8~16 simultaneously active engine threads on CPU[/font][/font]

[font=tahoma, geneva, sans-serif]Assume the 3D engine supports the following conventional features:[/font]
[font=tahoma, geneva, sans-serif]- some kind of background (mountains, ocean, sky)[/font]
[font=tahoma, geneva, sans-serif]- many thousand objects[/font]
[font=tahoma, geneva, sans-serif]- millions of triangles[/font]
[font=tahoma, geneva, sans-serif]- several point lights (or many point lights but only closest*brightest applied to each triangle)[/font]
[font=tahoma, geneva, sans-serif]- ambient lighting[/font]
[font=tahoma, geneva, sans-serif]- diffuse shading[/font]
[font=tahoma, geneva, sans-serif]- reflective shading[/font]
[font=tahoma, geneva, sans-serif]- soft shadows (variance shadow mapping or alternative)[/font]
[font=tahoma, geneva, sans-serif]- texture mapping[/font]
[font=tahoma, geneva, sans-serif]- bump mapping[/font]
[font=tahoma, geneva, sans-serif]- parallax mapping (maybe, vs real geometry)[/font]
[font=tahoma, geneva, sans-serif]- collision detection (broad & narrow phase, convex/concave, fairly accurate)[/font]
[font=tahoma, geneva, sans-serif]- collision response (basic physics)[/font]
[font=tahoma, geneva, sans-serif]- objects support hierarchies (rotate/translate against each other)[/font]
[font=tahoma, geneva, sans-serif]- semi-transparent objects == optional[/font]

##### Share on other sites
Using queries to find if objects are occluded is interesting, but difficult. The problem is that a query may give that an object was indeed occluded, but using that as a decision for the next frame may be invalid when things move around. Another problem with queries is that you have to be careful for when to ask for the result. If done too quickly, you can force the pipeline to slow down.

But there are some ways which I have used queries to great success. Suppose the world contains objects of many vertices, and the objects are sorted from near to far. I do the drawing of a frame like this:

1. Draw the first 10% of the objects.
2. Draw invisible bounding boxes representing the next 10%, with queries enabled.
3. For each object in 2, check the query result (which should now be available), and draw the full object if the box was visible.
4. Repeat step 2 for next 10% of objects.

There is an overhead of drawing boxes, but it is small compared to the full object. A problem with this algorithm is to find a good sub set of objects to test for each iteration. If the percentage is too small, there is the risk that the query may slow down the pipeline. If the percentage is too big, the number of occluded objects will go down.

##### Share on other sites
I find it suspicious that so many articles are published about one way or another about culling, or improving rendering efficiency, but in my experience they are all narrow focus - pointing at one possibility or another, but not the whole picture. Unfortunately, many approaches interact with other approaches, or are not mutually compatible. For example, you can only choose or organize the content of each VBO [or batch] on one primary basis. Also, if you work hard to avoid drawing objects that are not visible from the current camera, you might end up calling glDrawElements() zillions of times (so much for "large batches"), and you might spend a lot of CPU figuring out which objects are indeed "occluded" or "outside frustum" or otherwise excluded.

In practice, most decisions create substantial complexity in ways that become a combinitorial explosion of complexity - and almost always upon the CPU. Here is are a couple examples of practical problems and interactions that seem to be ignored in analyses of this topic. #1: When you cull objects because they are outside the frustum (or "occlusion", or other "efficiency" reason), those objects might cast shadows into the visible frame. Oops! In the silence of a realistic space game/simulation especially, these shadows can be key. #2: When you cull objects for any reason, how will your game/simulation know whether they've crashed into each other?

And these are just two "practical problems". My engine supports in-game/simulation cameras and displays. In other words, the scene may contain "security cameras" (and other kinds of cameras) that are displaying on fixed or moving LCD screens. Exactly how complex is the frustum and/or occlusion scheme with lots of cameras pointing in all directions, and their images being shown on lots of displays in real time in each view of each camera. Blank out! But I suppose that complexity is a bit esoteric compared to the problem of not computing collisions or showing shadows (of potentially huge or numerous objects) because they (and the sun) are behind the camera.

I guess now is time to stir the pot by floating my extremely unconventional non-solution solution. Before I state it, let me admit that special-purpose engines (engines for one game, or one narrow type of game) may be able to adopt more efficiency measures. For example, in some situations "players won't usually notice" that shadows are not cast by objects outside the frustum, and "players won't usually notice" when objects never collide outside the primary frustum.

I was driven to remove efficiency-measure after efficiency-measure for the kinds of reasons I just mentioned. My eventual "non-solution solution" was to remove or limit most efficiency measures, and focus on generalizing and streamlining the existing comutation and rendering processes (where "computation processes" includes "collision detection-and-response" and other [mostly-non-rendering] processes).

Anyway, I ended up with something like the following. Well, except, hopefully not literally "ended up", for perhaps some of the geniuses here will identify and explain better strategies and approaches.

Rather than try to rip out all sorts of techniques that I made work together to the extent possible, I "started over" and created the most streamlined general processes I could.

#1: On every frame, when the position or rotation of an object changes, mark the object "geometry modified".

#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.

#3: Make each batch contain up to 65536 vertices so the GPU only need access a 16-bit index to render each vertex. For objects with more than 65536 vertices can be rendered in a single batch with 32-bit indices, or divided into multiple batches with up to 65536 indices in each. Making batches 8192 elements or larger has substantial rendering speed advantages, while batches larger than 65536 vertices have minor speed benefit with current-generation GPUs (which contain roughly 512 ~ 1024 cores divided between vertex shaders and pixel shaders [and sometimes also geometry shaders, gpgpu/compute shaders, and others]).

#4: Write super-efficient 64-bit SIMD/SSE/AVX/FMA3/FMA4/etc functions in assembly-language multiply 4x4 matrices, and also functions to transform arrays of vertices [from local to world coordinates]. These routines are amazingly fast - only dozen nanoseconds per vertex --- per CPU. And I can make 4 to 8 cores transform vertices in parallel for truly amazing performance. This includes memory-access overhead to load the local-coordinates vertex into CPU/SIMD registers, and memory-access overhead to store world-coordinate vertices back into memory (two of them in fact, one with f64 elements and one with f32 elements for the GPU). And my vertices contain four f64x4 vectors to transform (position, zenith vector (normal), north vector (tangent), east vector (bi-tangent)).

#5: In general, each batch contains objects in a specific region of 3D space (or 2D space in primarily "flat" environments like the surface of a planet). All batches can be processed by the CPU for collision-detection, shadow-casting, and for other purposes --- OR --- any batches can be excluded if appropriate to the game/simulation/environment. However, the CPU need not render any batch that outside the frustum of a given camera, so the one primary sort of visibility culling performed is excluding regions that are entirely outside the relevant frustum.

##### some benefits of this approach #####
##### others please state drawbacks #####
#1: Since all objects are in VBOs in GPU memory, and they are always in world coordinates, there is only one transformation matrix for all objects. This means that the same transformation matrix can be applied to every vertex in every object, and so any mix of objects can be rendered in a single "batch" (a single call of glDrawElements() or equivalent). In some scenes a few type of objects benefit from "instancing". To support up to 16 instancing objects in any rendered frame, each vertex has a 4-bit field to specify an alternate or additional matrix for that object-type.

#2: In most games and simulations, the vast majority of objects are stationary on a random frame (if not permanently stationary). Therefore, the CPU needs to transform and move to VBO in GPU memory only those objects that were modified during the last frame, which is usually a tiny minority of objects.

#3: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate collision-detection [and collision-response if desired] on every object, every frame.

#4: Since we have accurate world-coordinates in CPU memory for every object every frame, we can perform accurate shadow casting by any-or-every object, every frame, whether those objects are rendered to the backbuffer or not.

#5: Since cameras and lights are also objects, we also have accurate world-coordinates in CPU memory for every camera (viewpoint) and every light source. We therefore can, if and when convenient, perform frustum and other visibility/shadowing computations on the CPU without the difficulty and complexity of having the GPU spew world-coordinates of all objects back to CPU memory.

===============

Okay, that ought to stir up a hornets nest, since my approach is almost opposite "conventional wisdom" in most ways. I invite specific alternatives that should work better for a general purpose game/simulation engine designed to render general environments in as realistic a manner as possible (not cartoons or other special-purpose scenarios, for example).

##### Share on other sites

I find it suspicious that so many articles are published about one way or another about culling, or improving rendering efficiency, but in my experience they are all narrow focus - pointing at one possibility or another, but not the whole picture.

True. But games are different, usually with different requirements and different ways that the game engine is constructed. So there is no "one size fits all". Rather, there is a palette of algorithms to choose from.
... and you might spend a lot of CPU figuring out which objects are indeed "occluded" or "outside frustum" or otherwise excluded.[/quote]
Certainly a problem, and a compromise is required. A problem is to find a good compromise that stays the same on various combinations of graphics cards and CPU architecture. I think multi threaded solutions will gradually push this in the direction where it is possible to do more and more pre-computation.
#1: When you cull objects because they are outside the frustum (or "occlusion", or other "efficiency" reason), those objects might cast shadows into the visible frame. Oops! In the silence of a realistic space game/simulation especially, these shadows can be key.[/quote]
Are you sure that is a problem? You cull the object, not the shadow of the object. That means that the object will not be shown on your screen, but the shadow will. The computation of shadows depends on light sources, and these have to do their own, independent, culling and rendering.
#2: When you cull objects for any reason, how will your game/simulation know whether they've crashed into each other?[/quote]
Culling is used for decided what shall be drawn. It is not used for collision detection, which is an independent algorithm (for example, using octree). I think collision detection is usually done on the main CPU, not the GPU.
My engine supports in-game/simulation cameras and displays.[/quote]
Pictures in cameras are created using a separate render pass, that have its own culling algorithm.
...Before I state it, let me admit that special-purpose engines (engines for one game, or one narrow type of game) may be able to adopt more efficiency measures. For example, in some situations "players won't usually notice" that shadows are not cast by objects outside the frustum, and "players won't usually notice" when objects never collide outside the primary frustum.[/quote]
Graphics optimization is a lot about cheating. You can, and have to, do the most "funny" things, as long as the player doesn't notice.
#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.[/quote]
A very common technique is to define an object as a mesh. You can then draw the same mesh in many places in the world, simply using different transformation matrices. This would not fit into the use of world coordinates for every vertex. When an object moves, you only have to update the model matrix instead of updating all vertices.
#3: Make each batch contain up to 65536 vertices so the GPU only need access a 16-bit index to render each vertex. For objects with more than 65536 vertices can be rendered in a single batch with 32-bit indices, or divided into multiple batches with up to 65536 indices in each. Making batches 8192 elements or larger has substantial rendering speed advantages, while batches larger than 65536 vertices have minor speed benefit with current-generation GPUs (which contain roughly 512 ~ 1024 cores divided between vertex shaders and pixel shaders [and sometimes also geometry shaders, gpgpu/compute shaders, and others]).[/quote]
Sure about that? One advantage of indexed drawing is the automatic use of cached results from the vertex shader. This will work just as fine with 32-bit indices. I don't know for sure.
#4: Write super-efficient 64-bit SIMD/SSE/AVX/FMA3/FMA4/etc functions in assembly-language multiply 4x4 matrices, and also functions to transform arrays of vertices [from local to world coordinates]. These routines are amazingly fast - only dozen nanoseconds per vertex --- per CPU. And I can make 4 to 8 cores transform vertices in parallel for truly amazing performance. This includes memory-access overhead to load the local-coordinates vertex into CPU/SIMD registers, and memory-access overhead to store world-coordinate vertices back into memory (two of them in fact, one with f64 elements and one with f32 elements for the GPU). And my vertices contain four f64x4 vectors to transform (position, zenith vector (normal), north vector (tangent), east vector (bi-tangent)).[/quote]
Isn't it better to let the GPU do this instead? I think a 4x4 matrix multiplication takes one cycle only? Hard to beat.

##### Share on other sites

Isn't it better to let the GPU do this instead? I think a 4x4 matrix multiplication takes one cycle only? Hard to beat.

Depends on the use case. If, for example, you're transforming a point light into the same space as a model, then sending the light's transformed position to the GPU, it's going to more efficient to do this transformation once per-model on the CPU than once per-vertex on the GPU. Even so, you're going to need a hell of a lot of point lights for that transformation to become in any way measurable on your performance graphs (and even then it's still going to be a miniscule fraction of your overall cost).

But yeah, for general position/matrix transforms you do them on the GPU - any kind of CPU-side SIMD optimization of that is way back in the realms of software T&L.

##### Share on other sites

[quote name='maxgpgpu' timestamp='1339409123' post='4948104']
I find it suspicious that so many articles are published about one way or another about culling, or improving rendering efficiency, but in my experience they are all narrow focus - pointing at one possibility or another, but not the whole picture.

True. But games are different, usually with different requirements and different ways that the game engine is constructed. So there is no "one size fits all". Rather, there is a palette of algorithms to choose from.
[/quote]
True enough for special-purpose engines. However, the topic of general-purpose engines is not irrelevant --- after all, there are lots of general-purpose engines!

... and you might spend a lot of CPU figuring out which objects are indeed "occluded" or "outside frustum" or otherwise excluded.[/quote]
Certainly a problem, and a compromise is required. A problem is to find a good compromise that stays the same on various combinations of graphics cards and CPU architecture. I think multi threaded solutions will gradually push this in the direction where it is possible to do more and more pre-computation.[/quote]
I agree, that that's one of the conclusions I came to. That's why I do a lot more work on the CPU than anyone I've heard give advice. My development system has an 8-core CPU, and I figure in 2~3 years 8-core will be mainstream and I'll have a 16-core CPU in my box.

#1: When you cull objects because they are outside the frustum (or "occlusion", or other "efficiency" reason), those objects might cast shadows into the visible frame. Oops! In the silence of a realistic space game/simulation especially, these shadows can be key.[/quote]
Are you sure that is a problem? You cull the object, not the shadow of the object. That means that the object will not be shown on your screen, but the shadow will. The computation of shadows depends on light sources, and these have to do their own, independent, culling and rendering.[/quote]
Yes, I suppose separate culling schemes per pass/phase can work. On the other hand, to perform culling at all assumes the CPU is figuring out which objects can be culled... for shadow-casting too. And if the CPU needs to know where things are every cycle to perform this process, then where is the savings of having the GPU transform everything on pass #0 in order to figure out what casts shadows into the frame, and what doesn't? This is one fo the considerations that drove me to just transform everything to world-coordinates as efficiently as possible in the CPU.

#2: When you cull objects for any reason, how will your game/simulation know whether they've crashed into each other?[/quote]
Culling is used for decided what shall be drawn. It is not used for collision detection, which is an independent algorithm (for example, using octree). I think collision detection is usually done on the main CPU, not the GPU.[/quote]
Yes, but you are just confirming my thesis now, because collision detection (and many other computations) must be performed in world-coordinates. So if culling means "having the GPU render every object", then the CPU needs to know which objects can be culled every frame, which means the CPU needs to transform every object that has moved every frame. Thus culling saves GPU time, but the CPU must at least transform the AABB or bounding circle to world-coordinates to perform any kind of culling at all.

My engine supports in-game/simulation cameras and displays.[/quote]
Pictures in cameras are created using a separate render pass, that have its own culling algorithm.[/quote]
Yes, my engine must perform a rendering loop for each camera. However, since my engine transforms every object that moved since the last frame to world-coordinates on the first iteration of that loop, the CPU need not transform any objects for the 2nd, 3rd, 4th... to final camera. The transformation process also automatically updates the AABB for every transformed object, so that information is also available for other forms of culling. Of course each loop the CPU must call glDrawElements() as many times as necessary to render the camera view into a texture attached to a FBO. No way to avoid that, though objects outside the frustum need not be drawn (a fairly quick and simple AABB versus frustum check).

...Before I state it, let me admit that special-purpose engines (engines for one game, or one narrow type of game) may be able to adopt more efficiency measures. For example, in some situations "players won't usually notice" that shadows are not cast by objects outside the frustum, and "players won't usually notice" when objects never collide outside the primary frustum.[/quote]
Graphics optimization is a lot about cheating. You can, and have to, do the most "funny" things, as long as the player doesn't notice.[/quote]
My opinion is... those days are numbered. Of course, in a few years they'll say not performing full-bore ray-tracing to every pixel is cheating, so in that sense you'll be correct for a long time.

#2: After all object changes are complete for each frame, the CPU performs transformations of "modified" objects (and their attached [articulatable] "child objects") from local-coordinates to world-coordinates. This updates the copy of the world-coordinates vertices in CPU memory. Then transfer just these modified vertices into the VBO that contains the object in GPU memory.[/quote]
A very common technique is to define an object as a mesh. You can then draw the same mesh in many places in the world, simply using different transformation matrices. This would not fit into the use of world coordinates for every vertex. When an object moves, you only have to update the model matrix instead of updating all vertices.[/quote]
Yes, that is the case I addressed with my comments on instancing. There's more than one way to accomplish instancing, but newer GPUs and versions of OpenGL API support instancing natively. The rotation, translation and other twist-spindle-mutilate that gets done to objects is usually a function of code in the shader plus a few global boundary variables (to specify where all the blades of grass do not go beyond).

#3: Make each batch contain up to 65536 vertices so the GPU only need access a 16-bit index to render each vertex. For objects with more than 65536 vertices can be rendered in a single batch with 32-bit indices, or divided into multiple batches with up to 65536 indices in each. Making batches 8192 elements or larger has substantial rendering speed advantages, while batches larger than 65536 vertices have minor speed benefit with current-generation GPUs (which contain roughly 512 ~ 1024 cores divided between vertex shaders and pixel shaders [and sometimes also geometry shaders, gpgpu/compute shaders, and others]).[/quote]
Sure about that? One advantage of indexed drawing is the automatic use of cached results from the vertex shader. This will work just as fine with 32-bit indices. I don't know for sure.[/quote]
32-bit indices work fine. They're just a tad slower than 16-bit indices.

#4: Write super-efficient 64-bit SIMD/SSE/AVX/FMA3/FMA4/etc functions in assembly-language multiply 4x4 matrices, and also functions to transform arrays of vertices [from local to world coordinates]. These routines are amazingly fast - only dozen nanoseconds per vertex --- per CPU. And I can make 4 to 8 cores transform vertices in parallel for truly amazing performance. This includes memory-access overhead to load the local-coordinates vertex into CPU/SIMD registers, and memory-access overhead to store world-coordinate vertices back into memory (two of them in fact, one with f64 elements and one with f32 elements for the GPU). And my vertices contain four f64x4 vectors to transform (position, zenith vector (normal), north vector (tangent), east vector (bi-tangent)).[/quote]
Isn't it better to let the GPU do this instead? I think a 4x4 matrix multiplication takes one cycle only? Hard to beat.
[/quote]
Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.

##### Share on other sites

Oh no! A 4x4 matrix multiply, or transforming a vertex with 1 position and 3 vectors takes much more than 1 cycle on the GPU (and GPU cycles are typically ~4 times slower than CPU cycles). Modern GPUs are not 4-wide any more. They found a few years ago that making each core 1-wide and increasing the number of GPU cores was a better tradeoff, especially as shaders got more sophisticated and did a lot of other work besides 4-wide operation. The fact is, one-core in a 64-bit CPU can perform a matrix-multiply or vertex transformation much faster than a GPU. However, a GPU has about 512~1024 cores these days, while the CPU only has 8.

Measuring a single transformation in isolation is not going to give you a meaningful comparison. As you noted, GPUs are much more parallel than CPUs (that's not going to change for the foreseeable), and they also have the advantage that this parallelism comes for free - there's no thread sync overhead or other nastiness in your program. At the most simplistic level, an 8-core CPU would need to transform vertexes 64 times faster (the actual number will be different but in a similar ballpark) than a 512-core GPU for this to be advantageous.

Another crucial factor is that performing these transforms on the GPU allows you to use static vertex buffers. Do them on the CPU and you're needing to stream data to the GPU every frame, which can fast become a bottleneck. Yes, bandwidth is plentiful and fast, but it certainly doesn't come for free - keeping as much data as possible 100% static for the lifetime of your program is still a win. Not to mention a gain in terms of reduction of code complexity in your program, which I think we can all agree is a good thing. Edited by mhagain

##### Share on other sites
Here's my take on a fairly high level of a generic rendering mechanism. It may however be more like last-gen than next-gen Some of it is implemented in Urho3D. It only considers separate objects, static world geometry does not have special handling. No automatic data reorganization for improved batching is done.

- Update animation for scene objects such as animated meshes and particle emitters. Use multiple cores for this. For each animated object, check previous frame's visibility result and skip or reduce update frequency if invisible. This may yet cause objects' bounds to change, meaning their position in the spatial graph has to be re-evaluated.

- Make sure the spatial graph (I use an octree) is up to date. Reinsert objects that have moved or have changed their bounds since last frame. This I don't know how to multi-thread safely.

- Query the spatial graph for objects and lights in the main view frustum. If occlusion is to be utilized, query occluders first (for me, there's a low-resolution CPU depth-only rasterizer in use). Detect possible sub-views, such as shadow map frustums or camera monitor views. Process sub-views and occlusion testing on multiple cores. Aggressively optimize unnecessary shadowcasters away by checking if their extruded bounds (extruded in light perspective) actually cross the main view frustum.

- Construct rendering operations from the visible objects. Automatically group operations for instancing when possible (same geometry/same shader/same material/same light). NB: I use either deferred rendering, or primitive multipass forward rendering, so the grouping is relatively easy. For a multi-light single-pass forward renderer I imagine it is harder. Instancing objects with too many vertices can actually be detrimental so a threshold value can be used.

- Sort rendering operations. Speed this up using multiple cores. For opaque geometry, sort both by state and distance for front-to-back ordering. The method I use is to check the closest geometry rendered with each unique renderstate, and use that info to determine the order of renderstates. This needs experimental testing to determine whether for example minimizing shader changes is more important than more strict front-to-back order.

- Render. Optimize where possible, such as using the stencil buffer & scissor to optimize light influences, render shadow maps first to avoid a lot of render target changes, and skip unnecessary buffer clears.

Btw. back when I did the 10000 cube experiment on various rendering engines I found that Unity rewrites a dynamic vertex buffer each frame for small batched objects (apparently it pre-transforms them on the CPU), so it seems they found that optimal at least for some cases. However I found hardware instancing of small meshes to perform much better, so I believe they do that to optimize for old API versions *without* hardware instancing support, and simply chose not to implement a separate hardware instancing path at least for the time being. Edited by AgentC

##### Share on other sites

- Sort rendering operations. Speed this up using multiple cores. For opaque geometry, sort both by state and distance for front-to-back ordering. The method I use is to check the closest geometry rendered with each unique renderstate, and use that info to determine the order of renderstates. This needs experimental testing to determine whether for example minimizing shader changes is more important than more strict front-to-back order.

Thanks for a good summary of optimization technologies, it looks like you really know are talking about.

I am now trying to improve my application optimization for state changes. Is there some easy rule of thumb what states changes are important to optimize for? In worst case, I suppose a state change will delay the pipeline.

##### Share on other sites
I am now trying to improve my application optimization for state changes. Is there some easy rule of thumb what states changes are important to optimize for? In worst case, I suppose a state change will delay the pipeline.

In general, the CPU/GPU communicate by sharing a ring-buffer in RAM. The CPU writes command packets into this buffer, and the GPU reads these command packets out of the buffer (much like sending packets over TCP, etc...).
 || CPU write cursor \/ [...][set texture][set stream][draw primitives][...] /\ || GPU read cursorGraphics API's like D3D/GL hide these details from the user, but as an approximation, you can think of every D3D call which starts with "Set" or "Draw" as being a fairly simple function that writes command packets into a queue... However, it can be a bit more tricky, because D3D/GL might be lazy when it comes to sending out packets -- it's possible, for example, that when I call [font=courier new,courier,monospace]SetStreamSource[/font] that the 'SetStream' packet isn't immediately written, instead, they might set a "stream dirty" flag to be true, and then later, when I call [font=courier new,courier,monospace]DrawIndexedPrimitive[/font], it checks that the flag is true, and sends out both the 'SetStreamSource' command, and the 'DrawIndexedPrimitive' command at the same time.

Whenever you're reducing the number of D3D/GL commands or re-sorting the order of these commands, you're changing the optimisation of both the CPU and the GPU -- You're affecting the way in which D3D/GL actually writes out it's command buffer, and you're changing the command stream that the GPU will execute.

Some optimisation strategies will make only one of these processors run faster, and will actually slow down the other processor, and this can be valid -- e.g. if you're CPU-side code is very fast, then you might choose to perform some GPU optimisations which have the effect of adding 1ms to your CPU execution times. And in other situations, you might do the opposite.
There are no general answers here, only situation-dependent advice.

That said, the CPU-side cost of a state-change is usually fairly predictable, and fairly constant. You can easily measure these by measuring the amount of time spent inside each D3D/GL call. Just be sure to note that if seem very fast, make sure that you're correlating them with the draw-call which follows them, and to be extra safe, issue a command-buffer flush after each draw-call so you know that the command packets have actually been computed and sent.

On the GPU-side, the costs vary by extreme amounts depending on (1) what you are rendering and (2) the actual GPU hardware and drivers.
Advice which makes Game A run twice as fast may make Game B run twice as slow.

For example, let's say that when a 700MHz GPU switches shader programs, it requires a 100,000 cycle delay (100k@700MHz = 0.14ms). This means that every time you switch the shader program state, you're adding 0.14ms to your frame-time.
However, now let's also say that this GPU is pipelined in a way where it can be shading pixels at the same time that it performs this 100000-cycle state-change. Now, as long as the GPU has over 0.14ms of pixel-shading work queued up (which depends on the complexity of the pixel shader, the amount/size of triangles drawn, etc...) when it reaches the change-shaders command, then there will be no delay at all! If it's only got 0.04ms of pixel-shading work queued up when it reaches this command, then there will be a 0.1ms delay (pipeline bubble).
So you can see, that even if you know the exact cost of a piece of GPU work, the actual time along your critical path depends greatly on all of the surrounding pieces of GPU work as well (in this case, the total waste from switching shaders depends on the pixel operations already being worked on, which depend on the vertex operations, etc..).

You could have 10 items of work, which all take 1ms each when run in isolation... but when you run them all together, they only take 3ms, thanks to pipelining... So simply attaching a 'cost' to these items doesn't mean anything. In this case, each 'item' takes 1ms OR 0.3ms (or maybe your item cost 0ms and another item cost 3ms) -- and you can't tell which, without knowing what else is being processed at the same time as your item, or by using a good profiling tool.

If you want to optimise your GPU usage, you've really got to get a good GPU profiler which can tell you about what the hardware is doing internally, where any stalls are in it's pipeline, which features are used at capacity, which operations are forming the bottleneck, etc...

...but to answer the question:
Is there some easy rule of thumb what states changes are important to optimize for?
http://developer.dow...g_Guide_G80.pdf
http://origin-develo...hBatchBatch.pdf
http://developer.amd...mming_guide.pdf Edited by Hodgman

• 40
• 15
• 10
• 23
• 19
• ### Similar Content

• By mmmax3d
Hi everyone,
I would need some assistance from anyone who has a similar experience
or a nice idea!
I have created a skybox (as cube) and now I need to add a floor/ground.
The skybox is created from cubemap and initially it was infinite.
Now it is finite with a specific size. The floor is a quad in the middle
of the skybox, like a horizon.
I have two problems:
When moving the skybox upwards or downwards, I need to
sample from points even above the horizon while sampling
from the botton at the same time.  I am trying to create a seamless blending of the texture
at the points of the horizon, when the quad is connected
to the skybox. However, I get skew effects. Does anybody has done sth similar?
Is there any good practice?
Thanks everyone!
• By mmmax3d
Hi everyone,
I would need some assistance from anyone who has a similar experience
or a nice idea!
I have created a skybox (as cube) and now I need to add a floor/ground.
The skybox is created from cubemap and initially it was infinite.
Now it is finite with a specific size. The floor is a quad in the middle
of the skybox, like a horizon.
I have two problems:
When moving the skybox upwards or downwards, I need to
sample from points even above the horizon while sampling
from the botton at the same time.  I am trying to create a seamless blending of the texture
at the points of the horizon, when the quad is connected
to the skybox. However, I get skew effects. Does anybody has done sth similar?
Is there any good practice?
Thanks everyone!

• I'm trying to implement PBR into my simple OpenGL renderer and trying to use multiple lighting passes, I'm using one pass per light for rendering as follow:
1- First pass = depth
2- Second pass = ambient
3- [3 .. n] for all the lights in the scene.
I'm using the blending function glBlendFunc(GL_ONE, GL_ONE) for passes [3..n], and i'm doing a Gamma Correction at the end of each fragment shader.
But i still have a problem with the output image it just looks noisy specially when i'm using texture maps.
Is there anything wrong with those steps or is there any improvement to this process?

• Hello Everyone!
I'm learning openGL, and currently i'm making a simple 2D game engine to test what I've learn so far.  In order to not say to much, i made a video in which i'm showing you the behavior of the rendering.
Video:

What i was expecting to happen, was the player moving around. When i render only the player, he moves as i would expect. When i add a second Sprite object, instead of the Player, this new sprite object is moving and finally if i add a third Sprite object the third one is moving. And the weird think is that i'm transforming the Vertices of the Player so why the transformation is being applied somewhere else?

Take a look at my code:
Sprite Class
(You mostly need to see the Constructor, the Render Method and the Move Method)
#include "Brain.h" #include <glm/gtc/matrix_transform.hpp> #include <vector> struct Sprite::Implementation { //Position. struct pos pos; //Tag. std::string tag; //Texture. Texture *texture; //Model matrix. glm::mat4 model; //Vertex Array Object. VertexArray *vao; //Vertex Buffer Object. VertexBuffer *vbo; //Layout. VertexBufferLayout *layout; //Index Buffer Object. IndexBuffer *ibo; //Shader. Shader *program; //Brains. std::vector<Brain *> brains; //Deconstructor. ~Implementation(); }; Sprite::Sprite(std::string image_path, std::string tag, float x, float y) { //Create Pointer To Implementaion. m_Impl = new Implementation(); //Set the Position of the Sprite object. m_Impl->pos.x = x; m_Impl->pos.y = y; //Set the tag. m_Impl->tag = tag; //Create The Texture. m_Impl->texture = new Texture(image_path); //Initialize the model Matrix. m_Impl->model = glm::mat4(1.0f); //Get the Width and the Height of the Texture. int width = m_Impl->texture->GetWidth(); int height = m_Impl->texture->GetHeight(); //Create the Verticies. float verticies[] = { //Positions //Texture Coordinates. x, y, 0.0f, 0.0f, x + width, y, 1.0f, 0.0f, x + width, y + height, 1.0f, 1.0f, x, y + height, 0.0f, 1.0f }; //Create the Indicies. unsigned int indicies[] = { 0, 1, 2, 2, 3, 0 }; //Create Vertex Array. m_Impl->vao = new VertexArray(); //Create the Vertex Buffer. m_Impl->vbo = new VertexBuffer((void *)verticies, sizeof(verticies)); //Create The Layout. m_Impl->layout = new VertexBufferLayout(); m_Impl->layout->PushFloat(2); m_Impl->layout->PushFloat(2); m_Impl->vao->AddBuffer(m_Impl->vbo, m_Impl->layout); //Create the Index Buffer. m_Impl->ibo = new IndexBuffer(indicies, 6); //Create the new shader. m_Impl->program = new Shader("Shaders/SpriteShader.shader"); } //Render. void Sprite::Render(Window * window) { //Create the projection Matrix based on the current window width and height. glm::mat4 proj = glm::ortho(0.0f, (float)window->GetWidth(), 0.0f, (float)window->GetHeight(), -1.0f, 1.0f); //Set the MVP Uniform. m_Impl->program->setUniformMat4f("u_MVP", proj * m_Impl->model); //Run All The Brains (Scripts) of this game object (sprite). for (unsigned int i = 0; i < m_Impl->brains.size(); i++) { //Get Current Brain. Brain *brain = m_Impl->brains[i]; //Call the start function only once! if (brain->GetStart()) { brain->SetStart(false); brain->Start(); } //Call the update function every frame. brain->Update(); } //Render. window->GetRenderer()->Draw(m_Impl->vao, m_Impl->ibo, m_Impl->texture, m_Impl->program); } void Sprite::Move(float speed, bool left, bool right, bool up, bool down) { if (left) { m_Impl->pos.x -= speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(-speed, 0, 0)); } if (right) { m_Impl->pos.x += speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(speed, 0, 0)); } if (up) { m_Impl->pos.y += speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(0, speed, 0)); } if (down) { m_Impl->pos.y -= speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(0, -speed, 0)); } } void Sprite::AddBrain(Brain * brain) { //Push back the brain object. m_Impl->brains.push_back(brain); } pos *Sprite::GetPos() { return &m_Impl->pos; } std::string Sprite::GetTag() { return m_Impl->tag; } int Sprite::GetWidth() { return m_Impl->texture->GetWidth(); } int Sprite::GetHeight() { return m_Impl->texture->GetHeight(); } Sprite::~Sprite() { delete m_Impl; } //Implementation Deconstructor. Sprite::Implementation::~Implementation() { delete texture; delete vao; delete vbo; delete layout; delete ibo; delete program; }
Renderer Class
#include "Renderer.h" #include "Error.h" Renderer::Renderer() { } Renderer::~Renderer() { } void Renderer::Draw(VertexArray * vao, IndexBuffer * ibo, Texture *texture, Shader * program) { vao->Bind(); ibo->Bind(); program->Bind(); if (texture != NULL) texture->Bind(); GLCall(glDrawElements(GL_TRIANGLES, ibo->GetCount(), GL_UNSIGNED_INT, NULL)); } void Renderer::Clear(float r, float g, float b) { GLCall(glClearColor(r, g, b, 1.0)); GLCall(glClear(GL_COLOR_BUFFER_BIT)); } void Renderer::Update(GLFWwindow *window) { /* Swap front and back buffers */ glfwSwapBuffers(window); /* Poll for and process events */ glfwPollEvents(); }
#shader vertex #version 330 core layout(location = 0) in vec4 aPos; layout(location = 1) in vec2 aTexCoord; out vec2 t_TexCoord; uniform mat4 u_MVP; void main() { gl_Position = u_MVP * aPos; t_TexCoord = aTexCoord; } #shader fragment #version 330 core out vec4 aColor; in vec2 t_TexCoord; uniform sampler2D u_Texture; void main() { aColor = texture(u_Texture, t_TexCoord); } Also i'm pretty sure that every time i'm hitting the up, down, left and right arrows on the keyboard, i'm changing the model Matrix of the Player and not the others.

Window Class:
#include "Window.h" #include <GL/glew.h> #include <GLFW/glfw3.h> #include "Error.h" #include "Renderer.h" #include "Scene.h" #include "Input.h" //Global Variables. int screen_width, screen_height; //On Window Resize. void OnWindowResize(GLFWwindow *window, int width, int height); //Implementation Structure. struct Window::Implementation { //GLFW Window. GLFWwindow *GLFW_window; //Renderer. Renderer *renderer; //Delta Time. double delta_time; //Frames Per Second. int fps; //Scene. Scene *scnene; //Input. Input *input; //Deconstructor. ~Implementation(); }; //Window Constructor. Window::Window(std::string title, int width, int height) { //Initializing width and height. screen_width = width; screen_height = height; //Create Pointer To Implementation. m_Impl = new Implementation(); //Try initializing GLFW. if (!glfwInit()) { std::cout << "GLFW could not be initialized!" << std::endl; std::cout << "Press ENTER to exit..." << std::endl; std::cin.get(); exit(-1); } //Setting up OpenGL Version 3.3 Core Profile. glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3); glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3); glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE); /* Create a windowed mode window and its OpenGL context */ m_Impl->GLFW_window = glfwCreateWindow(width, height, title.c_str(), NULL, NULL); if (!m_Impl->GLFW_window) { std::cout << "GLFW could not create a window!" << std::endl; std::cout << "Press ENTER to exit..." << std::endl; std::cin.get(); glfwTerminate(); exit(-1); } /* Make the window's context current */ glfwMakeContextCurrent(m_Impl->GLFW_window); //Initialize GLEW. if(glewInit() != GLEW_OK) { std::cout << "GLEW could not be initialized!" << std::endl; std::cout << "Press ENTER to exit..." << std::endl; std::cin.get(); glfwTerminate(); exit(-1); } //Enabling Blending. GLCall(glEnable(GL_BLEND)); GLCall(glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)); //Setting the ViewPort. GLCall(glViewport(0, 0, width, height)); //**********Initializing Implementation**********// m_Impl->renderer = new Renderer(); m_Impl->delta_time = 0.0; m_Impl->fps = 0; m_Impl->input = new Input(this); //**********Initializing Implementation**********// //Set Frame Buffer Size Callback. glfwSetFramebufferSizeCallback(m_Impl->GLFW_window, OnWindowResize); } //Window Deconstructor. Window::~Window() { delete m_Impl; } //Window Main Loop. void Window::MainLoop() { //Time Variables. double start_time = 0, end_time = 0, old_time = 0, total_time = 0; //Frames Counter. int frames = 0; /* Loop until the user closes the window */ while (!glfwWindowShouldClose(m_Impl->GLFW_window)) { old_time = start_time; //Total time of previous frame. start_time = glfwGetTime(); //Current frame start time. //Calculate the Delta Time. m_Impl->delta_time = start_time - old_time; //Get Frames Per Second. if (total_time >= 1) { m_Impl->fps = frames; total_time = 0; frames = 0; } //Clearing The Screen. m_Impl->renderer->Clear(0, 0, 0); //Render The Scene. if (m_Impl->scnene != NULL) m_Impl->scnene->Render(this); //Updating the Screen. m_Impl->renderer->Update(m_Impl->GLFW_window); //Increasing frames counter. frames++; //End Time. end_time = glfwGetTime(); //Total time after the frame completed. total_time += end_time - start_time; } //Terminate GLFW. glfwTerminate(); } //Load Scene. void Window::LoadScene(Scene * scene) { //Set the scene. m_Impl->scnene = scene; } //Get Delta Time. double Window::GetDeltaTime() { return m_Impl->delta_time; } //Get FPS. int Window::GetFPS() { return m_Impl->fps; } //Get Width. int Window::GetWidth() { return screen_width; } //Get Height. int Window::GetHeight() { return screen_height; } //Get Input. Input * Window::GetInput() { return m_Impl->input; } Renderer * Window::GetRenderer() { return m_Impl->renderer; } GLFWwindow * Window::GetGLFWindow() { return m_Impl->GLFW_window; } //Implementation Deconstructor. Window::Implementation::~Implementation() { delete renderer; delete input; } //OnWindowResize void OnWindowResize(GLFWwindow *window, int width, int height) { screen_width = width; screen_height = height; //Updating the ViewPort. GLCall(glViewport(0, 0, width, height)); }
Brain Class
#include "Brain.h" #include "Sprite.h" #include "Window.h" struct Brain::Implementation { //Just A Flag. bool started; //Window Pointer. Window *window; //Sprite Pointer. Sprite *sprite; }; Brain::Brain(Window *window, Sprite *sprite) { //Create Pointer To Implementation. m_Impl = new Implementation(); //Initialize Implementation. m_Impl->started = true; m_Impl->window = window; m_Impl->sprite = sprite; } Brain::~Brain() { //Delete Pointer To Implementation. delete m_Impl; } void Brain::Start() { } void Brain::Update() { } Window * Brain::GetWindow() { return m_Impl->window; } Sprite * Brain::GetSprite() { return m_Impl->sprite; } bool Brain::GetStart() { return m_Impl->started; } void Brain::SetStart(bool value) { m_Impl->started = value; } Script Class (Its a Brain Subclass!!!)
#include "Script.h" Script::Script(Window *window, Sprite *sprite) : Brain(window, sprite) { } Script::~Script() { } void Script::Start() { std::cout << "Game Started!" << std::endl; } void Script::Update() { Input *input = this->GetWindow()->GetInput(); Sprite *sp = this->GetSprite(); //Move this sprite. this->GetSprite()->Move(200 * this->GetWindow()->GetDeltaTime(), input->GetKeyDown("left"), input->GetKeyDown("right"), input->GetKeyDown("up"), input->GetKeyDown("down")); std::cout << sp->GetTag().c_str() << ".x = " << sp->GetPos()->x << ", " << sp->GetTag().c_str() << ".y = " << sp->GetPos()->y << std::endl; }
Main:
#include "SpaceShooterEngine.h" #include "Script.h" int main() { Window w("title", 600,600); Scene *scene = new Scene(); Sprite *player = new Sprite("Resources/Images/player.png", "Player", 100,100); Sprite *other = new Sprite("Resources/Images/cherno.png", "Other", 400, 100); Sprite *other2 = new Sprite("Resources/Images/cherno.png", "Other", 300, 400); Brain *brain = new Script(&w, player); player->AddBrain(brain); scene->AddSprite(player); scene->AddSprite(other); scene->AddSprite(other2); w.LoadScene(scene); w.MainLoop(); return 0; }

I literally can't find what is wrong. If you need more code, ask me to post it. I will also attach all the source files.
Brain.cpp
Error.cpp
IndexBuffer.cpp
Input.cpp
Renderer.cpp
Scene.cpp
Sprite.cpp
Texture.cpp
VertexArray.cpp
VertexBuffer.cpp
VertexBufferLayout.cpp
Window.cpp
Brain.h
Error.h
IndexBuffer.h
Input.h
Renderer.h
Scene.h
SpaceShooterEngine.h
Sprite.h
Texture.h
VertexArray.h
VertexBuffer.h
VertexBufferLayout.h
Window.h

• Hello fellow programmers,
For a couple of days now i've decided to build my own planet renderer just to see how floating point precision issues
can be tackled. As you probably imagine, i've quickly faced FPP issues when trying to render absurdly large planets.

I have used the classical quadtree LOD approach;
I've generated my grids with 33 vertices, (x: -1 to 1, y: -1 to 1, z = 0).
Each grid is managed by a TerrainNode class that, depending on the side it represents (top, bottom, left right, front, back),
creates a special rotation-translation matrix that moves and rotates the grid away from the origin so that when i finally
normalize all the vertices on my vertex shader i can get a perfect sphere.
T = glm::translate(glm::dmat4(1.0), glm::dvec3(0.0, 0.0, 1.0)); R = glm::rotate(glm::dmat4(1.0), glm::radians(180.0), glm::dvec3(1.0, 0.0, 0.0)); sides[0] = new TerrainNode(1.0, radius, T * R, glm::dvec2(0.0, 0.0), new TerrainTile(1.0, SIDE_FRONT)); T = glm::translate(glm::dmat4(1.0), glm::dvec3(0.0, 0.0, -1.0)); R = glm::rotate(glm::dmat4(1.0), glm::radians(0.0), glm::dvec3(1.0, 0.0, 0.0)); sides[1] = new TerrainNode(1.0, radius, R * T, glm::dvec2(0.0, 0.0), new TerrainTile(1.0, SIDE_BACK)); // So on and so forth for the rest of the sides As you can see, for the front side grid, i rotate it 180 degrees to make it face the camera and push it towards the eye;
the back side is handled almost the same way only that i don't need to rotate it but simply push it away from the eye.
The same technique is applied for the rest of the faces (obviously, with the proper rotations / translations).
The matrix that result from the multiplication of R and T (in that particular order) is send to my vertex shader as r_Grid'.
// spherify vec3 V = normalize((r_Grid * vec4(r_Vertex, 1.0)).xyz); gl_Position = r_ModelViewProjection * vec4(V, 1.0); The r_ModelViewProjection' matrix is generated on the CPU in this manner.
// No the most efficient way, but it works. glm::dmat4 Camera::getMatrix() { // Create the view matrix // Roll, Yaw and Pitch are all quaternions. glm::dmat4 View = glm::toMat4(Roll) * glm::toMat4(Pitch) * glm::toMat4(Yaw); // The model matrix is generated by translating in the oposite direction of the camera. glm::dmat4 Model = glm::translate(glm::dmat4(1.0), -Position); // Projection = glm::perspective(fovY, aspect, zNear, zFar); // zNear = 0.1, zFar = 1.0995116e12 return Projection * View * Model; } I managed to get rid of z-fighting by using a technique called Logarithmic Depth Buffer described in this article; it works amazingly well, no z-fighting at all, at least not visible.
Each frame i'm rendering each node by sending the generated matrices this way.
// set the r_ModelViewProjection uniform // Sneak in the mRadiusMatrix which is a matrix that contains the radius of my planet. Shader::setUniform(0, Camera::getInstance()->getMatrix() * mRadiusMatrix); // set the r_Grid matrix uniform i created earlier. Shader::setUniform(1, r_Grid); grid->render(); My planet's radius is around 6400000.0 units, absurdly large, but that's what i really want to achieve;
Everything works well, the node's split and merge as you'd expect, however whenever i get close to the surface
of the planet the rounding errors start to kick in giving me that lovely stairs effect.
I've read that if i could render each grid relative to the camera i could get better precision on the surface, effectively
getting rid of those rounding errors.

My question is how can i achieve this relative to camera rendering in my scenario here?
I know that i have to do most of the work on the CPU with double, and that's exactly what i'm doing.
I only use double on the CPU side where i also do most of the matrix multiplications.
As you can see from my vertex shader i only do the usual r_ModelViewProjection * (some vertex coords).

Thank you for your suggestions!