Renderer Design

Started by
2 comments, last by Aztral 13 years ago
As I rework my renderer to something more sophisticated than just brute forcing its way through all the geometry in a scene, I've come up with a few questions.

- My first question involves rendering order. I read quite frequently that opaque geometry should be rendered front to back, and translucent geometry should be rendered back to front. I also read quite frequently that render data should be sorted on a per material, texture and shader basis to avoid opengl state changes. Which of these sorts should take priority? I'm assuming that it's the former, spatial sort for a couple of reasons. Translucent geometry obviously HAS to be rendered in a particular order if you want transparency to be correct. I'd also guess that the benefit of early out offered by rendering opaque geometry from front to back is greater than (obviously this can vary) that offered by limiting state changes, though the benefit here is lot less obvious and seemingly dependent upon the situation.

That said, my current plan is to sort spatially and then sort that sorted data by material. I'm certain this will still offer some benefit, as static render data especially is likely to share material, texture and shader with spatially nearby geometry. (For example, my terrain geometry is broken up into small chunks to allow for only rendering visible terrain, but all this terrain shares the same material and chunks are right next to one another). I'm really just curious as to how these two methods coincide.

This is also relevant to where I store data, as data that is likely to be rendered at the same time should be stored in the same VBO to limit the number of bind buffer calls.

- My next question is also somewhat related to rendering order. When I have data sorted by material this should allow me to make significantly less gl*Pointer/draw calls, since instead of having to do this per object I can do it per material (hopefully encapsulating several objects). However, to do so I would need to append vertices to objects depending on their rendering order such that an object rendered first renders a degenerate triangle "going into" the next object. I'm not sure how I would do this if rendering order is dynamic. If it was fixed I could do this when the geometry is initially loaded into a VBO, but since it is not I'd have to change this degenerate vertex per frame, which seems silly and makes me think I'm going about this incorrectly. This is something I haven't been able to wrap my head around for a while. Sorting by material without worrying about draw calls would at least mean I don't have to bind textures, active shaders and send material data as frequently, which is good, but it also seems like it should be able to hugely limit the number of draw calls that must be made (which, from my understanding, is a big time performance boost). I am just not clear how to go about it.


- Third Question

This is less important in my opinion, but currently I use the following structure to store vertex data:


struct RenderVertex
{
float x, y, z; // position
float nx, ny, nz; // normal
float r, g, b, a; // color
float s0, t0; // multi tex coords
float s1, t1;
float s2, t2;
};


This works fine, but every time I only use position and tex coords, or position and normals, or some small subset of the struct I cringe. Would it be worth it to have several different structures depending on the data I actually need for an object? Using this struct exclusively seems like a pretty monstrous waste of memory. I also recall that it is suggested your vertex structure size aligns for 32 bytes, but I am not sure how important that is.

These are the questions I can think of right now, hopefully they are clear. Thanks for any advice.
Advertisement
Opaque geometry can be written in any order provided there is a Z-Buffer. If you have slow shaders for this part a depth pre-pass or rendering front-to back helps speed it up. Alpha blended translucent objects just need to be rendered back to front in order to look correct, and additive objects don't need to be ordered (unless mixed with alpha blended objects as well).

For a long time changing pixel shader constants was as expensive as changing the shader itself (i.e. Geforce 9800 and older), so there isn't much to be gained by sorting by pixel shader. The hardware cost in changing the shader basically boils down the hardware (possibly, or not!) needing to have finished processing all the pixels with the old shader before the new one can be loaded and executed. It isn't exactly a stall but some of the shader cores are going to be idle for a tiny amount of time when that happens. The pathological case would be rendering a very small object (composed of fewer pixels than shader cores), and switching the shader, and repeating the process over and over. The throughput approaches some low fraction of the hardware's best case (say 5-10%). I would imagine the newer hardware has addressed this to some degree, but nobody likes talking about it :) If you are rendering a lot of pixels per draw call then the switch ends up only costing something like 0.01% the draw instead of 90%.

Most hardware has a way of doing a tri-strip restart index so you don't have to encode a degenerate connector with 2 vertices. This includes D3D (there is a query object to get the index if it is supported). The advantage to the restart is it doesn't pollute the post transform cache and chew up 2 of the vertices it stores. The other possibility is to just use indexed tri-lists all the time instead of strips. They are certainly much easier to work with, though the index buffers are 2-3x larger.

As for unused vertex data, you should be able to split them into separate streams and have unused components be bound to a stream containing a single vertex with a stride of 0, and have it initialized to some nice value that works for all vertices (all 0's, etc). This cuts down on the vertex shader permutations, in that the shader can just assume most of the attributes exist and operate on them as if they do, as long as the math works with the dummy data it is a good tradeoff.
http://www.gearboxsoftware.com/

I'd also guess that the benefit of early out offered by rendering opaque geometry from front to back is greater than (obviously this can vary) that offered by limiting state changes, though the benefit here is lot less obvious and seemingly dependent upon the situation.
Rendering front-to-back is a GPU optimisation (assuming HiZ/EarlyZ is functioning). Reducing state-changes is largely a CPU optimisation.
It would be best to write your sorting code in a such a way that it's easy for a user of the library to configure it.
My next question is also somewhat related to rendering order. When I have data sorted by material this should allow me to make significantly less gl*Pointer/draw calls, since instead of having to do this per object I can do it per material (hopefully encapsulating several objects). However, to do so I would need to append vertices to objects depending on their rendering order such that an object rendered first renders a degenerate triangle "going into" the next object. I'm not sure how I would do this if rendering order is dynamic.[/quote]Dynamically building VBOs based on the rendering order sounds like a lot of work... I'd probably rather put up with the extra draw calls.
This is less important in my opinion, but currently I use the following structure to store vertex data:
*snip*[/quote]64bytes per vertex is way too big. Don't use floats unless you need them. e.g. 128-bit colour is most likely overkill. Use bytes or shorts or half-floats where possible.
Would it be worth it to have several different structures depending on the data I actually need for an object?[/quote]Yes.

Opaque geometry can be written in any order provided there is a Z-Buffer. If you have slow shaders for this part a depth pre-pass or rendering front-to back helps speed it up.
[/quote]


[color="#1C2837"]Rendering front-to-back is a GPU optimisation (assuming HiZ/EarlyZ is functioning). Reducing state-changes is largely a CPU optimisation.
It would be best to write your sorting code in a such a way that it's easy for a user of the library to configure it.
[/quote]

This makes sense. I guess it's dependent upon the situation then, as to what sorting (front to back or by vertex shader) would be optimal. I suppose I will take Hodgman's suggestion and make it configurable, this will make it easy to profile and compare performance as well.

In general I tend to understand what kind of things will offer a performance boost but to what extent or how a specific optimization compares to another I rarely can say for sure. Especially in the context of opengl calls, where it's hard to really see what is going on.

For example, I know that in general:
minimizing state changes,
minimizing buffer binds,
minimizing texture binds,
minimizing shader changes,
minimizing shader uniform updates,
rendering in a specific order,
minimizing draw calls,
and basically minimizing anything that requires data to be sent over the bus to the GPU

all lead to performance gains, but doing ALL of these things simultaneously strikes me as difficult and quite complicated. I guess a more broad question is how much I should be trying to accommodate all of these versus a couple, and if the latter which should be focused on above others?


For a long time changing pixel shader constants was as expensive as changing the shader itself (i.e. Geforce 9800 and older), so there isn't much to be gained by sorting by pixel shader. The hardware cost in changing the shader basically boils down the hardware (possibly, or not!) needing to have finished processing all the pixels with the old shader before the new one can be loaded and executed. It isn't exactly a stall but some of the shader cores are going to be idle for a tiny amount of time when that happens. The pathological case would be rendering a very small object (composed of fewer pixels than shader cores), and switching the shader, and repeating the process over and over. The throughput approaches some low fraction of the hardware's best case (say 5-10%). I would imagine the newer hardware has addressed this to some degree, but nobody likes talking about it :) If you are rendering a lot of pixels per draw call then the switch ends up only costing something like 0.01% the draw instead of 90%.
[/quote]

This is quite useful thank you. I'll take this to imply that as long as I'm not frequently shading a very small number of pixels before switching fragment shader I shouldn't stress too much about swapping the shader. This is the kind of in depth information that seems useful for answering the question above!


Most hardware has a way of doing a tri-strip restart index so you don't have to encode a degenerate connector with 2 vertices. This includes D3D (there is a query object to get the index if it is supported). The advantage to the restart is it doesn't pollute the post transform cache and chew up 2 of the vertices it stores. The other possibility is to just use indexed tri-lists all the time instead of strips. They are certainly much easier to work with, though the index buffers are 2-3x larger.
[/quote]

As far as I can tell from a bit of googling my options are to indeed use primitive restart or use glMultiDrawElements to achieve this. I'm thinking of going with your latter suggestion, since my COLLADA importer already loads models as indexed triangles and other functionality (like generating geometry for a plane or other primitives) I can easily change to render as triangles instead of triangle strips.


As for unused vertex data, you should be able to split them into separate streams and have unused components be bound to a stream containing a single vertex with a stride of 0, and have it initialized to some nice value that works for all vertices (all 0's, etc). This cuts down on the vertex shader permutations, in that the shader can just assume most of the attributes exist and operate on them as if they do, as long as the math works with the dummy data it is a good tradeoff.
[/quote]

By split them into separate streams do you mean not pack them into interleaved structures? I see what your saying, again I'm just trying to figure out what is most optimal. My understanding is that interleaving arrays is faster in general (for caching purposes?) but that if you are going to update a specific attribute of a vertex frequently and not others, the frequently updated attribute should be stored elsewhere. I could be wrong, and I could be (probably am) overthinking this big time.


[color="#1C2837"]64bytes per vertex is way too big. Don't use floats unless you need them. e.g. 128-bit colour is most likely overkill. Use bytes or shorts or half-floats where possible.
[/quote]

Doh. I probably should have been able to figure out this one myself. 128-bit color is most definitely overkill for what I'm doing.

For vertex data I have been tinkering with making the vertex "structure" configurable on a per-vertex array basis. For example a VertexArray will contain an array of



struct VertexAttribute
{
GLenum type; // i.e. GL_FLOAT, GL_UNSIGNED_BYTE
GLenum components; // components per vertex, i.e. XYZ position would be 3
uint32 offset; // offset into vertex you find the start of this attribute
};


each indicating a different attribute like position, color, normal, etc. which can be dynamically added to a VertexArray depending on what data is pulled from COLLADA or what the process for generating geometry determines is necessary for rendering. Post-loading some function will use these structs to determine the actual stride of a vertex and pack everything into either interleaved (or non-interleaved, I haven't decided yet) arrays for VBO's.

Thanks again, to both of you.

This topic is closed to new replies.

Advertisement