Advanced Render Queue API

Started by
30 comments, last by melbow 11 years, 3 months ago

Let me preface this by saying I have read everything I could find on the web on this topic, but have been unable to answer my question.

This list includes most notably:

I understand sorting the draw calls and how to use this system under the fixed-function pipeline. However, when I change to a programmable pipeline, I can't seem to come up with how a "Render Operation" would be structured.

The best solution I could come up with was an object list for both uniforms and attributes. However, I can't see how a VAO or VBO would fit into this scheme.

Roughly, my code might look like:


class ShaderProgram {
public:
  void Create(char*,char*);
  void MakeCurrent();
  ShaderAttributeInfo const & GetAttribute(const char* name);
  ...

private:
  List<ShaderAttributeInfo> m_Attributes;
  List<ShaderUniformInfo> m_Uniforms;
};

class ShaderAttribInfo {
  GLint handle;
  GLenum type;

public:
  Set(void* data);
};

class ShaderAttrib {
  ShaderAttribInfo* info;
  void* data;
  ...
};

... // Same basic structure for Uniforms

class RenderOperation {
  List<ShaderAttrib> m_Attributes;
  List<ShaderUniform> m_Uniforms;
  ShaderProgram* m_Program;
};

I'm trying to make my system as versatile, yet efficient as possible.

To reiterate, my question is: How would a "Render Operation" be formatted for a Render Queue using a programmable pipeline? The operation should allow for any sort of uniform or attribute and allow for VAOs and VBOs. And I don't feel the need to support a fixed function pipeline, so don't worry about that.

I hope I made my confusion clear enough. Thanks a ton.

Advertisement

For my render queue that aims for OpenGL 3.2+ I have RenderTask that has following information:

  • VAO (pointer to my VAO class instance, but could be just GLuint of VAO)
  • Shader (same as above, I use pointer to my Shader class instance, could be simplified to GLuint of program)
  • Uniform map (rarely used, see below)
  • Texture unit map
  • RenderState
  • Number of indices and vertices
  • Base vertex (optional)
  • Base index (optional)
  • UBO (pointer and size of a buffer)
  • Sorting key (its initially empty and computed by queue itself, depending on what queue is this it takes into account different task attributes to create sorting key)
  • PrimitiveType (TRIANGLES, LINES, POINTS)

Thats about it I think. This list grew over time, as I started with different setup (for example I kept info about VBO/IBO, but then I realized that I'm using only VAO anyway so I don't really need that information here).

Now I will explain a bit most important aspects.

Uniform Map

This is a class I'm not really happy with (uses void* and stuff like that to store different values in a variant-like style). This allows me to store uniform values in a way that defers calls of glUniformX to a later stage (rendering). It works like:


task.uniforms["myuniform"] = value;

I don't think I even use it now, except for quick prototyping. I started using UBOs so I put all uniforms there.

Texture unit map

My textures have pre-defined units that correspond to various texture types (base/diffuse texture, normal map, g-buffer textures, shadow maps etc.) So when I set textures for a certain render task I use this convention:


task.textures[TextureUnit::BASE] = baseTextureId;
task.textures[TextureUnit::NORMAL] = normalTextureId;

Render state

Here I put things like blending, scissor tests and other things that affect the rendering and could be grouped like this. I have a RenderState class that allows me to set it all easily and works well with my GLStateCache class that keeps information about state changes and prevents unnecessary calls (like setting the same scissor test over and over for few renders, when you could just skip setting it until the scissor test is disabled or change values).

Base vertex/base index

This is optional information when drawing using glDrawElementsBaseVertex/glDrawRangeElementsBaseVertex and glDrawElements but with offset index. Basically, I do not specify draw command that is to be used at this point, its determined based on whether these attributes are non-zero etc. The most appropriate call for a given task is used.

Sorting key

Sorting key is generated by the queue, not the task itself. I do this because one task may be sent to different queues (DEBUG, OVERLAY, TERRAIN, OBJECTS, FOLIAGE) and each queue may have different sorting criteria, so when RenderTask is added to queue it goes through key generation first that generates sortingKey based on what is important for this queue.


I'm still not 100% happy with it and this setup hasn't been tested on any heavy load - my game is in very early stage and all I've done was rendering UI / simple world geometry, but it works pretty well. It allows me to fetch tasks from any queue I need by different renderers (I separated renderers into own classes, OverlayRenderer, DeferredWorldRenderer, DebugRenderer) and they all fetch from (sometimes more than 1) specific queue.

This allows for fun stuff like adding the same tasks to both, WORLD and DEBUG queue, which renders this object using 2 renderers - one normal, and another as wireframe, or with visualised normals.

Hope it helps, I'm open to any discussion about it because topic is very interesting and finding information is pretty hard - there are few big GameDev threads and some information can be found on the web but its hard to put it into pieces. My implementation emerged from some reading but also from my own ideas that came out once I actually started writing it and using for rendering.


Where are we and when are we and who are we?
How many people in how many places at how many times?
Thanks for replying Noizex. You gave some great input. I was wondering if you perform any batching of your Render Tasks and if so, how? I don't see any place for attributes that aren't in your VAO.
Thanks for replying Noizex. You gave some great input. I was wondering if you perform any batching of your Render Tasks and if so, how? I don't see any place for attributes that aren't in your VAO.

I don't batch them really, I have plan to draw some objects with instancing (not sure yet if I will just submit 1 RenderTask that has additional info for instanced drawing, or many RenderTasks and somehow determine they should be instanced and collapsed to 1 draw call in the renderer). I batch things before I submit RenderTask too because often its very specific for the thing thats drawn.

Whole thing is so flexible that I don't optimize too much yet (like batching everything) because its more convenient to have 1 VAO per single objec and just draw it with 1 draw call per object. If I ever run into draw call problem I can batch them just modifying how tasks are processed by renderer. I will for sure batch things like foliage / grass / particles and other things that otherwise would waste too many draw calls, but for normal objects / terrain I don't really want to optimize that yet, unless I see that I end up with too many calls from this.


Where are we and when are we and who are we?
How many people in how many places at how many times?

If possible, use UBOs instead of plain uniforms... The GL2.x method of managing uniforms is over-complicated and doesn't match the hardware at all, causing you to write a heap of complex code in your engine, and causing the driver to then have to emulate a bunch of cruft...

In my engine:

A RenderItem is a DrawCall and an array of StateGroup pointers.

StateGroups are sets of States, which map to all the GL/D3D parameters that affect how a draw-call functions, including a dozen or so UBO (AKA cbuffer) binding-slots (per shader stage), the VAO (AKA vertex-declaration/input-layout + a dozen or so vertex-buffer slots), a dozen or so texture slots (per shader stage), the rasterization state, the blend mode, the depth-stencil state, etc...

RenderItems are then put into (sortable) collections (RenderGroups), which are submitted with an associated RenderStage (which describes the render-target/MRT/FBO state and any clear/resolve operations).

Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.
Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.
Yeah they're pretty much implementations of a base-class interface, but they don't use virtual.

In theory, virtual shouldn't be that slow, so it's important to understand what makes it "slow" in practice.
When calling a regular function, the compiler can hard-code a "jump to this address" instruction -- a completely predictable branch, with the main performance pitfall that the jump may cause an i-cache miss, if you haven't called that function recently (i.e. if the function's code isn't already in the cache).
When calling a virtual function, it needs instructions to read the first word of the object (the vtable address), then add some hard-coded offset to that word to get the address of the vtable entry for the function we're after, then fetch the word at that address, and finally jump to the address in that word.
This has the same performance issue as above (an i-cache miss if the function isn't resident), but also the branch is harder to predict because the branch-target isn't known until right before you perform the jump (this matters more on PPC than x86). Further, we need to perform two additional reads from memory, which could each cause d-cache misses -- the first one (the vtable address fetch) is likely if you've not read a member from this object recently, and the second one (the function pointer fetch) is likely to miss if you've not called a virtual function on this type of object recently. If the function-call has to read members of the object, then the 1st one doesn't matter, as you've just moved the cache miss from inside the function to right before it! Cache misses are really slow (hundreds of cycles), so they're a much bigger deal than the few extra instructions or the branch misprediction.

So, the main reason that virtual functions are slow, is that you might cause a cache-miss when fetching a function pointer from a vtable. There's not much you can do about this, because the compiler implements vtables behind the scenes, leaving you with no control over where they live in RAM.

So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same "vtable" -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but the following commands are less likely to cause the same cache-miss.
e.g.
namespace Commands
{
	enum Type
	{
		Foo,
		Bar0,
		Bar1,
		Bar2,
	};
}

struct Command
{
	u8 id;
};
struct Foo
{
	u8 id;
	int value;
};
struct Bar
{
	u8 id;
	int value;
};

void Submit_Foo(Device& device, Command& command)
{
	assert( command.id == Commands::Foo );
	Foo& foo = *(Foo*)&command;
	device.DoFoo( foo.value );
}

void Submit_Bar(Device& device, Command& command)
{
	assert( command.id >= Commands::Bar0 && command.id <= Commands::Bar2 );
	Bar& bar = *(Bar*)&command;
	device.SetBarSlot( bar.id - Commands::Bar0, bar.value );
}

typedef void(*PFnSubmit)(Device&, Command&);
static PFnSubmit g_CommandTable[] =
{
	&Submit_Foo,//Commands::Foo
	&Submit_Bar,//Commands::Bar0
	&Submit_Bar,//Commands::Bar1
	&Submit_Bar,//Commands::Bar2
};

inline void SubmitCommand(Device& device, Command& command)
{
	g_CommandTable[command.id](device, command);
}

//then, if you're really fancy, you can cut down on cache misses by pre-fetching or unrolling when executing a batch of commands
void SubmitCommands(Device& device, CommandIterator commands)
{//n.b. pseudo code
	prefetch g_CommandTable
	foreach command in commands
		prefetch next command.id
		SubmitCommand( device, command )
}
So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same "vtable" -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but the following commands are less likely to cause the same cache-miss.

So if everything is a command, do you batch ahead of time like Noizex? Say for example, geometry instancing, how would you collect each instance for creation of the commands? And thanks for such a good example. You've helped clarify a ton already :D

So if everything is a command, do you batch ahead of time like Noizex? Say for example, geometry instancing, how would you collect each instance for creation of the commands?

Yeah, like Noizex, I don't perform any merging of commands at this level of the library.
e.g. As I mentioned earlier, my RenderItem contains a DrawCall -- I never take two items with DrawIndexed calls and merge them into a single DrawIndexedInstanced call; that's the responsibility of the next layer up, which generates RenderItems.

So, I've got multiple layers within the renderer:
1st -- you can submit a sequence of commands to a device.
2nd -- you can take a sequence of RenderItems and submit them, which sends a (culled) sequence of commands to the 1st layer.
3rd+ -- you can generate a sequence of RenderItems somehow, and pass it to the 2nd layer.

n.b. DrawCall and State both "inherit" from Command, so at the 2nd layer they're different things, but at the 1st layer, everything is the same.

The 1st layer is very simple, quite similar to the example code I posted above.
The 2nd layer can perform sorting of RenderItems, and does redundant state-change optimizations -- e.g. if a render-item contains a state that was previously submitted, then it won't be submitted again. The 2nd layer can also write commands to a "command buffer" instead of submitting them directly to the device, which is used when performing rendering tasks on a background thread that isn't able to access the device itself.
The 3rd (and higher) layers are where the "higher level" rendering ideas live, like scene management, etc...

There can be different rendering systems in the 3rd layer. e.g. I might have a system that performs culling of the static world geometry, and collects the static RenderItem objects required to draw the level itself. Then, I might have a different system that procedurally generates RenderItems required to draw particle systems, etc...
Say I've got a crowd rendering system, where the characters are all instanced, then this system might find all the visible characters, then generate a single RenderItem that contains a DrawIndexedInstanced call, and the necessary state to draw all those characters.

Hodgman, wouldn't this violate the strict aliasing rule when you cast a Command reference to a Foo or Bar reference, or vice versa?

This topic is closed to new replies.

Advertisement