• Create Account

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

31 replies to this topic

### #1melbow  Members   -  Reputation: 215

Like
5Likes
Like

Posted 27 December 2012 - 12:37 AM

Let me preface this by saying I have read everything I could find on the web on this topic, but have been unable to answer my question.

This list includes most notably:

I understand sorting the draw calls and how to use this system under the fixed-function pipeline. However, when I change to a programmable pipeline, I can't seem to come up with how a "Render Operation" would be structured.

The best solution I could come up with was an object list for both uniforms and attributes. However, I can't see how a VAO or VBO would fit into this scheme.

Roughly, my code might look like:

class ShaderProgram {
public:
void Create(char*,char*);
void MakeCurrent();
ShaderAttributeInfo const & GetAttribute(const char* name);
...

private:
};

GLint handle;
GLenum type;

public:
Set(void* data);
};

void* data;
...
};

... // Same basic structure for Uniforms

class RenderOperation {
};

I'm trying to make my system as versatile, yet efficient as possible.

To reiterate, my question is: How would a "Render Operation" be formatted for a Render Queue using a programmable pipeline? The operation should allow for any sort of uniform or attribute and allow for VAOs and VBOs. And I don't feel the need to support a fixed function pipeline, so don't worry about that.

I hope I made my confusion clear enough. Thanks a ton.

Edited by melbow, 27 December 2012 - 12:42 AM.

### #2noizex  Members   -  Reputation: 777

Like
8Likes
Like

Posted 27 December 2012 - 07:59 AM

For my render queue that aims for OpenGL 3.2+ I have RenderTask that has following information:

• VAO (pointer to my VAO class instance, but could be just GLuint of VAO)
• Shader (same as above, I use pointer to my Shader class instance, could be simplified to GLuint of program)
• Uniform map (rarely used, see below)
• Texture unit map
• RenderState
• Number of indices and vertices
• Base vertex (optional)
• Base index (optional)
• UBO (pointer and size of a buffer)
• Sorting key (its initially empty and computed by queue itself, depending on what queue is this it takes into account different task attributes to create sorting key)
• PrimitiveType (TRIANGLES, LINES, POINTS)

Thats about it I think. This list grew over time, as I started with different setup (for example I kept info about VBO/IBO, but then I realized that I'm using only VAO anyway so I don't really need that information here).

Now I will explain a bit most important aspects.

Uniform Map

This is a class I'm not really happy with (uses void* and stuff like that to store different values in a variant-like style). This allows me to store uniform values in a way that defers calls of glUniformX to a later stage (rendering). It works like:

task.uniforms["myuniform"] = value;

I don't think I even use it now, except for quick prototyping. I started using UBOs so I put all uniforms there.

Texture unit map

My textures have pre-defined units that correspond to various texture types (base/diffuse texture, normal map, g-buffer textures, shadow maps etc.) So when I set textures for a certain render task I use this convention:

task.textures[TextureUnit::BASE] = baseTextureId;


Render state

Here I put things like blending, scissor tests and other things that affect the rendering and could be grouped like this. I have a RenderState class that allows me to set it all easily and works well with my GLStateCache class that keeps information about state changes and prevents unnecessary calls (like setting the same scissor test over and over for few renders, when you could just skip setting it until the scissor test is disabled or change values).

Base vertex/base index

This is optional information when drawing using glDrawElementsBaseVertex/glDrawRangeElementsBaseVertex and glDrawElements but with offset index. Basically, I do not specify draw command that is to be used at this point, its determined based on whether these attributes are non-zero etc. The most appropriate call for a given task is used.

Sorting key

Sorting key is generated by the queue, not the task itself. I do this because one task may be sent to different queues (DEBUG, OVERLAY, TERRAIN, OBJECTS, FOLIAGE) and each queue may have different sorting criteria, so when RenderTask is added to queue it goes through key generation first that generates sortingKey based on what is important for this queue.

I'm still not 100% happy with it and this setup hasn't been tested on any heavy load - my game is in very early stage and all I've done was rendering UI / simple world geometry, but it works pretty well. It allows me to fetch tasks from any queue I need by different renderers (I separated renderers into own classes, OverlayRenderer, DeferredWorldRenderer, DebugRenderer) and they all fetch from (sometimes more than 1) specific queue.

This allows for fun stuff like adding the same tasks to both, WORLD and DEBUG queue, which renders this object using 2 renderers - one normal, and another as wireframe, or with visualised normals.

Hope it helps, I'm open to any discussion about it because topic is very interesting and finding information is pretty hard - there are few big GameDev threads and some information can be found on the web but its hard to put it into pieces. My implementation emerged from some reading but also from my own ideas that came out once I actually started writing it and using for rendering.

### #3melbow  Members   -  Reputation: 215

Like
0Likes
Like

Posted 27 December 2012 - 08:32 PM

Thanks for replying Noizex. You gave some great input. I was wondering if you perform any batching of your Render Tasks and if so, how? I don't see any place for attributes that aren't in your VAO.

### #4noizex  Members   -  Reputation: 777

Like
1Likes
Like

Posted 28 December 2012 - 03:38 AM

Thanks for replying Noizex. You gave some great input. I was wondering if you perform any batching of your Render Tasks and if so, how? I don't see any place for attributes that aren't in your VAO.

I don't batch them really, I have plan to draw some objects with instancing (not sure yet if I will just submit 1 RenderTask that has additional info for instanced drawing, or many RenderTasks and somehow determine they should be instanced and collapsed to 1 draw call in the renderer). I batch things before I submit RenderTask too because often its very specific for the thing thats drawn.

Whole thing is so flexible that I don't optimize too much yet (like batching everything) because its more convenient to have 1 VAO per single objec and just draw it with 1 draw call per object. If I ever run into draw call problem I can batch them just modifying how tasks are processed by renderer. I will for sure batch things like foliage / grass / particles and other things that otherwise would waste too many draw calls, but for normal objects / terrain I don't really want to optimize that yet, unless I see that I end up with too many calls from this.

### #5Hodgman  Moderators   -  Reputation: 22477

Like
8Likes
Like

Posted 28 December 2012 - 08:09 AM

If possible, use UBOs instead of plain uniforms... The GL2.x method of managing uniforms is over-complicated and doesn't match the hardware at all, causing you to write a heap of complex code in your engine, and causing the driver to then have to emulate a bunch of cruft...

In my engine:

A RenderItem is a DrawCall and an array of StateGroup pointers.

StateGroups are sets of States, which map to all the GL/D3D parameters that affect how a draw-call functions, including a dozen or so UBO (AKA cbuffer) binding-slots (per shader stage), the VAO (AKA vertex-declaration/input-layout + a dozen or so vertex-buffer slots), a dozen or so texture slots (per shader stage), the rasterization state, the blend mode, the depth-stencil state, etc...

RenderItems are then put into (sortable) collections (RenderGroups), which are submitted with an associated RenderStage (which describes the render-target/MRT/FBO state and any clear/resolve operations).

### #6melbow  Members   -  Reputation: 215

Like
0Likes
Like

Posted 28 December 2012 - 12:18 PM

Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.

### #7Hodgman  Moderators   -  Reputation: 22477

Like
7Likes
Like

Posted 29 December 2012 - 12:51 AM

Hodgman, are your states then inheritted from a State base class? If not, how do you determine how to set the state the correct way? I would think you would want to avoid the overhead of virtual functions on something so low level on the engine.
Yeah they're pretty much implementations of a base-class interface, but they don't use virtual.

In theory, virtual shouldn't be that slow, so it's important to understand what makes it "slow" in practice.
When calling a regular function, the compiler can hard-code a "jump to this address" instruction -- a completely predictable branch, with the main performance pitfall that the jump may cause an i-cache miss, if you haven't called that function recently (i.e. if the function's code isn't already in the cache).
When calling a virtual function, it needs instructions to read the first word of the object (the vtable address), then add some hard-coded offset to that word to get the address of the vtable entry for the function we're after, then fetch the word at that address, and finally jump to the address in that word.
This has the same performance issue as above (an i-cache miss if the function isn't resident), but also the branch is harder to predict because the branch-target isn't known until right before you perform the jump (this matters more on PPC than x86). Further, we need to perform two additional reads from memory, which could each cause d-cache misses -- the first one (the vtable address fetch) is likely if you've not read a member from this object recently, and the second one (the function pointer fetch) is likely to miss if you've not called a virtual function on this type of object recently. If the function-call has to read members of the object, then the 1st one doesn't matter, as you've just moved the cache miss from inside the function to right before it! Cache misses are really slow (hundreds of cycles), so they're a much bigger deal than the few extra instructions or the branch misprediction.

So, the main reason that virtual functions are slow, is that you might cause a cache-miss when fetching a function pointer from a vtable. There's not much you can do about this, because the compiler implements vtables behind the scenes, leaving you with no control over where they live in RAM.

So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same "vtable" -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but the following commands are less likely to cause the same cache-miss.
e.g.
namespace Commands
{
enum Type
{
Foo,
Bar0,
Bar1,
Bar2,
};
}

struct Command
{
u8 id;
};
struct Foo
{
u8 id;
int value;
};
struct Bar
{
u8 id;
int value;
};

void Submit_Foo(Device& device, Command& command)
{
assert( command.id == Commands::Foo );
Foo& foo = *(Foo*)&command;
device.DoFoo( foo.value );
}

void Submit_Bar(Device& device, Command& command)
{
assert( command.id >= Commands::Bar0 && command.id <= Commands::Bar2 );
Bar& bar = *(Bar*)&command;
device.SetBarSlot( bar.id - Commands::Bar0, bar.value );
}

typedef void(*PFnSubmit)(Device&, Command&);
static PFnSubmit g_CommandTable[] =
{
&Submit_Foo,//Commands::Foo
&Submit_Bar,//Commands::Bar0
&Submit_Bar,//Commands::Bar1
&Submit_Bar,//Commands::Bar2
};

inline void SubmitCommand(Device& device, Command& command)
{
g_CommandTable[command.id](device, command);
}

//then, if you're really fancy, you can cut down on cache misses by pre-fetching or unrolling when executing a batch of commands
void SubmitCommands(Device& device, CommandIterator commands)
{//n.b. pseudo code
prefetch g_CommandTable
foreach command in commands
prefetch next command.id
SubmitCommand( device, command )
}

Edited by Hodgman, 29 December 2012 - 01:02 AM.

### #8melbow  Members   -  Reputation: 215

Like
0Likes
Like

Posted 29 December 2012 - 07:29 PM

So, for this particular case (renderer commands), I implemented a vtable-type system myself, where every different command class shares the same "vtable" -- this means that if I execute 100 commands in a row (from a pre-prepared command buffer), then the first one will generate a cache-miss when reading this "vtable", but the following commands are less likely to cause the same cache-miss.

So if everything is a command, do you batch ahead of time like Noizex? Say for example, geometry instancing, how would you collect each instance for creation of the commands? And thanks for such a good example. You've helped clarify a ton already

### #9Hodgman  Moderators   -  Reputation: 22477

Like
4Likes
Like

Posted 29 December 2012 - 08:37 PM

So if everything is a command, do you batch ahead of time like Noizex? Say for example, geometry instancing, how would you collect each instance for creation of the commands?

Yeah, like Noizex, I don't perform any merging of commands at this level of the library.
e.g. As I mentioned earlier, my RenderItem contains a DrawCall -- I never take two items with DrawIndexed calls and merge them into a single DrawIndexedInstanced call; that's the responsibility of the next layer up, which generates RenderItems.

So, I've got multiple layers within the renderer:
1st -- you can submit a sequence of commands to a device.
2nd -- you can take a sequence of RenderItems and submit them, which sends a (culled) sequence of commands to the 1st layer.
3rd+ -- you can generate a sequence of RenderItems somehow, and pass it to the 2nd layer.

n.b. DrawCall and State both "inherit" from Command, so at the 2nd layer they're different things, but at the 1st layer, everything is the same.

The 1st layer is very simple, quite similar to the example code I posted above.
The 2nd layer can perform sorting of RenderItems, and does redundant state-change optimizations -- e.g. if a render-item contains a state that was previously submitted, then it won't be submitted again. The 2nd layer can also write commands to a "command buffer" instead of submitting them directly to the device, which is used when performing rendering tasks on a background thread that isn't able to access the device itself.
The 3rd (and higher) layers are where the "higher level" rendering ideas live, like scene management, etc...

There can be different rendering systems in the 3rd layer. e.g. I might have a system that performs culling of the static world geometry, and collects the static RenderItem objects required to draw the level itself. Then, I might have a different system that procedurally generates RenderItems required to draw particle systems, etc...
Say I've got a crowd rendering system, where the characters are all instanced, then this system might find all the visible characters, then generate a single RenderItem that contains a DrawIndexedInstanced call, and the necessary state to draw all those characters.

Edited by Hodgman, 29 December 2012 - 08:47 PM.

### #10IncidentRay  Members   -  Reputation: 148

Like
1Likes
Like

Posted 29 December 2012 - 09:57 PM

Hodgman, wouldn't this violate the strict aliasing rule when you cast a Command reference to a Foo or Bar reference, or vice versa?

### #11Hodgman  Moderators   -  Reputation: 22477

Like
3Likes
Like

Posted 30 December 2012 - 01:49 AM

Hodgman, wouldn't this violate the strict aliasing rule when you cast a Command reference to a Foo or Bar reference, or vice versa?

Yes. Technically, casting a Foo* to a Command* is undefined behaviour, but in practice, it will work in most situations.

We're never writing to an aliased Command and reading from an aliased Foo (or vice versa) inside the one function, which minimizes the risks.
e.g. this code would be dangerous:

assert( command.id == 0 );//assume the command is actuall a "Foo"
command.id = 42;//change the id value
Foo& foo = *(Foo*)&command;
assert( foo.id == 42 );//the id value should be changed on the "Foo" also, but this might fail in optimized builds!

The worst thing in the earlier code is a sub-optimal assertion:

assert( command.id >= Commands::Bar0 && command.id <= Commands::Bar2 );//this will load command.id from RAM
Bar& bar = *(Bar*)&command;
device.SetBarSlot( bar.id - Commands::Bar0, bar.value );//bar.id will generate another "load" instruction here, even though the value was loaded above

Also, the only value that we actually need to "alias" is the first member -- u8 id -- and it doesn't actually need to be aliased as a different type, so it's possible to write this system in a way that doesn't violate strict aliasing if you need to -- e.g.

//Instead of this:
Foo foo = { Commands::Foo, 1337 };
Command* cmd = (Command*)&foo;
SubmitCommand( device, *cmd );

//We could use
Foo foo = { Commands::Foo, 1337 };
u8* cmd = &foo.id;
SubmitCommand( device, cmd );

//with:
inline void SubmitCommand(Device& device, u8* command)
{
g_CommandTable[*command](device, command);
}
void Submit_Foo(Device& device, u8* command)
{
assert( *command == Commands::Foo );
Foo& foo = *(Foo*)(command - offsetof(Foo,id));
device.DoFoo( foo.value );
}

P.S. u8* (my version of unsigned char*) is allowed to alias any other type (strict aliasing rule doesn't apply to it), but the above version will work even if this wasn't true.

Edited by Hodgman, 30 December 2012 - 01:52 AM.

### #12IncidentRay  Members   -  Reputation: 148

Like
0Likes
Like

Posted 30 December 2012 - 07:43 PM

Also, the only value that we actually need to "alias" is the first member -- u8 id -- and it doesn't actually need to be aliased as a different type, so it's possible to write this system in a way that doesn't violate strict aliasing if you need to -- e.g.

Thanks for the example.  Would you still need the Command struct with this design?  Also, I was wondering whether you think it's worth trying to always avoid breaking the strict aliasing rule, or do you think it's better to just risk the undefined behavior if it's the simplest option?

Edited by IncidentRay, 30 December 2012 - 07:45 PM.

### #13Hodgman  Moderators   -  Reputation: 22477

Like
0Likes
Like

Posted 01 January 2013 - 07:52 AM

Thanks for the example.  Would you still need the Command struct with this design?  Also, I was wondering whether you think it's worth trying to always avoid breaking the strict aliasing rule, or do you think it's better to just risk the undefined behavior if it's the simplest option?

No, the command struct has been replaced with a pointer to the id's primitive type.

Yes, breaking the strict-aliasing rule can be very bad, because it can cause the compiler to emit code that doesn't do what you intended it to! So it should be avoided.

I've taken this thread off-topic enough already, so I've started new topic just about the strict aliasing rule over here

### #14TiagoCosta  Crossbones+   -  Reputation: 1655

Like
0Likes
Like

Posted 02 January 2013 - 06:24 PM

I would like to add another question in this topic:

How do you handle RenderItems (objects) that require a Texture that is generated by a differente RenderStage.

Example:

In a deferred renderer, every light source needs to have it's shadow map generated, but you only have a GPU resource to store the shadow map so you have to:

-Draw Light 1;

-Draw Light 2;

...

Currently I handle this by having a command called ExecuteRenderStage that stop the rendering of the current render stage, executes another stage and restores back to the "main" one, but I would like to hear how you do it.

Tiago Costa
Aqua Engine - my DirectX 11 game "engine" - In development

### #15melbow  Members   -  Reputation: 215

Like
0Likes
Like

Posted 09 January 2013 - 08:10 PM

All this talk of unpredictable behavior has me questioning this approach. What if a Command was simply a sort of container, like:
struct Command {
Commands::Type id;
union {
Foo* foo;
Bar* bar;
} u;
};

### #16Hodgman  Moderators   -  Reputation: 22477

Like
0Likes
Like

Posted 09 January 2013 - 08:27 PM

I don't know why I didn't mention it before, but in my own engine I get around the undefined behaviour the potential aliasing issues with inheritance...

struct Command { Commands::Type id; };
struct Foo : public Command { int value; };

How do you handle RenderItems (objects) that require a Texture that is generated by a differente RenderStage

I just submit a series of stages. e.g. the stage to generate a shadow-map, then a stage that draws the light (which is a draw-call paired who's paired state-group sets the texture generated by the first stage).

Edited by Hodgman, 11 January 2013 - 04:57 AM.

### #17IncidentRay  Members   -  Reputation: 148

Like
1Likes
Like

Posted 09 January 2013 - 11:02 PM

I get around the undefined behaviour with inheritance...

But if you use inheritance, don't the structs become non-POD types?  That might create more undefined behavior to deal with -- for example, I was thinking of using memcmp for detecting redundant state-changes in the RenderGroup class, but that would only work if the structs were POD.

### #18melbow  Members   -  Reputation: 215

Like
0Likes
Like

Posted 11 January 2013 - 02:06 AM

I too am puzzled by how redundant state changes are eliminated in this model. Am I correct in that states may be submitted in any order? And if this is the case, then states may be sorted and then linearly compared. However, this seems expensive considering how many states may be set per frame. I'm sure you have a much more clever way of doing this.

### #19Hodgman  Moderators   -  Reputation: 22477

Like
1Likes
Like

Posted 11 January 2013 - 04:51 AM

But if you use inheritance, don't the structs become non-POD types?  That might create more undefined behavior to deal with -- for example, I was thinking of using memcmp for detecting redundant state-changes in the RenderGroup class, but that would only work if the structs were POD.
You've got a good eye for C++ details ;) I should've said inheritance avoids the strict-aliasing issues, but you're right, the standard says that using inheritance like that means they're now non-POD.
However, on the compilers that I support, they still act as if they were POD, so I can still memcmp/memcpy them on these compilers. Relying on compiler details should generally be avoided, but it's something you can choose to do

Instead of inheritance, I guess I could've used composition to be fully compliant, e.g.
struct Command { int id; };
struct FooCommand { Command baseClass; int fooValue; };

I too am puzzled by how redundant state changes are eliminated in this model. Am I correct in that states may be submitted in any order? And if this is the case, then states may be sorted and then linearly compared. However, this seems expensive considering how many states may be set per frame. I'm sure you have a much more clever way of doing this.
I haven't really mentioned redundant state removal, except that I do it at the "second level". The 1st level takes a stream of commands, and can't do any redundant state removal besides the traditional technique, which is to check the value of every state before submitting it, something like:
if( 0!=memcmp(&cache[state.id], &state, sizeof(State)) ) { cache[state.id]=state; Apply(state); }

A lot of renderers do do redundant state checking at that level, which pretty much means having an if like the above every time you go to set a state. I do a little bit of this kind of state caching, but try to avoid it.
Instead, I do redundant state checking at the next level up -- the part that generates the sequences of commands in the first place. This part of the code also submits commands to set states back to their default values if a particular draw-call hasn't been paired with any values for that state.
After sorting my render-items, the "2nd layer" which produces the stream of commands for the 1st layer looks like:
defaults[maxStates] = {/*states to apply if a value doesn't exist for them*/}

previousState[maxStates] = {NULL} // a cache of which states are 'current'

nonDefaultState[maxStates] = {true} // which states have a non-default value

for each item in renderItems

draw = item.draw
stateGroups = item.stateGroups

statesSet[maxStates] = {false} //which states have been set by this item
for each group in stateGroups
for each state in group
if statesSet[state.id] == false && //this state not set by a previous group in this item
previousState[state.id] != state //this state not set by a previous item and still current
then
Submit(state) // add to command buffer, or send to device
statesSet[state.id] = true
previousState[state.id] = state
endif
endfor
endfor

setToDefault = nonDefaultState & ~statesSet
nonDefaultState = statesSet
for each id in setToDefault
Submit(defaults[state.id]) // add to command buffer, or send to device
previousState[state.id] = defaults[state.id]
endfor

Submit(draw) // add to command buffer, or send to device

endfor
Except the actual C++ code uses a lot of bitmasks instead of arrays of bools, and uses pointers to identify state value equality, and everything is tightly laid out to be cache-friendly, etc...

Edited by Hodgman, 11 January 2013 - 04:54 AM.

### #20melbow  Members   -  Reputation: 215

Like
1Likes
Like

Posted 11 January 2013 - 10:30 PM

Thanks again Hodgman. I really appreciate how detailed yet concise your responses are. The only thing that is still not completely clear to me is the generation of RenderItems. Are they allocated each frame from a data cache (like what is described here http://docs.madewithmarmalade.com/native/api_reference/iwgxapidocumentation/iwgxapioverview/datacache.html )? And would a higher level object like a GLShader or GeometryPacket class then maintain their respective Commands? I am not seeing a way to to check for duplicate states by comparing pointers unless the Commands are maintained by global, shared resources, or I guess if Commands ARE global shared resources, but the first option seems cleaner.

Again, I really appreciate everyone's input on this thread, it has helped me a great deal.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS