Frostbite rendering architecture question.

Started by
75 comments, last by n3Xus 12 years, 8 months ago
[font="arial, verdana, tahoma, sans-serif"][font="arial, verdana, tahoma, sans-serif"][quote name='Eric Lengyel' timestamp='1309832790' post='4831213']What mainstream GPU, specifically, do you believe doesn't have alpha test capabilities outside the pixel shader?
Is there any way to know??[/quote]

Yes, as a matter of fact, there is. It is not too difficult to reverse-engineer the command buffer by stepping through assembly code with the Visual Studio debugger and see exactly what information the driver is sending to the hardware. Once you know how to locate the command buffer, the extraction of hardware register data can be automated. You can learn many interesting things by doing this, and it will change the way you think about the hardware. (Also, AMD has actually published their register specs for hardware up to R700.) I can tell you the register numbers and formats for the alpha test functionality on any GPU that I have physical access to.

[/font][/font]
[font="arial, verdana, tahoma, sans-serif"][font="arial, verdana, tahoma, sans-serif"][quote name='Eric Lengyel' timestamp='1309915357' post='4831584']I agree that having the kill instruction early in a shader can provide a performance increase for complex shaders
Doesn't texkill/clip/discard just set a bit indicating that the ROP should discard, and not actually skip the shader instructions that come after it? Or has this been improved on newer cards?[/font][/font]
[/quote]

Generally, yes, but that bit can also be used to suppress texture fetches later in the shader, saving memory bandwidth, and GPUs have done this since at least 2004.
Advertisement
@Hodgman
1.You can ask the GPU guys, but then you'd be not allowed to tell anyone, might explain why Krypton doesn't say anything specific.
2.You can develop for consoles, then you might get a little inside, depending on the console.
3.You can also check the open gpu specifications that ATI/AMD and intel released. For ATI as an example
http://developer.amd.com/documentation/guides/Pages/default.aspx#open_gpu
you will see, that the R3xx family of GPUs has "Alpha Functions" which refers to alphatest, the various HD2x00 graphicscards have alphablend, but I can't find any alphatest informations anymore. I've seen some linux driver mailinglist where some guys were wondering how to apply that to their driver, it makes some things quite complicated.
I think PowerVR chips, that are used in Atom chipsets, support D3D10.1, but it's no secret, due to the deferred pipeline, all the computations are done on the chip, there is no real ROP, even if you output antialiasing and use alpha blending, it's all done in the shader units down to the point where the AA samples are merged into one final pixel, which is the only moment the part that you could call "ROP" is doing something, by converting the color to the final format.

Regarding the alpha test mask:
The compiler of the driver can decide on that. branching is usually free, you don't waste any performance in that case. You are right, it still needs to set the masking bits and the ROPs need to merge all fragment streams, but it seems like they have no way to compare, just masking pixels based on the bitmask. But that's what they do anyway all the time, be it due to the fine raster mask or alpha2coverage.

Conceptually, mine looks more like[source lang=cpp]class StateGroup
{
public:
typedef std::vector<RenderState*> StateVec;

void Add(RenderState* s) { states.push_back(s); }
StateVec::const_iterator Begin() { return states.begin(); }
StateVec::const_iterator End() { return states.begin(); }
private:
StateVec states;
};

class RenderCommand
{
public:
virtual ~RenderCommand(){}
virtual void Execute( RenderDevice& ) = 0;
};

class DrawCall : public RenderCommand {};
class RenderState : public RenderCommand
{
enum StateType
{
BlendMode,
VertexBuffer,
CBuffer0,
CBuffer1,
/*etc*/
};
virtual StateType GetType() const = 0;
};

//Dx9 implementation
class BindVertexBuffer : public RenderState
{
public:
void Execute(RenderDevice&);
StateType GetType() { return VertexBuffer; }
private:
IDirect3DVertexBuffer9* buffer;
};
class DrawIndexedPrimitives : public DrawCall
{
public:
void Execute(RenderDevice&);
private:
D3DPRIMITIVETYPE Type;
INT BaseVertexIndex;
UINT MinIndex;
UINT NumVertices;
UINT StartIndex;
UINT PrimitiveCount;
};[/source]In practice though, for performance reasons there's no std::vectors of pointers or virtual functions -- the state-group is a blob of bytes that looks something like:|size |bitfield |number |state #0|state #0|state #1|state #1|...
|in |of states|of states|type |data |type |data |...
|bytes|contained|contained|enum | |enum | |...

What stops you from using a single StateGroup and use it in the whole hierarchy? I guess in the MaterialRes you could get the StateGroup from the ShaderRes and so on
Nothing, it's perfectly valid to merge groups together like that if you want to wink.gif
However, in this case, the instance-group might be shared between a couple or draw-calls (the number that make up a particular model), the geometry group might be shared between dozens of draw-calls (that model times the number of instances of that model), the material group might be shared between hundreds of draw-calls (if the same material is used by different models) and the shader group might be shared between thousands (if the same shader is used by different materials).
The 'stack' kinda forms a pyramid of specialization/sharing, where the bottom layers are more likely to be shared between items, and the top layers are more likely to be specialized for a particular item.
[/quote]

I'm a bit interested in learning a bit more about how you've created a setup where you avoid use of virtual functions and vectors. I've hardly slept last night trying to figure out how I would do that - Having a system like the one you propose would kill performance having each render command require a virtual call plus a lot of vector iterations. I've read your blog post regarding the blobs but I'm having a hard time figuring out how that fits into this. Would you just have the renderer which receives the commands switch on type and reinterp cast the memory?
Would you just have the renderer which receives the commands switch on type and reinterp cast the memory?
Pretty much, yes. It's a bit like writing a VM.
Having a system like the one you propose would kill performance having each render command require a virtual call plus a lot of vector iterations.[/quote]It's important to note why it would be slow in a naive implementation. The main bottleneck here is not the clock-cycles required to perform arithmetic, but the time spent waiting for data to be moved from RAM to the CPU.

Iterating through a vector is actually the best kind of iteration -- you're accessing memory in a completely linear fashion, allowing great cache utilization. Iterating through a blob is the same (except that each item is of variable size) -- you start reading at the beginning of a contiguous blob, consuming items in a linear fashion until you reach the end.

Virtual functions on the other hand, are not cache friendly. To call a virtual, you read the vtable pointer (often the first 4 bytes of the object), add an offset to that pointer, read another pointer located at that address, and then execute the code referenced by the second pointer... a process which is not at all friendly to your cache -- it involves jumping to random memory locations, instead of predictable, linear iteration.

Going back to vectors -- iterating through a single vector is fast. However, each individual vector is going to be in a random memory location, meaning that when you finish with one vector and decide to start iterating another, you'll end up stalling as data is pulled down into the cache... If you could somehow put all of your vectors into the same region, then they could all be pulled into the cache in advance, and the amount of time stalling could be reduced dramatically.

I use a linear allocator to create the state-groups. This is where "blobs" come in -- like in the ascii-diagram, a header is allocated, followed by a sequence of commands, each of which is an ID followed by data. This means the group, although made up of several allocations, is all actually allocated in a single contiguous block. To read the blob, it's a bit like working with a serialized file-format.
This also means that all of the different state-groups are in the same region of memory; each group is allocated just after the end of the last group. The state-group allocations of an entire level fit comfortably into your L1 cache (or a SPU...)

As an aside - because "blobs" are just bits (i.e. POD), another nicety is that you can simply [font="Courier New"]memcpy[/font] them into an array of bytes somewhere, and you've got yourself a "command buffer". This lets you easily multi-thread your draw submission on API's without native multi-threading capabilities.

To iterate through a state-group (or a command buffer), you can [font="Courier New"]switch[/font] on the ID value to figure out how many bytes there are until the next command. Alternatively, you can use fixed-width commands to remove this detail.
When you you want to execute a command, you can again [font="Courier New"]switch[/font] on the ID value to figure out which function to call (and pass the data section to). If you're iterating and executing, you can [font="'Courier New"]switch[/font] once to call the execution function, and have it return the size of the data block that needs to be jumped over to reach the next command.

Unlike virtual functions, you've got a lot more control over how things are laid out in memory when using a simple "[font="Courier New"]switch[/font]" based solution.
Some potential examples (not necessarily good ideas) --
* instead of using a [font="Courier New"]switch[/font], you could manually write a table of function-pointers, and use the ID as an index into that array. This allows you to choose where in memory you allocate the table, and compared to [font="Courier New"]virtual[/font], you've only got a single table instead of one per-class.
* you could write a [font="Courier New"]switch[/font] containing [font="Courier New"]goto[/font]'s, which will automatically be compiled into a jump-table. At the [font="Courier New"]goto[/font] labels, you could force-inline your command-handling functions, in order to ensure that all your instructions are in the same region to get good i-cache performance.
Premature post
Fantastic Hodgman thank you very much for the answer. I was also wondering how you go about sorting your resulting commandbuffer as it consists of several connected commands which in themselves cannot be moved around independently?
On the previous page, I mentioned submitting draw/state pairs... Let's call them [font="Courier New"]RenderInstance[/font]s:struct RenderInstance
{
u32 sortingKey;
DrawCall* draw;
vector<StateGroup*> states;//not really a vector ;)
};
It's the queue of [font="Courier New"]RenderInstance[/font]s which gets sorted (not the command buffers). The sorted [font="Courier New"]RenderInstance[/font] queue is then used to generate a stream of commands.
Afterwards, another job takes the sorted instances and submits their commands to either the device or to a command buffer. Something like:submit instances
sort instances
for each instance
for each state-group
for each state
if state is not redundant
submit state
submit draw-call
The [font="Courier New"]submit[/font] part is either switching on the type to execute the command then and there, or it's copying it into a buffer that can be executed later.

To sort the instances, I let the "submitter" specify a 32-bit number, which can be anything. The lower level rendering systems don't care what the numbers mean, they're just used to sort items into the right order.
The higher level rendering systems might put material-hashes in there, or depth values, or a combination of both, with some bits specifying layers, some specifying depth, some specifying a material ID, etc....
Thanks for information guys, this has really cleared things up.

class CommandBindVAO
{
private:
uint m_uiVAO;

public:
void Execute(Context* pkContext) const
{
pkContext->BindVAO(m_uiVAO);
}
};

class CommandUnbindVAO
{
public:
void Execute(Context* pkContext) const
{
pkContext->UnbindVAO();
}
};

class CommandBindProgram
{
private:
RFShaderProgram* m_pkProgram;

public:
void Execute(Context* pkContext) const
{
pkContext->BindProgram(m_pkProgram);
}
};

class CommandSetRenderState
{
private:
RFRenderState* m_pkState;

public:
void Execute(Context* pkContext) const
{
pkContext->ApplyRenderState(m_pkState);
}
};

class CommandGroup
{
public:
enum ECmdType
{
ST_BIND_VAO = 1 << 0,
ST_UNBIND_VAO = 1 << 1,
ST_SET_PASS_UNIFORMS = 1 << 2,
ST_BIND_PROGRAM = 1 << 3,
ST_SET_RENDERSTATE = 1 << 4
};

private:
size_t m_szCmdsSize;
uint64 m_uiCmdFlags;
uint m_uiCmdCount;
void* m_pvCmd;

public:
size_t GetCmdSize() const { return m_szCmdsSize; }
uint64 GetCmdFlags() const { return m_uiCmdFlags; }
uint GetCmdCount() const { return m_uiCmdCount; }
const void* GetCmds() const { return m_pvCmd; }
};

////////////////////////////////////////////////////////

void Renderer::Render()
{
// Create sort list
uint uiIndex = 0;
for (RenderQueue::InstanceVector::const_iterator kIter = m_pkQueue->Begin();
kIter != m_pkQueue->End(); ++kIter)
{
m_kSortList.push_back(SortListItem((*kIter).GetSortKey(), uiIndex));
uiIndex++;
}

// Sort render queue
std::stable_sort(m_kSortList.begin(), m_kSortList.end(), QueueSorter);

// Iterate render instances in sorted order
for (std::vector<SortListItem>::const_iterator kIter = m_kSortList.begin(); kIter != m_kSortList.end(); ++kIter)
{
const RenderInstance& kInstance = m_pkQueue->Get(kIter->m_uiIndex);

// Iterate command groups
uint64 uiUsedCommands = 0;
for (RenderInstance::CommandGroupVector::const_iterator kCmdIter = kInstance.Begin();
kCmdIter != kInstance.End(); ++kCmdIter)
{
const CommandGroup* pkCmdGroup = *kCmdIter;

// Iterate commands and execute on context
const void* pvCmds = pkCmdGroup->GetCmds();
uint uiCmdCount = pkCmdGroup->GetCmdCount();

for (uint ui = 0; ui < uiCmdCount; ++ui)
{
// Get command type
const CommandGroup::ECmdType eType = *reinterpret_cast<const CommandGroup::ECmdType*>(pvCmds);
pvCmds = static_cast<const void*>(static_cast<const char*>(pvCmds) + sizeof(CommandGroup::ECmdType));

// Check if command type was already applied ealiere in the stack
bool bApply = (uiUsedCommands & eType) != 0;

// Remember type
uiUsedCommands |= eType;

// Handle command type correctly
switch (eType)
{
case CommandGroup::ST_BIND_VAO:
{
// Execute command
if (bApply)
{
const CommandBindVAO& kCmd = *reinterpret_cast<const CommandBindVAO*>(pvCmds);
kCmd.Execute(m_pkContext);
}

// Offset command stream
pvCmds = static_cast<const void*>(static_cast<const char*>(pvCmds) + sizeof(CommandBindVAO));
}
break;

case CommandGroup::ST_UNBIND_VAO:
{
// Execute command
if (bApply)
{
const CommandUnbindVAO& kCmd = *reinterpret_cast<const CommandUnbindVAO*>(pvCmds);
kCmd.Execute(m_pkContext);
}

// Offset command stream
pvCmds = static_cast<const void*>(static_cast<const char*>(pvCmds) + sizeof(CommandUnbindVAO));
}
break;

case CommandGroup::ST_BIND_PROGRAM:
{
// Execute command
if (bApply)
{
const CommandBindProgram& kCmd = *reinterpret_cast<const CommandBindProgram*>(pvCmds);
kCmd.Execute(m_pkContext);
}

// Offset command stream
pvCmds = static_cast<const void*>(static_cast<const char*>(pvCmds) + sizeof(CommandBindProgram));
}
break;

case CommandGroup::ST_SET_RENDERSTATE:
{
// Execute command
if (bApply)
{
const CommandSetRenderState& kCmd = *reinterpret_cast<const CommandSetRenderState*>(pvCmds);
kCmd.Execute(m_pkContext);
}

// Offset command stream
pvCmds = static_cast<const void*>(static_cast<const char*>(pvCmds) + sizeof(CommandSetRenderState));
}
break;
}
}
}

// Switch on drawcall and execute
const DrawCall* pkDrawCall = kInstance.GetDrawCall();
switch (pkDrawCall->GetType())
{
case DrawCall::DCT_DRAW_ARRAYS:
static_cast<const DrawCallDrawArrays*>(pkDrawCall)->Execute(m_pkContext);
break;
}
}
}


CommandGroups are what you would call StateGroups - As that is what they are. Commands to change state as far as I understand.

Right now I'm manually iterating the command groups which would obviously be done using a proper iterator when time comes. Same goes with the use of vectors. :)

Just a quick mockup of a renderer::render method. Am I completely on the wrong track. Obviously my framework is written in OpenGL though that shouldn't change much. Context is a context proxy which keeps track of which VAO / state is set etc.

I'm having a hard time figuring out which commands I could define as all I could come up with where the 5 I've shown. I'm also a bit in doubt why you would make a separate Drawcall class instead of having it as a command.

The uniforms are cause me problems as well. In my setup a material contains x techniques which in turn contains x passes which contain x uniforms (default values / auto values set by the framework) and a shader program. Each MeshRes (in the sense you're using it) contains a pointer to a material. Come command queue execution I have to apply / update these uniforms after having bound the shader program. Would that result in a new command type? Would this defeat the purpose of having this highly compacted memory command queue as that would require me to jump to the MaterialPass and iterate all the uniforms updating / uploading them to the GPU.

The following is basicly what I think I need to do

for each MeshInstance in MeshInstanceList
{
for each SubMesh in MeshInstance
{
Store ShaderProgram // Which program to render using
Store UniformDefaults // Material pass defined uniform defaults
Store UniformAuto // Material pass defined auto-filled uniforms using context state (view, viewprojection, time etc.)
Store TextureDefaults // Material pass defined textures - Set in material definition
Store UniformInstance // Submesh Instance defined uniforms
Store TextureInstance // Submesh Instance defined textures
Store VAO // Submesh buffer data binding
Store DrawCall // Encapsuled

Add RenderInstance to queue
}
}

Sort renderqueue

Submit renderqueue to renderer

for each RenderInstance in renderqueue
{
Update UniformAuto from context

Find and apply WorldTransform on context (used for auto uniforms)

Apply ShaderProgram on context

Apply UniformDefaults
Apply UniformAuto
Apply UniformInstance

Bind TextureDefaults
Bind TextureInstance

Bind VAO
Dispatch DrawCall
Unbind VAO
}


Is the above sensible and would it make sense in the context of what Hodgman has proposed?

Oh and thank you very much for all the help you've given me - And the community!
Just a quick mockup of a renderer::render method. Am I completely on the wrong track?
Yeah that looks similar to what I'm used to. I use something analogous to your "[font="Courier New"]uiUsedCommands[/font]/[font="Courier New"]bApply[/font]" code to ensure commands at the top of the stack take precedence over commands of the same type lower in the stack.

My [font="Courier New"]bApply[/font] test is a bit more complicated though, as it also checks if the command being inspected was already set by the previous render-instance. i.e. if two consecutive render instances use the same material, then all the states from the material's state-group can usually be ignored when drawing the 2nd instance.

My "Iterate render instances" loop is also passed a "default" state-group, which is conceptually put at the bottom of every state-stack. If an instance doesn't set a particular state and the default group contains that state, then the default value will be used.
If you don't do this, then you end up with behaviours like -- one object enables alpha blending, and then all following objects also end up being alpha-blended, because they didn't manually specify a "disable alpha blending" command.

Also, with the way your code is at the moment, only a single [font="Courier New"]SetRenderState[/font] command will be applied per instance. If you want to set two different render-states, only the first one will actually be set at the moment (the second will be ignored). For this reason, I have every different render-state as a different command ID.
I'm having a hard time figuring out which commands I could define as all I could come up with where the 5 I've shown. I'm also a bit in doubt why you would make a separate Drawcall class instead of having it as a command.[/quote]As above, I've got commands for each different render-state. I've also got commands for each different CBuffer slot and each texture-binding slot (for each type of shader).

I've limited myself to 14 CBuffer slots each for the vertex and pixel shader, so, there's actually 28 different IDs that are associated with the "bind cbuffer" command.

My draw-calls are actually a command, just like state-changes. However, I split commands into 3 different categories -- general state-changes, draw-calls, and per-pass state-changes.
State-groups can only contain general state-changes. Actual render-instances must use a draw-call command (not a state-change command).
The 3rd category are stored in something similar to a state-group, which is used to set up an entire "pass" of the rendering pipeline -- commands such as binding render-targets, depth-buffers, viewports, scissor tests, etc go into this category.
Come command queue execution I have to apply / update these uniforms after having bound the shader program. Would that result in a new command type?[/quote]There's a bunch of different abstractions for how uniforms are set, depending on your API... GL uses this model you're familiar with, you set the uniforms on the currently bound program... DX9 uses a model where there's a set of ~200 global registers, and any changes made to them persist from one shader to the next... DX10/11 are similar to 9, but you've got a set of bound CBuffers instead of individually bound uniforms.

So, I looked at these abstractions, and decided that the cbuffer approach made the most sense to me. No matter what the back-end rendering API actually is, my renderer deals with cbuffers -- and as as above, I've got 14 cbuffer binding slots/commands per shader type.

The way this is used generally, is that a "shader" state-group on the bottom of the stack contains commands to bind cbuffers that contain default values. The "material" and "object/instance" cbuffers then contains commands to bind their own cbuffers (which override the "default" commands).

On APIs that don't actually use the cbuffer abstraction, then yes, there's a step that looks at the currently bound cbuffers and sets all of the individual uniforms. I do this step prior to every draw call (with a whole bunch of optimisations to skip unnecessary work).
Regarding memory layout, I allocate all my cbuffer blocks (which are blobs containing uniforms) from a separate linear allocator.

This topic is closed to new replies.

Advertisement