Frostbite rendering architecture question.

Started by
75 comments, last by n3Xus 12 years, 8 months ago

[quote name='TiagoCosta' timestamp='1313418697' post='4849392']You say that you don't use virtual functions. So what classes do you use to bind cbuffers, shaders resources and vertex/index buffers?
I have a class that wraps the D3D interface. To support multiple platforms, a different version of this class can be chosen at compile time (i.e. compile time polymorphism).

Internally, the D3D device (depending on platform/environment) may be implemented using either virtual functions, function pointers, a jump-table or a switch... but that's out of my hands and is the same for everyone using D3D wink.gif
[/quote]

You misunderstood me :rolleyes: I wasn't talking about D3D in general... In the second page of this topic you posted some classes including virtual functions and you wrote:
[quote name='Hodgman']
In practice though, for performance reasons there's no std::vectors of pointers or virtual functions[/quote]
I can write a header specifying the state type in the blob and then at run-time use a switch or if-statements to bind each resource correctly but is there a faster way? because identifying the type of each resource every time I need to bind it doesn't sound fast.
Advertisement
You misunderstood me, I wasn't talking about D3D in general...
Yeah I was going off on a tangent about D3D possibly having virtuals, because I thought you were asking about my device wrapper.
The device wrapper actually binds things, and doesn't use any virtuals. The device wrapper is called by commands that are put in the command stream, which are executed by a virtual-like mechanism.
In the second page of this topic you posted some classes including virtual functions and you wrote ** I can write a header specifying the state type in the blob and then at run-time use a switch or if-statements to bind each resource correctly but is there a faster way? because identifying the type of each resource every time I need to bind it doesn't sound fast.[/quote]In the post you're referencing, I mention all my bindings (and all other state-changes) go into a stream of commands, structured as command-type-enum values, followed by command data.
These command-type values are used to implement dynamic dispatch using a switch statement (which is compiled into a jump table) instead of using a vtable. It's important to note why using virtuals would be slow, and why an alternative is preferable.

To call a function through a vtable, you:
* fetch the v-table address, stored in the first 4 bytes of the object.
* add an offset to the v-table address (depending on which virtual function you want to call -- e.g. [font="Courier New"]vtable+0[/font] may be [font="Courier New"]ExecuteMessage[/font]).
* fetch the function address stored at the computed address.
* jump to the function address

This actually seems pretty fast -- it's just adding a few numbers together before doing the jump!

However, every different object type has a different v-table pointer in it's first 4 bytes. These v-tables are placed in arbitrary locations by the compiler, meaning the computed addresses above could be in any part of RAM. The v-tables for two different objects could be right next to each other (good), or on completely different pages (bad). This memory layout is 100 times more important for performance than the actual CPU instructions that carry out the above 4 dot points.

To call a function through a local jump table, you:
* fetch the table offset (i.e. type-enum), stored in the first byte of the object.
* add the offset to your local jump table address.
* fetch the function address stored at the computed address.
* jump to the function address.

This seems almost exactly the same!

The important difference is that every object is using the same table of function pointers -- the computed address for different types will be in the same region of memory, which means that after executing the first command, the table is sure to be in the cache.
It does pretty much the same computations as the vtable would, except it exploits locality of reference by having a singe table, at a known memory location, which every different type shares. The reason it's faster than a vtable is because it avoids cache misses.


If you're not interested in implementing a command-queue though, then this isn't that important to you biggrin.gif
Simple question: A command-queue can be an std::vector containing pairs of draw calls and state-group arrays right?Or is there another magic class? :lol:

[quote name='TiagoCosta' timestamp='1313418697' post='4849392']You say that you don't use virtual functions. So what classes do you use to bind cbuffers, shaders resources and vertex/index buffers?
I have a class that wraps the D3D interface. To support multiple platforms, a different version of this class can be chosen at compile time (i.e. compile time polymorphism).

[/quote]

How do you manage releases for each platform? On PC you could have the ability to run OGL, Dx9, and Dx11. Do you have completely separate engine/game builds for each of these? I can't see how you can decide whether to use Dx11 if a machine supports it or fallback to Dx9 if it doesn't, if you are deciding during compile time.

I ask because in my engine I currently have an engine DLL which contains all the core functionality and I then have a DLL for each of the different renderers. I can then dynamically choose which renderer I want to load. This has the problem of many virtual calls though because of the abstract interface and so I might be inclined to go with a compile time system instead if I can find an elegant way to manage releases and what renderer is compiled in.
Simple question: A command-queue can be an std::vector containing pairs of draw calls and state-group arrays right?Or is there another magic class? :lol:
Yeah, could be.

I need this kind of 'switch', because I don't know what kinds of commands the state-groups contain.
If you had a 'fixed' state-group like below, you wouldn't need any kind of switch/virtual, because you know what kind of commands you're recieving:struct StateGroup
{
Shader* shader;
VertexDeclaration* vd;
VertexBuffer* vb;
IndexBuffer* ib;
int numCb;
CBuffers** cb;
};


I have a queue that's basically a vector of draw/state-array pairs (e.g. [font="Courier New"]struct Item { Draw* d; StateGroup** s; int numStates; }[/font]), which I can submit for drawing (which sends commands to D3D).

However, I also support converting the above queue structure into an array of commands (e.g. [font="Courier New"]vector<Command*>[/font]). N.B. draw-calls and individual states are both commands. When doing this, any redundant state-changes are omitted (e.g. if two consecutive items share the same shader, only the first 'Shader' command will be written into the array).

This allows the first kind of queue (of draw/state pairs) to be converted into a queue of raw commands on a background thread, before being submitted on the main thread, which allows even a D3D9 renderer to take advantage of multi-core CPUs.
I can then dynamically choose which renderer I want to load.
I've basically given up on this feature. It used to be cool back in 1998 when some cards were faster at GL and some at D3D, and when games still shipped with a software-rendered version, but these days there's basically only one good renderer per platform.
How do you manage releases for each platform? On PC you could have the ability to run OGL, Dx9, and Dx11. Do you have completely separate engine/game builds for each of these? I can't see how you can decide whether to use Dx11 if a machine supports it or fallback to Dx9 if it doesn't, if you are deciding during compile time.[/quote]Ignoring windows compatibility between versions for a moment, each platform has one renderer we care about -
XP - D3D9
Vista/7 - D3D11
Mac - GL
Mobiles - GLES
Consoles - their proprietary ones.

So it only becomes complicated if you have a Windows version that includes XP support. In this case, I'd opt for two different EXE's - [font="Courier New"]GameDX9.exe[/font] and [font="Courier New"]GameDX11.exe[/font]. You can then make a "launcher" [font="Courier New"]Game.exe[/font] which picks the right one based off the Windows environment -- if they're on XP or have a DX9-class GPU (or for some reason don't have the D11 runtime installed), launch the D9 exe, else launch the D11 exe.

I need this kind of 'switch', because I don't know what kinds of commands the state-groups contain.
If you had a 'fixed' state-group like below, you wouldn't need any kind of switch/virtual, because you know what kind of commands you're recieving:struct StateGroup
{
Shader* shader;
VertexDeclaration* vd;
VertexBuffer* vb;
IndexBuffer* ib;
int numCb;
CBuffers** cb;
};


I have a queue that's basically a vector of draw/state-array pairs (e.g. [font="Courier New"]struct Item { Draw* d; StateGroup** s; int numStates; }[/font]), which I can submit for drawing (which sends commands to D3D).


I would recommend breaking that up even a bit more.
I keep vertices in their own buffer, as well as normals. Tangents and bi-normals in another buffer, and texture coordinates in a buffer.
The reason is to avoid sending unnecessary data to the card when, for example, lighting is disabled.
When doing any standard multi-pass forward rendering, the first pass is to gather ambient details only, so unless you use hemisphere lighting or similar, you don’t need to submit normals, tangents, and binormals.
Then you will generate shadow maps. Submit only the vertices, omit normals, texture coordinates, etc. This can save you tons of bandwidth, making the generation of shadow maps virtually free.

This also helps with instancing. If an instance of your master model wants to change one part but keep the rest, the instance can generate a replacement for only that attribute. So I could submit a different set of texture coordinates for this one instance, without needing to duplicate the rest of the attributes of the master model.


[In Passing]
Because system states can change at run-time, my system is to perform a binary search for which shader needs to be active and activate it at render-time. I have dirty flags to avoid unnecessary searching, so if the system states are the same as they were last frame, the previous shader is kept.

Then I organize shaders into classes (here I mean classifications, not the C++ class type). One class of shaders for shadow mapping, one for ambient-only lighting, one for lighting, etc. This reduces the bucket sizes for the shared list of shaders and allows the binary search to be negligible (assuming a search is performed at all).


Deciding which shader to use at render-time gives me a lot of flexibility, and I save some RAM by generating only the permutations that are requested.
But I mention this system in passing. Mainly I want to make a strong suggestion to split up the vertex buffers and use streams.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

YogurtEmperor: did you do any performance test comparing the multistreamed rendering that you use versus the traditional "everything in one vertex buffer"?

I heard there is a performance hit using data separated into multiple buffers, but is that outwighted in the big picture (if you use shadow mapping,.. etc)?
I started off with everything in one buffer for the sake of getting a result and moving on, then split them into multiple buffers.
No performance change was noticed by just splitting them (not applying them to faster shadow mapping etc.) Tested in OpenGL and DirectX 9 so far.
The gains by not sending the extra data far outweigh whatever slowdown is supposed to be there by splitting them. For me that was a noticeable gain in performance. Around 50% faster creation of shadow maps alone.
That makes sense because the difference between a full vertex buffer and one with only vertices is about 62% (meaning vertex-only buffer is 38% the size of the original in a standard case), and because of the simplicity of the shaders and the fact that there is no shader swapping or texture swaps during the entire render, most of the time spent generating shadow maps is just the upload of vertices.

The gains get more serious as you do more complex renders, since, for almost every type of pass, you can omit at least one attribute from the upload.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Nice to hear biggrin.gif thanks!
Can someone please give me some examples of objects that need different shader permutations? In this post we are talking about using 32/64 boolean options, and I can only think about some options like diffuse, normal, spec maps, parallax mapping and reflectivity...:huh:


Virtual texturing vs direct texturing
SH lighting enable
lightmap enable
GPU Skinning
Cubemap / envmap
shadowmap enable (only need to receive shadows if a shadowcaster volume intersects the object!)
Decal/Detail textures

And as Hodgman mentioned, picking out individual passes for multi-pass shaders.

This topic is closed to new replies.

Advertisement