DrawItems for temporary rendering

52,717

December 26, 2016 12:23 AM

One thing wasn't answered there though, and neigther could I find an answer in the forums: If you generally compile your DrawItems once, how do you handle temporary draw calls, like from a sprite-batcher?

In all the games I've used this architecture on, we generally start by compiling DrawItems on demand, every frame... and then over time, we rewrite systems to pre-generate and re-use as much data as possible. Often a system will ship on it's first game as using on-demand DrawItems, and then on the next game it will be optimized to use persistent DrawItems :D
For immediate-mode style systems, it's pretty easy to just keep them as using per-frame DrawItems.

1) I cannot bind textures directly anymore. However, the SpriteBatch takes an Texture-pointer as an argument, which I want to keep for the high-level user interface. Internally, if I have to batch say 100 different textures (which can happend when rendering my editors UI), that means that I have to aquire 100 resource-lists, and keep a map to them, having to deal with when and how to remove those. It works, but isn't that performant, and seems a bit hacky.

It's actually quote common for simple systems to want to use a ResourceList with only one texture in it -- e.g. a sprite renderer that binds a single texture.
Note that I use ResourceListId's and TextureId's in my back-end. To optimize for this particular use-case, if the high-bit of a ResourceListId is set, then that means that it's actually a TextureId in disguise. This lets these simple rendering systems bind single textures without the overhead of managing ResourceList objects.

2) Obviously, I have to compile a DrawItem for every batch that is submitted, then processing that DrawItem, effectively adding unnecessary overhead for temporary draws.

This should be very similar overhead as your previous system -- evaluating state groups upon rendering.
FWIW, our first implementation of this architecture also evaluated state-groups upon rendering, not upon draw-item creation :)
For certain systems, the overhead of per-frame DrawItem creation can be tolerated - it's only a little extra CPU time, and you can do it on any thread.

- Precompile a number of DrawItems, to sufficently handle a max-number of batches. Then process and submit them in order, modifying the attributes that differ per batch (going to be a lot => resource lists, draw-command, possibly program/blend/rasterizer-id if the effect changes).

We do have a few hacks where pre-created DrawItems are modified in order to get more re-use out of them... but I usually try to avoid this as having mutable draw-items decreases the amount of optimizations you can do on certain platforms.
I've struck a trade-off here, where when you create a DrawItem, you can specify some flags for special behaviour -- e.g. that you want a mutable DrawItem. This allows said platforms to avoid doing excessive optimizations on that DrawItem that wouldn't be compatible with mutability.

One example of this is software skinning (yep, we've shipped software skinned characters in 2016 in order to optimize for shitty GPU's :o). There's a big ring-buffer that holds skinned vertices, and every frame, the vertices of each character are written into a different location in that buffer, meaning that the vertex-buffer offset of their DrawItems needs to be updated.

7,344

Author

December 26, 2016 11:37 AM

For immediate-mode style systems, it's pretty easy to just keep them as using per-frame DrawItems.

Okay, thats how I'll keep it for now then. I can always optimize if its turns out to be a problem.

Note that I use ResourceListId's and TextureId's in my back-end. To optimize for this particular use-case, if the high-bit of a ResourceListId is set, then that means that it's actually a TextureId in disguise. This lets these simple rendering systems bind single textures without the overhead of managing ResourceList objects.

Uh, thats pretty clever, seems like a good solution. I'm in the process of changing everything to IDs, didn't like the idea but now that I'm doing it, makes many things much easier.

Now while continuing to work on it, I came up with a few more questions:

1) How does a DrawCall actually look like? You mention type, offset and primitive count, but what about indexed/non-indexed, instanced, multi-draw-indirect etc...? Obviously you can figure out if you need to draw indexed or not by looking at if an index-buffer is bound, but for indexed draw calls there is an additional index-buffer-offset-parameter, for instanced rendering the number of instances to draw, etc... Do you just have a bunch of generic uint32_t parameters to handle this? (don't think so, since your DrawItem-structs are really small)

2) Regarding InputAssemblerConfig. In your StateGroup, you have VertexData and InstanceData separated, but then you have a InputAssemblerConfigID in your DrawItem. How does that work? You mention that InputAssemblerConfigs can and should be shared between different draw calls, but how does the logic look like for figuring that out in this case? If you just bind the InputAssemblerConfigID to the StateGroup, I can tell my meshes and instanced meshes to generate their complete config and reuse that, but if I have to do the generation when compiling the state-group, do you have some sort of hashing/lookup, or am I misunderestanding how that works? Furthermore, does it even make sense to separate InstanceData, since VertexData holds the stream format, which already has to know to use an instance buffer for certain attributes (doesn't it?)

3) In order to compile a DrawItem, you need to use a specific RenderPass (for override & defaults). For immediate-mode, this is pretty simple, but what ie. for meshes that are rendered in a deferred pass, shadow-pass, reflection, ... Any tips on how to implement this? Up until now, I could just generate my StateGroups, and insert them to any pass. The pass would then first evaluate an unique DrawItem, binding its own state (cbuffers, ...), followed by the items submitted. Now, I need to somehow register the renderable with the pass, to create a state-item for the pass/renderable combination. Does that sound about right? Where do I store the generated DrawItem for the pass, inside a lookup-table/array in the Renderable, or inside the Pass itself?

4) Furthermore, upon Execution of a DrawItem I have to submit the RenderPass-information (render-targets, viewport). How you handle this in combination with your render-queue/DrawItem-sorting? Again, in my old system I would have one queue/array, the I could sort, and the RenderPass-bindings would be done by the "DrawItem" inserted by the state group with the right key. Now, how you handle executing/binding the RenderPass-information? Storing it alongside the DrawItem with say std::pair<RenderPass*, DrawItem*> seems like a waste. Maybe instead of array<DrawItem*>, array<pair<RenderPass*, array<DrawItem*>>, where each passes items are sorted and executed separetly, does that sound right?

5) Where/How does the scissor rect fit in the StateGroup/DrawItem? I require to use it at least for my UI, where I iteratively refine a scissor-rect for each widget that is drawn. So making it part of the RenderPass-data is not a good option. In DX11 its part of the rasterizer-stage, but making it part of the Rasterizer-config seems clumsy, and as I woul definately need more than 256 scissor-rects, I would need to increase the ID-size of the rasterizer config from 8 to 16 bit, also I'd have to creatate the whole rasterizer-config when just the scissor-rect changed. Making it just part of the DrawItem would increase its size by 16 Byte, so thats not an option eigther. Maybe make it an unique type of resource with an 16 bit ID, and add that to the render-item? Do you have any clever way for handling this in your system, or at least hints for how I can approach this?

Thanks again!

52,717

December 28, 2016 09:44 AM

1) How does a DrawCall actually look like? You mention type, offset and primitive count, but what about indexed/non-indexed, instanced, multi-draw-indirect etc...? Obviously you can figure out if you need to draw indexed or not by looking at if an index-buffer is bound, but for indexed draw calls there is an additional index-buffer-offset-parameter, for instanced rendering the number of instances to draw, etc... Do you just have a bunch of generic uint32_t parameters to handle this? (don't think so, since your DrawItem-structs are really small)

I have a separate draw-call-descriptor struct for each type, which the user fills in prior to compiling a draw-item.
The interface to create draw-items looks something like:


struct LinearDrawDesc
{
	PrimitiveType::Type primitive;
	u32 primitiveCount;
	u32 vbOffset; // - vbOffset is counting in the number-of-vertices from the start of the buffer, NOT in bytes
	u8 stencilRef;
	bool useStencilRef;
};
struct IndexedDrawDesc
{
	PrimitiveType::Type primitive;
	u32 primitiveCount;
	u32 vbOffset; // - vbOffset is counting in the number-of-vertices from the start of the buffer, NOT in bytes
	u32 ibOffset; // - ibOffset is in number-of-indices, NOT in bytes
	u8 stencilRef;
	bool useStencilRef;
};

class DrawItemWriter
{
public:
	DrawItemWriter();
	// Either pass a Scope allocator, or pass 'Persistent'.
	// In the Persistent case: Each DrawItem must have it's Release function called, and
	//                         the DrawItemSharedResources must have it's Release function called.
	// In the Scope case: the DrawItems and DrawItemSharedResource will be released automatically by the supplied Scope. Do not call Release on them.
	void Begin( GpuDevice& gpu, Scope& alloc );
	void Begin( GpuDevice& gpu, Persistent_tag, DrawItemSharedResources* reuseExistingSharedData=0 );

	void BeginPass( u32 pass, const PassState*, const RenderTargetState& );
	void BeginPass( const RenderPass& );

	void PreFlattenStates( FlattenedDrawStates& output, u32 stateGroupCount, const StateGroup*const* stateGroups );//If you're going to use the same state-group stack for multiple draws within a pass, this lets you pay the stack-flattening cost once.

	DrawItem* Add( const char* name, const DrawDesc&,         u32 stateGroupCount, const StateGroup*const* stateGroups, const DrawItemOptions& opt = DrawItemOptions() );
	DrawItem* Add( const char* name, const DrawDesc&,         const FlattenedDrawStates&,                               const DrawItemOptions& opt = DrawItemOptions() );
	
	void EndPass();
	DrawItemSharedResources* End();
};

Internally, every draw-item starts with a 64-bit header, which mostly contains state ID's, but it also contains a jump-table index.
Jump tables are kind of like a vtable used for virtual function calls - an array of function pointers. The array itself differs per platform, but looks something like:


	typedef void(*PfnDraw)(void*, const void*);
	const static PfnDraw s_drawJumpTable[] = 
	{
		&DL <0>, &DI <0>, &IDL<0>, &IDI<0>,//non instanced, no per-draw stencil-ref
		&DLI<0>, &DII<0>, &IDL<0>, &IDI<0>,//    instanced, no per-draw stencil-ref
		&DL <1>, &DI <1>, &IDL<1>, &IDI<1>,//non instanced,    per-draw stencil-ref
		&DLI<1>, &DII<1>, &IDL<1>, &IDI<1>,//    instanced,    per-draw stencil-ref
	};

^That's 16 different drawing function permutations, depending on linear/indexed, instanced, indirect, and whether the stencil-ref value is set per pass or per draw(!)... More on that last one later.

When building a draw-item, the DrawItemWriter asks the back-end for the appropriate jump-table index, based on the type of draw-call that it's building, and then stores this index in the header. Note that a table of 16 entries requries 4 bits in the header to store this info.
The actual draw-item itself can then be one of 16 different structures, as it will be interpreted by the corresponding function in that table.

2) Regarding InputAssemblerConfig. In your StateGroup, you have VertexData and InstanceData separated, but then you have a InputAssemblerConfigID in your DrawItem. How does that work?
Furthermore, does it even make sense to separate InstanceData, since VertexData holds the stream format, which already has to know to use an instance buffer for certain attributes (doesn't it?)

See DrawItemSharedResources, above. The DrawItemWriter keeps track of these potentially reusable structures while building a collection of draw-items (between one Begin/End pair). Begin also takes a pointer to an existing DrawItemSharedResources, if you want a new set of draw-items to continue using the same pool as an early batch of draw-items. That class itself is basically a pool / hash-table, yep.

With the separate instance/vertex data, I split those because they tend to come from separate sources, which means separate state-groups. The mesh itself has a state-group that binds the per-vertex buffers, and an instancing system will have another state-group that binds the per-instance buffers.

3) In order to compile a DrawItem, you need to use a specific RenderPass (for override & defaults). For immediate-mode, this is pretty simple, but what ie. for meshes that are rendered in a deferred pass, shadow-pass, reflection, ... Any tips on how to implement this? Up until now, I could just generate my StateGroups, and insert them to any pass. The pass would then first evaluate an unique DrawItem, binding its own state (cbuffers, ...), followed by the items submitted. Now, I need to somehow register the renderable with the pass, to create a state-item for the pass/renderable combination. Does that sound about right? Where do I store the generated DrawItem for the pass, inside a lookup-table/array in the Renderable, or inside the Pass itself?
4) Furthermore, upon Execution of a DrawItem I have to submit the RenderPass-information (render-targets, viewport). How you handle this in combination with your render-queue/DrawItem-sorting?

That problem simply isn't solved in my low-level API -- it just trusts that you submit a draw-item alongside the same render-pass as you created it with, and does undefined behaviour otherwise.
The GpuContex submit function looks like:


	//Submit a list of draw-calls, optionally clearing before the first draw
	void Submit( const RenderPass&, const DrawList&, const ClearCommand* c=0 );

Where DrawList is a lightweight class that basically contains a DrawItem** and a count. So yes, you submit a collection of DrawItems and explicitly state the RenderPass to use with them.

At a higher level, rendering systems will query their model's shaders as to which passes that model is compatible with, and will query the rendering pipeline as to which passes it intends to render models with. The rendering systems will then pre-generate several draw-items for each mesh -- one for each pass that it will be used in.
A scene traversal / sorting system can then get the list of required passes from the rendering pipeline, and then fill in one array per pass from these rendering systems, sort them, and hand the sorted arrays over to the rendering pipeline.

5) Where/How does the scissor rect fit in the StateGroup/DrawItem? In DX11 its part of the rasterizer-stage, but making it part of the Rasterizer-config seems clumsy, and as I woul definately need more than 256 scissor-rects, I would need to increase the ID-size of the rasterizer config from 8 to 16 bit, also I'd have to creatate the whole rasterizer-config when just the scissor-rect changed.

The draw-item header has one bit indicating whether a per-draw-item scissor rect is being supplied or not. If not, there's no extra overhead, otherwise there's four u16's added to the end of the draw-item containing the scissor rect (I don't support floating point rect coords :( ).

The scissor rect is not part of the rasterizer-state object in DX11 - only a bool saying whether a scissor rect is in use or not is part of that object. So, I always compile a pair of D3D rasterizer-state objects for each of my own rasterizer-states, so I can pick the right one depending on whether a draw-item/pass wants scissoring or not.

Also see the above example of having the draw submission code vary based on whether the draw-item has a per-draw stencil-ref or not -- you can build similar permutations that deal with extra data like a scissor rect being present or not.

7,344

Author

December 28, 2016 01:47 PM

Thanks for again taking the time to explain, and for showing some code.

Internally, every draw-item starts with a 64-bit header, which mostly contains state ID's, but it also contains a jump-table index. Jump tables are kind of like a vtable used for virtual function calls - an array of function pointers. The array itself differs per platform, but looks something like:
That's 16 different drawing function permutations, depending on linear/indexed, instanced, indirect, and whether the stencil-ref value is set per pass or per draw(!)... More on that last one later.

Ah, that sounds just about right, I'm going to use something similiar. I think I don't need per-draw stencil for now, so I can use the bit for per-draw scissor rect instead (thats a pretty good solution too).

See DrawItemSharedResources, above. The DrawItemWriter keeps track of these potentially reusable structures while building a collection of draw-items (between one Begin/End pair). Begin also takes a pointer to an existing DrawItemSharedResources, if you want a new set of draw-items to continue using the same pool as an early batch of draw-items. That class itself is basically a pool / hash-table, yep.

To make sure I understand this right - whats the reason between the DrawItemSharedResource in the first place? I'm working under the assumption that the device stores some linear arrays for the different resources (texture, buffers, samplers, states, ...) to later bind from an ID, when the draw item is to be executed.

Then I assumed that this sort of reusing of structures would happen in the writer itself, which would have different hash_maps for rasterizer-states, input-assembler config etc... but appearently you use a DrawItemSharedResources for that.

So my questions:

1) Is this close to how it would work?

2) Whats the reason behind the DrawItemSharedResources? I fail to see why you would have different batches of DrawItems not share those reusable structures (why would I generate a different InputAssemblerConfig/RasterizerConfig with the same parameters)? So assuming my description of the process is correct, whats the use-case for not sharing resources between different batches of DrawItems? If my description of the process is wrong, a correction should probably make this more clear.

With the separate instance/vertex data, I split those because they tend to come from separate sources, which means separate state-groups. The mesh itself has a state-group that binds the per-vertex buffers, and an instancing system will have another state-group that binds the per-instance buffers.

That does make sense, but I'm still not sure how to handle the StreamFormat for instancing then. Say a mesh has just one vertex buffer for positions. The StreamFormat will only have one float3-position attribute, from that one buffer. Now you want to instance this mesh with per-instance transform matrices - you now have to bind a different InputLayout, that includes this instance-streamed buffer. In your presentation though, StreamFormat comes exlusively from the VertexBuffer. How does that work? I can see two ways:

- Actually, StreamFormat is its own StateGroup-item, and can be overwritten by the instancing-system, which will generate a combined StreamFormat from the mesh & instancing data (thats along the lines of what I've been doing so far) OR

- You submit a separate StreamFormat alongside the Instance-buffers, and the Writer will then combine those two and use it for the InputAssemblerConfig

3 ) Which one do you use? Or is there some other clever trick to handle this?

Where DrawList is a lightweight class that basically contains a DrawItem** and a count. So yes, you submit a collection of DrawItems and explicitly state the RenderPass to use with them.

I see, that sounds good. I used to have a render-queue for all passes, but it really doesn't make much sense to have it that way now that I think about it - why would I submit all items in one queue, when I explicitely sort them by pass-id afterwards anyways. (In general, while refactoring my render-API, I've come across so many WTF-why-did-I-do-that moments, not only regarding the DrawItem-compilation... its been time I'm tackling that)

The draw-item header has one bit indicating whether a per-draw-item scissor rect is being supplied or not. If not, there's no extra overhead, otherwise there's four u16's added to the end of the draw-item containing the scissor rect (I don't support floating point rect coords :( ). The scissor rect is not part of the rasterizer-state object in DX11 - only a bool saying whether a scissor rect is in use or not is part of that object. So, I always compile a pair of D3D rasterizer-state objects for each of my own rasterizer-states, so I can pick the right one depending on whether a draw-item/pass wants scissoring or not.

Thats pretty neat and exactly what I need, too. I've always wondered why DX11 even uses floating-point for the scissor-rec, u16 sounds good enough.

So I hope those will be the last questions I have :) Thanks again!

EDIT: Well, since I'm at it:

4) Since you mentioned redudant-state checking by XORing the last and current draw-item. So it sounds like you only perform the redundancy-checks at this granularity - ie. if I bind a resource list to slot 0, you only check if the resource lists are the same, and not whether single textures are different?

I currently have a system in place where I ie. check for single textures, and even for single shader-stages (in combination with looking at what shader stages are currently active, and what textures are being used). Even though thats currently 800 byte information for both current & last state (would go down once I full switched to IDs), and a lot of checks, I assumed (think I even profiled) that its still faster than ie. binding a full resource-list to all stages, when in reality something like Domain/Hull-shader are really only used rarely.

Did you do any profiling, and/or have any theoretical knowledge to whether the gain of not issuing lots of cost for handling that fine-grained state checking outweigths the gains from not issuing some redundant state-changes (or even just binding resources that are not needed); or did you just assume that the amount of loops/branches required would kill any benefit over just XOR/bit-compares right away?

52,717

December 28, 2016 02:13 PM

I've always wondered why DX11 even uses floating-point for the scissor-rec, u16 sounds good enough.

I've never used the feature, but AFAIK it means that you can make a viewport that slices though a fraction of an MSAA pixel, covering just some of its samples.

This is actually every viewport in D3D9 due to it's stupid pixel coordinate system :o

Say a mesh has just one vertex buffer for positions. The StreamFormat will only have one float3-position attribute, from that one buffer. Now you want to instance this mesh with per-instance transform matrices - you now have to bind a different InputLayout, that includes this instance-streamed buffer. In your presentation though, StreamFormat comes exlusively from the VertexBuffer. How does that work?

My StreamFormat would contain all the attributes, which includes per-vertex and per-instance ones.

The meshes state-group sets up the per-vertex buffer bindings and the StreamFormat - so if you render this mesh without another state-group to bind the per-instance buffers, then you'll simply be rendering with NULL buffers for the per-instance attributes.

When I author a shader or a StreamFormat, I'm being explicit whether it's an instanced shader or not. I don't have a switch/permutation system to support both non-instanced and instanced draws with the one shader -- though that would be a valid choice too and many engines support such a thing.

To make sure I understand this right - whats the reason between the DrawItemSharedResource in the first place?

Yeah I've led you astray a bit here. It's basically a proxy for DrawItem memory management.

The device does keep a hash-map of raster-states, blend-states, textures, resource lists, buffers, etc... but it does not keep around a map of InputAssemblerConfigs.

An IAConfig structure is a bit like a ResourceList, but for vertex attribute bindings + input layout config, so I certainly could have the device keep track of them via ID's just like ResourceLists...

But, lifetime management is a bit funky -- the user doesn't create IAConfig structures because they're internal to the implementation of a DrawItem... and DrawItems are POD, so I can't use RAII to keep track of the lifetime (and doing so would add bloat anyway...).

So instead, when you create a DrawItem with the writer object, it also creates (or reuses/modifies) a DrawItemResources structure, which does the lifetime management for those IAConfig structures. Basically, this is how I let the user know that DrawItems allocate some internal objects which must have a lifetime that is the same or longer than the DrawItems. The user doesn't have to know/care what these internal allocations are -- they just have to know to clean them up for me after the DrawItems are no longer in use.

The DrawItems themselves are POD -- the user can memcpy them if they like -- but this DrawItemResources structure is a bona fide class, requiring the user to keep a pointer to it and call Release at the appropriate time.

This is basically a bit of a leaky abstraction. The writers are supposed to just return an opaque POD blob to you with no memory management strings attached... but I went and made the IAConfig shared between draws, which required it's lifetime to be tracked, which required me to return this proxy object back to the user.

There's other things this can be used for too. On some platforms, a DrawItem can be pre-compiled into a small command buffer / bundle of GPU commands, which also requires memory management (allocations to be free'ed after the DrawItems are no longer in use), which the DrawItemResources structure can provide.

This does create a hassle for the user though, as now they have to hold onto this pointer... but I mostly use scope-stack allocation, in which case the user doesn't have to hold onto that pointer, as mentioned in the comment :)

7,344

Author

December 28, 2016 02:37 PM

I've never used the feature, but AFAIK it means that you can make a viewport that slices though a fraction of an MSAA pixel, covering just some of its samples. This is actually every viewport in D3D9 due to it's stupid pixel coordinate system :o

I'm glad I don't have to work with D3D9 fulltime anymore, I remember all the shenanigans with stupid stuff like that :D Been thinking of dropping support for the (currently utterly broken) D3D9 implementation anyways, though some of the players of a game I make are still using WinXP... can't people just upgrade to Win7 and be fine with it? :/

My StreamFormat would contain all the attributes, which includes per-vertex and per-instance ones. The meshes state-group sets up the per-vertex buffer bindings and the StreamFormat - so if you render this mesh without another state-group to bind the per-instance buffers, then you'll simply be rendering with NULL buffers for the per-instance attributes. When I author a shader or a StreamFormat, I'm being explicit whether it's an instanced shader or not. I don't have a switch/permutation system to support both non-instanced and instanced draws with the one shader -- though that would be a valid choice too and many engines support such a thing.

Okay, that makes more sense. TBH my current solution isn't so far from that. I was/am having headaches for how to handle ie. instancing meshes loaded from files this way - obviously the modelling tools won't export instancing-information; and also for stuff like mesh-preview, using instancing seems like overkill. I think somebody once told me they evalute their input-layout/stream format with the combination of mesh & shader it uses, guess now that makes sense, and I'll build something like that.

So instead, when you create a DrawItem with the writer object, it also creates (or reuses/modifies) a DrawItemResources structure, which does the lifetime management for those IAConfig structures. Basically, this is how I let the user know that DrawItems allocate some internal objects which must have a lifetime that is the same or longer than the DrawItems. The user doesn't have to know/care what these internal allocations are -- they just have to know to clean them up for me after the DrawItems are no longer in use. The DrawItems themselves are POD -- the user can memcpy them if they like -- but this DrawItemResources structure is a bona fide class, requiring the user to keep a pointer to it and call Release at the appropriate time.

Ah, I see, now it makes more sense. For getting things running I think I'll leave it out, shouldn't be too hard to implement somewhere along the lines.

That should be it for now - see my edit about redundant state checking, otherwise I should be fine to go. Thanks alot for answering all those questions :)

52,717

December 29, 2016 04:17 AM

4) Since you mentioned redudant-state checking by XORing the last and current draw-item. So it sounds like you only perform the redundancy-checks at this granularity - ie. if I bind a resource list to slot 0, you only check if the resource lists are the same, and not whether single textures are different? I currently have a system in place where I ie. check for single textures, and even for single shader-stages (in combination with looking at what shader stages are currently active, and what textures are being used). Even though thats currently 800 byte information for both current & last state (would go down once I full switched to IDs), and a lot of checks, I assumed (think I even profiled) that its still faster than ie. binding a full resource-list to all stages, when in reality something like Domain/Hull-shader are really only used rarely. Did you do any profiling, and/or have any theoretical knowledge to whether the gain of not issuing lots of cost for handling that fine-grained state checking outweigths the gains from not issuing some redundant state-changes (or even just binding resources that are not needed); or did you just assume that the amount of loops/branches required would kill any benefit over just XOR/bit-compares right away?

I do 64bit XOR'ing of the pipeline state, and SSE XOR'ing of the resource bindings to quickly detect if any resource lists have changed since the previous draw. Once I've figured out which resource lists need to be bound, I iterate through them to actually bind the textures to D3D. In this step, I do actually have a cache of D3D texture pointers, which is used to avoid redundant calls to D3D (if two different resource lists happened to bind textures to the same slots).

The XOR'ing is there to very quickly handle the case where consecutive draw calls don't require any D3D state/resource updates.

I don't keep a cache for each stage, and IIRC only support 32 textures bound at once, so that's 32 * sizeof(pointer) == 256 bytes (or 128 bytes for 32bit OS's, which I've dropped :) ).

I do avoid binding textures to stages that don't need them though. The shader program has a bitmask of which t# registers are in use for each stage (so 5 stages * 32 texture regsiters = 20 bytes of bitmask info), which I use during the binding process to skip certain *SSetShaderResourceViews calls when possible. I actually have three different version of my resource binding code, optimized for PS/VS (90% of all shaders), all stages, and compute :)

The code is pretty streamlined, to make only a single PSetShaderResourceViews call no matter how many textures are bound per draw, which means that it can redundantly bind a bunch of textures -- e.g. if t0 and t32 both need to be set, every SRV in the middle of that range will also be set.

This caching introduces a problem though -- if I skip binding textures to certain stages, but only keep a cache that's shared by all stages, then the cache can become wrong when switching shaders. e.g. if shader A uses t0 in the PS only, and then shader B uses t0 in the VS only, and two consecutive draws use those shaders with the same texture, then the second one will think that the texture has be bound because the cache says it's bound, but it will actually only have been bound to the PS and not the VS.

I solve this by simply invalidating my cache whenever a new shader program is bound which has a different resource bitmask to the previous shader program -- which can cause some redundant resource bindings on the first draw of any shader program.

I actually used to have one cache per stage, which doesn't suffer from this drawback, and I can't remember right now why I switched to this version... :o

At one point I did all this caching via a macro, so I could easily replace if(NeedsBinding) with if(true). I don't remember any exact details, but I remember that the redundant call checking logic was a small win in general.