# Material and pass sorting - over my head!

This topic is 4778 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I've been working on a material/shader system, and trying to work out a sorting algorithm for material passes. Unfortunately I'm kind of stuck on this. Here's where I'm at: Set of materials: S = { M1, M2, ... Mm } For each material Mi, ordered set of passes: Ti = { MiP1, MiP2, ... MiPp } where p >= 1 but may be different for each material. Render process consists of ordered sequence of passes, eg Given S = { { M1P1 }, { M2P1, M2P2 }, { M3P1, M3P2 } } Possible render sequence: M2P1, M2P2, M3P1, M1P1, M3P2 Passes from different materials may appear in any order, but passes from within the same material must keep their relative ordering. Passes may be interleaved from different materials. Given a set S, how many possible render sequences are there? For bonus points, (and more importantly really,) how would you iterate them? I've been struggling with this problem for hours. Any help greatly appreciated :-) [Edited by - posit on January 17, 2005 1:36:20 PM]

##### Share on other sites
Hey posit, I'm a little busy to discuss this right now, but I'll meet you later in IRC. Or you can get my IM off my site alternatively I'm connected to AfterNET right now if you want to PM. I think a real time conversation would be more appropriate for this topic. We can post a log of our conversation if others in the community request it. Any takers?

##### Share on other sites
After some investigation, this looks very much like the Travelling Salesman Problem, but with a catch: passes within a material must remain in order. I don't yet know whether this will make things easier or harder.

##### Share on other sites
Quote:
 Original post by positAfter some investigation, this looks very much like the Travelling Salesman Problem, but with a catch: passes within a material must remain in order. I don't yet know whether this will make things easier or harder.Any thoughts? Anyone? Please?

I'd be interested in hearing others thoughts on this. The TSP problem has many applications in the world of software engineering. If we view the cities as the set of render states required to render a given pass. Then the cost (distance) to get from one set of states (city) to another could be used to heuristically calculate the optimal order of rendering. Questions? Comments? Concerns?

##### Share on other sites
A diagram of an example having two materials, to help visualise the problem. M1 has 3 passes, M2 has 2 passes.

##### Share on other sites
With the ordering constraint, it appears to be more of a shortest-path problem in a DAG, where the graph is an N-dimensional hypergrid, with N being the number of materials and the size of each dimension being the number of passes involved.

##### Share on other sites
Quote:
 I'd be interested in hearing others thoughts on this. The TSP problem has many applications in the world of software engineering. If we view the cities as the set of render states required to render a given pass. Then the cost (distance) to get from one set of states (city) to another could be used to heuristically calculate the optimal order of rendering. Questions? Comments? Concerns?

The system I'm planning to use to sort my render states (just a general theory at this point, I haven't looked at implementation), is where I don't try and consider everything at once. I add new items to the list one at a time, using a kind of tree structure, where a new item comes to rest in the heirachy when it reaches a leaf.

Ok, a little more detail. I prioritize certain render state changes, based on how much time a state change of that type will eat up, on average. This sort priority list might go something like this:

-Texture
-Something else

I'd want to do some careful testing to decide what state changes I add to this list. I can't imagine I'd end up with more than 4 or 5. The further down the list a change is, the less you gain from sorting by it, the less likely you are to encounter cases where you can actually sort by it, and the longer it'll take you to sort.

Anyway, the sorter adds new items to the graph one at a time, traversing the graph to the branch where that renderable item belongs. When the heirachy is filled with all the items that are to be rendered, the sorter simply does a complete traversal of the heirachy, and sends the items off to the renderer to be drawn. Because all renderable items have been grouped together based on the most important common state properties however, the expensive state changes are minimized, based on the priority of the state change. To demonstrate, here's some test data:

Name	Shader	Texture	SomethingA: 	2	1	3B: 	2	2	1C: 	2	2	2D: 	3	1	3E: 	3	3	1F:	1	1	2G:	2	2	1

In this case, we have 6 renderable items that are to be sorted, based on three state changes of interest: Shader, Texture, and Something. Note that items like shader and texture now have numbers associated with them. Those numbers should probably be handles or referances to objects, but in the case that a string is of interest, a hash will be generated for that string so it can be quickly compared. Anyway, from that set of test data, here's how the structure would be generated:

		      Root		       |		-------+---------		|      |         \  Shader		1      2          3		|     / \        / \   		|    /   \      /   \  Texture		1   1     2     1   3		|   |    / \    |   |		|   |   /   \   |   |Something	2   3   1    2  3   1		F   A   BG   C  D   E

The internal grouping nodes represent a perticular state value. If a new asset is being added that doesn't have a matching grouping node for its state value, that grouping node is created, so when the first asset "A", was added to the list, three grouping nodes were created for each of the three states I'm sorting by. When B was added however, only two were created, because A shared the same shader, and had already created that grouping node. When sending data to the renderer, each branch is fully explored before returning to a higher level of the heirachy, so the topmost state change will change the absolute minumum number of times.

This is just the theoretical idea. I haven't given all that much thought to the data structures I'll have to employ to allow fast comparison of state values to grouping nodes, and I haven't run tests to measure the efficiency of this structure, but I think it has promise. It may be more of a hinderance than a help when there are just a few objects, but when there are a few thousand renderables being thrown at it, it should really provide a boost, as long as the time required to add new items can be minimized.

EDIT: Note that this doesn't solve the TSP problem to find the absolute minimum number of state changes. If state changes of both priorities 2 and 3 changing at once is more expensive than a single priority 1 state change, it won't handle it. This is slightly different to the TSP problem though, because in TSP, a change in any direction is equivelant. In this case, some are more expensive, hence the priority ordering. Also, in this, you don't have to end up back where you started.

##### Share on other sites
Nemesis2k2 brings out some interesting points, unfortunately I'm still at work and don't have time to write a real response. I'll respond again in a few hours.

##### Share on other sites
Quote:
 Original post by Nemesis2k2

Okay, I honestly wanted to reply to this with a long drawn out explanation, but unfortunately I'm swamped at work and I keep telling myself "I'll just get this last feature in before I go home". I've been saying that for over 5 hours now, and I'm still not ready to go home. But I said I'd respond again so I will, it'll just be a bit shorter and less explanatory.

Briefly: The cost of inserting every object into that tree every frame is more than likely going to outweigh any benefit you'd get from the sorting. Insertion into the tree gets more expensive as the tree grows, so the more objects you have in your world the longer it takes to build the tree. You'd probably be better off doing a quick or merge sort on a list of all visible objects. Alternatively, you could rely on temporal coherence and possibly be better off with an insertion sort.

The TSP problem is a unique way of looking at this issue, but I'm not completely convinced as to it's practicality on modern hardware. As far as optimization is concerned you're probably better off sorting a simple list of visible objects as was stated above. This method would be especially efficient if you took advantage of the state capturing features of your preferred graphics APIs.

##### Share on other sites
Just thought I'd point out that it doesn't actually need to be a tree structure in memory. It can be easily implemented using a radix sort, by treating each component as a 'digit'.

-bodisiw

##### Share on other sites
Quote:
 Original post by NoahAdlerJust thought I'd point out that it doesn't actually need to be a tree structure in memory. It can be easily implemented using a radix sort, by treating each component as a 'digit'.

I'm not sure if you were referring directly to Nemesis2k2's proposed method or speaking in general. In a previous project I did use a radix sort to handle render state sorting. It worked out pretty well but I eventually moved away from it when I found more efficient methods.

##### Share on other sites
I think you guys are needlessly overcomplicating what's really a fundamentally much simpler problem. At least, it seems this way to me, and I'll attempt to prove why I think I'm right in the passage below, in a semi-rigorous way.

Background:

Here's a toned-down approach to the task, what I'll call (tongue-in-cheek) the "Barbecue Problem". You have a set S of n shiskebobs, S1 ... Sn. Each shiskebob has a number of meat cubes on it. You may not remove a given cube from a shiskebob unless you have removed all other cubes above it. Let S(i,j) denote the jth cube on the ith shiskebob and |Si| denote the number of meat cubes on shiskebob i. (The |Si|th meat cube on Si -- that is, S(i,|Si|) -- is the last one to be removed from that shiskebob.)

Each of the individual cubes thus corresponds to one of the passes; each shiskebob corresponds to a material. The stacking of the meat cubes on the shiskebob determines the order in which the passes must be completed.

Problem Statement:

The problem can now be restated as, "Given n and |S1| ... |Sn|, determine the number of possible ways to leave all shiskebobs empty."

Solution and Proof Attempt:

1.) For all i, j, j < k <= |Si| : S(i,j) must be removed before S(i,k).
2.) By (1), each S(i,j) on a given Si is thus uniquely determined.
3.) By (2), the distinction between any two meat cubes from a given shiskebob is irrelevant. You can determine the identity of the topmost cube merely by looking at how many have previously been drawn from the same shiskebob.
4.) The number of different permutations of n objects, where there are n1 indistinguishable objects of type 1, n2 indistinguishable objects of type 2, ..., nk indistinguishable objects of type k, is (n!)/(n1! n2! ... nk!). (This proof is left as an exercise to the reader, but is simple and well-known.)
5.) Let |S| = SUM[1 -> n : |Si|].
6.) By (4) and (5), the number of possible orderings is therefore
             |S|!T = ---------------------    |S1|! |S2|! ... |Sn|!

QED.

Experimental Confirmation:

Let's test it out for some sample values. Suppose we have one shiskebob of one meat cube (which we'll call a1) and one shiskebob of two meat cubes (which we'll call b1 and b2). The set of possible orderings is { {a1 b1 b2}, {b1 a1 b2}, {b1 b2 a1} }, so our expected value for T is 3.

Sure enough, |S|! = 3! = 6, |S1|! = 1! = 1, and |S2|! = 2! = 2, and 6/2 = 3. The expected results are confirmed. How about a more complicated example? Let's take the one originally posed by posit:

Quote:
 Given S = { { M1P1 }, { M2P1, M2P2 }, { M3P1, M3P2 } }

This gives |S| = 5, |S1| = 1, |S2| = 2, and |S3| = 2. Plugging in, we have 5!/(1! 2! 2!) = 120/4 = 30, which seems like a reasonable answer. Indeed, you could write a program to enumerate all thirty possibilities, written below:
## This program solves the Barbecue Problem.## Enter number of meat cubes on each shiskebob, separated by commas:>> 1, 2, 2>> 30 possible sequences. Enumerate? (y/n) ya1 b1 b2 c1 c2b1 a1 b2 c1 c2b1 b2 a1 c1 c2b1 b2 c1 a1 c2b1 b2 c1 c2 a1  ---> 5 sequencesa1 b1 c1 b2 c2b1 a1 c1 b2 c2b1 c1 a1 b2 c2b1 c1 b2 a1 c2b1 c1 b2 c2 a1  ---> 10 sequencesa1 c1 b1 b2 c2c1 a1 b1 b2 c2c1 b1 a1 b2 c2c1 b1 b2 a1 c2c1 b1 b2 c2 a1  ---> 15 sequencesa1 c1 b1 c2 b2c1 a1 b1 c2 b2c1 b1 a1 c2 b2c1 b1 c2 a1 b2c1 b1 c2 b2 a1  ---> 20 sequencesa1 b1 c1 c2 b2b1 a1 c1 c2 b2b1 c1 a1 c2 b2b1 c1 c2 a1 b2b1 c1 c2 b2 a1  ---> 25 sequencesa1 c1 c2 b1 b2c1 a1 c2 b1 b2c1 c2 a1 b1 b2c1 c2 b1 a1 b2c1 c2 b1 b2 a1  ---> 30 sequences## End program.

You can see this boils down to the problem we've been discussing above. What is required now to make it into a weighted digraph is a table of costs involved to switch between each of the passes described in the original problem.

HTH,

##### Share on other sites
Quote:
 Briefly: The cost of inserting every object into that tree every frame is more than likely going to outweigh any benefit you'd get from the sorting.

This is my main concern with the approach. There are a few things that work in my favour though. First of all, in a real situation, there are only going to be so many render states, or at least ones that matter. Expensive render state changes usually involve a limited number of options, or large sets of data, which there will be only a limited number of. It's unlikely there'll be a few hundred different shaders flying around, or a few thousand different textures, unless the number of objects being rendered is magnitudes larger. This should keep the length of the lists being iterated through reasonably small, when compared to the number of renderable items being sorted.

Secondly, the efficiency of the sorting of renderables could be greatly improved, if the internal group nodes were themselves sorted, allowing a more efficient search algorithm to be used, rather than a simple iteration and comparison on all group nodes. This has the problem however of requiring direct access to nodes in the list, as well as efficient insertion. I have a few different ideas floating around to tackle that, which I would want to test in code, to see if they really are more efficient, and which is the best one. I haven't got that far in the planning however, as I'm working on other things, and haven't got to sorting yet. In fact, this is the first thing I've written about it, I don't even have any notes jotted down on this yet, let alone detailed planning. I formed most of the details of this method while I was writing the post.

Quote:
 Insertion into the tree gets more expensive as the tree grows, so the more objects you have in your world the longer it takes to build the tree.

The insertion into the tree only gets more expensive when a new grouping node is created, not when a new renderable asset is added. Since the number of different state values is likely to be limited with the expensive state changes, this should reduce the effect. Also, it is important to note that as the number of objects in your world increases, so does the potential for wasted state changes, even if no new state values are added, and hence, so does the performance penalty for not sorting them. In fact, I argue that the negative effect of no sorting with a large number of objects in the world, becomes larger than that of an inefficient sorting algorithm. Consider the case where I have 10,000 different objects in my world, but only 3 different sets of state values. A sorter could probably limit the number of state changes in that case to the absolute minimum. Rendering them with absolutely no sorting however could result in changing state values every single object. Clearly this is an extreme example, but a valid one none the less. As the number of different values for expensive state changes is likely to be comparatively quite small in any large set of data, I think the cost of building the tree is likely to be small compared to the savings for doing so.

I think a tree structure has the potential to be the fastest sorting method in this situation. With a bit of thought on the design, aided by testing and profiling, I think a sorter could be designed from this method which balances these concerns quite nicely. I'm open to any ideas for improvement, or alternate methods you know of though.

##### Share on other sites
Quote:
 I think you guys are needlessly overcomplicating what's really a fundamentally much simpler problem.

Actually, I've just realised that halfway through the thread, I'd forgotten what the original question was actually about. I've been talking about sorting data before drawing, in order to minimize time lost due to expensive state changes, which is not what the original problem was now that I've gone back and read it again (after spending 15 minutes trying to figure out how your answer related to what I thought the issue was). Bugger.

##### Share on other sites
Nemesis2k2: Yes, you're not alone in thinking that I'm worrying about this too much :-) However, it strikes me that this is a load-time optimization that gives an effectively free run-time efficiency boost, so it's worth some investigation. It remains to be seen how practical this will be in the long run.

kSquared:

Quote:
 Original post by kSquared |S|!T = --------------------- |S1|! |S2|! ... |Sn|!

Nice job! Now that you mention it, I do remember that equation, you're right.

Quote:
 ## This program solves the Barbecue Problem.## Enter number of meat cubes on each shiskebob, separated by commas:>> 1, 2, 2>> 30 possible sequences. Enumerate? (y/n) ya1 b1 b2 c1 c2b1 a1 b2 c1 c2b1 b2 a1 c1 c2b1 b2 c1 a1 c2b1 b2 c1 c2 a1 ---> 5 sequencesa1 b1 c1 b2 c2b1 a1 c1 b2 c2b1 c1 a1 b2 c2b1 c1 b2 a1 c2b1 c1 b2 c2 a1 ---> 10 sequencesa1 c1 b1 b2 c2c1 a1 b1 b2 c2c1 b1 a1 b2 c2c1 b1 b2 a1 c2c1 b1 b2 c2 a1 ---> 15 sequencesa1 c1 b1 c2 b2c1 a1 b1 c2 b2c1 b1 a1 c2 b2c1 b1 c2 a1 b2c1 b1 c2 b2 a1 ---> 20 sequencesa1 b1 c1 c2 b2b1 a1 c1 c2 b2b1 c1 a1 c2 b2b1 c1 c2 a1 b2b1 c1 c2 b2 a1 ---> 25 sequencesa1 c1 c2 b1 b2c1 a1 c2 b1 b2c1 c2 a1 b1 b2c1 c2 b1 a1 b2c1 c2 b1 b2 a1 ---> 30 sequences## End program.

I'm not sure if you're implying a method of iterating here, but a recursive one comes to mind: f(S) : for each i, call f(a copy of S minus the first pass of Si). The leaves of this call tree will be the various permutations. If there's a truly iterative way, I can't see it at the moment, but the recursive way will do.

Thanks very much for taking the time to look into this. Now the kebab analogy's making me hungry, so time for breakfast...

##### Share on other sites
Seems to me that Assassin's response nailed it, but another difficult piece of the puzzle remains, that is: how should the weights of the graph edges be calculated? That's gonna be a hard problem to solve, as it will almost certainly be card dependent. For instance, weights could be given to shader swaps, texture swaps, etc., and simply summed in order to find the entire weight, but this would ignore issues such as if the cost of switching two textures is less than two times the cost of switching one texture, or other anomolies. It would probably take some genetic algorithm get an empirically good heuristic. Who's up for writing such a beast? :-)

-bodisiw

##### Share on other sites
Do you guys think that ANY tree/list, to sort the passes/materials is suitable for this problem? My guess is that no matter what kind of sorting you use, anything will be too slow if its specifically recreated each frame.
Also the traversal of big Linked Lists/Trees ( as they are probably going to occur in this case ) is probably going to be slow. How do you solve that problem in order to use any of the suggestions above?

##### Share on other sites
Quote:
 Original post by DtagDo you guys think that ANY tree/list, to sort the passes/materials is suitable for this problem? My guess is that no matter what kind of sorting you use, anything will be too slow if its specifically recreated each frame.

Does it have to be re-created each frame, though? It seems to me you could probably build the tree of state changes needed to render the whole scene (adding the nodes as needed when new objects are inserted in the scene) ... then for each frame you'd just mark as 'active' the branches of states needed to render things which are actually visible in the given frame, and then traverse down these 'active' branches, without actually having to touch the tree structure itself.

##### Share on other sites
Well its not only the building of a tree that is too slow, its also the traversal. Linked lists / trees have the disadvantage that the processor can't really optimize it because all the memory is located somewhere else. So a traversal of a big tree on a per frame basis will definately significantly reduce your application speed.