Stack-based Kd traversal

Graphics and GPU Programming Programming

Started by ajas95 January 07, 2005 03:44 PM

32 comments, last by Eelco 19 years, 3 months ago

767

Author

January 07, 2005 03:44 PM

Havram offers a recursive implementation of a Kd-tree traversal implemented with a stack. The branch prediction is particularly horrific, so I've been optimizing that aspect with great success so far. One part this is very troubling though is the part:


int tmp = exPt;
Increment(exPt);

/* possibly skip current entry point so not to overwrite the data */
if (exPt == enPt)
    Increment(exPt);

This seems really hacky, and I don't understand what it does, or what conditions en_idx would equal ex_idx. I'm looking for a way to remove that if statement. Can anyone help me?

ajas95

767

Author

January 07, 2005 06:59 PM

HAH! I figured it out... well, not really, but I figured out a way to mimic that behavior (I'd still really like to know what that section is there for). I now have my non-leaf-node traversal completely branchless, always prefetched, and down to 52 cycles per loop-iteration in every case (i.e. never better, never worse).

Of course it's easy to get it faster than that if you don't count the branch misprediction penalties, but for this algorithm particularly, they are really bad.

phantomus

741

January 08, 2005 08:11 AM

Hm given my current project I would be very interested in exchanging some thoughts on this subject. Could you share your implementation with us (me :) )?

ajas95

767

Author

January 08, 2005 10:42 AM

Sure. It uses conditional moves instead of branches to test the en/ex points against the split plane and assign to cur_node, far_node and decide whether to do the stack computation.

That was the easy part, and I had it running in 45 cycles in the happy case where just one side is traversed... When it actually needed to do the full computation and push the stack (which was about 25% of the time), it was really expensive. It can't start computing the next split point until it knows to not take the branch (after mispredicting it!), so it stalls badly.

I decided if I started the split_plane intersection computation early, I could fill lots of stall cycles. So it does the whole computation as if it's going to traverse the far node, assigning all the needed values into the next stack position then conditionally increments the ex_pt value based on whether the first computation decided that the far node actually needed to be traversed.

The way I actually solved the problem I posted was by doing yet another conditional move to increment the copy of ex_pt stored in the register. Like:

	mov	eax,	ex_idx	mov	edx,	ex_idx	mov	esi,	eax	// esi : temp (old ex_idx)	inc	eax	add	edx,	2	cmp	eax,	en_idx	cmove	eax,	edx     // if (++ex_idx == en_idx)  ex_idx++;// do stack assignments.... then,	cmp	no_continue, 1  // check if no_continue flag was set (i.e. traverse far node)	cmove	esi,	eax     // if so, then increment ex_idx to the new value	mov	ex_idx,	esi

Now, don't get me wrong. I know this whole idea sounds retarded. The whole point of Havram's traversal is that you have to do this split plane computation in the fewest cases possible! However, everything schedules very nicely when you know beforehard you're going to do the full traversal. The computations themselves are very fast, but you can't just wait til the last second to do it, and then expect the result to be ready immediately.

There are a few more tricks, but this is the gist of it. If you want more code, I'd be happy to post it.

Eelco

301

January 09, 2005 02:11 PM

ive often hear people about this branching problem, but ive never understood why its a bad thing, or what it is precisely.

is branching another word for an if statement? a place where the execution of your code branches? why is this bad? because the compiler cant make optimazations like prefeching data, because it cant predict what will happen at runtime?

the way i want to solve the prefetching/cache problem: as soon as you decide to traverse a node, prefetch both children into L1 cache using the asm prefectcht1 instruction on the child pointer (only one cache line to prefetch if you make sure two children are always on the same cacheline) before anything else. sounds a little too simple though, am i missing something?

anyway, 52 cycles flat sounds very impressive. not that im very good at assembler, but still id like to see it.

DudeMiester

156

January 09, 2005 02:21 PM

It's because the CPU has a series of instructions within it's internal pipeline. If there is a branch it has to try and guess the program flow to continue adding instructions into the pipeline. If it makes a mistake with this guess, then it has to flush all the incorrect instructions out of the pipeline and add the correct ones, this wastes a good chunck of time, and that's the problem with branh mis-prediction.

[s] [/s]
I can see the fnords.

Eelco

301

January 09, 2005 02:34 PM

sorry if im going a little OT, but while were talking about optimizing kd-tree traversal, i came up with a slight improvement of my own i havnt yet seen before.

instead of storing leaf data in a linked list of some sort, i plan to store them in an array of pointers. the size of the leaf can be store in the splitposition field. these arrays, which shouldnt be allowed to get any bigger than one cacheline, can then be placed in the same memory pool as the nodes, in the best case placed on the same cacheline as the node referencing them, so that in most cases it will already be cached. then you can prefectch all needed data at once without any further delay. optimally, you could start with the primitive closest to the processor which will arive first (buti dont know if thats even possible), and you can do your mailboxing before you even have to fetch any data.

ajas95

767

Author

January 09, 2005 03:08 PM

Quote:Original post by Eelco
is branching another word for an if statement? a place where the execution of your code branches? why is this bad?

Yes it's an 'if' statement, or 'for' or 'while' condition, where the CPU has to decide between executing 2 different next instructions. The processor won't know for several cycles which to choose, it has to decode the branch instruction and process the condition... so meanwhile, it predicts one and starts processing it right away. If it's wrong, it has to wait for those instructions it started to finish and flush the pipeline before processing the correct branch. This can be a non-trivial penalty.

But that doesn't mean that 'if' statements are necessarily bad. CPU branch predictors have gotten very good at detecting jump patterns... however, they can't infer more predictability than there is in the input. In the case of this algorithm, every branch has roughly equal probability to be taken, and the previous jump outcome lends no hint to what the next will be, so prediction will do no better than 50-50. (That's actually not entirely true. A given ray, for instance if all components of the direction vector are positive, may give good prediction... however that prediction will only be valid for the duration of the ray, and may 'poison' the predictors for future rays, so I don't think it will be a net gain in the long run).

A good rule is that if you, as a programmer, have a good idea of which branch will be taken, the processor will be able to figure it out also. For this algorithm, that is not the case. One quick thing to try is to write 2 additional versions of the algorithm, one for all components of ray direction are positive, and one for all negative. At the very beginning of traversal, if you detect one of these cases, call the appropriate special traversal function, otherwise use the generic one. I would bet that would give you a quick 5-10% performance improvement.

Eelco

301

January 09, 2005 03:18 PM

thank you, thats interesting to know.

phantomus

741

January 09, 2005 04:27 PM

Quote:Original post by Eelco
these arrays, which shouldnt be allowed to get any bigger than one cacheline, can then be placed in the same memory pool as the nodes, in the best case placed on the same cacheline as the node referencing them, so that in most cases it will already be cached.

I had this same idea a while back and actually tested it. It doesn't work. :) It isn't slower either, it just doesn't help. Apparently the cache hits that you get for the primitives compensate for the cache misses that you get for nodes (that are further away because of this interleaving trick). Given the fact that tree construction isn't exactly eased by this trick, I took it out.

Stack-based Kd traversal

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Stack-based Kd traversal

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines