Stack-based Kd traversal

This topic is 4785 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Havram offers a recursive implementation of a Kd-tree traversal implemented with a stack. The branch prediction is particularly horrific, so I've been optimizing that aspect with great success so far. One part this is very troubling though is the part:
int tmp = exPt;
Increment(exPt);

/* possibly skip current entry point so not to overwrite the data */
if (exPt == enPt)
Increment(exPt);

This seems really hacky, and I don't understand what it does, or what conditions en_idx would equal ex_idx. I'm looking for a way to remove that if statement. Can anyone help me?

Share on other sites
HAH! I figured it out... well, not really, but I figured out a way to mimic that behavior (I'd still really like to know what that section is there for). I now have my non-leaf-node traversal completely branchless, always prefetched, and down to 52 cycles per loop-iteration in every case (i.e. never better, never worse).

Of course it's easy to get it faster than that if you don't count the branch misprediction penalties, but for this algorithm particularly, they are really bad.

Share on other sites
Hm given my current project I would be very interested in exchanging some thoughts on this subject. Could you share your implementation with us (me :) )?

Share on other sites
Sure. It uses conditional moves instead of branches to test the en/ex points against the split plane and assign to cur_node, far_node and decide whether to do the stack computation.

That was the easy part, and I had it running in 45 cycles in the happy case where just one side is traversed... When it actually needed to do the full computation and push the stack (which was about 25% of the time), it was really expensive. It can't start computing the next split point until it knows to not take the branch (after mispredicting it!), so it stalls badly.

I decided if I started the split_plane intersection computation early, I could fill lots of stall cycles. So it does the whole computation as if it's going to traverse the far node, assigning all the needed values into the next stack position then conditionally increments the ex_pt value based on whether the first computation decided that the far node actually needed to be traversed.

The way I actually solved the problem I posted was by doing yet another conditional move to increment the copy of ex_pt stored in the register. Like:

	mov	eax,	ex_idx	mov	edx,	ex_idx	mov	esi,	eax	// esi : temp (old ex_idx)	inc	eax	add	edx,	2	cmp	eax,	en_idx	cmove	eax,	edx     // if (++ex_idx == en_idx)  ex_idx++;// do stack assignments.... then,	cmp	no_continue, 1  // check if no_continue flag was set (i.e. traverse far node)	cmove	esi,	eax     // if so, then increment ex_idx to the new value	mov	ex_idx,	esi

Now, don't get me wrong. I know this whole idea sounds retarded. The whole point of Havram's traversal is that you have to do this split plane computation in the fewest cases possible! However, everything schedules very nicely when you know beforehard you're going to do the full traversal. The computations themselves are very fast, but you can't just wait til the last second to do it, and then expect the result to be ready immediately.

There are a few more tricks, but this is the gist of it. If you want more code, I'd be happy to post it.

Share on other sites

is branching another word for an if statement? a place where the execution of your code branches? why is this bad? because the compiler cant make optimazations like prefeching data, because it cant predict what will happen at runtime?

the way i want to solve the prefetching/cache problem: as soon as you decide to traverse a node, prefetch both children into L1 cache using the asm prefectcht1 instruction on the child pointer (only one cache line to prefetch if you make sure two children are always on the same cacheline) before anything else. sounds a little too simple though, am i missing something?

anyway, 52 cycles flat sounds very impressive. not that im very good at assembler, but still id like to see it.

Share on other sites
It's because the CPU has a series of instructions within it's internal pipeline. If there is a branch it has to try and guess the program flow to continue adding instructions into the pipeline. If it makes a mistake with this guess, then it has to flush all the incorrect instructions out of the pipeline and add the correct ones, this wastes a good chunck of time, and that's the problem with branh mis-prediction.

Share on other sites
sorry if im going a little OT, but while were talking about optimizing kd-tree traversal, i came up with a slight improvement of my own i havnt yet seen before.

instead of storing leaf data in a linked list of some sort, i plan to store them in an array of pointers. the size of the leaf can be store in the splitposition field. these arrays, which shouldnt be allowed to get any bigger than one cacheline, can then be placed in the same memory pool as the nodes, in the best case placed on the same cacheline as the node referencing them, so that in most cases it will already be cached. then you can prefectch all needed data at once without any further delay. optimally, you could start with the primitive closest to the processor which will arive first (buti dont know if thats even possible), and you can do your mailboxing before you even have to fetch any data.

Share on other sites
Quote:
 Original post by Eelcois branching another word for an if statement? a place where the execution of your code branches? why is this bad?

Yes it's an 'if' statement, or 'for' or 'while' condition, where the CPU has to decide between executing 2 different next instructions. The processor won't know for several cycles which to choose, it has to decode the branch instruction and process the condition... so meanwhile, it predicts one and starts processing it right away. If it's wrong, it has to wait for those instructions it started to finish and flush the pipeline before processing the correct branch. This can be a non-trivial penalty.

But that doesn't mean that 'if' statements are necessarily bad. CPU branch predictors have gotten very good at detecting jump patterns... however, they can't infer more predictability than there is in the input. In the case of this algorithm, every branch has roughly equal probability to be taken, and the previous jump outcome lends no hint to what the next will be, so prediction will do no better than 50-50. (That's actually not entirely true. A given ray, for instance if all components of the direction vector are positive, may give good prediction... however that prediction will only be valid for the duration of the ray, and may 'poison' the predictors for future rays, so I don't think it will be a net gain in the long run).

A good rule is that if you, as a programmer, have a good idea of which branch will be taken, the processor will be able to figure it out also. For this algorithm, that is not the case. One quick thing to try is to write 2 additional versions of the algorithm, one for all components of ray direction are positive, and one for all negative. At the very beginning of traversal, if you detect one of these cases, call the appropriate special traversal function, otherwise use the generic one. I would bet that would give you a quick 5-10% performance improvement.

Share on other sites
thank you, thats interesting to know.

Share on other sites
Quote:
 Original post by Eelcothese arrays, which shouldnt be allowed to get any bigger than one cacheline, can then be placed in the same memory pool as the nodes, in the best case placed on the same cacheline as the node referencing them, so that in most cases it will already be cached.

I had this same idea a while back and actually tested it. It doesn't work. :) It isn't slower either, it just doesn't help. Apparently the cache hits that you get for the primitives compensate for the cache misses that you get for nodes (that are further away because of this interleaving trick). Given the fact that tree construction isn't exactly eased by this trick, I took it out.

Share on other sites
Quote:
Original post by phantomus
Quote:
 Original post by Eelcothese arrays, which shouldnt be allowed to get any bigger than one cacheline, can then be placed in the same memory pool as the nodes, in the best case placed on the same cacheline as the node referencing them, so that in most cases it will already be cached.

I had this same idea a while back and actually tested it. It doesn't work. :) It isn't slower either, it just doesn't help. Apparently the cache hits that you get for the primitives compensate for the cache misses that you get for nodes (that are further away because of this interleaving trick). Given the fact that tree construction isn't exactly eased by this trick, I took it out.

well, i must say i havnt thought about how to implement it yet either, and i can see it complicating things indeed. but my plan is not to have any cache misses for nodes anyway, by always prefetching one treedepth ahead. since prefetching one cacheline into L1 cache takes about equally long as traversing one node (if my math is correct), i dont see the caching of treenodes being much of a problem. however, this isnt possible for the leafdata, since you can prefetch the leaf, but there isnt any time associated with traversing it, so waiting for the leafdata to arrive cant be filled with any other usefull activity.

however, another point of concern is that i want pairs of children to be 16-byte aligned, so one pair is always on the same cacheline, thus easy to prefectch. this means however that putting the arbitrary sized leafdata pointers in between might create quite a lot empty space, which is always a bad thing in terms of memory transfer ofcource.

Share on other sites
Hm I tried this prefetching stuff also (_mm_prefetch( _MM_HINTT0 ) if I'm correct) but I couldn't get it to make a difference. I'm particularly miserable at sse coding atm, so this should not hold you back from trying. :) And please post here if you do manage to get a gain out of prefetching in the tree traversal code.

Share on other sites

I've just tried to see if removing the "push case" branch in Wald's traversal/leaf locator was a win and so far it's a loss (on my k8).
Everything else was already conditionnaly moved etc, but it seems that the cost of systematically storing things on top of the stack outweights the cost for the branch. The problem is that removing that branch also changed the way the rest of the code is unrolled by gcc.

Need to instrument a bit and (maybe) give PGO a try :)

Share on other sites
Quote:
 Original post by phantomusHm I tried this prefetching stuff also (_mm_prefetch( _MM_HINTT0 ) if I'm correct) but I couldn't get it to make a difference. I'm particularly miserable at sse coding atm, so this should not hold you back from trying. :) And please post here if you do manage to get a gain out of prefetching in the tree traversal code.

well, im not sure if its going to make a difference either. in theory it should, but its always rather complex. maybe using T0 isnt a good idea. if i remember correctly, T1 reads it into L1 cache, and T0 'into the processor', although i dont fully understand what was meant by this. if this means loading the cacheline into the actual registers, i can see it being more of a waste of registers than a speedgain, so prefetching it into L1 seems like a better thing to do. also, prefectch is not guaranteed to do anything, it just hints at what might be done, so maybe being more explicit by requesting actual memory from that cacheline to force prefetching will actually do anything.

however, i must say i also think it sounds almost too easy. why would havran dedicate a complete chapter to tree structuring for best cache performance if it could be solved so easily by just prefetching a bit ahead?

well, ill keep you informed when i actually get there.

Share on other sites
Quote:
 Original post by Eelcomaybe using T0 isnt a good idea. if i remember correctly, T1 reads it into L1 cache, and T0 'into the processor', although i dont fully understand what was meant by this. (snip)... so maybe being more explicit by requesting actual memory from that cacheline to force prefetching will actually do anything.

Straight from Intel's doc:
• T0 (temporal data)—prefetch data into all levels of the cache hierarchy.
— Pentium III processor—1st- or 2nd-level cache.
— Pentium 4 and Intel Xeon processors—2nd-level cache.
• T1 (temporal data with respect to first level cache)—prefetch data into level 2 cache and
higher.
— Pentium III processor—2nd-level cache.
— Pentium 4 and Intel Xeon processors—2nd-level cache.
• T2 (temporal data with respect to second level cache)—prefetch data into level 2 cache and
higher.
— Pentium III processor—2nd-level cache.
— Pentium 4 and Intel Xeon processors—2nd-level cache.
• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal
cache structure and into a location close to the processor, minimizing cache pollution.
— Pentium III processor—1st-level cache
— Pentium 4 and Intel Xeon processors—2nd-level cache

And for AMD (the cache line size is 64):
Current AMD Athlon 64 and AMD Opteron processors implement the PREFETCHT0,
PREFETCHT1 and PREFETCHT2 instructions in exactly the same way as the PREFETCH
instructions. That is, the data is brought into the L1 data cache. This functionality could be changed in
future implementations.

It has nothing to do with registers, there can be 6 prefetch in flight at a given time (as far as i remember, for k8) and you're right it's just a hint.

For the traversal, you want to load data into the L1. The only trouble is locality/predictability.
Like Phantomus i had no luck for now with kdtrees, maybe i'll come up with a nicer tree layout. It was much easier with BVH (it's much more predictable and nodes were way bigger); got a 20% speedup if i remember.

Share on other sites
thanks for the info. however, i find it quite confusing.
Quote:
Original post by tbp
Quote:
 Original post by Eelcomaybe using T0 isnt a good idea. if i remember correctly, T1 reads it into L1 cache, and T0 'into the processor', although i dont fully understand what was meant by this. (snip)... so maybe being more explicit by requesting actual memory from that cacheline to force prefetching will actually do anything.

Straight from Intel's doc:
• T0 (temporal data)—prefetch data into all levels of the cache hierarchy.
— Pentium III processor—1st- or 2nd-level cache.
— Pentium 4 and Intel Xeon processors—2nd-level cache.

errr, im confused. so it isnt possible to prefetch into L1 on anything > PIII? that would suck badly, but i dont buy it.
Quote:
 • T1 (temporal data with respect to first level cache)—prefetch data into level 2 cache andhigher.— Pentium III processor—2nd-level cache.— Pentium 4 and Intel Xeon processors—2nd-level cache.• T2 (temporal data with respect to second level cache)—prefetch data into level 2 cache andhigher.— Pentium III processor—2nd-level cache.— Pentium 4 and Intel Xeon processors—2nd-level cache.• NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporalcache structure and into a location close to the processor, minimizing cache pollution.— Pentium III processor—1st-level cache— Pentium 4 and Intel Xeon processors—2nd-level cacheAnd for AMD (the cache line size is 64):Current AMD Athlon 64 and AMD Opteron processors implement the PREFETCHT0,PREFETCHT1 and PREFETCHT2 instructions in exactly the same way as the PREFETCHinstructions. That is, the data is brought into the L1 data cache. This functionality could be changed infuture implementations.It has nothing to do with registers, there can be 6 prefetch in flight at a given time (as far as i remember, for k8) and you're right it's just a hint.For the traversal, you want to load data into the L1. The only trouble is locality/predictability.

why is it hard to predict? as soon as you start with a node, you have the pointer to the cacheline containing both children. so you can prefetch all your data with just one prefetch atleast 52 cycles before its needed. according to my math, i shouldnt have any delays with my processor speed then. for faster processors it might be a concern though, but even then the mayority of time should be spent useful, and dont forget you already should get plenty of natural cachehits with 8 nodes on one cacheline and slightly coherent rays. i think the total time spent waiting on data from RAM is hard to minimize if you just prefetch one node ahead.
Quote:
 Like Phantomus i had no luck for now with kdtrees, maybe i'll come up with a nicer tree layout. It was much easier with BVH (it's much more predictable and nodes were way bigger); got a 20% speedup if i remember.

havran reports of a way of structuring trees where he claims near 100% cache hits. however, he assumes 128 bit cachelinesize and doesnt use wald's node layout. i dont remember the details however.

Share on other sites
Quote:
 Original post by tbpbut it seems that the cost of systematically storing things on top of the stack outweights the cost for the branch. The problem is that removing that branch also changed the way the rest of the code is unrolled by gcc.

The key is that you can start the split point calculation almost at the very beginning of the loop. The way it works for me is that I use cmpps and movmskps to generate bits for en/ex against split, and use those to do the conditional assignments for cur_node and far_node. Also, I do the normal split_t calculation, and split pt calculation. Each of those creates a very long dependency chain, but by interleaving the operations, it ends up not stalling.

I would post an image of the pipeline sim that CodeAnalyst gives me, but I don't have a hosting site.

The point is, the last instruction to calculate the split point actually dispatches just before the movmskps, but doesn't complete until nearly 30 instructions later, just before it's written to the stack.

Also, I just got rid of my prefetching code, and there are startlingly few cache misses during traversal. Interesting. And ditching the prefetch sped it up to 50 cycles (There's plenty of room for improvement, still) :)

Share on other sites
Quote:
 Original post by ajas95Also, I just got rid of my prefetching code, and there are startlingly few cache misses during traversal. Interesting. And ditching the prefetch sped it up to 50 cycles (There's plenty of room for improvement, still) :)

it might be due to using a small test model that fits nicely in cache, or maybe its just not that important.

50 cycles, thats pretty neat. my lack of ASM skills prevents me from understanding half of what youre saying, but i intend to do so eventually.

Share on other sites
Quote:
 Original post by Eelcoit might be due to using a small test model that fits nicely in cache, or maybe its just not that important.

Well, it's 125,000 triangles in a 20 deep / 10 tris per node tree, so I don't think it ALL fits in the cache :)

I've shaved off a few more instructions that became redundant when I got rid of prefetching, so it's down to 46 now. The big obvious bottleneck now is the conditional move to increment ex_idx again when ex_idx == en_idx. So I'd like to go back to my original question about how to get rid of that.

Any ideas?

Share on other sites
Quote:
 Original post by Eelcoerrr, im confused. so it isnt possible to prefetch into L1 on anything > PIII? that would suck badly, but i dont buy it.

Buy it or not, that's the law.
Sue Intel, if you've bought a P4 then that's a very valid reason :p

Quote:
 why is it hard to predict? as soon as you start with a node, you have the pointer to the cacheline containing both children. so you can prefetch all your data with just one prefetch atleast 52 cycles before its needed. according to my math, i shouldnt have any delays with my processor speed then. for faster

Eh? you need to prefetch way before you need the data, that's the only purpose of prefetching. Otherwise, just let the processor do its magic.
If you don't beleive me once again, just try. And you'll just waste away decoding bandwith with no other effect :)
So it's a bit more tricky than what you describe.

Side note:
I've instrumented my "locate leaf" function (Wald traversal); for a 500k model on a 21 node deep search (with 9 pushes) i'm at about 20 cycles (give or take a couple) for mono rays. That's a bit artificial though because that's inlined normally and the instrumentation changed the code gen a bit.
Anyway that's a good indication.

Share on other sites
Quote:
 Original post by tbpEh? you need to prefetch way before you need the data, that's the only purpose of prefetching. Otherwise, just let the processor do its magic.If you don't beleive me once again, just try. And you'll just waste away decoding bandwith with no other effect :)So it's a bit more tricky than what you describe.

its not that i dont believe you, but id like to understand the reasons for it.

one of my CPU (1.3ghz) cycles is 0.7*10^-9 seconds, or 0.7nS, so in the time it takes to process one node, 0.7*50 = 40nS (probably much higher, since i dont think i have the gut to be diving deeply into ASM afterall), i should be able to fetch my data without much stalling, assuming 60nS RAM.

am i completely overlooking something? are you guys designing for three times faster processors? is this 60nS figure a very very optimistic figure? otherwise i dont see why prefetching isnt going to work for me.

but i agree, there is only one way to find out for sure :).

Share on other sites
Quote:
 Original post by Eelcoits not that i dont believe you, but id like to understand the reasons for it.one of my CPU (1.3ghz) cycles is 0.7*10^-9 seconds, or 0.7nS, so in the time it takes to process one node, 0.7*50 = 40nS

See "IA-32 Intel® Architecture Optimization" chap. 6 and appendix E, "Mathematics of Prefetch Scheduling Distance", for a complete discussion + formula.

Amd says, more pragmatically:
Determining Prefetch Distance
When determining how far ahead to prefetch, the basic guideline is to initiate the prefetch early enough so that the data is in the cache by the time it is needed, under the constraint that there can’t be more than eight prefetches in flight at any given time.
To determine the optimal prefetch distance, use empirical benchmarking when possible. Prefetching three or four cache lines ahead (192 or 256 bytes) is a good starting point and usually gives good results. Trying to prefetch too far ahead impairs performance.

Quote:
 (probably much higher, since i dont think i have the gut to be diving deeply into ASM afterall)

No need to dive into asm, that's what good compilers are for.

Quote:
 but i agree, there is only one way to find out for sure :).

;)

Share on other sites
Quote:
Original post by tbp
Quote:
 (probably much higher, since i dont think i have the gut to be diving deeply into ASM afterall)

No need to dive into asm, that's what good compilers are for.

i think ajas95 would disagree...

Share on other sites
Quote:
 Original post by Eelcoi think ajas95 would disagree...

Perhaps.
But if the motivation for going to asm is scheduling/branch removal, then it's pretty moot: good compilers have patterns for cmov*, fcmov* and SSE and/andn/or sequences; ie in my leaf search routine the split plane distance is always computed but nicely scheduled away, then there's various cmovs...
The generated code executes in 20 cycles (avg.) with a total of 2 branches (one being used for looping, removing the other isn't a win at least on my box).

Share on other sites
Quote:
Original post by tbp
Quote:
 Original post by Eelcoi think ajas95 would disagree...

Perhaps.
But if the motivation for going to asm is scheduling/branch removal, then it's pretty moot: good compilers have patterns for cmov*, fcmov* and SSE and/andn/or sequences; ie in my leaf search routine the split plane distance is always computed but nicely scheduled away, then there's various cmovs...
The generated code executes in 20 cycles (avg.) with a total of 2 branches (one being used for looping, removing the other isn't a win at least on my box).

whats a leaf search routine? are you referring to node traversal? in that case, 20 cycles average seems very fast to me. i was under the impression node traversal typically takes quite a bit longer, but i could be horribly wrong. anyway, id love to know what algorithm youve used.