Stack-based Kd traversal

Graphics and GPU Programming Programming

Started by ajas95 January 07, 2005 03:44 PM

32 comments, last by Eelco 19 years, 3 months ago

301

January 10, 2005 03:43 PM

Quote:Original post by tbp
Eh? you need to prefetch way before you need the data, that's the only purpose of prefetching. Otherwise, just let the processor do its magic.
If you don't beleive me once again, just try. And you'll just waste away decoding bandwith with no other effect :)
So it's a bit more tricky than what you describe.

its not that i dont believe you, but id like to understand the reasons for it.

one of my CPU (1.3ghz) cycles is 0.7*10^-9 seconds, or 0.7nS, so in the time it takes to process one node, 0.7*50 = 40nS (probably much higher, since i dont think i have the gut to be diving deeply into ASM afterall), i should be able to fetch my data without much stalling, assuming 60nS RAM.

am i completely overlooking something? are you guys designing for three times faster processors? is this 60nS figure a very very optimistic figure? otherwise i dont see why prefetching isnt going to work for me.

but i agree, there is only one way to find out for sure :).

tbp

200

January 10, 2005 04:05 PM

Quote:Original post by Eelco
its not that i dont believe you, but id like to understand the reasons for it.

one of my CPU (1.3ghz) cycles is 0.7*10^-9 seconds, or 0.7nS, so in the time it takes to process one node, 0.7*50 = 40nS

See "IA-32 Intel® Architecture Optimization" chap. 6 and appendix E, "Mathematics of Prefetch Scheduling Distance", for a complete discussion + formula.

Amd says, more pragmatically:
Determining Prefetch Distance
When determining how far ahead to prefetch, the basic guideline is to initiate the prefetch early enough so that the data is in the cache by the time it is needed, under the constraint that there can’t be more than eight prefetches in flight at any given time.
To determine the optimal prefetch distance, use empirical benchmarking when possible. Prefetching three or four cache lines ahead (192 or 256 bytes) is a good starting point and usually gives good results. Trying to prefetch too far ahead impairs performance.

Quote:
(probably much higher, since i dont think i have the gut to be diving deeply into ASM afterall)

No need to dive into asm, that's what good compilers are for.

Quote:but i agree, there is only one way to find out for sure :).

;)

Eelco

301

January 10, 2005 04:57 PM

Quote:Original post by tbp
Quote:
(probably much higher, since i dont think i have the gut to be diving deeply into ASM afterall)

No need to dive into asm, that's what good compilers are for.

i think ajas95 would disagree...

tbp

200

January 11, 2005 09:22 AM

Quote:Original post by Eelco
i think ajas95 would disagree...

Perhaps.
But if the motivation for going to asm is scheduling/branch removal, then it's pretty moot: good compilers have patterns for cmov*, fcmov* and SSE and/andn/or sequences; ie in my leaf search routine the split plane distance is always computed but nicely scheduled away, then there's various cmovs...
The generated code executes in 20 cycles (avg.) with a total of 2 branches (one being used for looping, removing the other isn't a win at least on my box).

Eelco

301

January 11, 2005 09:31 AM

Quote:Original post by tbp
Quote:Original post by Eelco
i think ajas95 would disagree...

Perhaps.
But if the motivation for going to asm is scheduling/branch removal, then it's pretty moot: good compilers have patterns for cmov*, fcmov* and SSE and/andn/or sequences; ie in my leaf search routine the split plane distance is always computed but nicely scheduled away, then there's various cmovs...
The generated code executes in 20 cycles (avg.) with a total of 2 branches (one being used for looping, removing the other isn't a win at least on my box).

whats a leaf search routine? are you referring to node traversal? in that case, 20 cycles average seems very fast to me. i was under the impression node traversal typically takes quite a bit longer, but i could be horribly wrong. anyway, id love to know what algorithm youve used.

ajas95

767

Author

January 11, 2005 10:13 AM

Quote:Original post by tbp
good compilers have patterns for cmov*, fcmov* and SSE and/andn/or sequences;

Not my compiler! I've looked at the code VC 2003 generates. It can be really really bad at fpu code. Even when I try intrinsics, it schedules in all sorts of unneccessary movaps to store intermediate values on the stack. And no, it doesn't use cmov... and this is with every sort of optimization turned on.

So if GCC is generating that much better code, then it has become very smart. Could you post the asm it generates for the non-leaf portion of the loop?

tbp

200

January 11, 2005 10:21 AM

Quote:Original post by Eelco
whats a leaf search routine? are you referring to node traversal?

Yes, that's a part of the node traversal. You can more or less decompose such a traversal as:
1) locate a leaf (pushing other "path" along the way)
2) intersect with the leaf content; if that meets some criterions (distance etc) you're done otherwise pop the stack and goto 1).

See Wald's thesis p. 107 or Havran's thesis TAseq implementation (appendix A, p. 153) for such a decomposition (explicit in TAseq implementation).

I've supposed that was what ajas95 measured as he talked about distance computation and stack pushes, tho i was unsure it was for mono rays or packets.

I've given some context for that 20 cycles per node measure, and yes i think it's pretty ok (and compiler generated :p).

tbp

200

January 11, 2005 10:33 AM

Quote:Original post by ajas95
Not my compiler! I've looked at the code VC 2003 generates. It can be really really bad at fpu code. Even when I try intrinsics, it schedules in all sorts of unneccessary movaps to store intermediate values on the stack.

It's not too bad with fpu, but it's totally asinine with intrinsics/SSE (more on that below).
Sometimes you can convince it to ouput a cmov but it's very limited. GCC is much better for that (and intrinsics ;).

Here's a pathological code gen case for msvc2k3 & intrinsics:
http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=24087#p194723

Quote:And no, it doesn't use cmov... and this is with every sort of optimization turned on.

You got to hit the right optimization pattern, but as i've said it's both rare and very limited.

Quote:So if GCC is generating that much better code, then it has become very smart. Could you post the asm it generates for the non-leaf portion of the loop?

Sure, but first some disclaimers:
. that function is inlined in real code (in fact i've cut it away to bench), so that's a bit artificial (and the code gen varies a lot wrt branching etc)
. experimental gcc used (and i mean it)
. you'll notice some inneficiencies here & there but that's related to previous point
. that's not the instrumented version
. edit: revised timings for the instrumented version, 26 cycles per iteration/node; perhaps it's worth nuking the branch then.
. edit: it's not 26 cycles either, it's less but not only depends on the push/no-push ratio but also how deep was the search; it's a micro bench and not very useful on its own, let's say it's around 20 cycles - that is, fast enough - when we're not waiting for memory (the root of all Evil in Real World).

Comments are from a mail sent to Jacco The Mighty:

004011d0 <locate_leaf(unsigned long const*, kdtree::node_t const*,rt::mono::ray_t const&, rt::mono::ray_segment_t&)>: 4011d0:       push   %ebp 4011d1:       push   %edi 4011d2:       push   %esi 4011d3:       push   %ebx 4011d4:       mov    %edx,%ebx 4011d6:       sub    $0x8,%esp 4011d9:       mov    0x423420,%ebp 4011df:       mov    %eax,0x4(%esp) 4011e3:       mov    %ecx,(%esp) 4011e6:       mov    (%ebx),%edx #  load the node offset + flag bits 4011e8:       test   %edx,%edx # tricky way to check for a leaf 4011ea:       js     401241 # gcc found that on its own 4011ec:       mov    (%esp),%ecx 4011ef:       mov    %edx,%eax 4011f1:       and    $0xfffffffc,%edx # offset 4011f4:       and    $0x3,%eax # axis 4011f7:       movss  0x4(%ebx),%xmm0 # split 4011fc:       add    %edx,%ebx 4011fe:       xor    %edx,%edx 401200:       inc    %ebp 401201:       subss  (%ecx,%eax,4),%xmm0 # part of the distance computation 401206:       lea    0x8(%ebx),%edi 401209:       mulss  0x20(%ecx,%eax,4),%xmm0 # distance 40120f:       mov    0x4(%esp),%ecx 401213:       mov    (%ecx,%eax,4),%esi 401216:       mov    0x1c(%esp),%eax 40121a:       movss  0x4(%eax),%xmm1 40121f:       mov    %esi,%ecx 401221:       comiss %xmm1,%xmm0 # compare to far 401224:       seta   %dl 401227:       xor    %edx,%ecx # node addr computation, haha 401229:       comiss (%eax),%xmm0 # compare to near 40122c:       setb   %al 40122f:       movzbl %al,%eax 401232:       or     %edx,%eax # that's the skip condition 401234:       je     401251 # that's the "push on stack" branch 401236:       test   %ecx,%ecx 401238:       cmove  %edi,%ebx 40123b:       mov    (%ebx),%edx 40123d:       test   %edx,%edx 40123f:       jns    4011ec # again a test for leaf (it's unrolled a bit) 401241:       mov    %ebp,0x423420 401247:       mov    %ebx,%eax 401249:       add    $0x8,%esp 40124c:       pop    %ebx 40124d:       pop    %esi 40124e:       pop    %edi 40124f:       pop    %ebp 401250:       ret 401251:       mov    0x423410,%eax # from here is the push on stack branch 401256:       test   %esi,%esi 401258:       mov    %ebx,%ecx 40125a:       cmove  %edi,%ecx 40125d:       mov    %eax,%edx 40125f:       inc    %eax 401260:       shl    $0x4,%edx 401263:       test   %esi,%esi 401265:       mov    %eax,0x423410 40126a:       mov    %ecx,0x423010(%edx) 401270:       mov    0x1c(%esp),%ecx 401274:       cmovne %edi,%ebx # the new node 401277:       movss  %xmm0,0x423014(%edx) 40127f:       movss  %xmm0,0x4(%ecx) 401284:       movss  %xmm1,0x423018(%edx) 40128c:       jmp    4011e6

[Edited by - tbp on January 11, 2005 12:33:39 PM]

ajas95

767

Author

January 11, 2005 02:50 PM

tbp,
thank you very much for posting that. It made me realize something... obviously, we're using different algorithms. I'm using Havram's TbRec that he bragged so hard about in his thesis... but it's clearly flawed performance-wise.

The idea behind it is that the stack-push case happens roughly 22% of the time, so if by making that operation more expensive, we make the majority of traversal cases faster, then it's a win. But he measured 'faster' as meaning flops, but flops are actually fast... the branch complexity is what the real slow-down is, and he's making the stack-push case MUCH more expensive (that was what I've been optimizing against).

So I think the outcome of this is that Havran's method is not the best. However, I still refuse to believe that the branch mis-predict penalty is less than writing 3 values to the stack and doing a conditional increment to the stack_idx. Mis-predict alone is 18-20 cycles.

And if that is the kind of code experimental gcc is producing, then its future is very bright.

tbp

200

January 11, 2005 07:31 PM

Quote:Original post by ajas95
tbp,
thank you very much for posting that. It made me realize something... obviously, we're using different algorithms. I'm using Havram's TbRec that he bragged so hard about in his thesis... but it's clearly flawed performance-wise.

The idea behind it is that the stack-push case happens roughly 22% of the time, so if by making that operation more expensive, we make the majority of traversal cases faster, then it's a win. But he measured 'faster' as meaning flops, but flops are actually fast... the branch complexity is what the real slow-down is, and he's making the stack-push case MUCH more expensive (that was what I've been optimizing against).

Definitely.
And that's what Wald tackled in his thesis's traversal (less branches/cases and recognizing that computing the plane distance is not an issue on modern cpu); the only catch being that the pseudo code he presents is deliberately sub optimal and can be condensed to what i've presented (to be fair, he gives some hints about it).

Quote:
So I think the outcome of this is that Havran's method is not the best. However, I still refuse to believe that the branch mis-predict penalty is less than writing 3 values to the stack and doing a conditional increment to the stack_idx. Mis-predict alone is 18-20 cycles.

There's a bit more work than that on the push branch, the node & ray segment are updated. So in order to nuke it you have conditionalize another bunch of values.
I'm running this on an opteron and branching isn't that Evil (even when mispredicted).
Anyway it's not simple to evaluate that in a sensible fashion... You may very well be right :)

Quote:And if that is the kind of code experimental gcc is producing, then its future is very bright.

Indeed :)

Stack-based Kd traversal

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Stack-based Kd traversal

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines