Quote:Original post by ajas95
Not my compiler! I've looked at the code VC 2003 generates. It can be really really bad at fpu code. Even when I try intrinsics, it schedules in all sorts of unneccessary movaps to store intermediate values on the stack.
It's not too bad with fpu, but it's totally asinine with intrinsics/SSE (more on that below).
Sometimes you can convince it to ouput a cmov but it's very limited. GCC is much better for that (and intrinsics ;).
Here's a pathological code gen case for msvc2k3 & intrinsics:
http://www.flipcode.com/cgi-bin/fcmsg.cgi?thread_show=24087#p194723
Quote:And no, it doesn't use cmov... and this is with every sort of optimization turned on.
You got to hit the right optimization pattern, but as i've said it's both rare and very limited.
Quote:So if GCC is generating that much better code, then it has become very smart. Could you post the asm it generates for the non-leaf portion of the loop?
Sure, but first some disclaimers:
. that function is inlined in real code (in fact i've cut it away to bench), so that's a bit artificial (and the code gen varies a lot wrt branching etc)
. experimental gcc used (and i mean it)
. you'll notice some inneficiencies here & there but that's related to previous point
. that's not the instrumented version
.
edit: revised timings for the instrumented version, 26 cycles per iteration/node; perhaps it's worth nuking the branch then.
.
edit: it's not 26 cycles either, it's less but not only depends on the push/no-push ratio but also how deep was the search; it's a micro bench and not very useful on its own, let's say it's around 20 cycles - that is, fast enough - when we're not waiting for memory (the root of all Evil in Real World).
Comments are from a mail sent to Jacco The Mighty:
004011d0 <locate_leaf(unsigned long const*, kdtree::node_t const*,rt::mono::ray_t const&, rt::mono::ray_segment_t&)>: 4011d0: push %ebp 4011d1: push %edi 4011d2: push %esi 4011d3: push %ebx 4011d4: mov %edx,%ebx 4011d6: sub $0x8,%esp 4011d9: mov 0x423420,%ebp 4011df: mov %eax,0x4(%esp) 4011e3: mov %ecx,(%esp) 4011e6: mov (%ebx),%edx # load the node offset + flag bits 4011e8: test %edx,%edx # tricky way to check for a leaf 4011ea: js 401241 # gcc found that on its own 4011ec: mov (%esp),%ecx 4011ef: mov %edx,%eax 4011f1: and $0xfffffffc,%edx # offset 4011f4: and $0x3,%eax # axis 4011f7: movss 0x4(%ebx),%xmm0 # split 4011fc: add %edx,%ebx 4011fe: xor %edx,%edx 401200: inc %ebp 401201: subss (%ecx,%eax,4),%xmm0 # part of the distance computation 401206: lea 0x8(%ebx),%edi 401209: mulss 0x20(%ecx,%eax,4),%xmm0 # distance 40120f: mov 0x4(%esp),%ecx 401213: mov (%ecx,%eax,4),%esi 401216: mov 0x1c(%esp),%eax 40121a: movss 0x4(%eax),%xmm1 40121f: mov %esi,%ecx 401221: comiss %xmm1,%xmm0 # compare to far 401224: seta %dl 401227: xor %edx,%ecx # node addr computation, haha 401229: comiss (%eax),%xmm0 # compare to near 40122c: setb %al 40122f: movzbl %al,%eax 401232: or %edx,%eax # that's the skip condition 401234: je 401251 # that's the "push on stack" branch 401236: test %ecx,%ecx 401238: cmove %edi,%ebx 40123b: mov (%ebx),%edx 40123d: test %edx,%edx 40123f: jns 4011ec # again a test for leaf (it's unrolled a bit) 401241: mov %ebp,0x423420 401247: mov %ebx,%eax 401249: add $0x8,%esp 40124c: pop %ebx 40124d: pop %esi 40124e: pop %edi 40124f: pop %ebp 401250: ret 401251: mov 0x423410,%eax # from here is the push on stack branch 401256: test %esi,%esi 401258: mov %ebx,%ecx 40125a: cmove %edi,%ecx 40125d: mov %eax,%edx 40125f: inc %eax 401260: shl $0x4,%edx 401263: test %esi,%esi 401265: mov %eax,0x423410 40126a: mov %ecx,0x423010(%edx) 401270: mov 0x1c(%esp),%ecx 401274: cmovne %edi,%ebx # the new node 401277: movss %xmm0,0x423014(%edx) 40127f: movss %xmm0,0x4(%ecx) 401284: movss %xmm1,0x423018(%edx) 40128c: jmp 4011e6
[Edited by - tbp on January 11, 2005 12:33:39 PM]