Reducing the penalty for branching?

Started by
12 comments, last by Trienco 20 years, 3 months ago
lets say im weird and dont like a single "if" making a function take twice as long as it should. ie. (N,V,A,P being 3float vectors) float s=N*V; if (!s) return -1; float d=N*(P-A); if (d<0) r=-r; float t=(d-r)/s; if (t>1 || t<0) return -1; // offender if (C) C=A+V*t; if (nV) nV=(V - N*(2*s))*damp; return 1-t; this one will return at the offending line in 99% of all cases (no collision). simply returning without test will make it require 14-20µs. with the test and following code commented out its 25-35 while with the following code its even 35-40µs. that annoying jump seems to take quite a while and probably causes a cache miss as well. is there any way to avoid this? im thinking that if the resulting asm code place the return right behind the test and only branch if the condition is false it might work better. but as asm isnt my strong side and always feels like a bit of breaking things i wonder if theres a trick to make the compiler see this opportunity.
f@dzhttp://festini.device-zero.de
Advertisement
A good starting point where you can read about it is here
There are some articles (at a very low level) explaining how does it works, a bit outdated but if you''re curious you''ll surelly enjoy the reading

-Mat
perfect.. though the fact that the cpu is trying to predict the jump is a bit frustrating. doesnt sound like messing with the asm output and reordering it would do much good. with probably hundreds of collsions each frame that return early its pretty likely to already predict that right.
f@dzhttp://festini.device-zero.de
doesn''t msvc have directives to re-order the instructions for you?

in my gcc project I have macros that let me do things like,

if(likely(some_test))

and gcc will reorder the instructions appropriately.
just noticed im an idiot. the .14µs were only possible because the compiler noticed it always returns 0 and just replaced the whole function with 0 (just returning 0 in the function was taking exactly those .14).
checking with this in mind reduced the difference quite a lot so its not really worth the trouble i guess (15% faster doesnt matter as much as 100%).

guess i should have profiled the single lines in the first place without guessing and second guessing ,-)

at least it resulted in learning about those macros and getting an interesting link.. thanks for that

ok, did some profiling and either my tsc stuff is crap or measuring times in the .x .0x microsecond area is rather pointless. its interesting how the values change every time the program is running but remain consistent within one run. how N*V can take .12 and N*(A-P) only .05 (vectors, naive implementation). there must be going on a lot of optimizing inside that cpu. is it kind of caching instructions and can that make such a difference? *back to reading a bit more*

[edited by - Trienco on January 18, 2004 6:45:31 AM]
f@dzhttp://festini.device-zero.de
Hehe, profiling code this days is something really really tricky, i don''t know what tools are you using for profiling but i believe none of the existing ones is 100% accurate, just a task switch *will* affect your profiling results if is not taking the thread time, even so the thread time is not accurate so error propagates on each thread/task switch.
Another nice thing is when the OS hit the disk (because a page fault of your app or a background task) and results are really crazy then.
I never tried VTune (from intel) but i presume it will be somewhat the best you can get, as far as i remember it analyses how you code can be ''pipelined'' or ''paralelized'' the best for the target cpu and it does that knowing exactly how the cpu works so it ''calculates'' how long it will take (in a figurative way, it won''t tell you xxx ns), there may be some evaluation version floating around but i don''t know.
But remember that you can spend your whole life optimizing a piece of code :D

-Mat




hm.. i remember intels vtune making a better impression than amds code analyst (though i think that it was free at least). in theory cpuid should wait until all instructions are completed, so im still a bit amazed how the same thing (dot product) can be 4 times faster the second time around.. guess correct profiling is a bit more work than its worth for now (some problems might be solved by setting priority to realtime, hoping that the os will actually care about it and praying that it wont end up in an infinity loop during that phase). maybe its time to "evaluate" vtune again. had to reinstall a bit ago anyway, so at least there wont be any traces from last time ,-)
f@dzhttp://festini.device-zero.de
quote:Original post by Trienco
how N*V can take .12 and N*(A-P) only .05 (vectors, naive implementation).


N is almost sure to be cached (or even still in a register) the second time it is used in that function. Another thing is that since pretty much every modern CPU has multiple pipelines of execution, the second one might execute faster than you think. N*(A-P) is equivalent to (N*A)-(N*P), so the CPU could execute both N*A and N*P simultaneously since the results of each of those do not depend on one another, then it is a simply 1 cycle subtract.
thats what i hate about programming. the shortest line doesnt have to produce the fastest code.

i tried that and in fact n*p - n*a was close to twice as fast. that is, if values of .03µs even have any meaning considering all the things that mess around with it.

(glad those profiling counters are so easy to switch between logging every value and logging every x ms). using the old version was wildly varying between .03 and .25 (no real pattern), while yours is mostly at .03 and has single peaks of .25 (sounds like a cache miss?)

i begin to understand why intel is charging money for a software that is just "taking some numbers". mental note: keep parallel execution in mind rather than minimizing number of operations.
f@dzhttp://festini.device-zero.de
Really you should probably just worry about writing correct code, and let the compiler do its thing. When you do want to optimize something, use the profiler to see if the change you made actually made it faster. The thing about modern computers is that it is usually difficult to tell for sure which method will be faster without testing both methods and seeing for yourself which is faster. You may find that annoying, but the alternative (where shorter code meant shorter run times) would mean your program would run slower than it does today.

This topic is closed to new replies.

Advertisement