Archived

This topic is now archived and is closed to further replies.

Trienco

Reducing the penalty for branching?

Recommended Posts

Trienco    2555
lets say im weird and dont like a single "if" making a function take twice as long as it should. ie. (N,V,A,P being 3float vectors) float s=N*V; if (!s) return -1; float d=N*(P-A); if (d<0) r=-r; float t=(d-r)/s; if (t>1 || t<0) return -1; // offender if (C) C=A+V*t; if (nV) nV=(V - N*(2*s))*damp; return 1-t; this one will return at the offending line in 99% of all cases (no collision). simply returning without test will make it require 14-20µs. with the test and following code commented out its 25-35 while with the following code its even 35-40µs. that annoying jump seems to take quite a while and probably causes a cache miss as well. is there any way to avoid this? im thinking that if the resulting asm code place the return right behind the test and only branch if the condition is false it might work better. but as asm isnt my strong side and always feels like a bit of breaking things i wonder if theres a trick to make the compiler see this opportunity.

Share this post


Link to post
Share on other sites
matias suarez    122
A good starting point where you can read about it is here
There are some articles (at a very low level) explaining how does it works, a bit outdated but if you''re curious you''ll surelly enjoy the reading

-Mat

Share this post


Link to post
Share on other sites
Trienco    2555
perfect.. though the fact that the cpu is trying to predict the jump is a bit frustrating. doesnt sound like messing with the asm output and reordering it would do much good. with probably hundreds of collsions each frame that return early its pretty likely to already predict that right.

Share this post


Link to post
Share on other sites
C-Junkie    1099
doesn''t msvc have directives to re-order the instructions for you?

in my gcc project I have macros that let me do things like,

if(likely(some_test))

and gcc will reorder the instructions appropriately.

Share this post


Link to post
Share on other sites
Trienco    2555
just noticed im an idiot. the .14µs were only possible because the compiler noticed it always returns 0 and just replaced the whole function with 0 (just returning 0 in the function was taking exactly those .14).
checking with this in mind reduced the difference quite a lot so its not really worth the trouble i guess (15% faster doesnt matter as much as 100%).

guess i should have profiled the single lines in the first place without guessing and second guessing ,-)

at least it resulted in learning about those macros and getting an interesting link.. thanks for that

ok, did some profiling and either my tsc stuff is crap or measuring times in the .x .0x microsecond area is rather pointless. its interesting how the values change every time the program is running but remain consistent within one run. how N*V can take .12 and N*(A-P) only .05 (vectors, naive implementation). there must be going on a lot of optimizing inside that cpu. is it kind of caching instructions and can that make such a difference? *back to reading a bit more*

[edited by - Trienco on January 18, 2004 6:45:31 AM]

Share this post


Link to post
Share on other sites
matias suarez    122
Hehe, profiling code this days is something really really tricky, i don''t know what tools are you using for profiling but i believe none of the existing ones is 100% accurate, just a task switch *will* affect your profiling results if is not taking the thread time, even so the thread time is not accurate so error propagates on each thread/task switch.
Another nice thing is when the OS hit the disk (because a page fault of your app or a background task) and results are really crazy then.
I never tried VTune (from intel) but i presume it will be somewhat the best you can get, as far as i remember it analyses how you code can be ''pipelined'' or ''paralelized'' the best for the target cpu and it does that knowing exactly how the cpu works so it ''calculates'' how long it will take (in a figurative way, it won''t tell you xxx ns), there may be some evaluation version floating around but i don''t know.
But remember that you can spend your whole life optimizing a piece of code :D

-Mat




Share this post


Link to post
Share on other sites
Trienco    2555
hm.. i remember intels vtune making a better impression than amds code analyst (though i think that it was free at least). in theory cpuid should wait until all instructions are completed, so im still a bit amazed how the same thing (dot product) can be 4 times faster the second time around.. guess correct profiling is a bit more work than its worth for now (some problems might be solved by setting priority to realtime, hoping that the os will actually care about it and praying that it wont end up in an infinity loop during that phase). maybe its time to "evaluate" vtune again. had to reinstall a bit ago anyway, so at least there wont be any traces from last time ,-)

Share this post


Link to post
Share on other sites
Russell    118
quote:
Original post by Trienco
how N*V can take .12 and N*(A-P) only .05 (vectors, naive implementation).


N is almost sure to be cached (or even still in a register) the second time it is used in that function. Another thing is that since pretty much every modern CPU has multiple pipelines of execution, the second one might execute faster than you think. N*(A-P) is equivalent to (N*A)-(N*P), so the CPU could execute both N*A and N*P simultaneously since the results of each of those do not depend on one another, then it is a simply 1 cycle subtract.

Share this post


Link to post
Share on other sites
Trienco    2555
thats what i hate about programming. the shortest line doesnt have to produce the fastest code.

i tried that and in fact n*p - n*a was close to twice as fast. that is, if values of .03µs even have any meaning considering all the things that mess around with it.

(glad those profiling counters are so easy to switch between logging every value and logging every x ms). using the old version was wildly varying between .03 and .25 (no real pattern), while yours is mostly at .03 and has single peaks of .25 (sounds like a cache miss?)

i begin to understand why intel is charging money for a software that is just "taking some numbers". mental note: keep parallel execution in mind rather than minimizing number of operations.

Share this post


Link to post
Share on other sites
Russell    118
Really you should probably just worry about writing correct code, and let the compiler do its thing. When you do want to optimize something, use the profiler to see if the change you made actually made it faster. The thing about modern computers is that it is usually difficult to tell for sure which method will be faster without testing both methods and seeing for yourself which is faster. You may find that annoying, but the alternative (where shorter code meant shorter run times) would mean your program would run slower than it does today.

Share this post


Link to post
Share on other sites
Trienco    2555
i think i prefer 1. by now. all those "hidden" things in compiler and cpu seem to make it a pain to figure out whats actually happening and causing the numbers you see. at least its just a way to kill some time, so i can afford doing pointless micro optimizations on functions that will most likely never ever be a problem. hoped it would be a good way to figure out how a few things work.. but only showed me how hard it is to know/guess whats really going on. thats making optimizing a function a bit of a trial and error thing.. change something, check the time, change it again, check the time, notice that the same version varies by 200% with every time you run the program and then just decide its not worth it. one more reason to optimize algorithms rather than code.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
quote:
Original post by Trienco
thats what i hate about programming. the shortest line doesnt have to produce the fastest code.
And when you start changing algorithms, you may end up writing 100 times more code than the original, and it''ll be faster.

Share this post


Link to post
Share on other sites
matias suarez    122
From here:

---
Our code is expertly written in hand-optimized assembly language. It is thoroughly documented and virtually bug-free.
---
from the benchmark page:
---
...is faster than Direct3D software rendering. In some cases, more than 1000% faster.
---

Hand made optimization can really make a difference, compiler optimizations are still really far from producing ''the best possible optimized code'', they just produce ''an acceptable result on a general purpose way''.
This one is just one of the many examples were hand made optimizations are present as of today.
Anyway sometimes is better to focus more on content than speed

-Mat

Share this post


Link to post
Share on other sites