Sign in to follow this  
Ultimape

How to code profile.

Recommended Posts

I've searched all over the forums for how to profile code, but I always end up with just the suggestions TO profile, nothing else. Is this a functionality of some IDE, or a utility kinda like the gnudebugger? How do you profile your code?

Share this post


Link to post
Share on other sites
Depends really on 3 things:
1. How much money you want to spend
2. What platform you are on
3. What language you are developing in.

The poorman's profiler is to just do:


int start = GetTickCount();

.. run code

int end = GetTickCount();

int totalTime = end - start



This works, but is a pain to use. So you can take the time to write your own full featured profiler like this.

Compuware gives out a toned down freeware (or is it just trial now?) version of its fantastic code profiler. You can find it here.

If you don't have the time or skills for that, then shelling out some cash might be best. There are a few alternatives so look at Wikipedia which has a huge link selection, so peruse them.

If you are on Linux there is one built into the system (I can't remember the name though..).

Hope this helps somewhat.

Share this post


Link to post
Share on other sites
If you use Linux/gcc:

- compile using -pg.
- run the executable
- do $ gprof executablenamehere > resultsfilehere

Maybe there is a windowsport of gprof, I don't really know.

Share this post


Link to post
Share on other sites
Quote:
Original post by hydroo
If you use Linux/gcc:

- compile using -pg.
- run the executable
- do $ gprof executablenamehere > resultsfilehere

Maybe there is a windowsport of gprof, I don't really know.


Ah thats right. =) Can't believe I'd forgotten that, considering I was using that in my last sem's class!

Share this post


Link to post
Share on other sites
Don't use GetTickCount. It has a resolution in the millisecond range. Unless your code is really slow, or you're timing entire frames, you need a higher resolution such as from QueryPerformanceCounter.

Share this post


Link to post
Share on other sites
Quote:
Original post by Deyja
Don't use GetTickCount. It has a resolution in the millisecond range. Unless your code is really slow, or you're timing entire frames, you need a higher resolution such as from QueryPerformanceCounter.

It's more than that though, arbitrary profiling like that tends to produce results that are baised either for or against the code in question. Your code needs to be profiled in the usage domain that is applicable to the situation. The standard method of
Get start time
Do something that is expensive
Get end time

Has several problems in that it doesn't actually mimic a real world usage pattern. For instance, profiling dot product code where you perform a dot product many thousands of times linearly likely isn't a very good benchmark, because you will usually be performing something with the results of the dot product that could introduce more overhead, or that could cause events like cache flushes that could end up being more expensive than your dot product was.

Share this post


Link to post
Share on other sites
Quote:
Original post by Deyja
Don't use GetTickCount. It has a resolution in the millisecond range. Unless your code is really slow, or you're timing entire frames, you need a higher resolution such as from QueryPerformanceCounter.


A lot of people have noticed QPC is buggy on modern processors. timeGetTime with 1ms granularity is probably a safer option. Anyway, continue.

Share this post


Link to post
Share on other sites
Quote:
Original post by skittleo
Quote:
Original post by Deyja
Don't use GetTickCount. It has a resolution in the millisecond range. Unless your code is really slow, or you're timing entire frames, you need a higher resolution such as from QueryPerformanceCounter.

A lot of people have noticed QPC is buggy on modern processors. timeGetTime with 1ms granularity is probably a safer option. Anyway, continue.

Actually, its not buggy, its just that if you do not set the processor affinity, your QPC results will be dependent upon the core that your thread executes on. Windows will attempt to keep your threads localized to a single core, but...not always. As such, due to different execution speeds of the various cores (since clock rates will vary per core), you may end up with different values being returned by QPC. To fix this is simply a matter of setting your processor affinity.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu
Quote:
Original post by skittleo
Quote:
Original post by Deyja
Don't use GetTickCount. It has a resolution in the millisecond range. Unless your code is really slow, or you're timing entire frames, you need a higher resolution such as from QueryPerformanceCounter.

A lot of people have noticed QPC is buggy on modern processors. timeGetTime with 1ms granularity is probably a safer option. Anyway, continue.

Actually, its not buggy, its just that if you do not set the processor affinity, your QPC results will be dependent upon the core that your thread executes on. Windows will attempt to keep your threads localized to a single core, but...not always. As such, due to different execution speeds of the various cores (since clock rates will vary per core), you may end up with different values being returned by QPC. To fix this is simply a matter of setting your processor affinity.


But not the only problem with QPC: Performance counter value may unexpectedly leap forward. Without an understanding of both these issues (that just about every article I've encountered seems to miss), it would be easy to say that QPC is buggy. And actually it is buggy since the cause of the problem I linked is a design defect.

Share this post


Link to post
Share on other sites
Quote:
Original post by Washu

Actually, its not buggy, its just that if you do not set the processor affinity, your QPC results will be dependent upon the core that your thread executes on.


Its a bit more complicated than that.. oh and...

Quote:
Original post by Washu
Windows will attempt to keep your threads localized to a single core, but...not always.


No, it wont.. at least not under Windows XP64 Pro. XP64 on a dual core AMD64 infact attempts to balance the usage of each core even if it means swapping a single thread back and forth many times per second, so that for example each core has a running 60% average utilization. The only way to prevent this is to set a thread or process affinity.

I am assuming that this is to theoretically balance heat generation between cores but perhaps I have their motive wrong.

In any event, you would NOT want to avoid this behavior unless the production code will specifically avoid the behavior. Profile the real thing, not a mock trial with specifically allocated thread affinity's and so forth.

Quad cores or here so unless you plan on having special cases for 1 core, 2 cores, 4 cores, and soon 8 core systems... you really SHOULD let the OS manage the cores itself.

Quote:
Original post by Washu
As such, due to different execution speeds of the various cores (since clock rates will vary per core), you may end up with different values being returned by QPC. To fix this is simply a matter of setting your processor affinity.


I believe you are refering to the output of the RDTSC instruction, which on unpatched dual core AMD64's exhibited this specific problem. Additionally, on pentium processors with energy saving features or with AMD's "cool'n'quiet" technology, the clock speed could change dynamically.

If you cannot use GetTickCount (which often has an accuracy of +/- 10ms or 55ms) to profile an optmization, then you are probably profiling something that doesnt need optimization. Its a blink of an eye after all.

Share this post


Link to post
Share on other sites
Quote:
Original post by Rockoon1
If you cannot use GetTickCount (which often has an accuracy of +/- 10ms or 55ms) to profile an optmization, then you are probably profiling something that doesnt need optimization. Its a blink of an eye after all.

This is patently false. I've used profilers many times (DevPartner's free edition mostly, which doesn't work with VS2005, just 2003 unfortunately). Oftentimes, I find out that one function that runs in .01ms is being called more times than I'd imagined (by several orders of magnitude) and was the culprit.

GetTickCount wouldn't have helped me much in finding out that micro-optimizations in that function would produce a 50% or more speed increase in the entire application.

Of course, the real solution is to call the function fewer times [smile] But sometimes that's just not an option.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Use QFC and when you notice a jump forward/backwards use another timing routine that is less accurate. Easy and better solution.

Share this post


Link to post
Share on other sites
Quote:
Original post by BeanDog
This is patently false. I've used profilers many times (DevPartner's free edition mostly, which doesn't work with VS2005, just 2003 unfortunately). Oftentimes, I find out that one function that runs in .01ms is being called more times than I'd imagined (by several orders of magnitude) and was the culprit.


If your profiler is inserting timing calls into every function then you arent timing anything resembling production code. That type of obsessive observation can and often does effect the results.

The kind of profiling you are detailing is best done with an interrupt driven approach that simply interrupts your thread at regular intervals and see's where the instruction pointer is, collecting statistics... You were looking for a potential optimization target, correct? What did the absolute time have to do with it?

AMD's Code Analyst does precisely this. I highly recommend it for this purpose. Intel's alternative probably does similar although I have never used it.

Timing, IMHEO, is best reserved from comparing alternatives.

Share this post


Link to post
Share on other sites
I've been doing some research based on what you all have said, its helped me a lot in understanding profiling. Thank you.

To me, it seems that its not JUST how long any particular peice of code takes, but also how many times that code is called. If you can get the two (time, count), you simply multiply them together and figure out what percentage that is of the total code execution time. of course it would likely be important to take into account sleep time (like in a gui program waiting for input)

While adding in the overhead would't make it quite like the intended running enviroment, if applied universally, the diffrence shouldn't be noticable in terms of the percentages.

I imagine something like this could be compiled in using a construct similar to how the "assert()" macro can be cancled out.

The amd tool mentioned, if indeed it does take outside samples of the code's stack pointer... due to the fact that it would be doing so at a constant rate, you are, in a sense, indirectly getting the same percentage. Code that is ran more offten as a percentage of the execution time will have its stack pointer recognized more often in a near 1:1 ratio. Because it doesn't involve any real timeing systems, it would be much more robust to thread pauses and otehr os-related things. If you accidently alt-tab, (or another user on the system starts a heavy load in case of linux), the timing wouldn't be representive of your processes, and would incorperating hidden and unrelated variables. Where-as in the amd analysit case, the percentage is how many times you are pointed into the code verses how many times you have checked doesn't really care about the absolute time.

Share this post


Link to post
Share on other sites
Indeed. But I must stress that adding timing code to all your functions is not going to have consistent impact on each function. It can effect function inlining, branch prediction, and L1 data cache efficiency just to name a few. It will also, garanteed, effect L1 code/trace cache efficiency.

Intel's profiler is called VTUNE and it will do basically the same stuff that AMD's Code Analyst does. If you are looking for hotspots to put on the table for optimization consideration, then there really is no substitute for VTUNE or Code Analyst. These tools can also often tell you why a portion of the code is suprisingly slow (cache misses, branch mispredictions, etc..)

If you are looking to compare the performance of alternative algorithms, then a pair of GetTickCount's wrapped around a major portion of program is almost always an acceptable method. Simple and easy with games: just look at the FPS!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this