Jump to content
  • Advertisement
Sign in to follow this  
lochnesssnowman

Help with debugging question

This topic is 5393 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Ok troops, anyone help me out with some advice/solutions to this question? I was asked on the spot, and even with time to think about it, it's pretty difficult. Anyhow, question... Imagine you are a software engineer on a game team that is very close to deadline. The playtesters have uncovered a big problem - the game will occasionally and unexpectedly crash. This has been observed: - The crash cannot be repeated at will (it appears to happen once or twice a day during playtesting) - When viewing through the debugger, it appears that a certain structure in memory has been overwritten with some invalid data Bearing in mind the impending deadline, what steps would you take to resolve this mysterious scribble? Any suggestions? Cheers in advance. P.S. - This is also my first post so hello to everybody.

Share this post


Link to post
Share on other sites
Advertisement
Welcome to gamedev =)

First you should really try to make it reproduceable. Perhaps it can be "caught" in a replay?

Then I would try to see if some advanced tools might catch it for me, like valgrind or boundschecker.
If that fails I'd add a separate verification system for the class that gets invalidated and insert checkpoints into my code. E.g. run a binary search until you find the problems source.
Also many debuggers allows you to set breakpoints for modified data.

But there's plenty of other methods around. Just don't be tempted to give up, there's a systematic approach to solving almost any bug.

Share this post


Link to post
Share on other sites
find all the code that edits that structure, and put breakpoints on them, or try something like this after all the edits if the data gets edited very often.

im using pseudo code btw;



if(data == invalid){
breakpoint--->
}




then when you know where the problem is, it should be easily fixed, and you can remove that test :)

Share this post


Link to post
Share on other sites
Sounds like a pretty horrible bug. My advice: Start panicing. No actually, some debuggers can alert you whenever a certain data structure is being written to, that might be useful. It sounds like some code is writing out of bounds somewhere, double check any dodgy code that comes to mind.

Share this post


Link to post
Share on other sites
Heh, most professional developers have a war story or two about similar scenarios. It's often worse with console development too (I remember a nasty GameCube bug I had that would only ever happen on standalone test/debug units and never on the devkit regardless of whether it was set to retail or development mode - divide and conquer debugging and single step dry running isn't fun). It's things like this that cause you to be in the office until 1am the night before alpha.


This sounds very much like an "interview" question rather than a specific problem you're trying to fix right now; for this reason I'll give a more complete and correct answer rather than tell you about the evil bodges you can do to make the crash go away without fixing the real cause [evil].


With hindsight (a.k.a experience) you'll have done things to prevent the problem in the first place and put things into the code base to make the problem easier to track down:

1) sensible use of the language, asserts, unit testing, regression testing, using boundschecker, getting rid of warnings etc - they're the kind of good habits to get into which do genuinely catch most problems!

2) where possible, some sort of recording system for the test sessions can be invaluable - half a day spent implementing something like that can save a few days worth of debugging in those cases where the problem is caused by a very specific combination of game events.

3) assuming a PC or Xbox version of the game, the game will have been set up with the ability to automatically produce full minidumps from the test machine, the source code for that release would have been archived, and you'd be running a symbol server with symbols for each release. That way, as soon as there was a crash, the whole call stack and relevent source version could be remotely debugged.


Assuming I had the crashed game stopped in the debugger and had no easy way of reproducing the steps to cause the crash (i.e. no recording system), I'd probably go about things the following way:

0) Check the (hopefully detailed) reports from all the testers who found a crash in the same piece of code to see if I spot any patterns. Hopefully there are enough testers that a few are testing each level, and all levels are being tested simultaneously. Does it always happen with a particular level?, Does it always happen when a certain type of object/character is on screen?, Does it always happen when they do something which starts a particular type of effect?. The testers might not notice patterns individually, but correlating reports can often spot obvious areas of investigation.

1) Re-compile a debug build of that code with maximum warnings and maximum compiler provided checks.

2) Determine what type of crash it was and look at the code down the callstack to make sure it wasn't actually a problem with the code which was meant to use that structure.

3) Check up the callstack. If there are any functions in there which are allowed to change the structure, check that they're doing what they should and have been passed valid data.

4) Look at the structure that's been scribbled on in the memory viewer. Which parts of the structure have been overwritten, at what address, and what have they been overwritten by?

5) If you find text in the scribble, do a search in the codebase for that text and dry run any code using that text. Check for sub-patterns and search through any relevent data files which use text.

6) If there's non-text data in the scribble, check for any obvious patterns sub-patterns, particularly for any memory fill patterns used by your memory allocator (0xCDCDCDCD, 0xCCCCCCCC, 0xFEFEFEFE, etc). Those won't lead to the exact cause, but will give you some more clues about the cause.

7) If the structure is global/static, look in a MAP file (or similar) to see which few structures immediately precede and follow the structure being trashed. Then look in the memory viewer and watch window at those structures; have they been trashed?, have any been released/deallocated?, do any relate to what the tester was doing at the time of the crash?, does every element have a value which makes sense? Things like bugs in linked list code are pretty common causes of random trashes of neighbouring structures that weren't trapped by noMansLand checks by the allocator.

8) If the structure that's being trashed is dynamically allocated, check with the allocator to see who owns the memory above and below the block that's been trashed. Do the same tests as you would with a global structure. This kind of debugging is also a good reason for your allocator to not recycle freed memory immediately and a good reason to have trace/context information stored with debug allocations.

9) If the structure is on the stack (automatic variable), once again do the same checking of neighbouring structures - though you should have spotted any problems in the current function during your callstack walk.

10) If you have multiple crash dumps of the same problem, "diff" the values in the structure to see if the trash is random or always the same. Random trashes are much rarer (thankfully, they're sometimes swines to catch).


Depending on the nature of the crash and the code, there isn't too much more you can do in terms of pure "in place"/non-invasive debugging of a currently crashed build. It's then probably time to start with some "invasive" debugging. This will unfortunately mean changing/adding code and restarting the test process:

1) if your game doesn't already have some record and playback facility to help track down problems, see if a simple one can be cobbled together. It might help get things reproducible in better time.


2) if this is the highest priority "class A", get as much of the test department as possible to start trying to reproduce it (and find patterns) - preferably with a recording system in place. IMO this is one advantage of having some testers on your team, and others in a different timezone.


3) if you're on a platform/CPU which offers "memory page protection", you could ensure the structure ends up on a page boundary and write a "protect"/"unprotect" function to change the page attributes of the structure. Then any function which is allowed to write to the structure calls unprotect(), does it's business and calls protect() again.

With any luck, next time the code is tested, as soon as something scribbles over the structure you'll get a memory protection exception at the exact point where the scribble happens!


4a) if you don't have hardware page protection facilities, make a macro/function of the kind _Phalanx_ describes to check the validity of the structure. If the scribble is always the same, it can sometimes be better to make the function check for the data being equal to the scribble data rather than checking for validity of the members of the structure.

4b) in your highest level main loop, in between each call to a subsystem of the game, put a call to your structure checker in. Make it log pass or failure to a file.

4c) e.g.:
mainloop()
{
CHECK_STRUCTURE_AND_LOG("start of frame");
player.update();
CHECK_STRUCTURE_AND_LOG("after player.update");
physics.update();
CHECK_STRUCTURE_AND_LOG("after physics.update");
...
and so on
}


4d) Test until you get a crash, and look through the log file to see which sub-system the scribble was in.

4e) Once you've found the faulty sub-system, repeat the above procedure for that sub-system (i.e. more macros into the code and re-test).

4f) This divide-and-conquer method tends to be surprisingly fast at catching the culprit.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
If you are using a programming language which is capable of allocating memory/object (I refer to my knowlegde of C/C++.) on the heap, try looking for any usage of an object that has been free'd before. I think S1CA referred implicitly to such error.

Due to algorithms for memory allocation you can't reproduce where any new object will be located. So it could happen that your old object is still working without problems although it is free'd and at some time will be overwritten with some other data. It's because free doesn't take care of setting memory to zero or anything else for that matter and the system usually doesn't care about validating referred objects.
It's been a problem on an earlier project of mine.

Excuse misspelling or grammar errors, english isn't my mother-language. Good luck in finding your bug.

Share this post


Link to post
Share on other sites
Easiest is to use a product like Bounds Checker. They have a 14 day free trial and considering your time constraints that might be your best bet. You should check what your development environment already provides as well. As an example C++ Builder has Code Guard.

The most likely cause is using a pointer after what it points to has been released. The easiest way to catch that is to zero your pointers when you deallocate things. Then when you try to access the deallocated structure you crash instead of modifying some structure you didn't intend to.

Share this post


Link to post
Share on other sites
Hello. Just want to say thanks much for the replies. S1CA is right concerning this being an interview question :)

Thanks again. I'll definitely be around with more questions (and advice/help if I can) in future.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
memory managers are a godsend in those cases. Bound checks, the works. Problems like that usually arise with an array overflow, or objects that have been deleted and references to that object that are hanging about the place and not cleared. Also, formated strings are very good candidates for memory stamping and overflows.

The callstack may not be helpful, since that memory corruption can happen any time and not be detected for a while, but you never know.

So it should not be too difficult to track with a memory manager. Check what's been allocated before the allocation of the structure in memory. Check what objects references the structure, and what's the allocation before the memory space of that object.

Also, things like "breakpoint on data change" or whatever it's called should be the first thing to try. And do regular sanity checks on the data, make sure it's not corrupted or being corrupted. If you have a replay system, then that should be of great help too. Basically, put bug traps all over the place, and assert, assert, assert....

Personnaly, I don't have too many of those, since our engine uses handles and guids 99% of the time, and we do bound checkings everytime, so memory corruptions are rare.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!