Stackwalking across DLL boundaries?

Started by
3 comments, last by moeron 18 years, 2 months ago
Ok let me lay some background info down here. I've been getting bug reports from some beta testers that some user data is not retaining in files and is mysteriously disappearing from time to time. The real problem is that I cannot reproduce this problem on my own machine so I need to get some kind of debug information from my beta testers. Since this user data is always being handled by boost::shared_ptrs I've written my own deleter that trys to walk the stack to figure out where the call to delete is coming from. This is all based on the article located here. Now my stack walker appears to be working ok, it has trouble with function names, but whatever, at least I have the function addresses. The thing is it doesn't seem to be walking across dll boundaries. If I look in my debugger I can see that there is a couple function calls from outside my module, but in my stackwalker output it stops at the last function inside my dll. Has anyone encountered any problems like this before (bugs on other ppls machines that you can't see), or perhaps have a better technique I should be using for trying to squash a bug I cannot reproduce? The stackwalker is about the only thing I can think of right now, but there must be some other way to track this damn thing... Thanks for any ideas, moe.ron
moe.ron
Advertisement
What a coincidence, I ran into a problem with our stack walker yesterday at work. Here's the problem:

Stack walking by following EBP chains only works if the "Omit frame pointers" optimization is OFF. You're in control of that for YOUR code, but not for third party .LIBs and .DLLs, which may have omitted stack frame pointers.


Here's the rest of the story about stack frame stuff that may or may not help:

The assembly code that often occurs at the beginning of a function looks like this:
push ebpmov ebp, esp

This is only used so that later in the function, local variables and function arguments can be found at a constant offset from EBP:
mov eax, [ebp-0Ch]

This was neccessary because you might push or pop stuff onto the stack at any point in your function, which moves ESP. So it would be hard to figure out where ESP was in relation to local variables and arguments.

However, compilers are smarter than that. When they're putting the assembly code together, they can arrange it into groups called "Basic blocks". Basic blocks are clumps of instructions that have a single entry instruction and a consistent instruction execution (jumps and calls are always placed at the end of basic blocks).

It turns out that when you do stack analysis on basic blocks, you'll find that the stack will always be modified by the same amount between the start and end of the block. This leads to the compiler tracking what the stack offset is at the entrance and exit of all blocks in a function, and enforces that when a block is reachable from multiple predecessors, they leave the stack at the same place (you will often see things like "add esp, 0Ch" immediately before a basic block ends. At that point, it knows where the stack will be pointing at ALL times, and can replace
mov eax, [ebp-0Ch]

with
mov eax, [esp-114h]

or something along those lines. Notice that EBP is no longer relied on. This means that the compiler can use it as a temporary register, and doesn't need to swap around EAX, EDX, ECX and EBX as often!

This is exactly what goes on in the compiler when you enable "omit stack frame pointers". This also essentially eliminates the chances of doing a call stack walk with EBP chaining because the compiler doesn't write the prolog
push ebp // This is what you need for a good stack walker!mov ebp, esp


The alternate method is to crawl down the stack 4 bytes at a time (or 8 in 64 bit mode), finding any pointer that lies on a memory region that's marked as executable. This is easy, but very prone to error (what if someone uses a function pointer as an argument or a local variable?) The good news is that false positives like this are acceptable - you can just keep walking down the stack. Sure, you'll get stuff that you didn't want, but you WILL get all the real return addresses as well. Since everything on the stack is aligned, and the probability of encountering a value that isn't a return address and within the range of your code pages is very small, this actually works really well on fully optimized code without stack frames.

You can do further analysis on the results if you know the starting location of all of your functions: Since a return address on the stack is always the instruction *after* the CALL instruction, and a function will NEVER *end* in a CALL instruction, the return addresses will never match function start addresses. All function pointers passed as arguments or used as local variables *WILL* match function start addresses. So then your problem boils down to keeping track of executable regions of memory at runtime and keeping track of function start addresses offline when someone sends you a call stack.

I know this was probably overkill, but it might help.

[Edited by - Nypyren on January 28, 2006 1:43:46 AM]
Thanks for the response, I am curious though what you mean by this
Quote: finding any pointer that lies on a memory region that's marked as executable

How would I determine something like that?
moe.ron
On Windows, use VirtualQuery or VirtualQueryEx. Examine the AllocationProtect member of the returned MEMORY_BASIC_INFORMATION structure.
"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man
what is a fool-proof way to get my current location on the stack? I'm not really too advanced with Assembly, so I was originally thinking I could use the EIP register, but apparently I can't use the EIP in MSVC 6.0 inline asm. So is there any register I can use in inline asm to figure out where I'm at on the stack? If not do I have to write an actual asm file or something to be able to do this?

[edit]
Is ESP what I'm looking for?

Edit2
yup it was =) Thanks for the help guys

[Edited by - moeron on January 31, 2006 10:24:49 AM]
moe.ron

This topic is closed to new replies.

Advertisement