StackWalk64 and x86

Started by
8 comments, last by Jan Wassenberg 16 years, 3 months ago
Hi all, I dicovered today that the reason my debug builds were so slow is that my memory manager obtains a stack trace for all allocations at the time the allocation is made, and not if it leaks or not. So, to speed things up, I've been trying to modify my code to just store the current CONTEXT when an allocation is made, and then to walk the stack when reporting leaks. However, StackWalk64() seems to be walking the stack as it is at the time of the StackWalk64() call, not at the time the CONTEXT was captured. According to the Documentation, the CONTEXT parameter is not required on x86, which leads me to think that on x86 it's ignored and it'll always get the current context and then stack walk that, which is a problem for me. I can't walk the stack at allocation time (to get a stack trace), because that involves looking up debug symbols, which is the slow part, and I can't just grab the top of the stack and dump that, because it'll always be inside my memory manager, making the output pretty useless. I have an x64 build of my app, but I'm unable to test it just now (No 64-bit machine to test it on), I'll give it a go tomorrow and see if the problem exists in a x64 build (Which I doubt). Does anyone know if this is the case, and StackWalk64() grabs the current context in x86? And is there any way around this? Cheers, Steve
Advertisement
In a nutshell, a CONTEXT only captures the state of the CPU, which means it contains only minimal information about the stack. Unless you use the CONTEXT to perform a stack walk then and there, it becomes useless since the state of the stack will change if any functions are called are returned from.

I've never done this, but in theory you could perform a stack walk and only store the program counter for each stack frame. This doesn't require looking up the debug symbol information, and the addresses should still be good to obtain the relevant symbol information later.
Why not just pass the function and line number of the caller into the allocation routine? You can easily set up a macro to do it using __LINE__ and __FUNCTION__.
Doing a stack trace seems a rather complicated way to do things.
The slow bit tends to be resolving the addresses into function names. As long as you aren't doing that on the call to new it should be relatively quick.

If you need some example code just look at VLD.
Quote:According to the Documentation, the CONTEXT parameter is not required on x86, which leads me to think that on x86 it's ignored and it'll always get the current context and then stack walk that, which is a problem for me.

As a side note, it's definitely not always ignored on ia32, as I've seen results differ according to whether or not registers are correctly set.

Quote:I can't just grab the top of the stack and dump that, because it'll always be inside my memory manager, making the output pretty useless.

Since you know the number of frames between your StackWalk and the calling function, you can just skip that amount.

Quote:I've never done this, but in theory you could perform a stack walk and only store the program counter for each stack frame. This doesn't require looking up the debug symbol information, and the addresses should still be good to obtain the relevant symbol information later.

Yep, that works well :)

Quote:Why not just pass the function and line number of the caller into the allocation routine? You can easily set up a macro to do it using __LINE__ and __FUNCTION__.

That's fine until you get sick and tired of wrapping each instance of placement new in ugly #include "nommgr.h" / #include "mmgr.h". It also requires more work if you need std::nothrow_t.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Thanks for the replies. I've tried the code on x64, and it doesn't work, in the same way as x86 (Although after reading SiCrane's reply, thank makes sense.

Quote:Original post by SiCrane
In a nutshell, a CONTEXT only captures the state of the CPU, which means it contains only minimal information about the stack. Unless you use the CONTEXT to perform a stack walk then and there, it becomes useless since the state of the stack will change if any functions are called are returned from.

I've never done this, but in theory you could perform a stack walk and only store the program counter for each stack frame. This doesn't require looking up the debug symbol information, and the addresses should still be good to obtain the relevant symbol information later.
Ah, good point; I thought that it'd capture the whole stack, but I suppose that would be overkill...
I'll try doing the stack walk and storing the top 10 frames or something (Well, just EIP / the PC).

Quote:Original post by Jan Wassenberg
Quote:I can't just grab the top of the stack and dump that, because it'll always be inside my memory manager, making the output pretty useless.

Since you know the number of frames between your StackWalk and the calling function, you can just skip that amount.
I use my memory manager in a release build sometimes too (Well, release build + debug symbols), so the number of functions from the original caller varies, due to inlining.

Quote:Original post by Jan Wassenberg
Quote:Why not just pass the function and line number of the caller into the allocation routine? You can easily set up a macro to do it using __LINE__ and __FUNCTION__.

That's fine until you get sick and tired of wrapping each instance of placement new in ugly #include "nommgr.h" / #include "mmgr.h". It also requires more work if you need std::nothrow_t.
Yup. I used to use mmgr, and had all sorts of issues like this. I like the ability to just drop a header and source file into my app and have complete memory manager functionality.

In my old code, I walked the stack, resolving function names until I hit a function name that didn't start with "PMemory::" or "operator new", and then assumed that was the calling function; which worked fine. I'll try walking the stack and storing (up to) the top 10 frames or so, and then resolve them at leak time,and let you know how that works.

Cheers,
Steve
When I had this same problem I just stored the offset then resolved the names later. This was actually really fast. I used a DEBUG_WALK_DEPTH macro and a DEBUG_WALK_SKIP macro to define how deep to store and how many to skip before storing the stack. In my program its DEBUG_WALK_DEPTH = 10 and DEBUG_WALK_SKIP = 2. I only created support for x86 tracing but it would not be hard to add 64 bit support. Maybe this can help:
void CCallStack::WalkCallStack ( DebugDataStruct *debugStruct ){   if (debugStruct == NULL)      return;   CONTEXT context;   // Grap the current context (state of EBP,EIP,ESP registers)   memset(&context, 0, sizeof(CONTEXT));   context.ContextFlags = CONTEXT_ALL;   _asm {         call x      x: pop eax         mov context.Eip, eax         mov context.Ebp, ebp         mov context.Esp, esp   }   //RtlCaptureContext(&context);   STACKFRAME64 stackFrame;   memset(&stackFrame, 0, sizeof(STACKFRAME64));   // Stack frame must be set based on arcitecture#ifdef _M_IX86   stackFrame.AddrPC.Offset = context.Eip;   stackFrame.AddrPC.Mode = AddrModeFlat;   stackFrame.AddrFrame.Offset = context.Ebp;   stackFrame.AddrFrame.Mode = AddrModeFlat;   stackFrame.AddrStack.Offset = context.Esp;   stackFrame.AddrStack.Mode = AddrModeFlat;#else   #error "Platform not supported!"#endif   debugStruct->stackCount = 0;   HANDLE hThread = GetCurrentThread();   for (int frameNum = 0; frameNum < (DEBUG_WALK_DEPTH + DEBUG_WALK_SKIP); ++frameNum )   {      if (!StackWalk64(IMAGE_FILE_MACHINE_I386,m_hProcess,hThread,&stackFrame,&context,CCallStack::ReadMemoryRoutine,SymFunctionTableAccess64,SymGetModuleBase64,NULL))         break;      if (stackFrame.AddrPC.Offset == stackFrame.AddrReturn.Offset)         break;      // Valid call stack frame      if (stackFrame.AddrPC.Offset != 0)      {         if (frameNum >= DEBUG_WALK_SKIP)         {            debugStruct->stackOffset[debugStruct->stackCount] = stackFrame.AddrPC.Offset;            debugStruct->stackCount++;         }      } else         break;      }   }}

Using stackFrame.AddrPC.Offset you can later resolve the symbols during output.
Quote:Original post by Evil Steve
Quote:I've never done this, but in theory you could perform a stack walk and only store the program counter for each stack frame. This doesn't require looking up the debug symbol information, and the addresses should still be good to obtain the relevant symbol information later.
Ah, good point; I thought that it'd capture the whole stack, but I suppose that would be overkill...
I'll try doing the stack walk and storing the top 10 frames or something (Well, just EIP / the PC).
That seems to be what DevPartner's Error Checking does. Though it's a configurable number of frames. That manages to not slow it down much, so it should work well for you.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Well, just saving the program counter seems to work great. Loading a BSP file (Which makes about 6000 allocations, mostly STL ones) originally tool > 30 seconds, and now it takes about 2 seconds with the code to stack walk byt not resolve functions. Without the stack walking at all, it takes about half a second.

If anyone is interested in the code I have:
// Allocation struct (Irrelevant fields removed)struct AllocHeader{	#ifdef USE_STACKTRACE		static const size_t cnMaxStackFrames = 16;		size_t nPC[cnMaxStackFrames];	#endif};// Headers and libs:#ifdef USE_STACKTRACE	#include <dbghelp.h>	#pragma comment(lib,"dbghelp.lib")#endif // USE_STACKTRACE// Memory manager init time (From constructor):#ifdef USE_STACKTRACE	SymInitialize(GetCurrentProcess(), NULL, TRUE);#endif


And then the main code, RecordStackTrace is called for every allocation, and GetCallerForAllocation is called when memory leaks are detected:
void PMemory::RecordStackTrace(AllocHeader* pAllocation){#ifdef USE_STACKTRACE	// Capture context	CONTEXT ctx;	RtlCaptureContext(&ctx);	// Init the stack frame for this function	STACKFRAME64 theStackFrame;	memset(&theStackFrame, 0, sizeof(theStackFrame));	#ifdef _M_IX86		DWORD dwMachineType = IMAGE_FILE_MACHINE_I386;		theStackFrame.AddrPC.Offset = ctx.Eip;		theStackFrame.AddrPC.Mode = AddrModeFlat;		theStackFrame.AddrFrame.Offset = ctx.Ebp;		theStackFrame.AddrFrame.Mode = AddrModeFlat;		theStackFrame.AddrStack.Offset = ctx.Esp;		theStackFrame.AddrStack.Mode = AddrModeFlat;	#elif _M_X64		DWORD dwMachineType = IMAGE_FILE_MACHINE_AMD64;		theStackFrame.AddrPC.Offset = ctx.Rip;		theStackFrame.AddrPC.Mode = AddrModeFlat;		theStackFrame.AddrFrame.Offset = ctx.Rsp;		theStackFrame.AddrFrame.Mode = AddrModeFlat;		theStackFrame.AddrStack.Offset = ctx.Rsp;		theStackFrame.AddrStack.Mode = AddrModeFlat;	#elif _M_IA64		DWORD dwMachineType = IMAGE_FILE_MACHINE_IA64;		theStackFrame.AddrPC.Offset = ctx.StIIP;		theStackFrame.AddrPC.Mode = AddrModeFlat;		theStackFrame.AddrFrame.Offset = ctx.IntSp;		theStackFrame.AddrFrame.Mode = AddrModeFlat;		theStackFrame.AddrBStore.Offset = ctx.RsBSP;		theStackFrame.AddrBStore.Mode = AddrModeFlat;		theStackFrame.AddrStack.Offset = ctx.IntSp;		theStackFrame.AddrStack.Mode = AddrModeFlat;	#else	#	error "Platform not supported!"	#endif	// Walk up the stack	memset(pAllocation->nPC, 0, sizeof(pAllocation->nPC));	for(int i=0; i<AllocHeader::cnMaxStackFrames; ++i)	{		pAllocation->nPC = theStackFrame.AddrPC.Offset;		if(!StackWalk64(dwMachineType, GetCurrentProcess(), GetCurrentThread(), &theStackFrame,			&ctx, NULL, SymFunctionTableAccess64, SymGetModuleBase64, NULL))		{			break;		}	}#endif	UNREFERENCED_PARAMETER(pAllocation);}const char* PMemory::GetCallerForAllocation(AllocHeader* pAllocation){#ifdef USE_STACKTRACE	const size_t cnBufferSize = 512;	char szFile[cnBufferSize];	char szFunc[cnBufferSize];	unsigned int nLine;	static char szBuff[cnBufferSize*3];	// Initialise allocation source	strcpy(szFile, "??");	nLine = 0;	// Resolve PC to function names	size_t nPC;	for(int i=0; i<AllocHeader::cnMaxStackFrames; ++i)	{		// Check for end of stack walk		nPC = pAllocation->nPC;		if(nPC == 0)			break;		// Get function name		unsigned char byBuffer[sizeof(IMAGEHLP_SYMBOL64) + cnBufferSize];		IMAGEHLP_SYMBOL64* pSymbol = (IMAGEHLP_SYMBOL64*)byBuffer;		DWORD64 dwDisplacement;		memset(pSymbol, 0, sizeof(IMAGEHLP_SYMBOL64) + cnBufferSize);		pSymbol->SizeOfStruct = sizeof(IMAGEHLP_SYMBOL64);		pSymbol->MaxNameLength = cnBufferSize;		if(!SymGetSymFromAddr64(GetCurrentProcess(), nPC, &dwDisplacement, pSymbol))			strcpy(szFunc, "??");		else		{			pSymbol->Name[cnBufferSize-1] = '\0';			// See if we need to go further up the stack			if(strncmp(pSymbol->Name, "PMemory::", 9) == 0)			{				// In PMemory, keep going...			}			else if(strncmp(pSymbol->Name, "operator new", 12) == 0)			{				// In operator new or new[], keep going...			}			else if(strncmp(pSymbol->Name, "std::", 5) == 0)			{				// In STL code, keep going...			}			else			{				// Found the allocator (Or near to it)				strcpy(szFunc, pSymbol->Name);				break;			}		}	}	// Get file/line number	if(nPC != 0)	{		IMAGEHLP_LINE64 theLine;		DWORD dwDisplacement;		memset(&theLine, 0, sizeof(theLine));		theLine.SizeOfStruct = sizeof(theLine);		if(!SymGetLineFromAddr64(GetCurrentProcess(), nPC, &dwDisplacement, &theLine))		{			strcpy(szFile, "??");			nLine = 0;		}		else		{			const char* pszFile = strrchr(theLine.FileName, '\\');			if(!pszFile) pszFile = theLine.FileName;			else ++pszFile;			strncpy(szFile, pszFile, cnBufferSize);			nLine = theLine.LineNumber;		}	}	// Format into buffer and return	sprintf(szBuff, "%s:%d (%s)", szFile, nLine, szFunc);	return szBuff;#else	UNREFERENCED_PARAMETER(pAllocation);	return "Stack trace unavailable";#endif // USE_STACKTRACE}
I use RtlCaptureContext, which I know doesn't work on pre-XP because I didn't want to mess around with SEH, or x64 assembly (Which needs to be in a seperate asm file, ugh).

Thanks again,
Steve
Quote:I use my memory manager in a release build sometimes too (Well, release build + debug symbols), so the number of functions from the original caller varies, due to inlining.

Yes, but the number is either under your control (-> __declspec(noinline) or inline) or known to you (simply count them in debug/release for each compiler you support).
This is kind of hacky, but no less hacky than a max. number of stack frames (exact same problem, just now up for the library user to handle)

Quote:Yup. I used to use mmgr, and had all sorts of issues like this. I like the ability to just drop a header and source file into my app and have complete memory manager functionality.

Indeed. I am currently evaluating VLD, which appears very nice but causes an evil race condition with our scripting engine's /highest-priority/ GC thread *sigh*
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement