So, I've stumbled upon my first serious bug in years. As usual, my knowledge of C++ has tricked me into a false sense of safety, and I wanted to explain what happened.
My program executes a function call on a given object twice, with some additional work in-between the two calls. The call looks like this:
Kernel CUDA_Module::GetKernel(const std::string & name) const
{
// The call which causes the error is "implGetKernel"
return this->implGetKernel(name);
}
// The function above is called by:
Kernel Module::GetKernel(const std::string & name) const
{
// 'this->module' is a boost::shared_ptr<iModule>
// and 'CUDA_Module' inherits from 'iModule'
return this->module->GetKernel(name);
}
The first call works correctly, but the second results in an error:
Unhandled exception at 0x00e2fb83 in CUtest.exe: 0xC0000005: Access violation writing location 0xbd55d0c2.
The first scary thing about the error above is that the this pointer in the call:
- Is at address 0x00e2fb80
- Points to address 0x00e2fb80
By examining the assembly at the call point (not the violation point), I see the typical explore-vtable-and-call sequence Visual C++ does:
00403769 mov edx,dword ptr [this]
0040376C mov eax,dword ptr [edx]
0040376E mov ecx,dword ptr [this]
00403771 mov edx,dword ptr [eax]
00403773 call edx
Which, obviously, results in the program executing the vtable and dying a loud death because the instructions don't make sense. So, I seek the reason for the modification using a breakpoint (because, no matter how you look at it, the px member of a boost::shared_ptr should not be altered to equal its own address), and I suspect an overflow or similar random memory bombing. A memory breakpoint later, the modification appears to occur, deterministically (praise debuggers) in the following code:
0041BB23 call operator new (44FE80h)
Oops. After half an hour of puzzling (and noticing that the breakpoint gets triggered about once per function) it finally occurs to me that the the memory for the Module might have been deallocated. The precise details of this elude me, because the Module is a non-pointer, non-reference member variable of a stack-based instance of another class, which just so happens to currently be alive and well on the stack for the entire duration of the program.
Until I notice a small detail:
// What the stack-based instance actually calls is:
Native::Kernel Program::GetKernel(const CUlib::Extra::Symbol & s) const
{
return module.kernel[symbols][dimensions];
}
<span class="cpp-comment">// Where Module::kernel is a syntactic sugar implemented as:</span>
<span class="cpp-keyword">class</span> Module
{
<span class="cpp-keyword">public</span>:
<span class="cpp-keyword">const</span> <span class="cpp-keyword">class</span> KernelGetter
{
<span class="cpp-keyword">const</span> Module & m;
KernelGetter(<span class="cpp-keyword">const</span> Module & m) : m(m) {}
<span class="cpp-keyword">friend</span> <span class="cpp-keyword">class</span> Module;
<span class="cpp-keyword">public</span>:
Kernel <span class="cpp-keyword">operator</span>[](<span class="cpp-keyword">const</span> std::string & name) <span class="cpp-keyword">const</span>
{
<span class="cpp-keyword">return</span> m.GetKernel(name);
}
} kernel;
Module(<span class="cpp-keyword">const</span> boost::shared_ptr<iModule> & module) :
module(module), symbol(*<span class="cpp-keyword">this</span>), kernel(*<span class="cpp-keyword">this</span>) {}
<span class="cpp-comment">// other members, but no copy constructor.</span>
}
</pre></div><!–ENDSCRIPT–>
The error now seems obvious: a first Module is instantiated on the stack somewhere, and its <tt>KernelGetter</tt> gets a reference to it. Then, a copy is made in the constructor of the <tt>Program</tt>, but the <tT>KernelGetter</tt> keeps a reference to the old one, and results in undefined behaviour once the old one is wiped off the stack.
The reason for the error? I expected the const-reference-member of the <tt>KernelGetter</tt> to prevent the creation of a default copy constructor, and thus prevent copies (warning me when I would try to make a copy, so I can decide whether I must implement copy construction or forbid it). The compiler did, however, create a copy constructor (as could have been expected if I has actually used my brain for a second there), which proved to be semantically invalid.
Lessons learned:
<ul><li>Always explicitly disable constructors and assignment operators when you don't need them. Especially when using non-RAII members, and especially when using an exotic pattern.
<li>Do not feel safe just because you're using references instead of pointers.
<li>Learn how to read a stack trace correctly. The problem was clearly in there all along, and merely looking at what became of the <tt>this</tt> pointer along the call would have been a sufficient hint.
<li>Don't think that, just because the first call works and the second crashes, the first call is correct. In this case, it just didn't have the time to be corrupted as well, though it could have.</ul>
So far, so good. Back to work!