[C++] Bug postmortem

Started by
5 comments, last by T1Oracle 16 years, 9 months ago
So, I've stumbled upon my first serious bug in years. As usual, my knowledge of C++ has tricked me into a false sense of safety, and I wanted to explain what happened. My program executes a function call on a given object twice, with some additional work in-between the two calls. The call looks like this:
Kernel CUDA_Module::GetKernel(const std::string & name) const
{
  // The call which causes the error is "implGetKernel"
  return this->implGetKernel(name);
}

// The function above is called by:
Kernel Module::GetKernel(const std::string & name) const
{
  // 'this->module' is a boost::shared_ptr<iModule>
  // and 'CUDA_Module' inherits from 'iModule'
  return this->module->GetKernel(name);
}
The first call works correctly, but the second results in an error:
Unhandled exception at 0x00e2fb83 in CUtest.exe: 0xC0000005: Access violation writing location 0xbd55d0c2.
The first scary thing about the error above is that the this pointer in the call:
  • Is at address 0x00e2fb80
  • Points to address 0x00e2fb80
By examining the assembly at the call point (not the violation point), I see the typical explore-vtable-and-call sequence Visual C++ does:

00403769  mov         edx,dword ptr [this] 
0040376C  mov         eax,dword ptr [edx] 
0040376E  mov         ecx,dword ptr [this] 
00403771  mov         edx,dword ptr [eax] 
00403773  call        edx  
Which, obviously, results in the program executing the vtable and dying a loud death because the instructions don't make sense. So, I seek the reason for the modification using a breakpoint (because, no matter how you look at it, the px member of a boost::shared_ptr should not be altered to equal its own address), and I suspect an overflow or similar random memory bombing. A memory breakpoint later, the modification appears to occur, deterministically (praise debuggers) in the following code:
0041BB23  call        operator new (44FE80h) 
Oops. After half an hour of puzzling (and noticing that the breakpoint gets triggered about once per function) it finally occurs to me that the the memory for the Module might have been deallocated. The precise details of this elude me, because the Module is a non-pointer, non-reference member variable of a stack-based instance of another class, which just so happens to currently be alive and well on the stack for the entire duration of the program. Until I notice a small detail:

// What the stack-based instance actually calls is:
Native::Kernel Program::GetKernel(const CUlib::Extra::Symbol & s) const
{
  return module.kernel[symbols][dimensions];
}

<span class="cpp-comment">// Where Module::kernel is a syntactic sugar implemented as:</span>
<span class="cpp-keyword">class</span> Module
{
<span class="cpp-keyword">public</span>:
  <span class="cpp-keyword">const</span> <span class="cpp-keyword">class</span> KernelGetter
  {
    <span class="cpp-keyword">const</span> Module &amp; m;
    KernelGetter(<span class="cpp-keyword">const</span> Module &amp; m) : m(m) {}
    <span class="cpp-keyword">friend</span> <span class="cpp-keyword">class</span> Module;
  <span class="cpp-keyword">public</span>: 
    Kernel <span class="cpp-keyword">operator</span>[](<span class="cpp-keyword">const</span> std::string &amp; name) <span class="cpp-keyword">const</span>
    { 
      <span class="cpp-keyword">return</span> m.GetKernel(name); 
    }
  } kernel;


  Module(<span class="cpp-keyword">const</span> boost::shared_ptr&lt;iModule&gt; &amp; module) : 
    module(module), symbol(*<span class="cpp-keyword">this</span>), kernel(*<span class="cpp-keyword">this</span>) {}

  <span class="cpp-comment">// other members, but no copy constructor.</span>
}
</pre></div><!–ENDSCRIPT–>

The error now seems obvious: a first Module is instantiated &#111;n the stack somewhere, and its <tt>KernelGetter</tt> gets a reference to it. Then, a copy is made in the constructor of the <tt>Program</tt>, but the <tT>KernelGetter</tt> keeps a reference to the old &#111;ne, and results in undefined behaviour &#111;nce the old &#111;ne is wiped off the stack.

The reason for the error? I expected the const-reference-member of the <tt>KernelGetter</tt> to prevent the creation of a default copy constructor, and thus prevent copies (warning me when I would try to make a copy, so I can decide whether I must implement copy construction or forbid it). The compiler did, however, create a copy constructor (as could have been expected if I has actually used my brain for a second there), which proved to be semantically invalid.

Lessons learned:
<ul><li>Always explicitly disable constructors and assignment operators when you don't need them. Especially when using non-RAII members, and especially when using an exotic pattern.
<li>Do not feel safe just because you're using references instead of pointers.
<li>Learn how to read a stack trace correctly. The problem was clearly in there all along, and merely looking at what became of the <tt>this</tt> pointer along the call would have been a sufficient hint.
<li>Don't think that, just because the first call works and the second crashes, the first call is correct. In this case, it just didn't have the time to be corrupted as well, though it could have.</ul>

So far, so good. Back to work!
Advertisement
Thanks for taking the time to share that.
Just thought I'd let you know that somebody read it and appreciated the time taken to post it.
I have recently been making sure I disable the copy-constructor and assignment operator of classes that I don't expect to use them. Definitely a good idea.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by iMalc
Thanks for taking the time to share that.
Just thought I'd let you know that somebody read it and appreciated the time taken to post it.


Steve McConnell discusses in Code Complete the power of talking about your problems. He gives the example of going into someone else's office, starting to describe your problem, and realizing halfway through what the problem is.

The above worked this way as well. I realized, as I was describing the bug, that the Module was probably wiped off the stack before being used, which was enough to get me going about discovering the real cause [smile] I already had two-thirds of the post done, so I finished it up.
You always have very pedagogical presentations Vyk [smile]
Was an interesting read, thanks for sharing! Hopefully, I learned something and can avoid making a similar mistake =]
Best regards, Omid
Doesn't compiler report returning of a local variable? Or did it get puzzled by all this and missed it?
Quote:Original post by Antheus
Doesn't compiler report returning of a local variable? Or did it get puzzled by all this and missed it?


I believe it's too complex to notice, as I'm not returning a direct reference but rather a reference through another object.
Most of my runtime errors with boost:shared_ptr's boiled down to objects on the stack going out of scope that I did not account for. Each time I felt silly putting all those break points up only to see that a reference was passed somewhere that did have the necessary lifespan. I love boost shared_ptr's, they're such an elegant solution to sharing objects and applying reference counting.
Programming since 1995.

This topic is closed to new replies.

Advertisement