Sign in to follow this  

[C++] Bug postmortem

This topic is 3808 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So, I've stumbled upon my first serious bug in years. As usual, my knowledge of C++ has tricked me into a false sense of safety, and I wanted to explain what happened. My program executes a function call on a given object twice, with some additional work in-between the two calls. The call looks like this:
Kernel CUDA_Module::GetKernel(const std::string & name) const
{
  // The call which causes the error is "implGetKernel"
  return this->implGetKernel(name);
}

// The function above is called by:
Kernel Module::GetKernel(const std::string & name) const
{
  // 'this->module' is a boost::shared_ptr<iModule>
  // and 'CUDA_Module' inherits from 'iModule'
  return this->module->GetKernel(name);
}
The first call works correctly, but the second results in an error:
Unhandled exception at 0x00e2fb83 in CUtest.exe: 0xC0000005: Access violation writing location 0xbd55d0c2.
The first scary thing about the error above is that the this pointer in the call:
  • Is at address 0x00e2fb80
  • Points to address 0x00e2fb80
By examining the assembly at the call point (not the violation point), I see the typical explore-vtable-and-call sequence Visual C++ does:
00403769  mov         edx,dword ptr [this] 
0040376C  mov         eax,dword ptr [edx] 
0040376E  mov         ecx,dword ptr [this] 
00403771  mov         edx,dword ptr [eax] 
00403773  call        edx  
Which, obviously, results in the program executing the vtable and dying a loud death because the instructions don't make sense. So, I seek the reason for the modification using a breakpoint (because, no matter how you look at it, the px member of a boost::shared_ptr should not be altered to equal its own address), and I suspect an overflow or similar random memory bombing. A memory breakpoint later, the modification appears to occur, deterministically (praise debuggers) in the following code:
0041BB23  call        operator new (44FE80h) 
Oops. After half an hour of puzzling (and noticing that the breakpoint gets triggered about once per function) it finally occurs to me that the the memory for the Module might have been deallocated. The precise details of this elude me, because the Module is a non-pointer, non-reference member variable of a stack-based instance of another class, which just so happens to currently be alive and well on the stack for the entire duration of the program. Until I notice a small detail:
// What the stack-based instance actually calls is:
Native::Kernel Program::GetKernel(const CUlib::Extra::Symbol & s) const
{
  return module.kernel[symbols[s]][dimensions];
}

// Where Module::kernel is a syntactic sugar implemented as:
class Module
{
public:
  const class KernelGetter
  {
    const Module & m;
    KernelGetter(const Module & m) : m(m) {}
    friend class Module;
  public: 
    Kernel operator[](const std::string & name) const
    { 
      return m.GetKernel(name); 
    }
  } kernel;


  Module(const boost::shared_ptr<iModule> & module) : 
    module(module), symbol(*this), kernel(*this) {}

  // other members, but no copy constructor.
}
The error now seems obvious: a first Module is instantiated on the stack somewhere, and its KernelGetter gets a reference to it. Then, a copy is made in the constructor of the Program, but the KernelGetter keeps a reference to the old one, and results in undefined behaviour once the old one is wiped off the stack. The reason for the error? I expected the const-reference-member of the KernelGetter to prevent the creation of a default copy constructor, and thus prevent copies (warning me when I would try to make a copy, so I can decide whether I must implement copy construction or forbid it). The compiler did, however, create a copy constructor (as could have been expected if I has actually used my brain for a second there), which proved to be semantically invalid. Lessons learned:
  • Always explicitly disable constructors and assignment operators when you don't need them. Especially when using non-RAII members, and especially when using an exotic pattern.
  • Do not feel safe just because you're using references instead of pointers.
  • Learn how to read a stack trace correctly. The problem was clearly in there all along, and merely looking at what became of the this pointer along the call would have been a sufficient hint.
  • Don't think that, just because the first call works and the second crashes, the first call is correct. In this case, it just didn't have the time to be corrupted as well, though it could have.
So far, so good. Back to work!

Share this post


Link to post
Share on other sites
Thanks for taking the time to share that.
Just thought I'd let you know that somebody read it and appreciated the time taken to post it.
I have recently been making sure I disable the copy-constructor and assignment operator of classes that I don't expect to use them. Definitely a good idea.

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
Thanks for taking the time to share that.
Just thought I'd let you know that somebody read it and appreciated the time taken to post it.


Steve McConnell discusses in Code Complete the power of talking about your problems. He gives the example of going into someone else's office, starting to describe your problem, and realizing halfway through what the problem is.

The above worked this way as well. I realized, as I was describing the bug, that the Module was probably wiped off the stack before being used, which was enough to get me going about discovering the real cause [smile] I already had two-thirds of the post done, so I finished it up.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Doesn't compiler report returning of a local variable? Or did it get puzzled by all this and missed it?


I believe it's too complex to notice, as I'm not returning a direct reference but rather a reference through another object.

Share this post


Link to post
Share on other sites
Most of my runtime errors with boost:shared_ptr's boiled down to objects on the stack going out of scope that I did not account for. Each time I felt silly putting all those break points up only to see that a reference was passed somewhere that did have the necessary lifespan. I love boost shared_ptr's, they're such an elegant solution to sharing objects and applying reference counting.

Share this post


Link to post
Share on other sites

This topic is 3808 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this