Jump to content

- - - - -

Smart Pointers Aren't Always So Smart

Hieroglyph 3 D3D11
4: Adsense

The title may seem more controversial than I really intend, but I think it fits with the situation that I have recently encountered. I was going through each of the sample applications in Hieroglyph 3 after applying a user supplied patch, and I noticed that the frame rate of the MirrorMirror sample had dropped off significantly. This particular sample is designed to make multi-threaded rendering in Direct3D 11 pay off by rendering three reflective spheres which are surrounded by some large number of simple objects around them. Here is a screenshot of the sample to give you a visual:

Attached Image

Since each reflective sphere requires the scene to be rendered for its paraboloid map, the sample effectively renders the scene a total of four times - once for each sphere (both paraboloid maps are generated simultaneously) and then once to render the final scene. Thus if I specify 200 objects floating around the spheres, then you end up performing approximately 800 draw calls over four rendering passes. This is effectively the best use case for parallel rendering - the work loads are more or less evenly distributed over four threads, and there is a corresponding speed boost when using multi-threaded as opposed to single threaded rendering.

However, when I did my test I found that the frame rate in debug mode had dropped from somewhere around 70 to ~9. After going back through my source code tree, I found that before some recent changes to how I handle the input assembler stage's state the performance was as expected. This seemed really strange, since the new state management actually should have been more or less equivalent to the old method.

To further investigate, I stepped through the drawing operation with the debugger, and immediately found out the issue. I changed to using a standalone object to represent all of the input assembler state within the engine, including all of the available vertex buffer slots. To set up the situation, there are a maximum of D3D11_IA_VERTEX_INPUT_RESOURCE_SLOT_COUNT slots in the IA, which is currently defined as 32 available slots... I typically reference resources in the engine with a smart pointer to a proxy object. The proxy object contains the indices of a resource, plus any resource views that it would need. The proxy is used by applications as a very easy way to reference several pieces of data as one, and has overall worked out very well.

To get to the point, I was declaring an input assembler state object on the stack in each draw call, which initialized the state object to have null pointers for all of those 32 vertex buffer slots. Even though I was only using a single vertex buffer, I was still initializing all the other slots to null, which amounts to 32 assignments of the smart pointer. With a little math, we can see what was going on: 32 x 4 rendering passes x 200 = 25,600 references per frame.

In the end, I simply switched the stored state to use directly the index of the vertex buffer (since the input assembler doesn't use resource views, there is no need to use the proxy object anyways). Just a few short changes popped the speed right back up to where it should have been. So the moral of the story is this - smart pointers are only as smart as the person that is using them, and sometimes (especially in my case) they end up being not so smart :)

Anyways, this has opened my eyes to some state management issues with Hieroglyph 3, which I am now working on. My goal is to reduce the number of API calls to as few as possible, all the while properly managing the cached pipeline state with respect to multithreaded rendering... This will be the subject of my next post!

Nov 28 2011 01:40 PM
Heheh, in my game AdventureFar, I used smart pointers for handling chunks of the world. The speed wasn't fast enough, and profiling showed Boost's smart pointers as part of the problem. I now use regular pointers where speed is of issue, and smart pointers in less-critical areas.
In my case, it was the creation of the smart pointers (boost::make_shared()) that was part of the slowdown.

Smart pointers are very good, and I use them alot in my code, but they shouldn't just be used with a "replace every pointer with a smart pointer!" line of thought. Like everything in C++, used properly, they are of great benefit. Used improperly, your head asplodes.
Nov 28 2011 02:54 PM
I completely agree - they are a great way to simplify and secure high level code.&nbsp; However, as soon as you dip into anything that is high frequency then you are really playing with fire.&nbsp; In my case, only the high level objects should have been using the smart pointers, it was just my mistake to use them on something at a lower level.<br><br>At least I can go to sleep tonight and know that I wasn't the only one to make the same mistake :)<br>
Nov 28 2011 05:14 PM
Also, it is important to keep in mind a bit of the structuring of your application. shared_ptr can be used to forget about responsibility. In the google C++ guidelines, they advice to rarely use them. The reason being that no one has a clear ownership of the data. They advice using managers pattern instead. One can also recourse to a kind of managers that issues weak_ptr instead of ID objects... Well in any case, shared_ptr everywhere is just a bad Java. (because at least Java is quite clever about circular connections and floating graph components etc, C++ won't.)
But don't read me wrong, I love shared_ptr and I use them... where it makes sense.
Nov 30 2011 11:16 AM
The main reason shared_ptr has a performance impact is that the standard specifies it as thread-safe (the smart pointer only, not the object inside, so for example reset() is not thread-safe) and so it has to use atomic instructions to update the reference count--which imply memory barriers that will ruin any compiler and CPU reordering optimization of instructions around the atomic operation.
Nov 30 2011 01:15 PM
That's good to know - I didn't realize that atomics were used in the smart pointer implementation. That would really explain a lot, and is a good thing to keep in mind in the future. Thanks for "pointing" it out :)
Nov 30 2011 01:25 PM
IIRC another reason shared_ptr can be slow is that a heap alloc is used for the ref counter. It's a one-time hit, but it can be non-trivial to work out when it will occur unless your design is tight.

(Edit: it's actually one time per ref counted class instance, of course)

Check out Alexandrescu's Loki library and "Modern C++ Design" book for an in depth look at designing smart pointers, and a very customisable implementation which has template parameters controlling thread-safety, intrusiveness, etc. I've never used Loki, but Alexandrescu's discussion of the issues has definitely helped me when choosing a smart pointer for a particular situation.
Nov 30 2011 01:40 PM

The main reason shared_ptr has a performance impact is that the standard specifies it as thread-safe (the smart pointer only, not the object inside, so for example reset() is not thread-safe) and so it has to use atomic instructions to update the reference count--which imply memory barriers that will ruin any compiler and CPU reordering optimization of instructions around the atomic operation.

Interesting. I wonder how practical it is to create a non-thread safe version by using Boost's source for if your program isn't multi-threaded; or better yet, #ifdef to disable or enable thread-safety depending on your application's needs.
Nov 30 2011 03:06 PM
I try to use shared_ptr as little as possible, but to avoid the extra heap allocation you can use make_shared.

Most of the time I use unique_ptr(which has no overhead at all basically), or intrusive_ptr(which also has no overhead) since D3D resources have built in reference counts
Nov 30 2011 03:29 PM
And ofcourse there are the other types of smart pointers that are a bit more light-weight, like boost::scoped_ptr. I always check if I can't use one of those instead of resorting to raw pointers.
Nov 30 2011 05:27 PM

The main reason shared_ptr has a performance impact is that the standard specifies it as thread-safe and so it has to use atomic instructions to update the reference count

And despite this, they're still not completely "thread safe" (terrible vague term that it is) -- the objects they point to can be owned by multiple threads, but a shared_ptr itself cannot be shared between multiple threads (if any of them have write access).
[If you've got a shared_ptr that is read/writable by two threads, and one sets it to null at the same time as another thread attempts to copy it, the reference count can reach zero (and the object deleted) just before the pointer is copied and the recently-deleted ref-counter incremented back to 1]
Dec 01 2011 10:27 PM
This has become a rant about shared_ptr, not smart ptrs in general?

We use intrusive ref counting to eliminate the extra heap allocations. Our pointer wrapper can be defined to be either thread safe or not (by default it is not).

I agree that smart pointers can cause performance issues, mostly with cache misses since you have to fetch the object (or ref counter location) just to copy the pointer address.

It would indeed be wise to use raw pointers for low level systems such as rendering. The render manager may store a single smart pointer as objects are registered. This will preserve the object while it is inside the renderer. Then, internally raw pointers can be "safely" used.

Smart pointers also often cause circular references and memory leaks. But imo, their benefits FAR outweigh their pitfalls.
/ __Homer__
Dec 02 2011 02:21 AM
I hate them, I hate them, I hate them. You need a smart pointer manager to manage your smart pointers, why don't smart pointer classes implement simple code to log where they were allocated in debug builds? how stupid is it that a smart pointer leaks, somewhere, heres the object on the heap - oh you weren't tracing all heap allocations? poor u! smart is as smart does, its a misnomer!
/ __Homer__
Dec 02 2011 02:23 AM
oh I forgot - heres a worthless counter representing all the places where you copy-constructed it and incremented the ref counter, now fix it
Dec 02 2011 01:39 PM
I have to agree with bzroom - the performance issues were certainly caused by me. I simply used the wrong tool for the job, due to a lack of experience with them. However, this string of comments has been extremely informative and interesting.

I'm happy to see that there is such a huge amount of experience roaming around GDNet!

Note: GameDev.net moderates comments.