shared_ptr contains two members: a pointer to the object and reference counter object. The reference counter object contains a pointer to a the shared count implementation. The implementation object is polymorphic object contains as members, another copy of the pointer to the object, a copy of the deleter object, and two counters, one for shared_ptr references and one for weak_ptr references. And, as a polymorhpic object, it also has the vtable pointer.
My current experiment differs from shared_ptr by having the shared count object be non-polymorphic. Instead of containing a copy of the deleter object, it contains a pointer to a statically allocated polymorphic object that contains a deleter object. So with the loss of some flexiblity (being able to have deleter objects with state), the size of the object allocated on the heap drops by a little. It seems like it goes from 20 bytes to 16 bytes when I tried it out. While 4 bytes isn't that big a deal, it does drop it down to a power of two, which makes it suitable for use with a binary buddy suballocator.