With d3d11 instancing, you stream the instance data via a secondary buffer into the vertex declaration. This can make the vertex structure look pretty big byte wise (for example, four extra float4s just from the world matrix). I've read that you want to keep the byte size of vertex structures as low as possible to reduce memory bandwidth.
Is using a large vertex structure from instancing equivalent to using the same large vertex structure without instancing memory bandwidth wise? Or are the GPUs efficient at assembly the vertices from the non-instanced data and the instanced data so I do not need to worry about this? I'm assuming it is efficient since instancing is a recommended optimization, but I wanted to check.
The reason I'm a bit concerned is that I want to add generic instancing support to my engine but I'm wondering if there is too much overhead if the instance count is small.