Did you originally come from a OpenGL background Juliean? I think the hardware abstraction layer optimizes the draw calls. You may have a bottleneck in ur render loop. Dont do any updates in the render but on a separate thread.
No, I started out with DirectX, and only recently adopted an OpenGL4 renderer. I gotta say I can't make much sense of what you are implying with your post right now, so if you could elaborate on that in more detail I'd be very grateful. What I was implying with in my post is that
a) in order to achieve what the OP wants on Draw-Call level, each draw call has to have eigther 1) 5 additional parameters merely for cbuffer offsets for each shader stage or 2) a double array for each shader stage and all bound buffers. This is not so much even a problem on performance end, but more for the API bloat it imposes.
It is a strict fact that you get more performance if you map one 128 MB Cbuffer and copy data to it in one pass than you would if you mapped an amount of say 1024 buffers of the same size seperately though. Thats why what I was saying is that i definately makes sense in what DirectX11.1 proposes, but having offsets being passes in into the drawcall doesn't gain much, and can make matters worse on an architectural level.