Yes, GPUs can be very counter-intuitive at times.
At work we have an API test demo which I wrote which draws 1,000,000 cubes via instancing (so one draw call). The vertex shader loads, via an indexed cbuffer, a matrix4x4 for each vertex and uses that to transform it. The matrix buffer is static and never touched post-creation. Ran at just over 30fps on a NV GTX 470, all GPU load.
Last week I wrote another test based on that demo however this time the cbuffer held 4x float4, one of which was an axis/angle the angle component of which was updated by a compute shader. On drawing the values where then used to construct a quaternion to do the rotation part of the transform and used the rest to translate the cubes in various ways (a few sin/cos calculations with per-instance phase and amplitude amounts). After that the workload for the shader was the same as before.
This new version, despite transferring more data and doing more per vertex work ran at ~40+fps...
(I might, if I have time, modify the demo this week to do some point expansion for the cubes in the GS, get a feel for the slow down... then maybe throw in some HS/DS work too...)
Wait, so on both tests the cbuffer stream size was the same? (you said matrix 4x4 compared to 4x float 4's) That does sound interesting...