You're right; a jitter buffer would be a good solution.
There's one optimization to this model that I'd like to implement however I suspect it may be a problem: Given that a large proportional of the frames of input are likely to be empty there would be a considerable amount of wasted bandwidth (e.g, at a rate of 25Hz; 1KB/s of protocol overhead (IP+TCP) alone). Thus, would it be possible to omit this redundant traffic? Assuming that there is minimal jitter I figure the client would be able to set a time threshold for receiving input per frame and if exceeded assume that there was no input during this frame and opportunistically advance the simulation. If the assumption is wrong however it would require reversing the simulation, integrating the missed input, and then somehow fixing the descrepency between the current (errorneous state) and the correct state. I haven't heard of this being done so I'd be interested in hearing in any experiences with such a method.
I should point out that the game in question is effectively a carbon copy of Diablo II; the simulation computation requirement is minimal and thus it would be quite feasible to dump the entire game state (~250KB client-side) per frame (which is something I'm considering for implementing the reversing)