If you have some memory to spare then its worth doing a position only VB and IB for each mesh. As well as this just storing the position component of the original mesh it can usually contain a lot less vertices.
Think of the case of cubes with hard faces. This requires 24 vertices total (6 faces * 4 vertices). However the position only mesh just requires 8 vertices. This cuts down on the bandwidth for vertex fetching, needs less transforms and makes better use of the post transform cache.
By using the position only VB and IB the original mesh can then be a standard interleaved format.
For the above don't forget to optimise for post-TC and pre-TC, use a position only decl and VS that only transforms position.
Secondly when doing you're frustum culling pass then use a different frustum for the prepass with a much closer far plane so the accepted number of meshes drawn is much lower. There is little gain to drawing meshes far away, firstly they are likely to cover little pixels on the screen, secondly they are less likely to occlude many pixels and thirdly HiZ buffers tend to really lose precision at far distances. It not uncommon to set the far plane to be as close as say 150m.
You can also cull meshes from the prepass using some heuristics. For example Don;t even bother considering meshes which are unlikely to cover many screen pixels.
I usually find it a gain to not draw alpha-tested objects in the prepass but to ensure they get drawn first in the base pass. That way they will benefit from opaque objects in the pre-pass and then update the depth buffer before the opaque objects in the base pass.
When you are 6000 units from the camera you have effectively chopped off up to 13 bits of fractional precision. Typically this shouldn't matter for the rendering as you should have aggregate transforms which work in camera space. Is it possible you are transforming from local to world then world to view? If so then combine them together on the CPU beforehand.
A) Yes there will usually be a stall here. But rather than letting the GPU sit idle it will start to work on other pixels/vertices instead. GPUs can have many thousands of pixels/ vertices in some stage of execution at any point in time. One of the limiting factors is each element currently in progress requires some registers to store intermediate values so optimizing the shader to use less registers can help with ensuring there are enough elements in flight to hide these stalls.
B) Typically RTs are not in cache but they do have local ROP tiles which can cache data. These ROP tiles are flushed to VRAM when they are finished being written to or there is a RT switch.
C) Some render states can be pipelined with the draw call. Some can't and are set in one of many state contexts. Potentially some render state changes could cause the pipeline to flush or partially flush leading to bubbles of the GPU going idle. Which can and can't is very much hardware dependent. Also note that some render state switches could potentially cause a lot of work in the driver on the CPU side if the hardware doesn't directly support the feature or the CPU has to do some kind of processing on the data first.
Typically the way to fix this is to quantize your screen-space vertices to some fixed grid. This grid can be finer than the size of a pixel. Then when creating your interpolants you have a finer resolution so can you step using a finer resolution than this quantized grid. If done correctly then the scanline interpolation should never under or over shoot the endpoints. I use fixed point math to store screen space positions and interpolants, though you can use fp math when calculating intermediate values.
You cannot generally do it in the VS because a triangle can cross over cascade boundaries. If you can guarantee that it doesn't (using frustum checks on the CPU) then you can do it on the VS.
Typically for orthographic projected cascades you only need to do a single vector by matrix transform. Each cascade can then be done via a bias and scale operation which is a single MAD operation per cascade check.