The papers I've found recommends that as a shadow map, you render to a MSAA 32-bit float RGBA render target to store the two moments (m1, m1*m1, m2, m2*m2). You also need a depth buffer since the RGBA buffer can't be used as a depth buffer. This is incredibly expensive. For 4xMSAA, we get 16+4 bytes per MSAA sample. That's 80MBs just for a 4 sample 1024^2 variance map! We also need a resolved variance map, so add another 16MBs there. In total: 96MBs just for a 1024^2 shadow map. Don't you dare try a resolution 2048^2...
However, I found that I can reduce this memory footprint a lot. We don't have to calculate the moments until we resolve the MSAA texture! Instead of having a 32-bit float RGBA texture + a depth texture, we can get by with only a 32-bit float depth texture. By outputting view-space depth and modifying the depth range, we can just pass it down to gl_FragDepth (in the case of OpenGL) in the shadow rendering shader. When resolving we simply read the depth samples and calculate the moments and average them together! The result is that we only need 4 bytes per MSAA sample, period. That's 16MBs for a a 4xMSAA 1024^2 variance. The resolved variance texture is identical to before so that's another 16MBs. In total: 32MBs, which is a LOT better than 96MBs.
This not only reduces VRAM usage a lot, it also massively reduces the bandwidth needed for the shadow map rendering and resolve passes. On my GTX 295 (only one GPU, equal to a GTX 260/275), the performance is a lot better with my optimization. I'm getting 240 FPS in using the standard technique (1024^2 + 4xMSAA) and 440 FPS with my new one, which is almost twice as fast. Quality is identical to the normal technique since it's identical to normal EVSM stuff after resolving the MSAA texture.
I hope someone finds this useful, and if something's unclear feel free to ask!
EDIT: Fun fact: It's not possible to create a 8xMSAA 32-bit float RGBA texture, but it is possible to create a 8xMSAA 32-float depth buffer, so my technique works with 8xMSAA while the standard technique does not (at least on my hardware). Even funnier: Mine with 8xMSAA is 50% faster than the original with 4xMSAA and uses half as much memory.
Edited by theagentd, 21 November 2012 - 09:19 PM.