1) For my initial draw surface, I am using D3DFMT_A16B16G16R16 and want to make sure that's an appropriate choice.
3) And perhaps most important, how do you actually -draw- at "higher than 1.0" intensities? Do you really return RGB color components > 1.0, or do the extra HDR bits only get used from combined blending and so forth?
D3DFMT_A16B16G16R16 only stores values from 0-1, but it does so in increments of 1/65535th, instead of increments of 1/255th for a regular 8-bit texture.
When renering in HDR with this format, 1.0 is your maximum intensity, so if you want to use some other maximum, you have to divide by it at the end of your pixel shader.
e.g. if you want to store RGB intensity values from 0-10, you'd use something like:
return result * float4(0.1, 0.1, 0.1, 1)
Alternatively, you can use the D3DFMT_A16B16G16R16F format, which stores values from ~-65k to ~+65k, with floating point precision (small numbers are very precise, large numbers are less precise).
If you know what your maximum intensity value is, and it's a fairly small range, then the first method works very well.
If you don't know what your maximum intensity is, or it's a very large range, then the second method may be easier.
2) I'd like to confirm that my intermediate textures (where you step down to half size and gaussian blur) do not need extra bits; or do they? For the set of 4 or 5 bloom textures, what pixel format should I use for them?
That depends. In my renderer, I use the floating-point texture format, because I'm representing a variety of scenes with no upper limit on intensity values (besides the format's limitation of ~65k as a max value). I then assume that "bloom" occurs due to unwanted reflections in the camera lens, and arbitrarily say that this consists of 3% of the light hitting the lens.
So when I copy the main rendering to my small "bloom" texture, I multiply it by 0.03 (this is often called the "threshold" step), which gives me a max range of 0 to ~1950.
At the moment I'm also using 16F textures for these small bloom targets our of simplicity, but as an optimisation, I could use 8-bit textures instead.
What I'd do here is the same as above; I'd divide by the maximum value to remap them to the range of 0-1.
e.g. assuming the maximum bloom value is 1950:
instead of bloom = original * 0.03, I'd use bloom = original * 0.03 / 1950.
This should produce a number from 0-1. When the GPU tries to save that number into an 8-bit texture, it will round it to the nearest 1/255th.
If I was using the non-floating point HDR method (so my original values were in the range of 0-1), with a max range of 10 as above:
decoded = original * 10;// unpack stored 0-1 value to our actual intensity range
bloom = original * 0.03 / 0.3;//maximum bloom value is 10*0.03 == 0.3. Remap 0.3 to 1.0 for storage in 8 bit texture.
Then later when applying the bloom to the screen:
decodedBloom = texture * 0.3;//the texture stored 0.3 as 1.0. Remap 1.0 back to 0.3