I have written some code that takes video frames, run a series of HLSL pixel shaders on them, and returns the processed video.
I would like to review the overall memory transfers to see if it's done properly, if it could be optimized, and then I have some questions.
In this sample, we have 3 input textures out of 9 slots, and we run 3 shaders.
D = D3DPOOL_DEFAULT, S = D3DPOOL_SYSTEMMEM, R = D3DUSAGE_RENDERTARGET
m_InputTextures
Index 0 1 2 3 4 5 6 7 8 9 10 11
CPU D D D S
GPU R R R R R R
m_RenderTargets contains one R texture per output resolution. The Render Target then gets copied into the next available index. Command1 outputs to RenderTarget[0] and then to index 9, Command2 outputs to index 10, Command3 outputs to index 11, etc. Only the final output needs to be copied from the GPU back onto the CPU, requiring a SYSTEMMEM texture.
All the processing is done with D3DFMT_A16B16G16R16 format.
The full code is here
Any comments so far?
Here are a few questions.
1. Do all GPU-side textures need to be defined as RenderTargets? If I don't, then StretchRect commands fails, which I'm using to move the data around.
2. Can I do the processing in D3DFMT_A16B16G16R16 and then return the result in D3DFMT_X8R8G8B8 to avoid converting from 16-bit to 8-bit on the CPU? If so, which textures do I have to change?
3. If I want to work with half-float data, can the input textures be D3DFMT_A32B32G32R32F and then do all the processing in D3DFMT_A16B16G16R16F, to avoid doing float to half-float conversion on the CPU? If so, which textures do I have to change? Similarly, I could pass input frames as D3DFMT_X8R8G8B8 and process it in D3DFMT_A16B16G16R16.
Edit: To be more specific, I support 3 pixel formats: D3DFMT_X8R8G8B8, D3DFMT_A16B16G16R16 and D3DFMT_A16B16G16R16F. I'd like to be able to take the input in D3DFMT_X8R8G8B8, do the processing in D3DFMT_A16B16G16R16F and give back the result in D3DFMT_A16B16G16R16 (use use any other combination). How can I do that?
Edit2: By changing the last RenderTarget and the last m_InputTextures set to D3DFMT_A16B16G16R16F, I'm able to read it back as half-float successfully. If I change it to D3DFMT_X8R8G8B8, however... the R and B are reversed, and the image is repeated twice side-by-side. Other than that, if I compare that with the regular 16-bit processing, the image is exactly the same which indicates it internally processed it as D3DFMT_A16B16G16R16. Creating the texture as D3DFMT_A8B8G8R8 fails.
Then I've also tried changing the input textures to D3DPOOL_SYSTEMMEM and using UpdateSurface instead of StretchRect. It works. However, performance is not better and memory usage is higher.