I have this code that plugs into an AviSynth video processing chain to apply HLSL filters on each frame.
https://github.com/mysteryx93/AviSynthShader
I'm creating various memory buffers send the input frame to the GPU, process a series of 5 or 10 HLSL shaders one after the other through various texture buffers, and then transfer the result back from the GPU to the CPU after it's all done. Needless to say, the bottleneck is in memory transfers between the GPU and the CPU.
To get the best performance, I must create an instance of my class (including a DX9 engine and all video buffers) for each CPU core (8x in my case).
There is another way to do the multi-threading where 8 threads can call the GetFrame functions at the same time on the same class instance. Using a unique_lock, I can ensure only one frame gets processed at once through this class and it's kind of building a work queue.
Using this method, memory usage is MUCH lower, but I also get ~3fps instead of ~10fps which is definitely not good. I'm guessing the difference is that in this second implementation, the GPU isn't working while transferring the data back and forth? Maybe.
1. Why am I getting such lower performance when queuing the work through a single device?
2. Is there a way to get maximum performance by using a single DX9 device?