I have this code that processes all frames of a video file through HLSL pixel shaders. This is the previous version of the code.
https://github.com/mysteryx93/AviSynthShader/tree/master/Src
One issue I was having is that because each command had to create its own device, and since I was running on 8 threads, each instance gets multiplied by 8, I ended up with MANY devices and it took a massive amount of memory.
So I thought I could chain all the commands to execute them all at once on the same device, for each thread. I got the command chain to work, but performance isn't as good as I expected.
Here's a script that executes all the shaders with the same device via my code.
function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
return input
}
Here' a script that creates a command chain to execute them all at once under the same device.
function Test(clip input) {
cmd = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
return cmd.ExecuteShader(input)
}
Here are some benchmarks of both, running as 8 threads.
New code (command chain, 8 devices)
FPS (cur | min | max | avg): 16.00 | 1.333 | 1000000 | 39.12
Memory usage (phys | virt): 605 | 623 MB
Thread count: 137
CPU usage (current | average): 17% | 19%
Old code (device for each command, 48 devices)
FPS (cur | min | max | avg): 16.00 | 0.269 | 1000000 | 59.73
Memory usage (phys | virt): 728 | 803 MB
Thread count: 278
CPU usage (current | average): 28% | 24%
As you can see, the old code was performing considerably better, which may be due to the increased number of threads (double). The command chain takes less threads as expected, but only 2x less threads for 6x less devices. As for memory usage, the command chain does take less memory as expected, but not that much.
Is there something I'm missing? Should I expect considerable performance improvement with the command chain design or are these performances what I should expect? Perhaps something is wrong with my new code and benchmark.
EDIT: Here are some more benchmarks
New code, executing each command as a chain and reconfiguring the same device
function Test(clip input) {
cmd = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f")
return cmd.ExecuteShader(input)
}
Result:
FPS (min | max | average): 2.000 | 1000000 | 77.34
Memory usage (phys | virt): 599 | 621 MB
Thread count: 127
CPU usage (average): 25%
New code, executing each command individually
function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0").ExecuteShader(input)
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f").ExecuteShader(input)
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f").ExecuteShader(input)
return input
}
Result:
FPS (cur | min | max | avg): 32.00 | 0.571 | 1000000 | 110.7
Memory usage (phys | virt): 650 | 710 MB
Thread count: 209
CPU usage (current | average): 39% | 34%
Old code, executing each command on its own device
function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f")
return input
}
Result:
FPS (min | max | average): 17.78 | 320.0 | 116.4
Memory usage (phys | virt): 652 | 689 MB
Thread count: 183
CPU usage (average): 58%