Jump to content
  • Advertisement
Sign in to follow this  
MysteryX

Pixel Shader Chain Performance

This topic is 1056 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have this code that processes all frames of a video file through HLSL pixel shaders. This is the previous version of the code.

https://github.com/mysteryx93/AviSynthShader/tree/master/Src

 

One issue I was having is that because each command had to create its own device, and since I was running on 8 threads, each instance gets multiplied by 8, I ended up with MANY devices and it took a massive amount of memory.

 

So I thought I could chain all the commands to execute them all at once on the same device, for each thread. I got the command chain to work, but performance isn't as good as I expected.

 

Here's a script that executes all the shaders with the same device via my code.

function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
return input
}

Here' a script that creates a command chain to execute them all at once under the same device.

function Test(clip input) {
cmd = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
return cmd.ExecuteShader(input)
}

Here are some benchmarks of both, running as 8 threads.

 

New code (command chain, 8 devices)

FPS (cur | min | max | avg):    16.00 | 1.333 | 1000000 | 39.12
Memory usage (phys | virt):     605 | 623 MB
Thread count:                   137
CPU usage (current | average):  17% | 19%

Old code (device for each command, 48 devices)

FPS (cur | min | max | avg):    16.00 | 0.269 | 1000000 | 59.73
Memory usage (phys | virt):     728 | 803 MB
Thread count:                   278
CPU usage (current | average):  28% | 24%

As you can see, the old code was performing considerably better, which may be due to the increased number of threads (double). The command chain takes less threads as expected, but only 2x less threads for 6x less devices. As for memory usage, the command chain does take less memory as expected, but not that much.

 

Is there something I'm missing? Should I expect considerable performance improvement with the command chain design or are these performances what I should expect? Perhaps something is wrong with my new code and benchmark.

 

 

EDIT: Here are some more benchmarks

 

New code, executing each command as a chain and reconfiguring the same device

function Test(clip input) {
cmd = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f")
return cmd.ExecuteShader(input)
}

Result:

FPS (min | max | average):      2.000 | 1000000 | 77.34
Memory usage (phys | virt):     599 | 621 MB
Thread count:                   127
CPU usage (average):            25%

New code, executing each command individually

function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0").ExecuteShader(input)
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f").ExecuteShader(input)
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f").ExecuteShader(input)
return input
}

Result:

FPS (cur | min | max | avg):    32.00 | 0.571 | 1000000 | 110.7
Memory usage (phys | virt):     650 | 710 MB
Thread count:                   209
CPU usage (current | average):  39% | 34%

Old code, executing each command on its own device

function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f")
return input
}

Result:

FPS (min | max | average):      17.78 | 320.0 | 116.4
Memory usage (phys | virt):     652 | 689 MB
Thread count:                   183
CPU usage (average):            58%
Edited by MysteryX

Share this post


Link to post
Share on other sites
Advertisement
Are you still creating/destroying devices every frame?

It will probably be better to launch a new process, which creates it's own persistent device, and then have all of your AviSynth threads enqueue their workloads with that one external device. That process could terminate itself if noone sends any work to it for N seconds.

If your bottleneck isn't the GPU, then you're not fully benefiting from this GPU-based approach :)

Share this post


Link to post
Share on other sites

I'm not creating the device for every frame, but for every thread (ex: running 8 threads). However, I'm reconfiguring that device with different shaders for every frame (not recreating the pixel shader, but re-assigning it)

 

Yeah... I'm not getting much performance benefits. Each HLSL shader, even very simple ones, are considerably expensive on performance. Not sure where the bottleneck is.

 

If I was to take the single-device queue approach, each thread is blocked until it gets its result back so it can move on to the next frame, so with 8 threads, the queue would have 8 items in it max. It would be a *lot* of work to get it to work with that design; would I really see a considerable performance gain?

 

With your design, how do you pass video frame data between processes?

 

 

It seems that by design, the device renders the previous frame while I'm filling in the data of the current frame, which is why I have to call Present twice to get the current frame data. When displaying to the screen, nobody cares about a 1-frame delay (or even notice it's even happening). But in my case, by calling Present twice, I'm losing this optimization.

Edited by MysteryX

Share this post


Link to post
Share on other sites

Is this being called multiple times per frame, and what is it doing?

 

input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")

 

What I'm asking is, are you doing file IO, shader compilation and text parsing multiple times per frame?  Because if so, that's your bottleneck right there.

Share this post


Link to post
Share on other sites

This line creates an instance of my "Shader" class through AviSynth scripting.

 

I create the device and configure the shader when the class gets initiated.

 

Then, the GetFrame method gets called for every frame, and the class instance is the same for every frame.

Share this post


Link to post
Share on other sites

I just did another test.

 

Normal processing, 8 chains of commands in different threads (8 devices created)

FPS (min | max | average):      1.280 | 1000000 | 38.57
Memory usage (phys | virt):     568 | 586 MB
Thread count:                   137
CPU usage (average):            23%

If I remove this code after calling Present, the result will be corrupt because it always returns the previous frame but let's look at the performance difference

HR(m_pDevice->Clear(D3DADAPTER_DEFAULT, NULL, D3DCLEAR_TARGET, D3DCOLOR_XRGB(0, 0, 0), 1.0f, 0));
HR(m_pDevice->BeginScene());
HR(m_pDevice->EndScene());
SCENE_HR(m_pDevice->DrawPrimitive(D3DPT_TRIANGLEFAN, 0, 2), m_pDevice);
return m_pDevice->Present(NULL, NULL, NULL, NULL);

Benchmark

FPS (min | max | average):      1.882 | 1000000 | 78.47
Memory usage (phys | virt):     567 | 586 MB
Thread count:                   137
CPU usage (average):            28%

More than double performance!! This would qualify as a bottleneck.

 

The performance difference is either in the processing of the scene, or in the fact that I'm not filling in the next frame while it is processing the current frame, or both.

Is there a more effective way of flushing the command buffer to return the data that was just passed in?

 

If I wanted to fill in the next frame while waiting for the current frame, I only see 2 options.

 

1. Pass in the data of the next frame when calling my function to process the frame. However, this would only work when reading video in a linear way in single-threaded mode. With multi-threading, the next 8 frames will be split across 8 different threads so that won't work. Unless I find a way to work with the frame cache which I don't know much about.

 

2. Communicate with a single shared device in a separate process. With 8 threads, the device processing queue will have up to 8 frames filled in advance awaiting processing. But then, how can I communicate frame data with that separate process? Even passing buffer pointers wouldn't work as pointers can only be used by the process that created it (AFAIK).

 

Any other ideas?

 

 

About option #2, after a quick search I found out that inter-process communication can be done with Shared Memory. Still, this would be a lot of work that I would like to avoid.

https://msdn.microsoft.com/en-us/library/aa366551%28VS.85%29.aspx

http://www.codeproject.com/Articles/835818/Ultimate-Shared-Memory-A-flexible-class-for-interp

 

Even with that option, which allows sharing a raw data buffer, it would be difficult. I can't just pass a STRUCT array containing pointers to the various resources as the destination of those pointers wouldn't be shared. Even sharing strings would be difficult.

Edited by MysteryX

Share this post


Link to post
Share on other sites
HR(m_pDevice->Clear(D3DADAPTER_DEFAULT, NULL, D3DCLEAR_TARGET, D3DCOLOR_XRGB(0, 0, 0), 1.0f, 0));
HR(m_pDevice->BeginScene());
HR(m_pDevice->EndScene());
SCENE_HR(m_pDevice->DrawPrimitive(D3DPT_TRIANGLEFAN, 0, 2), m_pDevice);
return m_pDevice->Present(NULL, NULL, NULL, NULL);

This should be causing errors because DrawPrimitive should be called between BeginScene and EndScene.  Also, that "SCENE_HR" macro seems suspect - what's going on in there?  Look, it's really difficult to help you troubleshoot this when you're obviously doing weird things but not telling us what those weird things are - you need to be providing more information.

Share this post


Link to post
Share on other sites

You're right, DrawPrimitive was at the wrong place. Fixed it, but performance is still exactly the same. SCENE_HR is like HR except that it cancels the scene in case of failure.

I just found out that I can share resources across all threads. Creating a worker thread would be MUCH easier than creating a separate process.

Edited by MysteryX

Share this post


Link to post
Share on other sites

I got it "almost" working to chain up the commands under the same device. However, filling up the commands into the same device all chained up one after the other has LOWER performance than when I'm running 32 devices at the same time, although that takes a lot more memory and CPU.

 

If I run AviSynth with 8 threads and run 4 worker threads instead of 1, and chain up the commands so that command 2 outputs command 1 and I don't need dummy scenes to flush the GPU, performance is better but it is still lower than when running 32 devices.

 

Why is a single device behaving so poorly?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!