Pixel Shader Chain Performance

Graphics and GPU Programming Programming

Started by MysteryX October 28, 2015 01:48 AM

9 comments, last by MysteryX 8 years, 5 months ago

285

Author

October 28, 2015 01:48 AM

I have this code that processes all frames of a video file through HLSL pixel shaders. This is the previous version of the code.

https://github.com/mysteryx93/AviSynthShader/tree/master/Src

One issue I was having is that because each command had to create its own device, and since I was running on 8 threads, each instance gets multiplied by 8, I ended up with MANY devices and it took a massive amount of memory.

So I thought I could chain all the commands to execute them all at once on the same device, for each thread. I got the command chain to work, but performance isn't as good as I expected.

Here's a script that executes all the shaders with the same device via my code.


function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
return input
}

Here' a script that creates a command chain to execute them all at once under the same device.


function Test(clip input) {
cmd = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
return cmd.ExecuteShader(input)
}

Here are some benchmarks of both, running as 8 threads.

New code (command chain, 8 devices)


FPS (cur | min | max | avg):    16.00 | 1.333 | 1000000 | 39.12
Memory usage (phys | virt):     605 | 623 MB
Thread count:                   137
CPU usage (current | average):  17% | 19%

Old code (device for each command, 48 devices)


FPS (cur | min | max | avg):    16.00 | 0.269 | 1000000 | 59.73
Memory usage (phys | virt):     728 | 803 MB
Thread count:                   278
CPU usage (current | average):  28% | 24%

As you can see, the old code was performing considerably better, which may be due to the increased number of threads (double). The command chain takes less threads as expected, but only 2x less threads for 6x less devices. As for memory usage, the command chain does take less memory as expected, but not that much.

Is there something I'm missing? Should I expect considerable performance improvement with the command chain design or are these performances what I should expect? Perhaps something is wrong with my new code and benchmark.

EDIT: Here are some more benchmarks

New code, executing each command as a chain and reconfiguring the same device


function Test(clip input) {
cmd = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
cmd = cmd.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f")
return cmd.ExecuteShader(input)
}

Result:


FPS (min | max | average):      2.000 | 1000000 | 77.34
Memory usage (phys | virt):     599 | 621 MB
Thread count:                   127
CPU usage (average):            25%

New code, executing each command individually


function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0").ExecuteShader(input)
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f").ExecuteShader(input)
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f").ExecuteShader(input)
return input
}

Result:


FPS (cur | min | max | avg):    32.00 | 0.571 | 1000000 | 110.7
Memory usage (phys | virt):     650 | 710 MB
Thread count:                   209
CPU usage (current | average):  39% | 34%

Old code, executing each command on its own device


function Test(clip input) {
input = input.Shader(path="Test.hlsl", shaderModel="ps_3_0")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")
input = input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=.5f", param2="InputSize=352,288f")
return input
}

Result:


FPS (min | max | average):      17.78 | 320.0 | 116.4
Memory usage (phys | virt):     652 | 689 MB
Thread count:                   183
CPU usage (average):            58%

Hodgman

52,717

October 28, 2015 05:27 AM

Are you still creating/destroying devices every frame?

It will probably be better to launch a new process, which creates it's own persistent device, and then have all of your AviSynth threads enqueue their workloads with that one external device. That process could terminate itself if noone sends any work to it for N seconds.

If your bottleneck isn't the GPU, then you're not fully benefiting from this GPU-based approach :)

. 22 Racing Series .

MysteryX

285

Author

October 28, 2015 08:12 PM

I'm not creating the device for every frame, but for every thread (ex: running 8 threads). However, I'm reconfiguring that device with different shaders for every frame (not recreating the pixel shader, but re-assigning it)

Yeah... I'm not getting much performance benefits. Each HLSL shader, even very simple ones, are considerably expensive on performance. Not sure where the bottleneck is.

If I was to take the single-device queue approach, each thread is blocked until it gets its result back so it can move on to the next frame, so with 8 threads, the queue would have 8 items in it max. It would be a *lot* of work to get it to work with that design; would I really see a considerable performance gain?

With your design, how do you pass video frame data between processes?

It seems that by design, the device renders the previous frame while I'm filling in the data of the current frame, which is why I have to call Present twice to get the current frame data. When displaying to the screen, nobody cares about a 1-frame delay (or even notice it's even happening). But in my case, by calling Present twice, I'm losing this optimization.

21st Century Moose

13,459

October 28, 2015 09:53 PM

Is this being called multiple times per frame, and what is it doing?

input.Shader(path="Sharpen.hlsl", shaderModel="ps_3_0", param1="Amount=1f", param2="InputSize=352,288f")

What I'm asking is, are you doing file IO, shader compilation and text parsing multiple times per frame? Because if so, that's your bottleneck right there.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

MysteryX

285

Author

October 28, 2015 10:01 PM

This line creates an instance of my "Shader" class through AviSynth scripting.

I create the device and configure the shader when the class gets initiated.

Then, the GetFrame method gets called for every frame, and the class instance is the same for every frame.

MysteryX

285

Author

October 29, 2015 08:05 AM

I just did another test.

Normal processing, 8 chains of commands in different threads (8 devices created)


FPS (min | max | average):      1.280 | 1000000 | 38.57
Memory usage (phys | virt):     568 | 586 MB
Thread count:                   137
CPU usage (average):            23%

If I remove this code after calling Present, the result will be corrupt because it always returns the previous frame but let's look at the performance difference


HR(m_pDevice->Clear(D3DADAPTER_DEFAULT, NULL, D3DCLEAR_TARGET, D3DCOLOR_XRGB(0, 0, 0), 1.0f, 0));
HR(m_pDevice->BeginScene());
HR(m_pDevice->EndScene());
SCENE_HR(m_pDevice->DrawPrimitive(D3DPT_TRIANGLEFAN, 0, 2), m_pDevice);
return m_pDevice->Present(NULL, NULL, NULL, NULL);

Benchmark


FPS (min | max | average):      1.882 | 1000000 | 78.47
Memory usage (phys | virt):     567 | 586 MB
Thread count:                   137
CPU usage (average):            28%

More than double performance!! This would qualify as a bottleneck.

The performance difference is either in the processing of the scene, or in the fact that I'm not filling in the next frame while it is processing the current frame, or both.

Is there a more effective way of flushing the command buffer to return the data that was just passed in?

If I wanted to fill in the next frame while waiting for the current frame, I only see 2 options.

1. Pass in the data of the next frame when calling my function to process the frame. However, this would only work when reading video in a linear way in single-threaded mode. With multi-threading, the next 8 frames will be split across 8 different threads so that won't work. Unless I find a way to work with the frame cache which I don't know much about.

2. Communicate with a single shared device in a separate process. With 8 threads, the device processing queue will have up to 8 frames filled in advance awaiting processing. But then, how can I communicate frame data with that separate process? Even passing buffer pointers wouldn't work as pointers can only be used by the process that created it (AFAIK).

Any other ideas?

About option #2, after a quick search I found out that inter-process communication can be done with Shared Memory. Still, this would be a lot of work that I would like to avoid.

https://msdn.microsoft.com/en-us/library/aa366551%28VS.85%29.aspx

http://www.codeproject.com/Articles/835818/Ultimate-Shared-Memory-A-flexible-class-for-interp

Even with that option, which allows sharing a raw data buffer, it would be difficult. I can't just pass a STRUCT array containing pointers to the various resources as the destination of those pointers wouldn't be shared. Even sharing strings would be difficult.

21st Century Moose

13,459

October 29, 2015 08:47 AM


HR(m_pDevice->Clear(D3DADAPTER_DEFAULT, NULL, D3DCLEAR_TARGET, D3DCOLOR_XRGB(0, 0, 0), 1.0f, 0));
HR(m_pDevice->BeginScene());
HR(m_pDevice->EndScene());
SCENE_HR(m_pDevice->DrawPrimitive(D3DPT_TRIANGLEFAN, 0, 2), m_pDevice);
return m_pDevice->Present(NULL, NULL, NULL, NULL);

This should be causing errors because DrawPrimitive should be called between BeginScene and EndScene. Also, that "SCENE_HR" macro seems suspect - what's going on in there? Look, it's really difficult to help you troubleshoot this when you're obviously doing weird things but not telling us what those weird things are - you need to be providing more information.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

MysteryX

285

Author

October 29, 2015 05:15 PM

You're right, DrawPrimitive was at the wrong place. Fixed it, but performance is still exactly the same. SCENE_HR is like HR except that it cancels the scene in case of failure.

I just found out that I can share resources across all threads. Creating a worker thread would be MUCH easier than creating a separate process.

MysteryX

285

Author

October 31, 2015 09:36 PM

I finally re-implemented it with events. It's kind of working but in a weird and unstable way with poor performance... wrote about it here

http://www.gamedev.net/topic/672898-c-weird-behaviors-with-multi-threading/

MysteryX

285

Author

November 04, 2015 12:12 AM

I got it "almost" working to chain up the commands under the same device. However, filling up the commands into the same device all chained up one after the other has LOWER performance than when I'm running 32 devices at the same time, although that takes a lot more memory and CPU.

If I run AviSynth with 8 threads and run 4 worker threads instead of 1, and chain up the commands so that command 2 outputs command 1 and I don't need dummy scenes to flush the GPU, performance is better but it is still lower than when running 32 devices.

Why is a single device behaving so poorly?

Pixel Shader Chain Performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Pixel Shader Chain Performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines