Using Max Capacity with 1 Engine

Started by
1 comment, last by MysteryX 7 years, 10 months ago

I have this code that plugs into an AviSynth video processing chain to apply HLSL filters on each frame.

https://github.com/mysteryx93/AviSynthShader

I'm creating various memory buffers send the input frame to the GPU, process a series of 5 or 10 HLSL shaders one after the other through various texture buffers, and then transfer the result back from the GPU to the CPU after it's all done. Needless to say, the bottleneck is in memory transfers between the GPU and the CPU.

To get the best performance, I must create an instance of my class (including a DX9 engine and all video buffers) for each CPU core (8x in my case).

There is another way to do the multi-threading where 8 threads can call the GetFrame functions at the same time on the same class instance. Using a unique_lock, I can ensure only one frame gets processed at once through this class and it's kind of building a work queue.

Using this method, memory usage is MUCH lower, but I also get ~3fps instead of ~10fps which is definitely not good. I'm guessing the difference is that in this second implementation, the GPU isn't working while transferring the data back and forth? Maybe.

1. Why am I getting such lower performance when queuing the work through a single device?

2. Is there a way to get maximum performance by using a single DX9 device?

Advertisement

Woah... here's something weird I just realized.

When I run a video processing chain, if I remove one heavy video processing component, the performance DROPS from 6fps to 2.2fps !??

How performance can lower by reducing the workload is beyond me. It appears there is a work synchronization issue and that by delaying the parallel execution of my code through other work, it reduces these synchronization issues. Or something.

Any idea on this one!?

Edit:

The x86 version runs at 6fps with both components and 2.2fps if I take out the other component.

The x64 version runs at 6.7fps with both components and 8fps if I take out the other component.

For some reason, the x64 is less affected by this thread synchronization issue. But then, if I run the x64 code with 16 threads, it will happily eat 5GB of RAM!! While the x86 code will take half of that.

Here's something else that is strange. I have a dual-graphics Intel HD 4000 with Radeon 7670M. I get 6fps on the Intel and 5.8fps on the Radeon.

I found the solution. The main issue was that I was using the flag D3DPRESENT_INTERVAL_DEFAULT instead of D3DPRESENT_INTERVAL_IMMEDIATE which limited performance as if I wanted to display on 60hz (except that I don't display anything to the screen).

From there, I did several optimizations.

With one engine with 8 threads calling it, and locking the thread for the part that uses the renderer (excluding memory transfer in/out), I can now get decent performance.

I get best performance by creating 2 devices and alternating between them. Frame 1 goes to device A, frame 2 goes to device B and so on.

Now the performance is pretty good! Almost twice as before.

This topic is closed to new replies.

Advertisement