Jump to content
  • Advertisement
Sign in to follow this  

Using Max Capacity with 1 Engine

This topic is 912 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have this code that plugs into an AviSynth video processing chain to apply HLSL filters on each frame.



I'm creating various memory buffers send the input frame to the GPU, process a series of 5 or 10 HLSL shaders one after the other through various texture buffers, and then transfer the result back from the GPU to the CPU after it's all done. Needless to say, the bottleneck is in memory transfers between the GPU and the CPU.


To get the best performance, I must create an instance of my class (including a DX9 engine and all video buffers) for each CPU core (8x in my case).


There is another way to do the multi-threading where 8 threads can call the GetFrame functions at the same time on the same class instance. Using a unique_lock, I can ensure only one frame gets processed at once through this class and it's kind of building a work queue.


Using this method, memory usage is MUCH lower, but I also get ~3fps instead of ~10fps which is definitely not good. I'm guessing the difference is that in this second implementation, the GPU isn't working while transferring the data back and forth? Maybe.


1. Why am I getting such lower performance when queuing the work through a single device?


2. Is there a way to get maximum performance by using a single DX9 device?

Share this post

Link to post
Share on other sites

Woah... here's something weird I just realized.


When I run a video processing chain, if I remove one heavy video processing component, the performance DROPS from 6fps to 2.2fps !??


How performance can lower by reducing the workload is beyond me. It appears there is a work synchronization issue and that by delaying the parallel execution of my code through other work, it reduces these synchronization issues. Or something.


Any idea on this one!?




The x86 version runs at 6fps with both components and 2.2fps if I take out the other component.

The x64 version runs at 6.7fps with both components and 8fps if I take out the other component.


For some reason, the x64 is less affected by this thread synchronization issue. But then, if I run the x64 code with 16 threads, it will happily eat 5GB of RAM!! While the x86 code will take half of that.


Here's something else that is strange. I have a dual-graphics Intel HD 4000 with Radeon 7670M. I get 6fps on the Intel and 5.8fps on the Radeon.

Edited by MysteryX

Share this post

Link to post
Share on other sites

I found the solution. The main issue was that I was using the flag D3DPRESENT_INTERVAL_DEFAULT instead of D3DPRESENT_INTERVAL_IMMEDIATE which limited performance as if I wanted to display on 60hz (except that I don't display anything to the screen).


From there, I did several optimizations.


With one engine with 8 threads calling it, and locking the thread for the part that uses the renderer (excluding memory transfer in/out), I can now get decent performance.


I get best performance by creating 2 devices and alternating between them. Frame 1 goes to device A, frame 2 goes to device B and so on.


Now the performance is pretty good! Almost twice as before.

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!