Sign in to follow this  
MysteryX

DX9 Creating Many Threads

Recommended Posts

I have this code that processes memory video frames through HLSL pixel shaders and DirectX 9. There is no graphical interface, it is a DLL that runs within a video processing chain.
https://github.com/mysteryx93/AviSynthShader/blob/master/Src/D3D9RenderImpl.cpp

It is now working stable.

However, I realized that DX9 is creating 8 threads. That wouldn't be so much of an issue if it wasn't for the fact that if I run 8 instances of the video processing script, and the processing chain contains 12 calls to shaders, and each shader instance creates 8 threads, then I end up with 768 threads and massive memory consumption.

 

Why is DirectX 9 creating so many threads, and is there a way to prevent this?

Share this post


Link to post
Share on other sites

In a debugger, can you see what those other thread's call-stacks look like?

It's most likely your video driver, not D3D itself, which is creating them... Depending on the vendor, there might be some special hack where you can send a hint to the driver that you don't want it to attempt to perform any multi-threading optimizations. Alternatively, the user can provide such hints via their control panel.

What kind of GPU are you testing with?

Share this post


Link to post
Share on other sites

I have a dual-graphics Intel HD 4000 with Radeon 7670M

 

By debugging when I run a single instance, I see these threads. There is the main thread followed by 3 ntdll.dll, then 7x d3d9.dll each followed by 2 ntdll.dll

Share this post


Link to post
Share on other sites

Creating the DX9 device with flag D3DCREATE_DISABLE_PSGP_THREADING removes the extra threads... at least when running in VirtualDub with Release build and pausing in Visual Studio.

When I run a script in VirtualDub, I get 17 threads when I Pause in Visual Studio.

Running the exact same script with AvsMeter profiler gives me 37 threads!?

What the heck

Share this post


Link to post
Share on other sites

I found out that the difference is whether the application is configured as "High Performance" (Radeon 7670M) or "Power Saving" (Intel HD 4000) in ATI's dual-graphics control panel.

 

If using the Radeon 7670M, my full script creates 471 threads, processes ~4.3fps @ 67% CPU usage, and has memory usage of 1170MB phys / 1723MB virt

 

If using the Intel HD 4000, my full script creates 150 threads, processes ~4.5fps @ 67% CPU usage, and has memory usage of 1604MB phys / 2189MB virt

 

I find it weird that the memory usage is higher when the amount of threads is lower.

Share this post


Link to post
Share on other sites

I find it weird that the memory usage is higher when the amount of threads is lower.

 

Because resource usage isn't something you measure on a scale with only one axis.  Take a totally hypothetical scenario: on platform A you get 5 threads each using 10mb of memory, on platform B you get 10 threads each using 5mb of memory.  You shouldn't be surprised that 2 different hardware platforms have different resource usage characteristics.

 

The other question is: is this actually a problem?  You shouldn't equate resource usage with performance: while you may well be able to pull a few tricks to et memory usage down, it may be at the expense of some (or much) of the performance that your program needs.

Edited by mhagain

Share this post


Link to post
Share on other sites

I've been playing with the amount of threads. I get similar performance between 4 and 8 threads; but 5, 6 or 7 result in lower performance. With 4 threads, the CPU is only running at 80%.

 

Because of the way threads and resources are being managed, I'm getting slightly better performance off my Intel HD 4000 than off my Radeon 7670M.

 

However, the Intel HD 4000's GPU usage is only 10-25%. Is there a way I can optimize the GPU usage?

Share this post


Link to post
Share on other sites

If you're creating the threads yourself, then you probably want to specify a smaller stack size when you create them. The default is 1MB of address space per thread (which isn't the same as 1MB of memory allocated - the memory allocated will grow as needed up to the 1MB limit).

 

You can also change the default thread stack size, which can influence the size of threads that other code creates, if they use the default. It also changes the size of the stack on the main thread. To do that use the /stack linker option. Note that stack sizes on x86 Windows are always a multiple of 64K. There's also not much point in doing this for 64-bit code as you have more than enough address space for 1MB stack reservations.

 

If your 32-bit code is running out of address space, you can also move from 2GB to 4GB of address space by using the /LARGEADDRESSAWARE linker option. However this only gives you extra address space if the program is running on 64-bit Windows (or specially configured 32-bit Windows). You may also need to pass the D3DXCONSTTABLE_LARGEADDRESSAWARE flag to the D3DX shader compiler to make it 4GB compatible.

Share this post


Link to post
Share on other sites

Adding a .def file with "STACKSIZE 512KB" doesn't change memory usage but does increase performance, especially when using the Radeon, making it a bit faster than the Intel.

 

By applying the 4GB patch, however, I still get a "access violation in d3dx9_43.dll"

 

I changed this code to use the flag, and it doesn't help. In fact, adding that flag makes it crash every time even if it's not running out of memory.

HR(D3DXGetShaderConstantTableEx(buffer, D3DXCONSTTABLE_LARGEADDRESSAWARE, &m_pPixelConstantTable));

 

Overall, performance is still very slow and GPU usage low.
 

Edit: After running a profiler, the bottleneck is in the ConvertToFloat() and ConvertFromFloat() conversion; to convert YUV frame data into half-float RGB data. This might take someone who is good with assembly code to optimize that one...

Edited by MysteryX

Share this post


Link to post
Share on other sites

If for some strange reason you can't use the GPU for the conversion, with recent CPUs there are intrinsics to convert back and forth from float to half quickly. See http://blogs.msdn.com/b/chuckw/archive/2012/09/11/directxmath-f16c-and-fma.asp

 

The intrinsics are: _mm_cvtps_ph() and _mm_cvtph_ps() which require F16C instruction set support.

 

Even if you can't use those intrinsics the other DirectXMath functions may be quicker than the functions you're currently using.

Share this post


Link to post
Share on other sites

I was doing the conversion pixel by pixel which was taking forever. The updated code is processing them all at once with a buffer, which is much faster.

 

Is there a performance difference between DirectX Math and the DX9 conversion function?

 

Or, is there a way to convert straight from INT into half-float? Converting within the shader could be a good idea; but converting UINT16 to float16 would cause cropping.

 

 

As for the memory usage and many threads and devices being created, I'll see if I could chain the various shaders one after the other in the same instance, where I would reconfigure the same device with different shaders each step in the chain. This would probably drastically improve memory usage and performance.

 

Edit: I replaced the DX9 function with DirectX Math. Performance went up from 18.5fps to 20fps.

Edited by MysteryX

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this