DX9 Creating Many Threads

Started by
11 comments, last by MysteryX 8 years, 6 months ago

I have this code that processes memory video frames through HLSL pixel shaders and DirectX 9. There is no graphical interface, it is a DLL that runs within a video processing chain.
https://github.com/mysteryx93/AviSynthShader/blob/master/Src/D3D9RenderImpl.cpp

It is now working stable.

However, I realized that DX9 is creating 8 threads. That wouldn't be so much of an issue if it wasn't for the fact that if I run 8 instances of the video processing script, and the processing chain contains 12 calls to shaders, and each shader instance creates 8 threads, then I end up with 768 threads and massive memory consumption.

Why is DirectX 9 creating so many threads, and is there a way to prevent this?

Advertisement

In a debugger, can you see what those other thread's call-stacks look like?

It's most likely your video driver, not D3D itself, which is creating them... Depending on the vendor, there might be some special hack where you can send a hint to the driver that you don't want it to attempt to perform any multi-threading optimizations. Alternatively, the user can provide such hints via their control panel.

What kind of GPU are you testing with?

I have a dual-graphics Intel HD 4000 with Radeon 7670M

By debugging when I run a single instance, I see these threads. There is the main thread followed by 3 ntdll.dll, then 7x d3d9.dll each followed by 2 ntdll.dll

Creating the DX9 device with flag D3DCREATE_DISABLE_PSGP_THREADING removes the extra threads... at least when running in VirtualDub with Release build and pausing in Visual Studio.

When I run a script in VirtualDub, I get 17 threads when I Pause in Visual Studio.

Running the exact same script with AvsMeter profiler gives me 37 threads!?

What the heck

I found out that the difference is whether the application is configured as "High Performance" (Radeon 7670M) or "Power Saving" (Intel HD 4000) in ATI's dual-graphics control panel.

If using the Radeon 7670M, my full script creates 471 threads, processes ~4.3fps @ 67% CPU usage, and has memory usage of 1170MB phys / 1723MB virt

If using the Intel HD 4000, my full script creates 150 threads, processes ~4.5fps @ 67% CPU usage, and has memory usage of 1604MB phys / 2189MB virt

I find it weird that the memory usage is higher when the amount of threads is lower.

Have you tried if its faster when you not run 8 instances of the script but less (as there are more than enough threads for parallelizing already)?

I find it weird that the memory usage is higher when the amount of threads is lower.

Because resource usage isn't something you measure on a scale with only one axis. Take a totally hypothetical scenario: on platform A you get 5 threads each using 10mb of memory, on platform B you get 10 threads each using 5mb of memory. You shouldn't be surprised that 2 different hardware platforms have different resource usage characteristics.

The other question is: is this actually a problem? You shouldn't equate resource usage with performance: while you may well be able to pull a few tricks to et memory usage down, it may be at the expense of some (or much) of the performance that your program needs.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I've been playing with the amount of threads. I get similar performance between 4 and 8 threads; but 5, 6 or 7 result in lower performance. With 4 threads, the CPU is only running at 80%.

Because of the way threads and resources are being managed, I'm getting slightly better performance off my Intel HD 4000 than off my Radeon 7670M.

However, the Intel HD 4000's GPU usage is only 10-25%. Is there a way I can optimize the GPU usage?

If you're creating the threads yourself, then you probably want to specify a smaller stack size when you create them. The default is 1MB of address space per thread (which isn't the same as 1MB of memory allocated - the memory allocated will grow as needed up to the 1MB limit).

You can also change the default thread stack size, which can influence the size of threads that other code creates, if they use the default. It also changes the size of the stack on the main thread. To do that use the /stack linker option. Note that stack sizes on x86 Windows are always a multiple of 64K. There's also not much point in doing this for 64-bit code as you have more than enough address space for 1MB stack reservations.

If your 32-bit code is running out of address space, you can also move from 2GB to 4GB of address space by using the /LARGEADDRESSAWARE linker option. However this only gives you extra address space if the program is running on 64-bit Windows (or specially configured 32-bit Windows). You may also need to pass the D3DXCONSTTABLE_LARGEADDRESSAWARE flag to the D3DX shader compiler to make it 4GB compatible.

Adding a .def file with "STACKSIZE 512KB" doesn't change memory usage but does increase performance, especially when using the Radeon, making it a bit faster than the Intel.

By applying the 4GB patch, however, I still get a "access violation in d3dx9_43.dll"

I changed this code to use the flag, and it doesn't help. In fact, adding that flag makes it crash every time even if it's not running out of memory.

HR(D3DXGetShaderConstantTableEx(buffer, D3DXCONSTTABLE_LARGEADDRESSAWARE, &m_pPixelConstantTable));

Overall, performance is still very slow and GPU usage low.

Edit: After running a profiler, the bottleneck is in the ConvertToFloat() and ConvertFromFloat() conversion; to convert YUV frame data into half-float RGB data. This might take someone who is good with assembly code to optimize that one...

This topic is closed to new replies.

Advertisement