DX9 Doubling Memory Usage in x64

Started by
16 comments, last by MysteryX 7 years, 10 months ago

I have this code that takes video frame data, create a series of in-memory video buffers, run a series of HLSL shaders and return the output frame data. It's working fine when running in x86 mode.

https://github.com/mysteryx93/AviSynthShader/blob/master/Src/D3D9RenderImpl.cpp

If I compile in x64, it still works, BUT I'm getting a memory usage of 2337MB instead of 1117MB in x86!!

What could be causing such an increase in memory usage? What takes a lot of memory is that the code runs 8 instances in parallel, so 8 DX9 engines get created and all the video buffers get created 8 times.

But getting MORE than double the memory usage in x64? Why

Here are some benchmarks to give a better idea. In this case, I disabled Multi-Threading so it's running a single instance.

x86


Frames processed:               154 (0 - 153)
FPS (min | max | average):      4.267 | 5.818 | 5.094
Memory usage (phys | virt):     171 | 228 MiB
Thread count:                   26
CPU usage (average):            4%

x64


Frames processed:               160 (0 - 159)
FPS (min | max | average):      4.417 | 6.071 | 5.441
Memory usage (phys | virt):     333 | 381 MiB
Thread count:                   26
CPU usage (average):            4%

In this case the memory usage isn't so drastic but still is a lot higher.

Advertisement

Why do you think it's strictly about d3d? It might be your data.

If you operate with lots of pointers in structures - those will double their size in x86-64.

You'd better first check if it's really d3d data (with 'pix' maybe), or yours.

Adding to what vstrakh suggested, your VERTEX structure could be responsible if it uses some datatype that expands, such as 32-bit floats on x86 to 64-bit floats on x64. If you have big vertex buffers that would potentially double their total size then.

I don't have lots of pointers. What takes a lot of memory are video frame buffers and DX9-related objects so I'm pretty sure the difference is related to that.

As for VERTEX... I honestly don't understand how they work. They just have to be set up that way for the code to work with plain 2D frames. Could this really be causing so much memory usage? If so, can it be worked around?

Adding to what vstrakh suggested, your VERTEX structure could be responsible if it uses some datatype that expands, such as 32-bit floats on x86 to 64-bit floats on x64. If you have big vertex buffers that would potentially double their total size then.

You're saying that FLOAT becomes 64-bit instead of 32-bit? That could definitely be a problem.

Searching Google, however, doesn't show anything that indicates FLOAT to behave any differently on x64 platform. Float is 32-bit, Double is 64-bit, and from what I'm reading, it remains the same.

Although C++ don't clearly define the size of floating-point data types, Microsoft has it well-defined here

I just ran some quick test on the ConvertToShader and ConvertFromShader code alone (running 3 times), without running through DX9 shaders.

x86 gave a memory usage of 28MB and 153fps. x64 gave a memory usage of 27MB and 185fps. No problem here. Which means the problem is definitely related to the code executing DX9 shaders.

I can guess at one possibility. Some time ago Microsoft optimized the address space usage of D3D9 on Vista - see https://support.microsoft.com/en-gb/kb/940105 It's possible that that optimization was only applied to x86 as you're not going to run out of address space on x64.

Is this extra memory usage actually causing a significant, measurable performance issue? If not I wouldn't worry about it.

If you really want to investigate what's going on, I'd suggest creating the simplest possible test program that shows the memory usage difference, and using a tool like https://technet.microsoft.com/en-us/sysinternals/vmmap.aspx to investigate how memory gets allocated differently.

I have run VMMap. Here is the result.

32-bit


Type         Size        Committed   Private   Total WS  Private WS  Shareable WS  Shared WS  Locked WS  Blocks  Largest     
Total        1,332,764   1,092,604   910,776   549,472   511,088     38,384        11,308                2701    
Image        196,132     195,828     24,736    44,500    8,752       35,748        8,728                 616     39,016
Mapped File  4,956       4,956                 456                   456           448                   4       3,292
Shareable    25,840      5,780                 2,176                 2,176         2,128                 39      20,480
Heap         686,452     660,384     660,384   331,516   331,516                                         1042    16,192
Managed Heap                                                                                                     
Stack        167,680     9,084       9,084     4,008     4,008                                           786     1,024
Private Data 210,428     192,480     192,480   142,724   142,720     4             4                     214     8,192
Page Table   24,092      24,092      24,092    24,092    24,092                                                  
Unusable     17,184                                                                                              60
Free         2,885,568                                                                                   66      2,079,936

64-bit


Type         Size              Committed   Private     Total WS    Private WS  Shareable WS  Shared WS  Locked WS  Blocks  Largest           
Total        2,719,536         2,496,020   2,274,976   1,922,640   1,874,980   47,660        11,260                2405    
Image        232,656           232,656     22,280      48,456      3,560       44,896        8,588                 723     46,888
Mapped File  4,956             4,956                   476                     476           432                   4       3,292
Shareable    25,708            5,648                   2,280                   2,280         2,232                 35      20,480
Heap         735,564           705,844     705,780     385,196     385,192     4             4                     761     16,192
Managed Heap                                                                                                               
Stack        139,264           5,520       5,520       3,888       3,888                                           408     1,024
Private Data 1,552,624         1,535,908   1,535,908   1,476,856   1,476,852   4             4                     474     12,384
Page Table   5,488             5,488       5,488       5,488       5,488                                                   
Unusable     23,276                                                                                                        60
Free         137,436,239,360                                                                                       64      137,393,457,984

I have also tried running it with no extra memory available, and it didn't reduce the memory usage of this process.

Any idea from here?

Adding to what vstrakh suggested, your VERTEX structure could be responsible if it uses some datatype that expands, such as 32-bit floats on x86 to 64-bit floats on x64. If you have big vertex buffers that would potentially double their total size then.

The VERTEX structure contains floats, which are 32bit on both platforms.

Looking at the VMMap data, the main difference between the two is in the "Private Data" section - there's over 1GB extra in there in the x64 version.

According to the VMMAP documentation:

Private memory is memory allocated by VirtualAlloc and not suballocated either by the Heap Manager or the .NET run time. It cannot be shared with other processes, is charged against the system commit limit, and typically contains application data.

Assuming you're not calling VirtualAlloc() directly yourself, it's probably allocated by either D3D or the graphics driver.

In addition, if you look at the details of that private data, it's made up of large numbers of small allocations. On x64 there are twice as many of them, and they are averaging about three times as big (210,428 / 214 = 983.3; 1,552,624 / 474 = 3275.5).

It could either be memory leaks (which the debug runtime should complain about), or there could be a difference in behaviour between the two.

I believe WPA / XPERF should be able to give you call stacks for the VirtualAlloc calls, but I'm not sure on the details. You could also try breakpointing it in the debugger.

The code itself is fairly simple, behaves the same way and provides the same output.

I'm 99% sure that the difference is in the way DX9 or the video driver handles its memory internally.

Yet knowing this doesn't solve the problem.

What eats up memory are many texture buffers to process video frames through various steps of processing. The memory gets allocated during initialization by calling CreateInputTexture and gets released only after it is done processing all the frames.

I tried creating the device with D3DCREATE_DISABLE_PSGP_THREADING but that doesn't help.

I run it through Visual Studio 2015 debugger to analyze the memory allocations. Here is the result.

Avs_Shader_Memory64.png

Although the process is taking 2GB, what Visual Studio reports here seems more like the normal memory usage that I should expect. It's not tracking the excess memory.

This topic is closed to new replies.

Advertisement