DirectX 11 - Dual GPU (5970) Accumulation problems

Started by
8 comments, last by OctavianTheFirst 14 years, 1 month ago
Hi, I'm trying to accumulate a texture over time and it looks like it's missing some frames. I suspect it's related to the 2 GPUs working on duplicate resources, and each GPU computing a different frame. GPU0 computes frames : 0 2 4 6 8... GPU1 computes 1 3 5 7 This works great if nothing in the renderer depends on the previous frame, but I'm in a situation where I need to accumulate frames. So I need 0 + 1 + 2 + 3 + ... I'm rendering the current frame to a texture, then I use the "Accumulate" shader to add it to the accumulation texture. (A+B -> C; C->B) Looks like a basic thing to do but the results are 0 + 2 + 3 .. for even frames and 1 + 3 + 5 for odd frames. Here's a link where the first pic shows the accumulation with missing frames, and a second where 2 consecutive frames are blended together. http://jalbum.net/browse/recent/album/506418/ So what would be the best way to set up a bug-free accumulation buffer using a dual GPU card? I'm considering using an RWStructuredBuffer instead of a texture and do the summing in a compute shader. The N-Body sample (where each frame depends obviously on the previous frame) runs smoothly on dual GPUs at 2x100% usage, so I suppose RWStructuredBuffers don't suffer from the problem I'm having. System: Ati 5970, Windows 7, DirectX Sdk (February 2010), VS 2008 Accumulation shader: float4 PS_Accumulate(VSOut sv) : SV_Target { float4 v0 = t0.SampleLevel(s0p, sv.tcoord0.xy, 0); //new frame float4 v1 = t1.SampleLevel(s0p, sv.tcoord0.xy, 0); //accumulation buffer return v0*1.f + v1*.5f; } The textures are created in a standard way, using D3D11_BIND_RENDER_TARGET | D3D11_BIND_SHADER_RESOURCE. PS : The same things works just fine in DX9.0 but uses half of the GPU. Thank you [Edited by - OctavianTheFirst on February 26, 2010 12:51:27 AM]
Advertisement
Does the accumulation work properly if you use the reference rasterizer? I would try that out firs to ensure that the problem is what you think it is.

Also, are you using two devices to run the two GPUs or is everything done in the driver? You might also like to try recording several frames in PIX so that you can look at the contents of the accumulation frame buffer and see what is happening after each frame is rendered. The latest version of PIX is supposed to be able to work with DX11, although I have had trouble getting it to work properly on my machine...
I'm using a single device. Also, this happens in fullsceen mode only. When I'm windowed, the results are fine but the GPU usage drops to 50.

To solve this issue, I implemented an accumulation filter in the HDRFormats10 Sample, and it seems to work just fine. Therefore, I tend to think I do something wrong when setting up my Window / DX Device.

Some suggested I should set the Crossfire mode to Supertiling or Split-Frame instad of Alternate Frame. Is this something I can select when creating the device? I didn't find anything about it in Catalyst.


[Edited by - OctavianTheFirst on February 26, 2010 3:34:57 PM]
You've more or less answered the question yourself. DirectX has no special functionality for multi-gpu systems. If you want to control how things work together through directX, you will have to create more than one device and you will have to do the synchronization/sharing of surfaces between the two devices yourself.

IHV's do the best they can to integrate their options of SLI/Crossfire with existing applications, but some shader techniques just won't work with the options they provide you. Your alternating frame choice is one of those that is going to cause trouble. Imagine the work required by the driver to figure out what you're doing between frames and how that impacts the result and then taking steps to synchronize resources for you... That would be one impressive driver. Still, it would defeat the intent of alternating frames, the GPU's can't work entirely independently because they now have to synchronize on a shared resource.

Using the tiling choice should yield correct results.

You won't be able to repro this issue using REF/WARP, or any other single GPU setup.
I've done some more testing about this, and here are the results:

There is no problem with the Window / Device creation. DX11 SDK behaves the same way.

If you remove the Backbuffer Color Clear in BasicHLSL11_2008 and replace it with a sleep,

ID3D11RenderTargetView* pRTV = DXUTGetD3D11RenderTargetView();
//replace this line with a sleep
// pd3dImmediateContext->ClearRenderTargetView( pRTV, ClearColor );
Sleep(200);
ID3D11DepthStencilView* pDSV = DXUTGetD3D11DepthStencilView();

you will see if there are actually 2 textures instead of 1. If there are 2 textures, you will see each one interleaved with .2s delay, and this happens for any texture you allocate and using compute shaders to read/write to textures don't change a thing.

I'd like to have a function like texture->UpdateToLatestVersion() that would copy the pixels from the GPU with the most recent version to the one having the oldest version. It would not block the rendering pipeline if it's well done:

Frame 2 ______XXXX^ GPU0
Frame 1 __XXXX^___| GPU1
Frame 0 XXXX__|___Copy GPU0
Copy

All it needs to do is to interrupt the other processor and grab a copy of the texture. With DMA it could even do it without interrupting at all.

Also, I think that for StructuredBuffers, Append/Consume Buffers and anything that has to do with general computing on GPU, the driver makes sure you're always working on the latest revision. I'll give it a try shortly, this is my last hope :)

Question : If I allocate 2 devices, will I be able to transfer a texture from a GPU to another without going through main memory?

Question2 : How do I tell ATI/DirectX to "Use the tiling choice"???

[Edited by - OctavianTheFirst on February 26, 2010 10:36:12 PM]
I tried the accumulation in fullscreen DX10 and fullscreen DX11 (by modding the HDR ToneMapping demo) and it WORKS with DX 10 and doesn't work with DX11.

I understand I need an impressive driver, but how come they figured it out for DX10 and then woups doesnt work anymore in DX11.

It also looks a like a driver bug.
Well, D3D11/DXGI has gives you the ability to grab a keyedmutex from a shared resource. The mutex let's you lock a resource to an individual device for writes and reads. There might be some limitation for sharing across multiple physical adapters, but I've never tried it.

You could probably avoid blocking if you created a gpu 'staged' resource. This resource would be the shared surface that would transfer the data back and forth between GPUs. When frame0 is complete, it can lock the resource, do a CopyResource() on the GPU, then unlock the resource. Next, the GPU1 can lock the resource and copy out the contents with CopyResource() again, and then unlock it. Depending on the parallel nature of the algorithms and how they overlap, you might not have to pay for any syncing. However, if the second GPU depends on the finished results of the first GPU, you may end up with something that runs as though it's serial execution, unless each GPU can do 99% of its frame execution before it needs results from the other.

Several devices running on a single adapter probably don't even have to do a DMA transfer since they should both have access to the same GPU memory. An SLI or crossfire the system probably does have to do a DMA transfer if the data needs to be on a different physical device. The driver should handle those details. In the case of dualGPU it might be a pretty fast sync.

Since directx doesn't have an API for managing these sorts of scenarios you'd have to rely on NVIDIA or ATI publishing some sort of shim API to provide access to these features, or perhaps they have a control panel that will let you make the choice.

From what I understand most applications don't/can't leverage multi-GPU technology by default. The drivers often detect what application is running and then will enable specific paths for that application in order to enable the appropriate functionality. Meaning the driver has a list of games that it knows how to use SLI/CrossFire/dualGPU with.

A driver writer might be able to confirm/deny that. Anyway, I'm just not sure what options you have for controlling these features.

I couldn't comment on whether or not the driver writers are treating the resources (textures vs structured buffers) differently, but I can't think of a reason why they would have to do this. It may depend on how important the scenario is for them and so perhaps there has been more effort put into optimizing structured buffers.

Quote:Original post by DieterVW

From what I understand most applications don't/can't leverage multi-GPU technology by default. The drivers often detect what application is running and then will enable specific paths for that application in order to enable the appropriate functionality. Meaning the driver has a list of games that it knows how to use SLI/CrossFire/dualGPU with.

A driver writer might be able to confirm/deny that. Anyway, I'm just not sure what options you have for controlling these features.



That's pretty much how it works for Nvidia. Their control panel has a list of "known" apps with SLI profiles, and it will apply the profile if your app matches one on the list. Otherwise by default you only get one GPU. They do have an API (NVAPI) that let's you explicitly control which profile your app gets, but you need to be registered Nvidia developer to get access to bits that let you control SLI.

I don't know how ATI works, but I'd imagine it's similar.

"unless each GPU can do 99% of its frame execution before it needs results from the other."

That's pretty much the case :)

But I don't agree: if 50% of the frame execution happens before it needs result from the other, it's enough to prevent blocking.

____11111111 Computing frame 1 on GPU1
0000000022222222 Computing frame 0 then frame 2 on GPU0

I'm not sure that both GPUs share the same memory. The way this graphic card was described was more like "Crossfire on the same card", and I've also heard it can use effectively 1GB out of the 2GB available. This would mean each GPU can access 1GB.

Also, I might have been a little optimistic when I said it all works fine on DX10. It looks this way but I will run some more tests when I get back (I'm away for the weekend...)

[Edited by - OctavianTheFirst on February 27, 2010 2:45:22 PM]
Solved:

Using CopyResource(dest, src); instead of rendering as RenderTarget() indicates dest is dirty and the driver transfers the pixels from a GPU to another at 4GB/s.

This topic is closed to new replies.

Advertisement