[DX11] Device recreation best practice

Started by
4 comments, last by Jason Z 12 years, 4 months ago
Hi all!

I have a 2D visualization app (D3D11 with Direct2D Interop) that may get the device reset/lost or whatever, so I have to recreate my device and resources. The app is critical and should run as long as possible or at least quit with an error message. The question here arises wheter I should try to recreate the device as long as possible (without any display to the user in the meantime), or should I stop after 1000 tries, or after a minute? What do other programs (I don't mean microsoft msdn samples, but real delivered/professional applications/games)?

I already do a software fallback to WARP if no HW is available or if recreation failed.

As far as I know a device reset could occur of the following resons (some of them may be incorrect and there are at least some more that I forgot): driver hangs & gets recovered (never had this/couldn't reproduce that), driver update/installation (couldn't reproduce that), adapter/device/driver gets removed (never had this), internal errors (had that but don't ask me what is needed to reproduce), insufficient memory (especially often with WARP, so not too hard to reproduce),...)

thx,
Vertex
Advertisement
Wait one second before recreating the device, and if it fails wait ten seconds before trying again, and then just one try each minute, or something along those lines.

Wait one second before recreating the device, and if it fails wait ten seconds before trying again, and then just one try each minute, or something along those lines.

Thx Erik!

So you would not end trying to recreate the device?
The problem with the app is that it is kind of a user interface to a "control", so it would be important to see graphics. A very long time without seeing anything shouldn't be that user friendly or in other words the user doesn't know what happens. On the one side we should not close our application, on the other side the user should not stay in front of our app without seeing anything + transistions/... that need device recreation should work!

The recreation is a very important part for our app (as described here). You can't totally prevent device losts/resets, so we should do the best. Hence I am very interested how other applications/games do that.
You could display a message with GDI or a dialog box while waiting, saying that the device was lost and is waiting to be recreated, or even a button to attempt recreation right away. Perhaps store the times from creation until the device is removed, and if you find that it always gets removed in less than some number of minutes, stop trying to recreate it until the user asks for it.

You could display a message with GDI or a dialog box while waiting, saying that the device was lost and is waiting to be recreated, or even a button to attempt recreation right away. Perhaps store the times from creation until the device is removed, and if you find that it always gets removed in less than some number of minutes, stop trying to recreate it until the user asks for it.

Thx again Erik!

That sounds already pretty nice, but it is not allowed to bring a message box in the application (except we close the app): nothing is allowed to interrupt user interaction and so on. The graphics is important, but the user should not have to interact for the graphics stuff. The application could simply run without any person beeing there for days (then the graphics don't need to be drawn, but we could not ask the user to do anything or block anything of the app or something like that).

The idea with the time measurement is good. Thx! Nevertheless, it is hard to set the times + no user interaction or user dependency.

Vertex
If I were designing such a system, I would list each of the ways that a device could be 'lost', and then decide an appropriate measure for each of those cases while considering your provided constraints. According to this page, the WDDM driver model only allows the following reasons for a 'losing' a device:

"WDDM now provides a GPU memory manager and scheduler that allow multiple applications access to the GPU simultaneously. Because Direct3D applications no longer require exclusive access to the GPU it is possible to switch focus between applications with little penalty. Under WDDM Direct3D devices are only lost during driver upgrades, physical removal of the device, GPU reset and unexpected errors."

So each of these cases has its own semantic meaning, which means it should be handled specially:

  1. Driver Upgrades: This should only happen when someone is updating your control machine, which means you can assume someone is there. Displaying a non-interactive message window to inform them that the driver is being upgraded should be possible (if you can't already assume that your app will be shutdown for driver upgrades).
  2. Physical Removal of the Device: This follows the same logic as #1 - I think you can assume a technician is there and working on the machine if it is removed. This also assumes that the video card doesn't fall out of the machine, but that should be extremely rare...
  3. GPU Reset: I have seen this occur when there is a driver bug, or if a compute shader takes a very long time to execute. In this case, it is a systemic problem and will probably not go away on its own.
  4. Unexpected Errors: They are unexpected, so you don't know what to expect :)

Out of #1 and #2, I don't think you need to worry about them. They are extremely unlikely to occur unless someone is doing maintenance on your machine. #3 is likely to only occur in error or logical error situations, meaning you need to test against any possible erroneous inputs etc... Still, you won't know this one is coming until it hits you, but it should also only occur in special conditions. #4 is much the same as #3, they are conditions that are probably going to require a special situation to produce the problem, which I would assume is a non-repeating condition.

So my advice would be to retry to create the device every 5 seconds for up to 1 or 2 minutes. As soon as the first error occurs, I would open a GDI window with just a message indicating what is happening, and also to indicate the current status of the recreation (i.e. retrying in 5, 4, 3, 2, 1...). If it doesn't get corrected within that period of time, you need to assume something really bad has happened, and then switch to your backup with the WARP device. Please keep in mind that a WARP device doesn't necessarily provide 100% guarantee that the device will be created correctly either - i.e. if something happens to the dll that it resides in then you are stuck too... If this is a mission critical type of setup, I would provide even a GDI fall back for very basic, but functional, interface graphics.

And one last point - if you are running on a WDDM driver, I was not aware of the fact that you can run out of memory since the GPU memory is virtualized now. You would need to run out of memory on the whole machine before running out of GPU memory... How are you sure that you have an out of memory error?

This topic is closed to new replies.

Advertisement