What are your opinions on DX12/Vulkan/Mantle?

Started by
120 comments, last by Ubik 8 years, 10 months ago

I'm pretty torn on what to think about it. On one hand being able to write the implementations for a lot of the resource management and command processing allows for a lot of gains, and for a much better management of the rendering across a lot of hardware. Also all the threading support is going to be fantastic.

But on the other hand, accepting that burden will vastly increase the cost to maintain a platform and the amount of time to fully port. I understand that a lot of it can be ported piece by piece, but it seems like the amount of time necessary to even meet the performance of, say, dx12 is on the order of man weeks.

I feel like to fully support these APIs I need to almost abandon the previous APIs support in my engine since the veil is so much thinner, otherwise I'll just end up adding the same amount of abstraction that DX11 does already, kind of defeating the point.

What are your opinions?

Perception is when one imagination clashes with another
Advertisement
Depending on how you have designed things the time to port should be pretty low and, importantly, the per-API level of code for the new Vulkan and DX12 paths will be very thin as they look very much the same in terms of API space and functionality.

Older APIs might be a problem, however again this depends on your abstraction levels; if you have coupled to D3D11 or OpenGL's style of doing things too closely at the front end then yes, it'll cause you pain - if your abstraction was light enough then you could possibly treat D3D11 and OpenGL as if they are D3D12 and Vulkan from the outside and do the fiddling internally.

At this point however it depends on where you are with things; if you want to release Soon then you'd be better off forgetting the new APIs and getting finished before going back, gutting your rendering backend and doing it again for another project/product.

In fact I'd probably argue 'releasing soon' is the only reason to stick with the current APIs; if you are just learning then by all means continue however I would advocating switching to the newer stuff as soon as you can. That might mean waiting for lib to abstract things a bit of you don't feel like fiddling but the new APIs look much cleaner and much simpler to learn - a few tricky concepts are wrapped in them but they aren't that hard to deal with and, with Vulkan at least, it looks like you'll have solid debugging tools and layers built in to help.

I guess if you are a hobbyist you could keep on with the old stuff; but I'd honestly say that if you are looking to be a gfx programmer in the industry switching sooner rather than later will help you out as you'll be useful in the future when you join up.

I feel like to fully support these APIs I need to almost abandon the previous APIs support in my engine since the veil is so much thinner, otherwise I'll just end up adding the same amount of abstraction that DX11 does already, kind of defeating the point.

That's highly unlikely. One of the reasons the classical APIs are so slow is that they have to do a whole lot of rule validation. When you google an OpenGL function and you see that whole list of things that an argument is allowed to be, what happens when something goes wrong, etc, then the drivers have to actually validate that at runtime (to actually fulfill the language spec requirements correctly, but also to prevent you from crashing the GPU). That is because drivers can't make many assumptions about the context that you execute those API calls in. When you write a game engine with DX12 or Vulkan, that's not true. As the programmer you typically have complete knowledge of the relevant context and can make many assumptions, thus you can skip a whole lot of work that a classical OpenGL driver would have to do.

In addition to that multithreading in DX11 and OpenGL 4.x is still very crappy (although I'm not sure why) and with the new APIs you will be able to actually use multiple cores to do rendering API stuff for more than just 5% gains.

Thinking about it, it's kinda like C++ vs a more managed language like Java or C#. Similar concepts, of course extremely similar execution context, but one gives more control, is a more precise abstraction of the hardware, but enables you to shoot yourself in the foot more (and in return is faster).

In addition to that multithreading in DX11 and OpenGL 4.x is still very crappy (although I'm not sure why)

OpenGL doesnt really provide any mean for multithreading even in 4.4. OpenGL context belongs to a single thread who can issue rendering command. That's what vulkan/DX12 are tackling by the "command buffer" object that can be created on any thread (although they have to be commited to the command queue which seems to belong to a single thread only).

Actually there are way to do kindof multithreading in OpenGL 4 : you can share a context between thread, that's used to load texture asynchronously for instance, but I've heard that this is really inefficient. There is also glBufferStorage + IndirectDraw which allows you to access a buffer of instanced data that can be written like any others buffer, eg concurrently.

But it's not as powerful as what Vulkan or DX12 which allow to issue any command and not just instanced ones.

In addition to that multithreading in DX11 and OpenGL 4.x is still very crappy (although I'm not sure why)

OpenGL doesnt really provide any mean for multithreading even in 4.4. OpenGL context belongs to a single thread who can issue rendering command. That's what vulkan/DX12 are tackling by the "command buffer" object that can be created on any thread (although they have to be commited to the command queue which seems to belong to a single thread only).

Actually there are way to do kindof multithreading in OpenGL 4 : you can share a context between thread, that's used to load texture asynchronously for instance, but I've heard that this is really inefficient. There is also glBufferStorage + IndirectDraw which allows you to access a buffer of instanced data that can be written like any others buffer, eg concurrently.

But it's not as powerful as what Vulkan or DX12 which allow to issue any command and not just instanced ones.

Yes, but I'm more interested in what prevented driver implementors to get proper multithreading support into the APIs in the first place. DX11 has the concept of command lists too and it's kind of working, but practical gains from it are pretty small. I don't know what about the APIs (or the implementations of the drivers) prevents proper multithreading from working in DX11 and GL4.x

Many years ago, I briefly worked at NVIDIA on the DirectX driver team (internship). This is Vista era, when a lot of people were busy with the DX10 transition, the hardware transition, and the OS/driver model transition. My job was to get games that were broken on Vista, dismantle them from the driver level, and figure out why they were broken. While I am not at all an expert on driver matters (and actually sucked at my job, to be honest), I did learn a lot about what games look like from the perspective of a driver and kernel.

The first lesson is: Nearly every game ships broken. We're talking major AAA titles from vendors who are everyday names in the industry. In some cases, we're talking about blatant violations of API rules - one D3D9 game never even called BeginFrame/EndFrame. Some are mistakes or oversights - one shipped bad shaders that heavily impacted performance on NV drivers. These things were day to day occurrences that went into a bug tracker. Then somebody would go in, find out what the game screwed up, and patch the driver to deal with it. There are lots of optional patches already in the driver that are simply toggled on or off as per-game settings, and then hacks that are more specific to games - up to and including total replacement of the shipping shaders with custom versions by the driver team. Ever wondered why nearly every major game release is accompanied by a matching driver release from AMD and/or NVIDIA? There you go.

The second lesson: The driver is gigantic. Think 1-2 million lines of code dealing with the hardware abstraction layers, plus another million per API supported. The backing function for Clear in D3D 9 was close to a thousand lines of just logic dealing with how exactly to respond to the command. It'd then call out to the correct function to actually modify the buffer in question. The level of complexity internally is enormous and winding, and even inside the driver code it can be tricky to work out how exactly you get to the fast-path behaviors. Additionally the APIs don't do a great job of matching the hardware, which means that even in the best cases the driver is covering up for a LOT of things you don't know about. There are many, many shadow operations and shadow copies of things down there.

The third lesson: It's unthreadable. The IHVs sat down starting from maybe circa 2005, and built tons of multithreading into the driver internally. They had some of the best kernel/driver engineers in the world to do it, and literally thousands of full blown real world test cases. They squeezed that system dry, and within the existing drivers and APIs it is impossible to get more than trivial gains out of any application side multithreading. If Futuremark can only get 5% in a trivial test case, the rest of us have no chance.

The fourth lesson: Multi GPU (SLI/CrossfireX) is fucking complicated. You cannot begin to conceive of the number of failure cases that are involved until you see them in person. I suspect that more than half of the total software effort within the IHVs is dedicated strictly to making multi-GPU setups work with existing games. (And I don't even know what the hardware side looks like.) If you've ever tried to independently build an app that uses multi GPU - especially if, god help you, you tried to do it in OpenGL - you may have discovered this insane rabbit hole. There is ONE fast path, and it's the narrowest path of all. Take lessons 1 and 2, and magnify them enormously.

Deep breath.

Ultimately, the new APIs are designed to cure all four of these problems.

* Why are games broken? Because the APIs are complex, and validation varies from decent (D3D 11) to poor (D3D 9) to catastrophic (OpenGL). There are lots of ways to hit slow paths without knowing anything has gone awry, and often the driver writers already know what mistakes you're going to make and are dynamically patching in workarounds for the common cases.

* Maintaining the drivers with the current wide surface area is tricky. Although AMD and NV have the resources to do it, the smaller IHVs (Intel, PowerVR, Qualcomm, etc) simply cannot keep up with the necessary investment. More importantly, explaining to devs the correct way to write their render pipelines has become borderline impossible. There's too many failure cases. it's been understood for quite a few years now that you cannot max out the performance of any given GPU without having someone from NVIDIA or AMD physically grab your game source code, load it on a dev driver, and do a hands-on analysis. These are the vanishingly few people who have actually seen the source to a game, the driver it's running on, and the Windows kernel it's running on, and the full specs for the hardware. Nobody else has that kind of access or engineering ability.

* Threading is just a catastrophe and is being rethought from the ground up. This requires a lot of the abstractions to be stripped away or retooled, because the old ones required too much driver intervention to be properly threadable in the first place.

* Multi-GPU is becoming explicit. For the last ten years, it has been AMD and NV's goal to make multi-GPU setups completely transparent to everybody, and it's become clear that for some subset of developers, this is just making our jobs harder. The driver has to apply imperfect heuristics to guess what the game is doing, and the game in turn has to do peculiar things in order to trigger the right heuristics. Again, for the big games somebody sits down and matches the two manually.

Part of the goal is simply to stop hiding what's actually going on in the software from game programmers. Debugging drivers has never been possible for us, which meant a lot of poking and prodding and experimenting to figure out exactly what it is that is making the render pipeline of a game slow. The IHVs certainly weren't willing to disclose these things publicly either, as they were considered critical to competitive advantage. (Sure they are guys. Sure they are.) So the game is guessing what the driver is doing, the driver is guessing what the game is doing, and the whole mess could be avoided if the drivers just wouldn't work so hard trying to protect us.

So why didn't we do this years ago? Well, there are a lot of politics involved (cough Longs Peak) and some hardware aspects but ultimately what it comes down to is the new models are hard to code for. Microsoft and ARB never wanted to subject us to manually compiling shaders against the correct render states, setting the whole thing invariant, configuring heaps and tables, etc. Segfaulting a GPU isn't a fun experience. You can't trap that in a (user space) debugger. So ... the subtext that a lot of people aren't calling out explicitly is that this round of new APIs has been done in cooperation with the big engines. The Mantle spec is effectively written by Johan Andersson at DICE, and the Khronos Vulkan spec basically pulls Aras P at Unity, Niklas S at Epic, and a couple guys at Valve into the fold.

Three out of those four just made their engines public and free with minimal backend financial obligation.

Now there's nothing wrong with any of that, obviously, and I don't think it's even the big motivating raison d'etre of the new APIs. But there's a very real message that if these APIs are too challenging to work with directly, well the guys who designed the API also happen to run very full featured engines requiring no financial commitments*. So I think that's served to considerably smooth the politics involved in rolling these difficult to work with APIs out to the market, encouraging organizations that would have been otherwise reticent to do so.

[Edit/update] I'm definitely not suggesting that the APIs have been made artificially difficult, by any means - the engineering work is solid in its own right. It's also become clear, since this post was originally written, that there's a commitment to continuing DX11 and OpenGL support for the near future. That also helped the decision to push these new systems out, I believe.

The last piece to the puzzle is that we ran out of new user-facing hardware features many years ago. Ignoring raw speed, what exactly is the user-visible or dev-visible difference between a GTX 480 and a GTX 980? A few limitations have been lifted (notably in compute) but essentially they're the same thing. MS, for all practical purposes, concluded that DX was a mature, stable technology that required only minor work and mostly disbanded the teams involved. Many of the revisions to GL have been little more than API repairs. (A GTX 480 runs full featured OpenGL 4.5, by the way.) So the reason we're seeing new APIs at all stems fundamentally from Andersson hassling the IHVs until AMD woke up, smelled competitive advantage, and started paying attention. That essentially took a three year lag time from when we got hardware to the point that compute could be directly integrated into the core of a render pipeline, which is considered normal today but was bluntly revolutionary at production scale in 2012. It's a lot of small things adding up to a sea change, with key people pushing on the right people for the right things.

Phew. I'm no longer sure what the point of that rant was, but hopefully it's somehow productive that I wrote it. Ultimately the new APIs are the right step, and they're retroactively useful to old hardware which is great. They will be harder to code. How much harder? Well, that remains to be seen. Personally, my take is that MS and ARB always had the wrong idea. Their idea was to produce a nice, pretty looking front end and deal with all the awful stuff quietly in the background. Yeah it's easy to code against, but it was always a bitch and a half to debug or tune. Nobody ever took that side of the equation into account. What has finally been made clear is that it's okay to have difficult to code APIs, if the end result just works. And that's been my experience so far in retooling: it's a pain in the ass, requires widespread revisions to engine code, forces you to revisit a lot of assumptions, and generally requires a lot of infrastructure before anything works. But once it's up and running, there's no surprises. It works smoothly, you're always on the fast path, anything that IS slow is in your OWN code which can be analyzed by common tools. It's worth it.

(*See this post by Unity's Aras P for more thoughts. I have a response comment in there as well.)

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

I feel like to fully support these APIs I need to almost abandon the previous APIs support in my engine since the veil is so much thinner, otherwise I'll just end up adding the same amount of abstraction that DX11 does already, kind of defeating the point.

Yes.
But it depends. For example if you were doing AZDO OpenGL, many of the concepts will already be familiar to you.
However, for example, AZDO never dealt with textures as thin as Vulkan or D3D12 do so you'll need to refactor those.
If you weren't following AZDO, then it's highly likely that the way you were using the old APIs is incompatible with the new says.

Actually there are way to do kindof multithreading in OpenGL 4 : (...). There is also glBufferStorage + IndirectDraw which allows you to access a buffer of instanced data that can be written like any others buffer, eg concurrently.
But it's not as powerful as what Vulkan or DX12 which allow to issue any command and not just instanced ones.

Actually DX12 & Vulkan are exactly following the same path glBufferStorage + IndirectDraw did. It just got easier, made thiner, and can now handle other misc aspects from within multiple cores (texture binding, shader compilation, barrier preparation, etc).

The rest was covered by Promit's excellent post.

Actually there are way to do kindof multithreading in OpenGL 4 : (...). There is also glBufferStorage + IndirectDraw which allows you to access a buffer of instanced data that can be written like any others buffer, eg concurrently.
But it's not as powerful as what Vulkan or DX12 which allow to issue any command and not just instanced ones.

Actually DX12 & Vulkan are exactly following the same path glBufferStorage + IndirectDraw did. It just got easier, made thiner, and can now handle other misc aspects from within multiple cores (texture binding, shader compilation, barrier preparation, etc).

There is something I don't really understand in Vulkan/DX12, it's the "descriptor" object. Apparently it acts as a gpu readable data chunk that hold texture pointer/size/layout and sampler info, but I don't understand the descriptor set/pool concept work, this sounds a lot like array of bindless texture handle to me.

There is something I don't really understand in Vulkan/DX12, it's the "descriptor" object. Apparently it acts as a gpu readable data chunk that hold texture pointer/size/layout and sampler info, but I don't understand the descriptor set/pool concept work, this sounds a lot like array of bindless texture handle to me.

Without going into detail; it's because only AMD & NVIDIA cards support bindless textures in their hardware, there's one major Desktop vendor that doesn't support it even though it's DX11 HW. Also take in mind both Vulkan & DX12 want to support mobile hardware as well.
You will have to give the API a table of textures based on frequency of updates: One blob of textures for those that change per material, one blob of textures for those that rarely change (e.g. environment maps), and another blob of textures that don't change (e.g. shadow maps).
It's very analogous to how we have been doing constant buffers with shaders (provide different buffers based on frequency of update).
And you put those blobs into a bigger blob and tell the API "I want to render with this big blob which is a collection of blobs of textures"; so the API can translate this very well to all sorts of hardware (mobile, Intel on desktop, and bindless like AMD's and NVIDIA's).

If all hardware were bindless, this set/pool wouldn't be needed because you could change one texture anywhere with minimal GPU overhead like you do in OpenGL4 with bindless texture extensions.
Nonetheless this descriptor pool set is also useful for non-texture stuff, (e.g. anything that requires binding, like constant buffers). It is quite generic.

So... is the Vulkan API available for us plebeians to peruse anywhere?

This topic is closed to new replies.

Advertisement