Preparing for Mantle

andy macpherson · 2014-04-08T14:17:13

What are your thoughts here, what should we expect, this could be a good chance to get ahead at the start, any resources, ideas etc.. I really don't know where to start and how big a change it will be from GL/DX.

SeanMiddleditch

17,596

April 07, 2014 10:04 PM

Sounds like a pain in the ass, to me.

Both Mantle and D3D12 should actually be quite a bit easier for most non-trivial renderer designs.

It's also actually fairly similar to how one might try to use D3D11 today in a multi-threaded renderer; some of the trickier/dumber parts are thankfully simplified. Create a command list, create some resources, execute a command list with a set of resources as inputs, done. The rest of the changes are conceptual changes to simplify the resources model (no more different kinds of buffers, simpler texture semantics, etc.), the more explicit threading model (only particularly relevant if you want/need render threading), and the more explicit device model (pick which GPU you use for what on multi-GPU systems).

Sean Middleditch – Game Systems Engineer – Join my team!

Hodgman

52,717

April 07, 2014 11:06 PM

Regarding being CPU bound - this depends on whether you're making a graphical tech demo, or a game.
For the former, you might have 16ms of GPU time and 16ms of CPU time per frame dedicated to graphics.
For the latter, you've still got 16ms of GPU time (for now, until everyone else realizes you've ended up as the gatekeeper of their SPU-type jobs!), but maybe only a budget of 2ms of CPU time because all the other departments on the team need CPU time as well! In that situation, it's much easier to overrun your CPU budget...

. 22 Racing Series .

Krypt0n

4,769

April 08, 2014 12:24 AM

yet it makes me wonder, are we really that much cpu bound? from my perspective, it needs a really slow cpu to saturate on the API side. usually, with instancing etc. any modern i3,i5,i7 is fast enough in a single thread to saturate on the GPU side.
In my experience it's very easy to be CPU-bound in D3D11 with real-world rendering scenarios. Lots of draw calls, and lots of resource bindings. This is true for us even on beefy Intel CPU's. We've had to invest considerable amounts of engineering effort into changing our asset pipeline and engine runtime in ways that reduced CPU usage.

I'm implying that you'll end up doing the same for D3D12/Mantle, just not because of the CPU, but because the GPU will have idle-bubbles in the pipeline if you start switching states. (if you profile on consoles, with low CPU overhead, that's what you'll see) It's still work that has to be done and an 1GHz sequential processor won't do any magic. (not talking bout shaders, but bout the command processor part!)
We have low level access to HW for consoles and while you might think we could now end up being wasteful with drawcalls etc. we actually waste a lot of SPU cycles to batch meshes and remove redundant states and even shader preperation that the GPU could handle, to avoid it on the GPU.
it's just moving the bottleneck to another place, but it's not removing it and at some point you'll hit it again and end up with the same old thinking: the fastest optimization is to not do wasteful work, no matter how fast you'd do it otherwise.

The opengl extension from NVidia's talk are somehow way more what I'd hope for the direction of 'next gen apis'. it's as easy to use as opengl always was, just extending the critical parts to perform better. (I'm talking bout http://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead ). it's actually making things nicer with persistent mapped buffers (you don't need to guess and hope how every driver will 'optimize' your calls and you have all the responsibility and possibilities that comes with using persistent buffers). and if multidrawindirect would be extended a bit more to support an array of indexed shader objects, you could render the whole solid pass with one drawcall. shadowmaps would possibly end up being one drawcall each and preparing those batched drawcalls could be done in multithreaded way if you want.
Really? The future you want is more instancing, wrapped up in a typical OpenGL layer of "you have to do it in this super-special way in order to hit the fast path"??? To me it's completely at ends with what actual software developers want.

I have a feeling you haven't looked into Cass Everitt's talk.
it's not about classical instancing.
it's about the creation of a list of drawcalls, with various resources (vertexbuffers, indexbuffers, textures...) and just submitting all of it in one drawcall, so instead of






for_all_my_drawcall
  gl set states
  gl draw mesh

you write






for_all_my_drawcall
  store_states into array
  store_mesh offsets/count etc. into array

gl draw_everything of array

so, there is no "you have to do it in this super-special way in order to hit the fast path", it's quite the opposite, a very generic way. you don't have to touch the shader or something to account for some special instancing or something. you don't have to worry about resource limits and binding. all you do is creating a vector of all drawcalls, just like you'd 'record' it with mantle/D3D12.

yes, it's more limited right now, but that's why I've said, I'd rather see this extended.

Everybody who works on consoles knows how low-overhead it *should* be to generate command buffers, and so they constantly beg for lower-overhead draw calls, better multithreading, and more access to GPU memory. Instead we get that "zero driver overhead" presentation that's like "lol too bad we're never going to change anything, here's some new extensions that only work on Nvidia and may require to to completely rewrite your rendering pipeline to use effectively." Great :-/

I really disagree on that one.
it offers you persistant memory, where you can write multithreaded and manage it yourself, just like we do on consoles. it offers you to create command lists (rather vectors) in a multithreaded way, as you can do on consoles. and it's not about "we won't change a thing", it's rather "we've already given you a 90% solution that you can get hands on right now and the changes required are minimal compared to the rewrite for D3D12/Mantle for 10% more".

no offense intended, but have you really looked into it? I can't think of why it would be a pipeline rewrite for you, it's just a little change in buffer management (which aligns well with what you do if you follow best practice guides like https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf ) and the 2nd part is 'recording' of drawcalls, less complex than with D3D12/Mantle (because you don't have to pre-create all states and manage them), which isn't that different to what you do if you try to sort your drawcalls to minimize state switching (which everyone does even on consoles, where drawcalls should be cheap, yet those hit you hard on GPU).

and in case we don't want to optimize for the current saturation, but rather increase drawcall count etc. I really wonder when it starts to be rather suboptimal on the GPU side. if we'd be able to push 10M draw calls/s, that's like 100cycle/DC on a modern gpu and those have really deep pipelines, sometimes needing flushes for caches, every DC needs some context for the gpu setup that needs to be fetched. We'll end up with "yes, this could be done, but would be suboptimal, lets go back to instancing etc. again".
that's no different than what we do now. few 'big' setups per frame and pushing as many drawcalls with as few state/resource changes as possible to saturate rather on shader/rasterization/fillrate side.
Of course you can end up getting GPU limited, but the problem is that right now we can't even get to that point because there's too much CPU overhead. Software should be able to hit that breaking point of where batching is needed for GPU performance, and then developers can decide case-by-case on how much it makes sense for them to persue instancing and things like that. It shouldn't be that you're forced into 100% instancing from the start otherwise you're dead in the water on PC, at least in my humble opinion.

well, maybe I'm just too used to prepare everything in best way for GPU's, we barely ran into cpu limitations due to rendering. most of the time the GPU tries to run our games. it was at first as if consoles have benefits due to low overhead, but then you take some captures and realize you pay for cache and pipeline flushes and the solution is just the plain old way you'd always optimize for <D3D12 .
I just expect the same for D3D12/Mantle.

video game porting and optimization service + consulting

Krypt0n

4,769

April 08, 2014 12:32 AM

Regarding being CPU bound - this depends on whether you're making a graphical tech demo, or a game.
For the former, you might have 16ms of GPU time and 16ms of CPU time per frame dedicated to graphics.
For the latter, you've still got 16ms of GPU time (for now, until everyone else realizes you've ended up as the gatekeeper of their SPU-type jobs!), but maybe only a budget of 2ms of CPU time because all the other departments on the team need CPU time as well! In that situation, it's much easier to overrun your CPU budget...

yet there are very few games that saturate more than 2 cores. most have a render thread and that one is running independent of the other parts, that implies, from the architecture point of view, rendering in games nowadays runs no different than in tech demos unless your job system really fills up all cores and could benefit from freeing up the rendering thread/core.
if you don't occupy all cores and you don't run a render thread, there is no reason to complain about API limitations.

P.S. I'm about to sign an NDA with AMD to get access to Mantle, so it's obviously being released wider than just DICE now

part of the NDA is to not talk about the NDA ;)

video game porting and optimization service + consulting

Krypt0n

4,769

April 08, 2014 01:02 AM

Create a command list, create some resources, execute a command list with a set of resources as inputs, done.

so, how do you keep a game (engine) flexible, yet knowing _all_ the states etc. that you don't want to create on runtime? (assume pipeline creation can take as much time as shader linking in opengl which is 1s in bad cases).
there is no driver anymore that does that in a background thread for you, in an as fast as possible way.
assume you have about 1024 shader combination, add stencil, rasterizer, blend, rendertarget permutations that might be part of the gpu-setup and therefor included in one static state you have to create.
assume, it's not a state creation that is cross platform, but per driver+gpu, you cannot really do it offline before you ship the game.

The rest of the changes are conceptual changes to simplify the resources model (no more different kinds of buffers, simpler texture semantics, etc.).

there still are. check out the links in the 2nd post. it's split in 2 stages
1. you allocate a bunch of memory
2. you prepare it for a specific usage case e.g. as render target or vertexbuffer.

now assume you want to use a texture as render target and use it as source in the 2nd drawcall (e.g. some temporal texture you use in post processing). you need to state that to the API.
assume further you take advantage of the new multithreaded command generation, so you can't keep track of the state of an object inside the object, you rather need to track states per commandbuffer/thread.
assume further, you don't want to do redundant state conversions, as those might be quite expensive (changing layouts of buffers to make them best suited for texture sampling, for rendering, for vertex access), so you'd need to actually somehow merge the states of resources you use in consecutive command buffers.

the more explicit threading model (only particularly relevant if you want/need render threading), and the more explicit device model (pick which GPU you use for what on multi-GPU systems).

you know you have to test and balance all that? cross fire works across different GPUs. you can have an APU gpu + some mid range Radeon HD 7700 + a top notch R9 290x.
and with D3D, there is a generic driver that might execute asymmetrically on those GPUs. it's something you'd need to handle.
I don't say that's impossible, but for the majority of devs, it can end up in either a lot of work (testing all kind of configuration in various parts of your game) or you can disappoint some high end users that their expensive 4x crossfire is no faster thatn 3x crossfire or even buggy.

a lot of work that drivers did before, will end up in the hands of devs and it's not optional, it's what you'll have to do. you might ship a perfectly fine running game and some new GPU might take advantage of something that hasn't been used before and it might uncover a bug in your 1year old game that ppl still play. and AMD/NV won't release a driver fix, you need to release a patch.

I see benefits you've mentioned, but I also see all the drawbacks.
I like low level programming on consoles, below what mantle/D3D12 offers, but I'm not sure about the PC side. when there was Glide/S3 Metal/RRedline/... and even GL was working different (MiniGL/PowerSGL/..) every developer felt relieved it ended with D3D. and the RefRas was actually pushed by game devs, to be able to validate something is a driver bug. now it all seem forgotten and like a step back.

the Cass Everitt talk really seems like the best balance of both worlds to me (if it would be extended a little bit).

video game porting and optimization service + consulting

Hodgman

52,717

April 08, 2014 02:29 AM

so, how do you keep a game (engine) flexible, yet knowing _all_ the states etc. that you don't want to create on runtime? (assume pipeline creation can take as much time as shader linking in opengl which is 1s in bad cases).

I don't know how Mantle / D3D12 are going to do it, but if we were doing it ourselves close to the metal (no validation costs), then "creating a state object" is the same as "creating an array of N int32's" (free). And configuring one of them is the same as packing M different values into that array of N int32's (very, very cheap).
Modern GPUs are also moving towards being completely stateless, where there is hardly any per batch pipeline bubbling to worry about any more.
Also, modern GPUs are moving towards having multiple rings of commands being processed - so that if you do create a bubble, it can be filled with work that you've queued up on one of your compute queues anyway

I really disagree on that one.
it offers you persistant memory, where you can write multithreaded and manage it yourself, just like we do on consoles. it offers you to create command lists (rather vectors) in a multithreaded way, as you can do on consoles. and it's not about "we won't change a thing", it's rather "we've already given you a 90% solution that you can get hands on right now and the changes required are minimal compared to the rewrite for D3D12/Mantle for 10% more".

The bolded bit really is a big deal - I'm surprised no one else has highlighted it as the standout feature of that talk. Basically persistent buffers means that we've finally got a malloc/free for CPU+GPU addressable RAM! The GPU can read from it at full speed, and the CPU can do fast write-combined writes into it (CPU reads will be god awfully slow - non-cached). It's not quite feature complete yet -- AFAIK, we can't interpret this malloc'ed RAM however we like -- we can't tell a shader to fetch it as a texture and another to read from it as an index buffer, and another to stream-out to it, and another to write to it as a MSAA render-target... All features that should be possible in an ideal API.

The multi-draw-indirect stuff is cool, but it's still not a real command buffer -- it lets you submit a collection of draw calls, but no more than that -- they all have to share the same state.
The future is only getting more and more parallel/multi-core and sooner or later, having a single "GPU thread" isn't going to cut it any more. Plenty of engines are already built around it, but only on PS3/360 (and PS4/Xbone now)...

I think the "we won't change a thing" type comment was aimed at GL as a whole -- GL keeps just piling on more features to create the behemoth of an API that it is, where there's 100 ways to choose from to implement each feature, but only a very small set of them (usually including specific vendor-specific extensions) will get you onto the "fast path".
The idea of something like Mantle is that you're only given the fast path, and all the stuff on the "slow path" is left for you to imeplement/emulate yourself if you want it.
If you've not used GL for a while, it is pretty intimidating to try and get back into it and find where all the fast paths are for each different generation of cards... I'd actually prefer it if they had completely separate APIs for different eras of GPUs, like the Dx9 / Dx11 schism

so, there is no "you have to do it in this super-special way in order to hit the fast path", it's quite the opposite, a very generic way

No - they're demonstrating a really cool fast-path within GL in that talk -- the second slide even says so, that fast paths exist, but you just need to know what they are... implying that if you use the wrong parts of GL, then you won't be on the past path. Also, many of the features they're showing off aren't core GL -- they're vendor extensions, and despite the EXT prefix, they're not supported by all vendors yet either! They also point out several vendor-specific caveats in there -- "Intel's fast path involves doing things this way, AMD's fast-path requires this other thing", etc...
So you'd still need to have several vendor-specific GL back-ends in your engine -- and also back-ends for older generations of cards that are never going to support these extensions, which means they've got their own "fast paths"...

I'm implying that you'll end up doing the same for D3D12/Mantle, just not because of the CPU, but because the GPU will have idle-bubbles in the pipeline if you start switching states. (if you profile on consoles, with low CPU overhead, that's what you'll see) It's still work that has to be done and an 1GHz sequential processor won't do any magic. (not talking bout shaders, but bout the command processor part!)
We have low level access to HW for consoles and while you might think we could now end up being wasteful with drawcalls etc. we actually waste a lot of SPU cycles to batch meshes and remove redundant states and even shader preperation that the GPU could handle, to avoid it on the GPU.

Yeah, pipeline bubbles on GeForce 7-era cards is really horrible. You can actually feed it really high batch counts and the CP will work fine -- as long as you don't switch states with every batch. If the states are compatible, then it will actually merge batches internally. If the states aren't compatible and the batches are too small, then yeah, you end up wasting a huge amount of time in pipeline bubbles.
However, modern cards are implemented completely differently now. You probably still don't want to have batches with unique states that cover < 128 pixels, but the performance characteristics aren't at all comparable to something that's as many generations apart as the GF7!

. 22 Racing Series .

_the_phantom_

11,263

April 08, 2014 09:59 AM

The problem with the AZDO talk is that while it is nice and all that practically speaking it only works on one vendor; NVidia. AMD lack the key extensions and Intel are no where near close either (still a couple of major versions behind the spec afaik).

As cool as multidrawindirect is regardless of any protests it IS just an instancing system; your targets, shaders and states are fixed - at best you can reference different source data which at least opens the door to different textures (assuming bindless and drawID support aka NV only currently) but it is still instancing.

Which is fine, instancing is good and it's one part of the puzzle but it's not the be all and end all solution.

Games are not 'throw 1,000,000 objects at the screen once' affairs; multiple passes and multiple states means hitting the draw path frequently still with switching between. Most of these passes are logically independent and can be 'setup' in advance; in a simple forward render my final pass might require the shadow map pass to run first but from a command point of view I can build that shadow pass at the same time as the final pass and just execute them in the correct order. (Which scales with increasingly complex pass setups; we had a forward rendered game which was performing somewhere in the region of 8 passes before you even got to post processing).

OpenGL is, by it's nature, forcing you to this ALL on one thread, a thread which isn't getting any faster any more either. In 2014 you are going to have the same hard limit as you are in 2016 and 2018. Even if you make ever more exotic 'draw' commands which pack state information into the API function for depth, stencil, targets, shaders etc this will still have to be deserialised, broken up and processed somewhere just that 'somewhere' is now magic driver voodoo which has to do all manner of hazard tracking and trying to keep state straight; you've not made it any faster, you've just made the driver's life 100x harder.

This is where Mantle/D3D12 enter the game and say 'sure, you can do that OpenGL stuff but how about we let you record things on separate threads too?' - So now instead of being serialized on a single thread I can build my command buffers separately (probably still based around multidrawindirect) and just kick them one at a time in the correct order. Suddenly this massive serial block goes away and I can use more of the CPU resources and more importantly this work is now more predictable as the driver is doing less behind the scenes.

How is this not better?
I don't see how anyone can argue against this system with a straight face.

And this is before you add in persistent command buffers which can be replayed (because why am I regenerating the command buffer for my post stack every frame anyway?!?) and the reduction in overhead because the driver no longer has to track hazards (and the improvements that brings because we can see before the driver where the hazards are going to be anyway, instead of the driver late patching things in the command buffer).

Hell, the D3D12 early runtime from the //Build conference showed this to be a win right away and that's with early code!

Yes, AZDO has some nice ideas but to suggest that this is the solution and we should be making a fatter draw function and continue to push commands down a single thread is, frankly, short sighted madness and is driving us fast into a brick wall of performance issues.