Create a command list, create some resources, execute a command list with a set of resources as inputs, done.
so, how do you keep a game (engine) flexible, yet knowing _all_ the states etc. that you don't want to create on runtime? (assume pipeline creation can take as much time as shader linking in opengl which is 1s in bad cases). there is no driver anymore that does that in a background thread for you, in an as fast as possible way. assume you have about 1024 shader combination, add stencil, rasterizer, blend, rendertarget permutations that might be part of the gpu-setup and therefor included in one static state you have to create. assume, it's not a state creation that is cross platform, but per driver+gpu, you cannot really do it offline before you ship the game.
The rest of the changes are conceptual changes to simplify the resources model (no more different kinds of buffers, simpler texture semantics, etc.).
there still are. check out the links in the 2nd post. it's split in 2 stages 1. you allocate a bunch of memory 2. you prepare it for a specific usage case e.g. as render target or vertexbuffer.
now assume you want to use a texture as render target and use it as source in the 2nd drawcall (e.g. some temporal texture you use in post processing). you need to state that to the API. assume further you take advantage of the new multithreaded command generation, so you can't keep track of the state of an object inside the object, you rather need to track states per commandbuffer/thread. assume further, you don't want to do redundant state conversions, as those might be quite expensive (changing layouts of buffers to make them best suited for texture sampling, for rendering, for vertex access), so you'd need to actually somehow merge the states of resources you use in consecutive command buffers.
the more explicit threading model (only particularly relevant if you want/need render threading), and the more explicit device model (pick which GPU you use for what on multi-GPU systems).
you know you have to test and balance all that? cross fire works across different GPUs. you can have an APU gpu + some mid range Radeon HD 7700 + a top notch R9 290x. and with D3D, there is a generic driver that might execute asymmetrically on those GPUs. it's something you'd need to handle. I don't say that's impossible, but for the majority of devs, it can end up in either a lot of work (testing all kind of configuration in various parts of your game) or you can disappoint some high end users that their expensive 4x crossfire is no faster thatn 3x crossfire or even buggy.
a lot of work that drivers did before, will end up in the hands of devs and it's not optional, it's what you'll have to do. you might ship a perfectly fine running game and some new GPU might take advantage of something that hasn't been used before and it might uncover a bug in your 1year old game that ppl still play. and AMD/NV won't release a driver fix, you need to release a patch.
I see benefits you've mentioned, but I also see all the drawbacks. I like low level programming on consoles, below what mantle/D3D12 offers, but I'm not sure about the PC side. when there was Glide/S3 Metal/RRedline/... and even GL was working different (MiniGL/PowerSGL/..) every developer felt relieved it ended with D3D. and the RefRas was actually pushed by game devs, to be able to validate something is a driver bug. now it all seem forgotten and like a step back.
the Cass Everitt talk really seems like the best balance of both worlds to me (if it would be extended a little bit).
Regarding being CPU bound - this depends on whether you're making a graphical tech demo, or a game. For the former, you might have 16ms of GPU time and 16ms of CPU time per frame dedicated to graphics. For the latter, you've still got 16ms of GPU time (for now, until everyone else realizes you've ended up as the gatekeeper of their SPU-type jobs!), but maybe only a budget of 2ms of CPU time because all the other departments on the team need CPU time as well! In that situation, it's much easier to overrun your CPU budget...
yet there are very few games that saturate more than 2 cores. most have a render thread and that one is running independent of the other parts, that implies, from the architecture point of view, rendering in games nowadays runs no different than in tech demos unless your job system really fills up all cores and could benefit from freeing up the rendering thread/core. if you don't occupy all cores and you don't run a render thread, there is no reason to complain about API limitations.
P.S. I'm about to sign an NDA with AMD to get access to Mantle, so it's obviously being released wider than just DICE now
yet it makes me wonder, are we really that much cpu bound? from my perspective, it needs a really slow cpu to saturate on the API side. usually, with instancing etc. any modern i3,i5,i7 is fast enough in a single thread to saturate on the GPU side.
In my experience it's very easy to be CPU-bound in D3D11 with real-world rendering scenarios. Lots of draw calls, and lots of resource bindings. This is true for us even on beefy Intel CPU's. We've had to invest considerable amounts of engineering effort into changing our asset pipeline and engine runtime in ways that reduced CPU usage.
I'm implying that you'll end up doing the same for D3D12/Mantle, just not because of the CPU, but because the GPU will have idle-bubbles in the pipeline if you start switching states. (if you profile on consoles, with low CPU overhead, that's what you'll see) It's still work that has to be done and an 1GHz sequential processor won't do any magic. (not talking bout shaders, but bout the command processor part!) We have low level access to HW for consoles and while you might think we could now end up being wasteful with drawcalls etc. we actually waste a lot of SPU cycles to batch meshes and remove redundant states and even shader preperation that the GPU could handle, to avoid it on the GPU. it's just moving the bottleneck to another place, but it's not removing it and at some point you'll hit it again and end up with the same old thinking: the fastest optimization is to not do wasteful work, no matter how fast you'd do it otherwise.
The opengl extension from NVidia's talk are somehow way more what I'd hope for the direction of 'next gen apis'. it's as easy to use as opengl always was, just extending the critical parts to perform better. (I'm talking bout http://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead ). it's actually making things nicer with persistent mapped buffers (you don't need to guess and hope how every driver will 'optimize' your calls and you have all the responsibility and possibilities that comes with using persistent buffers). and if multidrawindirect would be extended a bit more to support an array of indexed shader objects, you could render the whole solid pass with one drawcall. shadowmaps would possibly end up being one drawcall each and preparing those batched drawcalls could be done in multithreaded way if you want.
Really? The future you want is more instancing, wrapped up in a typical OpenGL layer of "you have to do it in this super-special way in order to hit the fast path"??? To me it's completely at ends with what actual software developers want.
I have a feeling you haven't looked into Cass Everitt's talk. it's not about classical instancing. it's about the creation of a list of drawcalls, with various resources (vertexbuffers, indexbuffers, textures...) and just submitting all of it in one drawcall, so instead of
gl set states
gl draw mesh
store_states into array
store_mesh offsets/count etc. into array
gl draw_everything of array
so, there is no "you have to do it in this super-special way in order to hit the fast path", it's quite the opposite, a very generic way. you don't have to touch the shader or something to account for some special instancing or something. you don't have to worry about resource limits and binding. all you do is creating a vector of all drawcalls, just like you'd 'record' it with mantle/D3D12.
yes, it's more limited right now, but that's why I've said, I'd rather see this extended.
Everybody who works on consoles knows how low-overhead it *should* be to generate command buffers, and so they constantly beg for lower-overhead draw calls, better multithreading, and more access to GPU memory. Instead we get that "zero driver overhead" presentation that's like "lol too bad we're never going to change anything, here's some new extensions that only work on Nvidia and may require to to completely rewrite your rendering pipeline to use effectively." Great :-/
I really disagree on that one. it offers you persistant memory, where you can write multithreaded and manage it yourself, just like we do on consoles. it offers you to create command lists (rather vectors) in a multithreaded way, as you can do on consoles. and it's not about "we won't change a thing", it's rather "we've already given you a 90% solution that you can get hands on right now and the changes required are minimal compared to the rewrite for D3D12/Mantle for 10% more".
no offense intended, but have you really looked into it? I can't think of why it would be a pipeline rewrite for you, it's just a little change in buffer management (which aligns well with what you do if you follow best practice guides like https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf ) and the 2nd part is 'recording' of drawcalls, less complex than with D3D12/Mantle (because you don't have to pre-create all states and manage them), which isn't that different to what you do if you try to sort your drawcalls to minimize state switching (which everyone does even on consoles, where drawcalls should be cheap, yet those hit you hard on GPU).
and in case we don't want to optimize for the current saturation, but rather increase drawcall count etc. I really wonder when it starts to be rather suboptimal on the GPU side. if we'd be able to push 10M draw calls/s, that's like 100cycle/DC on a modern gpu and those have really deep pipelines, sometimes needing flushes for caches, every DC needs some context for the gpu setup that needs to be fetched. We'll end up with "yes, this could be done, but would be suboptimal, lets go back to instancing etc. again". that's no different than what we do now. few 'big' setups per frame and pushing as many drawcalls with as few state/resource changes as possible to saturate rather on shader/rasterization/fillrate side.
Of course you can end up getting GPU limited, but the problem is that right now we can't even get to that point because there's too much CPU overhead. Software should be able to hit that breaking point of where batching is needed for GPU performance, and then developers can decide case-by-case on how much it makes sense for them to persue instancing and things like that. It shouldn't be that you're forced into 100% instancing from the start otherwise you're dead in the water on PC, at least in my humble opinion.
well, maybe I'm just too used to prepare everything in best way for GPU's, we barely ran into cpu limitations due to rendering. most of the time the GPU tries to run our games. it was at first as if consoles have benefits due to low overhead, but then you take some captures and realize you pay for cache and pipeline flushes and the solution is just the plain old way you'd always optimize for <D3D12 . I just expect the same for D3D12/Mantle.
I think the problem are not the values for rendering, because it's stable when he just moves around, it rather seems like the it only jitters when physics is involved, which might be due to very fine time steps and some squaring etc. you might do because of e=m*c^2
switching to doubles is usually not a great solution, adjusting ranges is better, but in case of a plane simulation, where you simulate just a few units and it's a really big space with maybe detailed movement (e.g. rolling slowly on the ground), double might be ok'ish.
and I'm not 100% sure it's really the problem either, so I suggested to try it out, sorry if there was no detailed explanation 'why'. I was rather really trying to remote debug it. maybe it won't help.