Jump to content

  • Log In with Google      Sign In   
  • Create Account

Preparing for Mantle


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
17 replies to this topic

#1 rAm_y_   Members   -  Reputation: 481

Like
1Likes
Like

Posted 06 April 2014 - 12:24 PM

What are your thoughts here, what should we expect, this could be a good chance to get ahead at the start, any resources, ideas etc..

 

I really don't know where to start and how big a change it will be from GL/DX.



Sponsor:

#2 MJP   Moderators   -  Reputation: 11770

Like
5Likes
Like

Posted 06 April 2014 - 12:40 PM

I would check out the recent presentations from GDC about Mantle:

 

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Mantle-Introducing-a-New-API-for-Graphics-Guennadi-Riguer.ppsx

 

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Rendering-Battlefield-4-with-Mantle-Johan-Andersson.ppsx

 

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Mantle-and-Nitrous-Combining-efficient-engine-design-with-a-modern-API-Dan-Baker.ppsx

 

FWIW, D3D12 looks similar to Mantle in many ways.



#3 Promit   Moderators   -  Reputation: 7621

Like
3Likes
Like

Posted 06 April 2014 - 07:30 PM

Sounds like a pain in the ass, to me.



#4 TheChubu   Crossbones+   -  Reputation: 4766

Like
4Likes
Like

Posted 06 April 2014 - 07:33 PM

I'm starting to think that AMD will never release a SDK/bindings/whatever for it...

 

Like seriously, what is the point of all these "Mantle is cool" public presentations if only  a few (big) companies can get a hold of it? They could just make private presentations for the 10 or so representatives of the companies they're interested in and be done with it...

 

"Oh thanks Frostbite team for explaining to me how to use Mantle properly! Now with all this newfound knowledge I'll go to my favorite IDE and fuck myself!"


Edited by TheChubu, 06 April 2014 - 07:38 PM.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

 

My journals: dustArtemis ECS framework and Making a Terrain Generator


#5 SeanMiddleditch   Members   -  Reputation: 7174

Like
0Likes
Like

Posted 06 April 2014 - 08:45 PM

Like seriously, what is the point of all these "Mantle is cool" public presentations if only  a few (big) companies can get a hold of it?


Building hype is a large part of advertising any product. Why do you think you hear about games months or even years before they come out?

#6 Krypt0n   Crossbones+   -  Reputation: 2672

Like
4Likes
Like

Posted 07 April 2014 - 02:01 AM

lets not get of topic with rants about AMD. The topic is quite interesting and if not for Mantle, the same points can be made for D3D12 and we can be quite sure MS will release it to the public at some point.

 

I think there are two main components that make the new APIs different from the previous ones.

1.  a lot of caching/pre-creation of states. this can make your life quite difficult if you haven't designed for it. D3D11 has already states, but it's more of a bundling of settings to have less api calls, but with the new APIs, it seems like they optimize a lot of the whole GPU setup (kinda of similar to shader linking in opengl). Previously you could have artist controlled states or even dynamically created states by the game, but now, you don't really want to create those on runtime.

The issues we had in the past console generation with shader permutations, where you had tons of 'cached' version, depending on flags that each adds 2x the shader amount, now it will be the same for the whole rendering setup.

you can probably set any vertexshader, any pixelshader and then disable color writes, knowing the whole scope, the Mantle/D3D12 driver should be able to back track the GPU setup to your vertexshader, knowing just positions are needed and strip out every other redundant bit (which previously those 'mysterious' driver threads might or might not have done).

But this might be quite vendor specific (some might end up with the same gpu setup for two different states, e.g. in one you disable color writes in the other you set blend to be add(Zero,Zero) and other driver might not detect this), not sure how this would reflect the the API. whether you'd know pipelines are the same and you could adjust your sorting to account for this.

Everything on runtime needs to select from the pre-created permutation set to have a stable framerate. I wonder if there will be any guarantees on how long a pipeline creation might take (in opengl (es), sometimes shader linking takes several seconds). That's not only and issue of renderer architecture, but also initialization time. you don't want to spent minutes to cache thousands of states.

2. multithreading: previously I think there was no game that has used multithreading to speed up the API part. mainly because it was either not supported to use multiple interfaces or when there was a way (e.g. d3d11), it was actually slower.

yet it makes me wonder, are we really that much cpu bound? from my perspective, it needs a really slow cpu to saturate on the API side. usually, with instancing etc. any modern i3,i5,i7 is fast enough in a single thread to saturate on the GPU side.

and in case we don't want to optimize for the current saturation, but rather increase drawcall count etc. I really wonder when it starts to be rather suboptimal on the GPU side. if we'd be able to push 10M draw calls/s, that's like 100cycle/DC on a modern gpu and those have really deep pipelines, sometimes needing flushes for caches, every DC needs some context for the gpu setup that needs to be fetched. We'll end up with "yes, this could be done, but would be suboptimal, lets go back to instancing etc. again".

that's no different than what we do now. few 'big' setups per frame and pushing as many drawcalls with as few state/resource changes as possible to saturate rather on shader/rasterization/fillrate side.

 

The opengl extension from NVidia's talk are somehow way more what I'd hope for the direction of 'next gen apis'. it's as easy to use as opengl always was, just extending the critical parts to perform better. (I'm talking bout http://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead ). it's actually making things nicer with persistent mapped buffers (you don't need to guess and hope how every driver will 'optimize' your calls and you have all the responsibility and possibilities that comes with  using persistent buffers). and if multidrawindirect would be extended a bit more to support an array of indexed shader objects, you could render the whole solid pass with one drawcall. shadowmaps would possibly end up being one drawcall each and preparing those batched drawcalls could be done in multithreaded way if you want.

 

feels like GL has not the great marketing campain, but designing the next renderer would mean for me to rather go the NV/GL way and map it to D3D12/Mantle under the hood.



#7 Mona2000   Members   -  Reputation: 625

Like
0Likes
Like

Posted 07 April 2014 - 04:02 AM

and if multidrawindirect would be extended a bit more to support an array of indexed shader objects

Isn't that the point of shader subroutines?



#8 Hodgman   Moderators   -  Reputation: 31843

Like
3Likes
Like

Posted 07 April 2014 - 04:45 AM

If you're really game, you can read all about how to program the raw GCN architecture here:

http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf

http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

 

Basically, if you were at AMD working on a GL/D3D/Mantle driver, that's the information you'd need to know.

Mantle is basically a super-thin driver, compared to the GL/D3D drivers, so the abstraction provided will be half way between D3D and the info in that PDF wink.png



#9 Krypt0n   Crossbones+   -  Reputation: 2672

Like
0Likes
Like

Posted 07 April 2014 - 06:22 AM

 

and if multidrawindirect would be extended a bit more to support an array of indexed shader objects

Isn't that the point of shader subroutines?

 

technically, but I'd not use that practically.

you might get that opinion if you look at that from the programming point of view and technically it would be possible, but it's also a lot about HW that is important:

1. with just one shader setup, you'd need to always involve all the shader stages that any of your 'subroutines' need (hull, domain, geometry, pixelshader). and that might be wasteful

2. hardware allocates resources for the worst case. if you have a simple vertex shader for most geometry but some very complex e.g. for skinned characters, the gpu (or rather driver) would allocate registers, caches etc. for the skinned version, reducing your throughput a lot for all the other cases.

3. gpus have optimizations for special cases, e.g. running early depth culling in the rasterizer stage if you don't modify the depth outcome in the shader, but with a unified shader and subroutines, if just one uses e.g. clip/kill, that optimization would be disabled for all of subroutines.

 

you're right to the point that it would nicely work already for nowadays hardware and maybe we should consider to use that as a smart optimization in some very local cases where we know exactly what's going on. yet I'd like to see that as a general purpose solution with no worries whether it might hurt performance more than it gives. NVidia stated that their 'volta' GPU should have some build in ARM cores, maybe then they can process more high level states (aka shader smile.png ).



#10 MJP   Moderators   -  Reputation: 11770

Like
5Likes
Like

Posted 07 April 2014 - 03:14 PM


yet it makes me wonder, are we really that much cpu bound? from my perspective, it needs a really slow cpu to saturate on the API side. usually, with instancing etc. any modern i3,i5,i7 is fast enough in a single thread to saturate on the GPU side.
 
In my experience it's very easy to be CPU-bound in D3D11 with real-world rendering scenarios. Lots of draw calls, and lots of resource bindings. This is true for us even on beefy Intel CPU's. We've had to invest considerable amounts of engineering effort into changing our asset pipeline and engine runtime in ways that reduced CPU usage. Things aren't helped at all by the fact that we can't multithread D3D calls, and the fact that there's a giant driver thread always using up a core.
 

The opengl extension from NVidia's talk are somehow way more what I'd hope for the direction of 'next gen apis'. it's as easy to use as opengl always was, just extending the critical parts to perform better. (I'm talking bout http://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead ). it's actually making things nicer with persistent mapped buffers (you don't need to guess and hope how every driver will 'optimize' your calls and you have all the responsibility and possibilities that comes with  using persistent buffers). and if multidrawindirect would be extended a bit more to support an array of indexed shader objects, you could render the whole solid pass with one drawcall. shadowmaps would possibly end up being one drawcall each and preparing those batched drawcalls could be done in multithreaded way if you want.
 
Really? The future you want is more instancing, wrapped up in a typical OpenGL layer of "you have to do it in this super-special way in order to hit the fast path"??? To me it's completely at odds with what actual software developers want. Everybody who works on consoles knows how low-overhead it *should* be to generate command buffers, and so they constantly beg for lower-overhead draw calls, better multithreading, and more access to GPU memory. Instead we get that "zero driver overhead" presentation that's like "lol too bad we're never going to change anything, here's some new extensions that only work on Nvidia and may require to to completely rewrite your rendering pipeline to use effectively." Great :-/
 

and in case we don't want to optimize for the current saturation, but rather increase drawcall count etc. I really wonder when it starts to be rather suboptimal on the GPU side. if we'd be able to push 10M draw calls/s, that's like 100cycle/DC on a modern gpu and those have really deep pipelines, sometimes needing flushes for caches, every DC needs some context for the gpu setup that needs to be fetched. We'll end up with "yes, this could be done, but would be suboptimal, lets go back to instancing etc. again".
that's no different than what we do now. few 'big' setups per frame and pushing as many drawcalls with as few state/resource changes as possible to saturate rather on shader/rasterization/fillrate side.
 
Of course you can end up getting GPU limited, but the problem is that right now we can't even get to that point because there's too much CPU overhead. Software should be able to hit that breaking point of where batching is needed for GPU performance, and then developers can decide case-by-case on how much it makes sense for them to persue instancing and things like that. It shouldn't be that you're forced into 100% instancing from the start otherwise you're dead in the water on PC, at least in my humble opinion.

Edited by MJP, 08 April 2014 - 02:53 PM.


#11 SeanMiddleditch   Members   -  Reputation: 7174

Like
0Likes
Like

Posted 07 April 2014 - 04:04 PM

Sounds like a pain in the ass, to me.


Both Mantle and D3D12 should actually be quite a bit easier for most non-trivial renderer designs.

It's also actually fairly similar to how one might try to use D3D11 today in a multi-threaded renderer; some of the trickier/dumber parts are thankfully simplified. Create a command list, create some resources, execute a command list with a set of resources as inputs, done. The rest of the changes are conceptual changes to simplify the resources model (no more different kinds of buffers, simpler texture semantics, etc.), the more explicit threading model (only particularly relevant if you want/need render threading), and the more explicit device model (pick which GPU you use for what on multi-GPU systems).

#12 Hodgman   Moderators   -  Reputation: 31843

Like
3Likes
Like

Posted 07 April 2014 - 05:06 PM

Regarding being CPU bound - this depends on whether you're making a graphical tech demo, or a game.
For the former, you might have 16ms of GPU time and 16ms of CPU time per frame dedicated to graphics.
For the latter, you've still got 16ms of GPU time (for now, until everyone else realizes you've ended up as the gatekeeper of their SPU-type jobs!), but maybe only a budget of 2ms of CPU time because all the other departments on the team need CPU time as well! In that situation, it's much easier to overrun your CPU budget...


Edited by Hodgman, 07 April 2014 - 08:49 PM.


#13 Krypt0n   Crossbones+   -  Reputation: 2672

Like
2Likes
Like

Posted 07 April 2014 - 06:24 PM

yet it makes me wonder, are we really that much cpu bound? from my perspective, it needs a really slow cpu to saturate on the API side. usually, with instancing etc. any modern i3,i5,i7 is fast enough in a single thread to saturate on the GPU side.

In my experience it's very easy to be CPU-bound in D3D11 with real-world rendering scenarios. Lots of draw calls, and lots of resource bindings. This is true for us even on beefy Intel CPU's. We've had to invest considerable amounts of engineering effort into changing our asset pipeline and engine runtime in ways that reduced CPU usage.

I'm implying that you'll end up doing the same for D3D12/Mantle, just not because of the CPU, but because the GPU will have idle-bubbles in the pipeline if you start switching states. (if you profile on consoles, with low CPU overhead, that's what you'll see) It's still work that has to be done and an 1GHz sequential processor won't do any magic. (not talking bout shaders, but bout the command processor part!)
We have low level access to HW for consoles and while you might think we could now end up being wasteful with drawcalls etc. we actually waste a lot of SPU cycles to batch meshes and remove redundant states and even shader preperation that the GPU could handle, to avoid it on the GPU.
it's just moving the bottleneck to another place, but it's not removing it and at some point you'll hit it again and end up with the same old thinking: the fastest optimization is to not do wasteful work, no matter how fast you'd do it otherwise.

 
 

The opengl extension from NVidia's talk are somehow way more what I'd hope for the direction of 'next gen apis'. it's as easy to use as opengl always was, just extending the critical parts to perform better. (I'm talking bout http://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead ). it's actually making things nicer with persistent mapped buffers (you don't need to guess and hope how every driver will 'optimize' your calls and you have all the responsibility and possibilities that comes with  using persistent buffers). and if multidrawindirect would be extended a bit more to support an array of indexed shader objects, you could render the whole solid pass with one drawcall. shadowmaps would possibly end up being one drawcall each and preparing those batched drawcalls could be done in multithreaded way if you want.

Really? The future you want is more instancing, wrapped up in a typical OpenGL layer of "you have to do it in this super-special way in order to hit the fast path"??? To me it's completely at ends with what actual software developers want.

I have a feeling you haven't looked into Cass Everitt's talk.
it's not about classical instancing.
it's about the creation of a list of drawcalls, with various resources (vertexbuffers, indexbuffers, textures...) and just submitting all of it in one drawcall, so instead of




for_all_my_drawcall
  gl set states
  gl draw mesh
you write




for_all_my_drawcall
  store_states into array
  store_mesh offsets/count etc. into array

gl draw_everything of array 
so, there is no "you have to do it in this super-special way in order to hit the fast path", it's quite the opposite, a very generic way. you don't have to touch the shader or something to account for some special instancing or something. you don't have to worry about resource limits and binding. all you do is creating a vector of all drawcalls, just like you'd 'record' it with mantle/D3D12.

yes, it's more limited right now, but that's why I've said, I'd rather see this extended.

 

Everybody who works on consoles knows how low-overhead it *should* be to generate command buffers, and so they constantly beg for lower-overhead draw calls, better multithreading, and more access to GPU memory. Instead we get that "zero driver overhead" presentation that's like "lol too bad we're never going to change anything, here's some new extensions that only work on Nvidia and may require to to completely rewrite your rendering pipeline to use effectively." Great :-/

I really disagree on that one.
it offers you persistant memory, where you can write multithreaded and manage it yourself, just like we do on consoles. it offers you to create command lists (rather vectors) in a multithreaded way, as you can do on consoles. and it's not about "we won't change a thing", it's rather "we've already given you a 90% solution that you can get hands on right now and the changes required are minimal compared to the rewrite for D3D12/Mantle for 10% more".

no offense intended, but have you really looked into it? I can't think of why it would be a pipeline rewrite for you, it's just a little change in buffer management (which aligns well with what you do if you follow best practice guides like https://developer.nvidia.com/sites/default/files/akamai/gamedev/files/gdc12/Efficient_Buffer_Management_McDonald.pdf ) and the 2nd part is 'recording' of drawcalls, less complex than with D3D12/Mantle (because you don't have to pre-create all states and manage them), which isn't that different to what you do if you try to sort your drawcalls to minimize state switching (which everyone does even on consoles, where drawcalls should be cheap, yet those hit you hard on GPU).

 

and in case we don't want to optimize for the current saturation, but rather increase drawcall count etc. I really wonder when it starts to be rather suboptimal on the GPU side. if we'd be able to push 10M draw calls/s, that's like 100cycle/DC on a modern gpu and those have really deep pipelines, sometimes needing flushes for caches, every DC needs some context for the gpu setup that needs to be fetched. We'll end up with "yes, this could be done, but would be suboptimal, lets go back to instancing etc. again".
that's no different than what we do now. few 'big' setups per frame and pushing as many drawcalls with as few state/resource changes as possible to saturate rather on shader/rasterization/fillrate side.

Of course you can end up getting GPU limited, but the problem is that right now we can't even get to that point because there's too much CPU overhead. Software should be able to hit that breaking point of where batching is needed for GPU performance, and then developers can decide case-by-case on how much it makes sense for them to persue instancing and things like that. It shouldn't be that you're forced into 100% instancing from the start otherwise you're dead in the water on PC, at least in my humble opinion.

well, maybe I'm just too used to prepare everything in best way for GPU's, we barely ran into cpu limitations due to rendering. most of the time the GPU tries to run our games. it was at first as if consoles have benefits due to low overhead, but then you take some captures and realize you pay for cache and pipeline flushes and the solution is just the plain old way you'd always optimize for <D3D12 .
I just expect the same for D3D12/Mantle.

#14 Krypt0n   Crossbones+   -  Reputation: 2672

Like
2Likes
Like

Posted 07 April 2014 - 06:32 PM

Regarding being CPU bound - this depends on whether you're making a graphical tech demo, or a game.
For the former, you might have 16ms of GPU time and 16ms of CPU time per frame dedicated to graphics.
For the latter, you've still got 16ms of GPU time (for now, until everyone else realizes you've ended up as the gatekeeper of their SPU-type jobs!), but maybe only a budget of 2ms of CPU time because all the other departments on the team need CPU time as well! In that situation, it's much easier to overrun your CPU budget...

yet there are very few games that saturate more than 2 cores. most have a render thread and that one is running independent of the other parts, that implies, from the architecture point of view, rendering in games nowadays runs no different than in tech demos unless your job system really fills up all cores and could benefit from freeing up the rendering thread/core.
if you don't occupy all cores and you don't run a render thread, there is no reason to complain about API limitations.

P.S. I'm about to sign an NDA with AMD to get access to Mantle, so it's obviously being released wider than just DICE now biggrin.png

part of the NDA is to not talk about the NDA ;)

#15 Krypt0n   Crossbones+   -  Reputation: 2672

Like
2Likes
Like

Posted 07 April 2014 - 07:02 PM

Create a command list, create some resources, execute a command list with a set of resources as inputs, done.

so, how do you keep a game (engine) flexible, yet knowing _all_ the states etc. that you don't want to create on runtime? (assume pipeline creation can take as much time as shader linking in opengl which is 1s in bad cases).
there is no driver anymore that does that in a background thread for you, in an as fast as possible way.
assume you have about 1024 shader combination, add stencil, rasterizer, blend, rendertarget permutations that might be part of the gpu-setup and therefor included in one static state you have to create.
assume, it's not a state creation that is cross platform, but per driver+gpu, you cannot really do it offline before you ship the game.
 

The rest of the changes are conceptual changes to simplify the resources model (no more different kinds of buffers, simpler texture semantics, etc.).

there still are. check out the links in the 2nd post. it's split in 2 stages
1. you allocate a bunch of memory
2. you prepare it for a specific usage case e.g. as render target or vertexbuffer.

now assume you want to use a texture as render target and use it as source in the 2nd drawcall (e.g. some temporal texture you use in post processing). you need to state that to the API.
assume further you take advantage of the new multithreaded command generation, so you can't keep track of the state of an object inside the object, you rather need to track states per commandbuffer/thread.
assume further, you don't want to do redundant state conversions, as those might be quite expensive (changing layouts of buffers to make them best suited for texture sampling, for rendering, for vertex access), so you'd need to actually somehow merge the states of resources you use in consecutive command buffers.


 

the more explicit threading model (only particularly relevant if you want/need render threading), and the more explicit device model (pick which GPU you use for what on multi-GPU systems).

you know you have to test and balance all that? cross fire works across different GPUs. you can have an APU gpu + some mid range Radeon HD 7700 + a top notch R9 290x.
and with D3D, there is a generic driver that might execute asymmetrically on those GPUs. it's something you'd need to handle.
I don't say that's impossible, but for the majority of devs, it can end up in either a lot of work (testing all kind of configuration in various parts of your game) or you can disappoint some high end users that their expensive 4x crossfire is no faster thatn 3x crossfire or even buggy.


a lot of work that drivers did before, will end up in the hands of devs and it's not optional, it's what you'll have to do. you might ship a perfectly fine running game and some new GPU might take advantage of something that hasn't been used before and it might uncover a bug in your 1year old game that ppl still play. and AMD/NV won't release a driver fix, you need to release a patch.

I see benefits you've mentioned, but I also see all the drawbacks.
I like low level programming on consoles, below what mantle/D3D12 offers, but I'm not sure about the PC side. when there was Glide/S3 Metal/RRedline/... and even GL was working different (MiniGL/PowerSGL/..) every developer felt relieved it ended with D3D. and the RefRas was actually pushed by game devs, to be able to validate something is a driver bug. now it all seem forgotten and like a step back.

the Cass Everitt talk really seems like the best balance of both worlds to me (if it would be extended a little bit).

Edited by Krypt0n, 07 April 2014 - 07:07 PM.


#16 Hodgman   Moderators   -  Reputation: 31843

Like
4Likes
Like

Posted 07 April 2014 - 08:29 PM

so, how do you keep a game (engine) flexible, yet knowing _all_ the states etc. that you don't want to create on runtime? (assume pipeline creation can take as much time as shader linking in opengl which is 1s in bad cases).

I don't know how Mantle / D3D12 are going to do it, but if we were doing it ourselves close to the metal (no validation costs), then "creating a state object" is the same as "creating an array of N int32's" (free). And configuring one of them is the same as packing M different values into that array of N int32's (very, very cheap).
Modern GPUs are also moving towards being completely stateless, where there is hardly any per batch pipeline bubbling to worry about any more.
Also, modern GPUs are moving towards having multiple rings of commands being processed - so that if you do create a bubble, it can be filled with work that you've queued up on one of your compute queues anyway biggrin.png

I really disagree on that one.
it offers you persistant memory, where you can write multithreaded and manage it yourself, just like we do on consoles. it offers you to create command lists (rather vectors) in a multithreaded way, as you can do on consoles. and it's not about "we won't change a thing", it's rather "we've already given you a 90% solution that you can get hands on right now and the changes required are minimal compared to the rewrite for D3D12/Mantle for 10% more".

The bolded bit really is a big deal - I'm surprised no one else has highlighted it as the standout feature of that talk. Basically persistent buffers means that we've finally got a malloc/free for CPU+GPU addressable RAM! The GPU can read from it at full speed, and the CPU can do fast write-combined writes into it (CPU reads will be god awfully slow - non-cached). It's not quite feature complete yet -- AFAIK, we can't interpret this malloc'ed RAM however we like -- we can't tell a shader to fetch it as a texture and another to read from it as an index buffer, and another to stream-out to it, and another to write to it as a MSAA render-target... All features that should be possible in an ideal API.

The multi-draw-indirect stuff is cool, but it's still not a real command buffer -- it lets you submit a collection of draw calls, but no more than that -- they all have to share the same state.
The future is only getting more and more parallel/multi-core and sooner or later, having a single "GPU thread" isn't going to cut it any more. Plenty of engines are already built around it, but only on PS3/360 (and PS4/Xbone now)...
 
I think the "we won't change a thing" type comment was aimed at GL as a whole -- GL keeps just piling on more features to create the behemoth of an API that it is, where there's 100 ways to choose from to implement each feature, but only a very small set of them (usually including specific vendor-specific extensions) will get you onto the "fast path".
The idea of something like Mantle is that you're only given the fast path, and all the stuff on the "slow path" is left for you to imeplement/emulate yourself if you want it.
If you've not used GL for a while, it is pretty intimidating to try and get back into it and find where all the fast paths are for each different generation of cards... I'd actually prefer it if they had completely separate APIs for different eras of GPUs, like the Dx9 / Dx11 schism laugh.png
 

so, there is no "you have to do it in this super-special way in order to hit the fast path", it's quite the opposite, a very generic way

No - they're demonstrating a really cool fast-path within GL in that talk -- the second slide even says so, that fast paths exist, but you just need to know what they are... implying that if you use the wrong parts of GL, then you won't be on the past path. Also, many of the features they're showing off aren't core GL -- they're vendor extensions, and despite the EXT prefix, they're not supported by all vendors yet either! They also point out several vendor-specific caveats in there -- "Intel's fast path involves doing things this way, AMD's fast-path requires this other thing", etc...
So you'd still need to have several vendor-specific GL back-ends in your engine sad.png -- and also back-ends for older generations of cards that are never going to support these extensions, which means they've got their own "fast paths"...
 

I'm implying that you'll end up doing the same for D3D12/Mantle, just not because of the CPU, but because the GPU will have idle-bubbles in the pipeline if you start switching states. (if you profile on consoles, with low CPU overhead, that's what you'll see) It's still work that has to be done and an 1GHz sequential processor won't do any magic. (not talking bout shaders, but bout the command processor part!)
We have low level access to HW for consoles and while you might think we could now end up being wasteful with drawcalls etc. we actually waste a lot of SPU cycles to batch meshes and remove redundant states and even shader preperation that the GPU could handle, to avoid it on the GPU.

Yeah, pipeline bubbles on GeForce 7-era cards is really horrible. You can actually feed it really high batch counts and the CP will work fine -- as long as you don't switch states with every batch. If the states are compatible, then it will actually merge batches internally. If the states aren't compatible and the batches are too small, then yeah, you end up wasting a huge amount of time in pipeline bubbles.
However, modern cards are implemented completely differently now. You probably still don't want to have batches with unique states that cover < 128 pixels, but the performance characteristics aren't at all comparable to something that's as many generations apart as the GF7!


Edited by Hodgman, 07 April 2014 - 08:32 PM.


#17 phantom   Moderators   -  Reputation: 7565

Like
4Likes
Like

Posted 08 April 2014 - 03:59 AM

The problem with the AZDO talk is that while it is nice and all that practically speaking it only works on one vendor; NVidia. AMD lack the key extensions and Intel are no where near close either (still a couple of major versions behind the spec afaik).

As cool as multidrawindirect is regardless of any protests it IS just an instancing system; your targets, shaders and states are fixed - at best you can reference different source data which at least opens the door to different textures (assuming bindless and drawID support aka NV only currently) but it is still instancing.

Which is fine, instancing is good and it's one part of the puzzle but it's not the be all and end all solution.

Games are not 'throw 1,000,000 objects at the screen once' affairs; multiple passes and multiple states means hitting the draw path frequently still with switching between. Most of these passes are logically independent and can be 'setup' in advance; in a simple forward render my final pass might require the shadow map pass to run first but from a command point of view I can build that shadow pass at the same time as the final pass and just execute them in the correct order. (Which scales with increasingly complex pass setups; we had a forward rendered game which was performing somewhere in the region of 8 passes before you even got to post processing).

OpenGL is, by it's nature, forcing you to this ALL on one thread, a thread which isn't getting any faster any more either. In 2014 you are going to have the same hard limit as you are in 2016 and 2018. Even if you make ever more exotic 'draw' commands which pack state information into the API function for depth, stencil, targets, shaders etc this will still have to be deserialised, broken up and processed somewhere just that 'somewhere' is now magic driver voodoo which has to do all manner of hazard tracking and trying to keep state straight; you've not made it any faster, you've just made the driver's life 100x harder.

This is where Mantle/D3D12 enter the game and say 'sure, you can do that OpenGL stuff but how about we let you record things on separate threads too?' - So now instead of being serialized on a single thread I can build my command buffers separately (probably still based around multidrawindirect) and just kick them one at a time in the correct order. Suddenly this massive serial block goes away and I can use more of the CPU resources and more importantly this work is now more predictable as the driver is doing less behind the scenes.

How is this not better?
I don't see how anyone can argue against this system with a straight face.

And this is before you add in persistent command buffers which can be replayed (because why am I regenerating the command buffer for my post stack every frame anyway?!?) and the reduction in overhead because the driver no longer has to track hazards (and the improvements that brings because we can see before the driver where the hazards are going to be anyway, instead of the driver late patching things in the command buffer).

Hell, the D3D12 early runtime from the //Build conference showed this to be a win right away and that's with early code!

Yes, AZDO has some nice ideas but to suggest that this is the solution and we should be making a fatter draw function and continue to push commands down a single thread is, frankly, short sighted madness and is driving us fast into a brick wall of performance issues.

#18 Krypt0n   Crossbones+   -  Reputation: 2672

Like
0Likes
Like

Posted 08 April 2014 - 08:17 AM

sorry, no time to reply rite now, yet I thought you guys should see what we're talking bout:

http://channel9.msdn.com/Events/Build/2014/3-564

:)






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS