What are your opinions on DX12/Vulkan/Mantle?

Started by
120 comments, last by Ubik 8 years, 10 months ago

So... is the Vulkan API available for us plebeians to peruse anywhere?

Not Yet.

" Vulkan initial specifications and implementations are expected later this year "

From the press release (https://www.khronos.org/news/press/khronos-reveals-vulkan-api-for-high-efficiency-graphics-and-compute-on-gpus)

Advertisement

So... is the Vulkan API available for us plebeians to peruse anywhere?

Not Yet.

" Vulkan initial specifications and implementations are expected later this year "

From the press release (https://www.khronos.org/news/press/khronos-reveals-vulkan-api-for-high-efficiency-graphics-and-compute-on-gpus)

sad.png

Apparently the mantle spec documents will be made public very soon, which will serve as a draft/preview of the Vulkan docs that will come later.

I'm extremely happy with what we've heard about Vulkan so far. Supporting it in my engine is going to be extremely easy.

However, supporting it in other engines may be a royal pain.
e.g. If you've got an engine that's based around the D3D9 API, then your D3D11 port is going to be very complex.
However, if your engine is based around the D3D911 API, then your D3D9 port is going to be very simple.

Likewise for this new generation of APIs -- if you're focusing too heavily on current generation thinking, then forward-porting will be painful.

In general, implementing new philosophies using old APIs is easy, but implementing old philosophies on new APIs is hard.

In my engine, I'm already largely using the Vulkan/D3D12 philosophy, so porting to them will be easy.
I also support D3D9-11 / GL2-4 - and the code to implement these "new" ideas on these "old" APIs is actually fairly simple - so I'd be brave enough to say that it is possible to have a very efficient engine design that works equally well on every API - the key is to base it around these modern philosophies though!
Personally, my engines cross-platform rendering layer is based on a mixture of Mantle and D3D11 ideas.

Ive made my API stateless, where every "DrawItem" must contain a complete pipeline state (blend/depth/raster/shader programs/etc) and all resource bindings required by those programs - however, these way these states/bindings are described (in client/user code) is very similar to the D3D11 model.
DrawItems can/should be prepared ahead of time and reused, though you can create them every frame if you want... When creating a DrawItem, you need to specify which "RenderPass" it will be used for, which specifies the render-target format(s), etc.

On older APIs, this let's you create your own compact data structures containing all the data required to make D3D/GL API calls required for that draw-call.
On newer APIs, this let's you actually pre-compile the native GPU commands!

You'll notice that in the Vulkan slides released so far, when you create a command buffer, you're forced to specify which queue you promise to use when submitting it later. Different queues may exist on different GPUs -- e.g. if you've got an NVidia and an Intel GPU present. The requirement to specify a queue ahead of time means that you're actually specifying a particular GPU ahead of time, which means the Vulkan drivers can convert your commands to that GPU's actual native instruction set ahead of time!

In either case, submitting a pre-prepared DrawItem to a context/commanf-buffer is very simple/efficient.
As a bonus, you sidestep all the bugs involved in state-machine graphics APIs biggrin.png

Apparently the mantle spec documents will be made public very soon, which will serve as a draft/preview of the Vulkan docs that will come later.

I'm extremely happy with what we've heard about Vulkan so far. Supporting it in my engine is going to be extremely easy.

However, supporting it in other engines may be a royal pain.
e.g. If you've got an engine that's based around the D3D9 API, then your D3D11 port is going to be very complex.
However, if your engine is based around the D3D911 API, then your D3D9 port is going to be very simple.

Likewise for this new generation of APIs -- if you're focusing too heavily on current generation thinking, then forward-porting will be painful.

In general, implementing new philosophies using old APIs is easy, but implementing old philosophies on new APIs is hard.

In my engine, I'm already largely using the Vulkan/D3D12 philosophy, so porting to them will be easy.
I also support D3D9-11 / GL2-4 - and the code to implement these "new" ideas on these "old" APIs is actually fairly simple - so I'd be brave enough to say that it is possible to have a very efficient engine design that works equally well on every API - the key is to base it around these modern philosophies though!
Personally, my engines cross-platform rendering layer is based on a mixture of Mantle and D3D11 ideas.

Ive made my API stateless, where every "DrawItem" must contain a complete pipeline state (blend/depth/raster/shader programs/etc) and all resource bindings required by those programs - however, these way these states/bindings are described (in client/user code) is very similar to the D3D11 model.
DrawItems can/should be prepared ahead of time and reused, though you can create them every frame if you want... When creating a DrawItem, you need to specify which "RenderPass" it will be used for, which specifies the render-target format(s), etc.

On older APIs, this let's you create your own compact data structures containing all the data required to make D3D/GL API calls required for that draw-call.
On newer APIs, this let's you actually pre-compile the native GPU commands!

You'll notice that in the Vulkan slides released so far, when you create a command buffer, you're forced to specify which queue you promise to use when submitting it later. Different queues may exist on different GPUs -- e.g. if you've got an NVidia and an Intel GPU present. The requirement to specify a queue ahead of time means that you're actually specifying a particular GPU ahead of time, which means the Vulkan drivers can convert your commands to that GPU's actual native instruction set ahead of time!

In either case, submitting a pre-prepared DrawItem to a context/commanf-buffer is very simple/efficient.
As a bonus, you sidestep all the bugs involved in state-machine graphics APIs biggrin.png

That sounds extremely interesting. Could you make a concrete example of what the descriptions in a DrawItem look like? What is the granularity of a DrawItem? Is is it a per-Mesh kind of thing, or more like a "one draw item for every material type" kind of thing, and then you draw every mesh that uses that material with a single DrawItem?

Can I say something I do not like (DX related)? The "new" feature levels, especially 12.1.

Starting from 10.1 Microsoft introduced the concept of "feature level", a nice and smart way to collect all together hundreds of caps-bits and thousand of related permutation in a single - unique - decree. With feature level you can target older hardware with the last runtime available. Microsoft did not completely remove caps-bits for optional features, but their number reduced dramatically, something like two orders of magnitude. Even with Direct3D 11.2 the caps-bits number remained relatively small, although they could add a new feature level - let's call it feature level 11.2 - with all new optional features and tier 1 of tiled resources; nevermind that's not a big deal after all - complaints should be focused on the OS support since D3D 11.1.

Since the new API is focused mostly on the programming model, with Direct3D 12 new caps-bits and tiers collections were expected, and Microsoft did a good job reducing dramatically the complexity of different hardware capabilities permutations. New caps-bits and tiers of DX12 are not a big issue. At GDC15 they also announce two "new" feature levels (~14:00): feature level 12.0 and feature level 12.1. While feature level 12.0 looks reasonable (All GCN 1.1/1.2 and Maxwell 2.0 should support this - dunno about first generation of Maxwell), feature level 12.1 adds only ROVs (OK) and tier 1 of conservative rasterization (the most useless!) mandatory support.

I will not go into explicit details (detailed information should be still under NDA), however the second feature level looks tailor-made for a certain particular hardware (guess what!). Moreover FL 12.1 do not requires some really interesting features (greater conservative rasterization tier, volume tiled resources and even resource binding tier 3) that you could expected to be mandatory supported by future hardware. In substance FL12.1 really brake the concept of feature level in my view, which was a sort of "barrier" that defined new hardware capabilities for upcoming hardware.

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

That sounds extremely interesting. Could you make a concrete example of what the descriptions in a DrawItem look like? What is the granularity of a DrawItem? Is is it a per-Mesh kind of thing, or more like a "one draw item for every material type" kind of thing, and then you draw every mesh that uses that material with a single DrawItem?

My DrawItem corresponds to one glDraw* / Draw* call, plus all the state that needs to be set immediately prior the draw.
One model will usually have one DrawItem per sub-mesh (where a sub-mesh is a portion of that model that uses a material), per pass (where as pass is e.g. drawing to gbuffer, drawing to shadow-map, forward rendered, etc). When drawing a model, it will find all the DrawItems for the current pass, and push them into a render list, which can then be sorted.

A DrawItem which contains the full pipeline state, the resource bindings, and the draw-call parameters could look like this in a naive D3D11 implementation:


struct DrawItem
{
  //pipeline state:
  ID3D11PixelShader* ps;
  ID3D11VertexShader* vs;
  ID3D11BlendState* blend;
  ID3D11DepthStencilState* depth;
  ID3D11RasterizerState* raster;
  D3D11_RECT* scissor;
  //input assembler state
  D3D11_PRIMITIVE_TOPOLOGY primitive;
  ID3D11InputLayout* inputLayout;
  ID3D11Buffer* indexBuffer;
  vector<tuple<int/*slot*/,ID3D11Buffer*,uint/*stride*/,uint/*offset*/>> vertexBuffers;
  //resource bindings:
  vector<pair<int/*slot*/, ID3D11Buffer*>> cbuffers;
  vector<pair<int/*slot*/, ID3D11SamplerState*>> samplers;
  vector<pair<int/*slot*/, ID3D11ShaderResourceView*>> textures;
  //draw call parameters:
  int numVerts, numInstances, indexBufferOffset, vertexBufferOffset;
};

That structure is extremely unoptimized though. It's a base size of ~116 bytes, plus the memory used by the vectors, which could be ~1KiB!

I'd aim to compress them down to 28-100 bytes in a single contiguous allocation, e.g. by using ID's instead of pointers, by grouping objects together (e.g. referencing a PS+VS program pair, instead of referencing each individually), and by using variable length arrays built into that structure instead of vectors.

When porting to Mantle/Vulkan/D3D12, that "pipeline state" section all gets replaced with a single "pipeline state object" and the "input assembler" / "resource bindings" sections get replaced by a "descriptor set". Alternatively, these new APIs also allow for a DrawItem to be completely replaced by a very small native command buffer!

There's a million ways to structure a renderer, but this is the design I ended up with, which I personally find very simple to implement on / port to every platform.

That sounds extremely interesting. Could you make a concrete example of what the descriptions in a DrawItem look like? What is the granularity of a DrawItem? Is is it a per-Mesh kind of thing, or more like a "one draw item for every material type" kind of thing, and then you draw every mesh that uses that material with a single DrawItem?

My DrawItem corresponds to one glDraw* / Draw* call, plus all the state that needs to be set immediately prior the draw.
One model will usually have one DrawItem per sub-mesh (where a sub-mesh is a portion of that model that uses a material), per pass (where as pass is e.g. drawing to gbuffer, drawing to shadow-map, forward rendered, etc). When drawing a model, it will find all the DrawItems for the current pass, and push them into a render list, which can then be sorted.

A DrawItem which contains the full pipeline state, the resource bindings, and the draw-call parameters could look like this in a naive D3D11 implementation:


struct DrawItem
{
  //pipeline state:
  ID3D11PixelShader* ps;
  ID3D11VertexShader* vs;
  ID3D11BlendState* blend;
  ID3D11DepthStencilState* depth;
  ID3D11RasterizerState* raster;
  D3D11_RECT* scissor;
  //input assembler state
  D3D11_PRIMITIVE_TOPOLOGY primitive;
  ID3D11InputLayout* inputLayout;
  ID3D11Buffer* indexBuffer;
  vector<tuple<int/*slot*/,ID3D11Buffer*,uint/*stride*/,uint/*offset*/>> vertexBuffers;
  //resource bindings:
  vector<pair<int/*slot*/, ID3D11Buffer*>> cbuffers;
  vector<pair<int/*slot*/, ID3D11SamplerState*>> samplers;
  vector<pair<int/*slot*/, ID3D11ShaderResourceView*>> textures;
  //draw call parameters:
  int numVerts, numInstances, indexBufferOffset, vertexBufferOffset;
};

That structure is extremely unoptimized though. It's a base size of ~116 bytes, plus the memory used by the vectors, which could be ~1KiB!

I'd aim to compress them down to 28-100 bytes in a single contiguous allocation, e.g. by using ID's instead of pointers, by grouping objects together (e.g. referencing a PS+VS program pair, instead of referencing each individually), and by using variable length arrays built into that structure instead of vectors.

When porting to Mantle/Vulkan/D3D12, that "pipeline state" section all gets replaced with a single object and the "input assembler" / "resource bindings" sections get replaced by a descriptor. Alternatively, these new APIs also allow for a DrawItem to be completely replaced by a very small native command buffer!

There's a million ways to structure a renderer, but this is the design I ended up with, which I personally find very simple to implement on / port to every platform.

Thanks a lot for that description. I must say it sounds very elegant. It's almost like a functional programming approach to draw call submission, along with its disadvantages and advantages.

There is something I don't really understand in Vulkan/DX12, it's the "descriptor" object. Apparently it acts as a gpu readable data chunk that hold texture pointer/size/layout and sampler info, but I don't understand the descriptor set/pool concept work, this sounds a lot like array of bindless texture handle to me.

Without going into detail; it's because only AMD & NVIDIA cards support bindless textures in their hardware, there's one major Desktop vendor that doesn't support it even though it's DX11 HW. Also take in mind both Vulkan & DX12 want to support mobile hardware as well.
You will have to give the API a table of textures based on frequency of updates: One blob of textures for those that change per material, one blob of textures for those that rarely change (e.g. environment maps), and another blob of textures that don't change (e.g. shadow maps).
It's very analogous to how we have been doing constant buffers with shaders (provide different buffers based on frequency of update).
And you put those blobs into a bigger blob and tell the API "I want to render with this big blob which is a collection of blobs of textures"; so the API can translate this very well to all sorts of hardware (mobile, Intel on desktop, and bindless like AMD's and NVIDIA's).

If all hardware were bindless, this set/pool wouldn't be needed because you could change one texture anywhere with minimal GPU overhead like you do in OpenGL4 with bindless texture extensions.
Nonetheless this descriptor pool set is also useful for non-texture stuff, (e.g. anything that requires binding, like constant buffers). It is quite generic.

Thank.
I think it also make sparse texture available ? At least the tier level requested by arb_sparse_texture (ie without shader function returning residency state).

There is something I don't really understand in Vulkan/DX12, it's the "descriptor" object. Apparently it acts as a gpu readable data chunk that hold texture pointer/size/layout and sampler info, but I don't understand the descriptor set/pool concept work, this sounds a lot like array of bindless texture handle to me.

Without going into detail; it's because only AMD & NVIDIA cards support bindless textures in their hardware, there's one major Desktop vendor that doesn't support it even though it's DX11 HW. Also take in mind both Vulkan & DX12 want to support mobile hardware as well.
You will have to give the API a table of textures based on frequency of updates: One blob of textures for those that change per material, one blob of textures for those that rarely change (e.g. environment maps), and another blob of textures that don't change (e.g. shadow maps).
It's very analogous to how we have been doing constant buffers with shaders (provide different buffers based on frequency of update).
And you put those blobs into a bigger blob and tell the API "I want to render with this big blob which is a collection of blobs of textures"; so the API can translate this very well to all sorts of hardware (mobile, Intel on desktop, and bindless like AMD's and NVIDIA's).

If all hardware were bindless, this set/pool wouldn't be needed because you could change one texture anywhere with minimal GPU overhead like you do in OpenGL4 with bindless texture extensions.
Nonetheless this descriptor pool set is also useful for non-texture stuff, (e.g. anything that requires binding, like constant buffers). It is quite generic.

Thank.
I think it also make sparse texture available ? At least the tier level requested by arb_sparse_texture (ie without shader function returning residency state).

On DirectX 12 Feature Level 11/11.1 GPUs the support of tier 1 of tiled resources (sparse texture) is still optional. In that GPU range, even if their architecture should support tier 1 of tiled resource, there are some GPUs (low/low-mid end, desktop and mobile) that do not support it (e.g.: AMD HD 7700 Mobile GPUs driver support of tiled resources is still disable). The same should apply to OGL/Vulkan.

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

If all hardware were bindless, this set/pool wouldn't be needed because you could change one texture anywhere with minimal GPU overhead like you do in OpenGL4 with bindless texture extensions.
Nonetheless this descriptor pool set is also useful for non-texture stuff, (e.g. anything that requires binding, like constant buffers). It is quite generic.

They're actually designed specifically to exploit the strengths of modern bindless GPU's, especially AMD GCN as they're basically copy&pasted from the Mantle specs (which were designed to be cross-vendor, but obviously somewhat biased by having AMD GCN as the min-spec).

There is something I don't really understand in Vulkan/DX12, it's the "descriptor" object. Apparently it acts as a gpu readable data chunk that hold texture pointer/size/layout and sampler info, but I don't understand the descriptor set/pool concept work, this sounds a lot like array of bindless texture handle to me.

A descriptor is a texture-view, buffer-view, sampler, or a pointer
A descriptor set is an array/table/struct of descriptors.
A descriptor pool is basically a large block of memory that acts as a memory allocator for descriptor sets.

So yes, it's very much like bindless handles, but instead of them being handles, they're the actual guts of a texture-view, or an actual sampler structure, etc...

Say you've got a HLSL shader with:
Texture2D texture0 : register(t0);
SamplerState samLinear : register(s0);
In D3D11, you'd bind resources to this shader using something like:
ID3D11SamplerState* mySampler = ...;
ID3D11ShaderResourceView* myTexture = ...;
ctx.PSSetSampelrs( 0, 1, &mySampler );
ctx.VSSetSampelrs( 0, 1, &mySampler );
ctx.PSSetShaderResources( 0, 1, &myTexture );
ctx.VSSetShaderResources( 0, 1, &myTexture );
ctx.Draw(...);//draw something using the bound resources
Let's say that these new APIs give us a nice new bindless way to describe the inputs to the shader. Instead of assigning resources to slots/registers, we'll just put them all into a struct -- that struct is the descriptor set.
Our hypothetical (because I don't know the new/final syntax yet) HLSL shader code might look like:
struct DescriptorSet : register(d0)
{
  Texture2D texture0;
  SamplerState samLinear;
};
In our C/C++ code, we can now "bind resources" to the shader with something like this:
I'm inventing the API here -- vulkan doesn't look like this, it's just a guess of what it might look like:
struct MyDescriptorSet // this matches our shader's structure, using corresponding Vulkan C types instead of the "HLSL" types above.
{
  VK_IMAGE_VIEW texture0;    //n.b. these types are the actual structures that the GPU is hard-wired to interpret, which means
  VK_SAMPLER_STATE samLinear;//      they'll change from driver-to-driver, so there must be some abstraction here over my example
};                           //      such as using handles or pointers to the actual structures?

descriptorHandle = vkCreateDescriptorSet( sizeof(MyDescriptorSet), descriptorPool );//allocate an instance of the structure in GPU memory

//copy the resource views that you want to 'bind' into the descriptor set.
MyDescriptorSet* descriptorSet = (MyDescriptorSet*)vkMapDescriptorSet(descriptorHandle);
descriptorSet->texture0 = *myTexture; // CPU is writing into GPU memory here, via write-combined uncached pages!
descriptorSet->samLinear = *mySampler;
vkUnmapDescriptorSet(descriptorHandle);

//later when drawing something 
vkCmdBindDescriptorSet(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, descriptorHandle, 0);
vkCmdDraw(cmdBuffer, ...);//draw something using the bound resources
You can see now, when drawing an object, there's only a single API call required to bind all of it's resources.
Also, earlier we required to double up our API calls if the pixel-shader and the vertex-shader both needed the same resources, but now the descriptor-set is shared among all stages.
If an object always uses the same resources every frame, then you can prepare it's descriptor set once, ahead of time, and then do pretty much nothing every frame! All you need to do is call vkCmdBindDescriptorSet and vkCmdDraw.
Even better, those two functions record their commands into a command buffer... so it's possible to record a command buffer for each object ahead of time, and then every frame you only need to call vkQueueSubmit per object to submit it's pre-prepared command buffer.

If we want to modify which resources that draw-call uses, we can simply write new descriptors into that descriptor set. The easiest way is by mapping/unmapping the tables and writing with the CPU as above, but in theory you could also use GPU copy or compute jobs to modify them. GPU modification of descriptor sets would only be possible on truely bindless GPUs, so I'm not sure if this feature will actually be exposed by Vulkan/D3D12 -- maybe in an extension later... This would mean that when you want to change which material a draw-item uses, you could use a compute job to update that draw-item's descriptor set! Along with multi-draw-indirect, you could move even more CPU side work over to the GPU.


Also, it's possible to put pointers to descriptor sets inside descriptor sets!
This is useful where you've got a lot of resource bindings that are shared across a series of draw-calls, so you don't want the CPU to have to re-copy all those bindings for each draw-call.

e.g. set up a shader with a per-object descriptor set, which points to a per-camera descriptor set:
cbuffer CameraData
{
  Matrix4x4 viewProj;
};

struct SharedDescriptorSet
{
  SamplerState samLinear;
  CameraData camera;
}
struct MainDescriptorSet : register(d0)
{
  Texture2D texture0;
  SharedDescriptorSet* shared;
};
The C side would then make an instance of each, and make one link to the other. When drawing, you just have to bind the per-object one:
sharedDescriptorHandle = vkCreateDescriptorSet( sizeof(SharedDescriptorSet), descriptorPool );
obj0DescriptorHandle = vkCreateDescriptorSet( sizeof(MainDescriptorSet ), descriptorPool );

SharedDescriptorSet* descriptorSet = (SharedDescriptorSet*)vkMapDescriptorSet(sharedDescriptorHandle);
descriptorSet->camera = *myCbufferView;
descriptorSet->samLinear = *mySampler;
vkUnmapDescriptorSet(sharedDescriptorHandle);

MainDescriptorSet * descriptorSet = (MainDescriptorSet *)vkMapDescriptorSet(obj0DescriptorHandle);
descriptorSet->texture0 = *myTexture;
descriptorSet->shared = sharedDescriptorHandle;
vkUnmapDescriptorSet(obj0DescriptorHandle);

//bind obj0Descriptor, which is a MainDescriptorSet, which points to sharedDescriptor, which is a SharedDescriptorSet
vkCmdBindDescriptorSet(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, obj0DescriptorHandle, 0); 
vkCmdDraw(cmdBuffer, ...);//draw something using the bound resources

This topic is closed to new replies.

Advertisement