Jump to content
  • Advertisement
Sign in to follow this  
  • entries
  • comments
  • views

Adding MSAA and stream output support

Sign in to follow this  


Since my last entry I've added support for more of Direct3D12's features to my framework, which are:

  • Cube and cube array textures. The implementation of these isn't notably different from the other texture types, so I don't plan to delve into these.
  • Mipmaps for all the texture types (including cube and cube array). I'll talk briefly about these in a moment.
  • MSAA render targets.
  • Stream output stage of the graphics pipeline.


Since I added mipmap support after all the texture types, there were 3 main questions for implementing this feature.
For the first question the options are add more classes treating texture that use mipmaps as new texture types or add the number of mipmaps as an argument to the existing texture classes. The existing texture types being represented by different classes makes sense from a validation perspective since they have a different number of dimensions. Whereas mipmaps are just different resolutions of a texture. Since the difference between a texture with multiple mipmaps and a texture with just 1 resolution is so minor, adding more classes is overkill conceptually and practically doesn't have any substantial validation benefit due to the number of mipmaps being variable. This naturally leads to the other option as adding it as an argument when creating the resource. For any place that needs the mipmap index to be validated (e.g. when determining which mipmap in a texture to upload to), the minimum of 0 can be checked by using the appropriate unsigned type (UINT16), and the maximum can be checked with the argument validation that I had mentioned in a previous dev journal entry which in this case is just comparing two UINT16 values.

Determining the subresource index is obvious for 1D and 2D textures with mipmaps (0 is the normal image, and each mipmap level just increments by 1). For the more complex texture types such as a cube array with mipmaps, determining the answer requires checking and understanding the documentation. I found https://msdn.microsoft.com/en-us/library/windows/desktop/ff476906%28v=vs.85%29.aspx and https://msdn.microsoft.com/en-us/library/windows/desktop/dn705766%28v=vs.85%29.aspx useful for understanding the order of subresource indices for the texture arrays. And since the second link covers it so well and has nice graphics to explain it, I won't re-hash it here.

As for TextureUploadBuffer, it is responsible for creating the upload buffer and preparing a command list to copy from the upload buffer to a texture subresource (e.g. a particular side of a cube texture). For reference here's the function declaration of one of the overloads for creating a TextureUploadBuffer from before mipmapping support was added:

static TextureUploadBuffer* CreateD3D12(const GraphicsCore& graphics, const Texture2D& texture);

This guarantees that the upload buffer will be large enough to hold the data for that particular texture, but also allows the buffer to be re-used for other smaller or equal textures. For texture arrays, the created buffer is enough for 1 entry in the array. So if you have a Texture2DArray with 5 entries and want to upload to all of them at once, you would need 5 TextureUploadBuffer instances (the function comments for the array overloads have comments explaining this). So there are two basic options here, add in the mipmap level to create an exact fit resource, or ignore the mipmap level the buffer will be used for and create a resource large enough for the default texture size. The first option would reduce memory usage, increase code complexity, and require the mipmap level between the created buffer and the requested upload level to be validated. The second option fits the pattern established by the texture array types of make it big enough for 1 subresource. So, I went for the second approach which leaves the function signature unchanged for mipmap support. This does waste a bit of memory, but the buffer only needs to be kept around while uploading, so the extraneous memory usage is temporary.

There is another detail about mipmap support I would like to mention which is the shader resource view. In Direct3D there can be multiple views for the same resource, such as using a 2D texture as a both a texture and a render target. Another option for these multiple views is to create a view that has access to only a subset of the mipmap levels. In the framework, the internal function D3D12_Texture::Create (called by the factory methods for all the texture classes) takes care of creating the resource and a shader resource view that has access to all the mipmap levels. At the moment, this full view is the only one available through the framework. However, for future development work for adding partial mipmap shader resource views, I would implement them by adding factory methods to the various texture classes that take the source texture and min/num mipmap levels as arguments where instead of calling D3D12_Texture::Create, it would just AddRef on the resource and create the requested shader resource view, passing both the resource and shader resource view to the instance of the particular texture class to manage. This AddRef and create a new view approach is basically the same approach I took with RenderTarget::CreateD3D12FromTexture where the main differences are the render target is not dealing with mipmap levels and it's creating a render target view instead of another shader resource view.


While there are other resources online that cover MSAA render targets and depth stencils in Direct3D12, I want to sum up the changes needed for them here which I will cover in the remainder of this section. And there is one point worth mentioning before getting into the changes, the swap chain cannot be created with MSAA enabled. A separate render target is needed, and I'll discuss how to go from this separate render target to the back buffer at the end of this section.

The first change is actually optional but recommended of checking the quality level supported by the device. This is to ensure when creating render targets and depth stencils that you don't exceed the supported quality level. Since devices can support different levels of multisampling for different formats, settings about the render target need to be passed in when checking, specifically the format, multisample count per pixel, and whether it is for a tiled resource or not. Since this check should be done before creating a MSAA render target, for the framework these fields are passed to a GraphicsCore instance's CheckSupportedMultisampleLevels function instead of the function taking a RenderTargetMSAA instance and extracting the values. The GraphicsCore instance is available to the application by using the Game base class which has a function named GetGraphics and has been used in all the sample programs. The function's declaration in GraphicsCore is:

virtual UINT CheckSupportedMultisampleLevels(GraphicsDataFormat format, UINT sample_count, bool tiled) const = 0;

And for anyone that would like to see the implementation that calls into Direct3D12:

UINT D3D12_Core::CheckSupportedMultisampleLevels(GraphicsDataFormat format, UINT sample_count, bool tiled) const
  query.Format = (DXGI_FORMAT)format;
  query.SampleCount = sample_count;
  query.NumQualityLevels = 0;
  HRESULT rc = m_device->CheckFeatureSupport(D3D12_FEATURE_MULTISAMPLE_QUALITY_LEVELS, &query, sizeof(query));
  if (FAILED(rc)) {
    throw FrameworkException("Failed to get multisample quality information");
  return query.NumQualityLevels;

Unlike the previous change, the rest of the changes are required for MSAA to work.

The next change is to create a pipeline that can handle MSAA. For the test program, everything going through the graphics pipeline is to go to a MSAA render target, I updated the only pipeline created in it to use MSAA. Through the framework this is adding 2 arguments to each of Pipeline's CreateD3D12 overloads. Here's an example of the update to one of the overloads:

static Pipeline* CreateD3D12(const GraphicsCore& graphics_core, const InputLayout& input_layout, Topology topology, const Shader& vertex_shader, const StreamOutputConfig* stream_output, const Shader& pixel_shader, DepthFuncs depth_func, const RenderTargetViewConfig& rtv_config, const RootSignature& root_sig, UINT ms_count = 1, UINT ms_quality = 0, bool wireframe = false)

Since I expect MSAA to be enabled more often than wireframe rendering, I added the MSAA arguments of ms_count and ms_quality before wireframe so that the wireframe argument can keep using its default value when not needed instead of needing to be expressly set for every pipeline. As for how this affects the Direct3D12 code for creating a pipeline, it is updating 3 fields in D3D12_GRAPHICS_PIPELINE_STATE_DESC. While they are non-adjacent lines in D3D12_Pipeline::CreateDefaultPipelineDesc, the changes to these fields are:

desc.RasterizerState.MultisampleEnable = ms_count > 1;
desc.SampleDesc.Count = ms_count;
desc.SampleDesc.Quality = ms_quality;
// fill in the rest of the structure here

And for anyone that wants to see the full implementation of D3D12_Pipeline::CreateDefaultPipelineDesc:

void D3D12_Pipeline::CreateDefaultPipelineDesc(D3D12_GRAPHICS_PIPELINE_STATE_DESC& desc, const D3D12_InputLayout& layout, const D3D12_RenderTargetViewConfig& rtv, const D3D12_RootSignature& root, D3D12_PRIMITIVE_TOPOLOGY_TYPE topology, UINT ms_count, UINT ms_quality, bool wireframe, const StreamOutputConfig* stream_output)
  if (layout.GetNextIndex() != layout.GetNum())
    throw FrameworkException("Not all input layout entries have been set");
  if (ms_count < 1)
    throw FrameworkException("Invalid multisample count");
  else if (ms_count == 1 && ms_quality != 0)
    throw FrameworkException("Multisampling quality must be 0 when multisampling is disabled");
  desc.pRootSignature = root.GetRootSignature();
  desc.InputLayout.pInputElementDescs = layout.GetLayout();
  desc.InputLayout.NumElements = layout.GetNum();
  desc.RasterizerState.FillMode = wireframe ? D3D12_FILL_MODE_WIREFRAME : D3D12_FILL_MODE_SOLID;
  desc.RasterizerState.CullMode = D3D12_CULL_MODE_BACK;
  desc.RasterizerState.FrontCounterClockwise = false;
  desc.RasterizerState.DepthBias = D3D12_DEFAULT_DEPTH_BIAS;
  desc.RasterizerState.DepthBiasClamp = D3D12_DEFAULT_DEPTH_BIAS_CLAMP;
  desc.RasterizerState.SlopeScaledDepthBias = D3D12_DEFAULT_SLOPE_SCALED_DEPTH_BIAS;
  desc.RasterizerState.DepthClipEnable = true;
  desc.RasterizerState.MultisampleEnable = ms_count > 1;
  desc.RasterizerState.AntialiasedLineEnable = false;
  desc.RasterizerState.ForcedSampleCount = 0;
  desc.RasterizerState.ConservativeRaster = D3D12_CONSERVATIVE_RASTERIZATION_MODE_OFF;
  desc.BlendState = rtv.GetBlendState();
  desc.DepthStencilState.DepthEnable = false;
  desc.DepthStencilState.StencilEnable = false;
  desc.SampleMask = UINT_MAX;
  desc.PrimitiveTopologyType = topology;
  desc.NumRenderTargets = rtv.GetNumRenderTargets();
  memcpy(desc.RTVFormats, rtv.GetFormats(), sizeof(RenderTargetViewFormat) * desc.NumRenderTargets);
  desc.SampleDesc.Count = ms_count;
  desc.SampleDesc.Quality = ms_quality;
  if (stream_output)
    desc.StreamOutput = ((D3D12_StreamOutputConfig*)stream_output)->GetDesc();

You'll notice that function isn't calling ID3D12Device::CreateGraphicsPipelineState, which is due to the various D3D12_Pipeline::Create overloads calling CreateDefaultPipelineDesc. And the Create overloads call ID3D12Device::CreateGraphicsPipelineState after doing changes to D3D12_GRAPHICS_PIPELINE_STATE_DESC for their particular overload.

The next change is creating a MSAA render target and depth stencil. In keeping with the framework's design and goals, these are separate classes (RenderTargetMSAA and DepthStencilMSAA) from their non-MSAA versions. Like the pipeline, creating these requires adding the sample count and quality as arguments to their respective CreateD3D12 functions. These additional arguments get applied to D3D12_RESOURCE_DESC for both the render target and depth stencil:

D3D12_RESOURCE_DESC resource_desc;
resource_desc.SampleDesc.Count = sample_count;
resource_desc.SampleDesc.Quality = quality;
// fill in the rest of the structure here

Also the render target view and depth stencil views need updating of their view dimension (the non-MSAA versions use D3D12_RTV_DIMENSION_TEXTURE2D and D3D12_DSV_DIMENSION_TEXTURE2D respectively):

view_desc.ViewDimension = D3D12_RTV_DIMENSION_TEXTURE2DMS;
// fill in the rest of the structure here
view_desc.ViewDimension = D3D12_DSV_DIMENSION_TEXTURE2DMS;
// fill in the rest of the structure here

After the resources are created they can be used with a corresponding pipeline by passing them to the framework's command list's OMSetRenderTarget overload, or for raw Direct3D12 ID3D12GraphicsCommandList::OMSetRenderTargets. Which makes the final change for adding MSAA support resolving the MSAA render target to a presentable back buffer. When using the framework this is accomplished by calling the CommandList's RenderTargetToResolved function, which makes the end of a Draw function something along the lines of:

m_command_list->RenderTargetToResolved(*m_render_target_msaa, current_render_target);
graphics.ExecuteCommandList(*m_command_list); graphics.Swap();

And the implementation of RenderTargetToResolved is:

void D3D12_CommandList::RenderTargetToResolved(const RenderTargetMSAA& src, const RenderTarget& dst)
  ID3D12Resource* src_resource = ((const D3D12_RenderTargetMSAA&)src).GetResource();
  ID3D12Resource* dst_resource = ((const D3D12_RenderTarget&)dst).GetResource();
  D3D12_RESOURCE_BARRIER barrier[2];
  barrier[0].Flags = D3D12_RESOURCE_BARRIER_FLAG_NONE;
  barrier[0].Transition.pResource = src_resource;
  barrier[0].Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
  barrier[0].Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
  barrier[0].Transition.StateAfter = D3D12_RESOURCE_STATE_RESOLVE_SOURCE;
  barrier[1].Flags = D3D12_RESOURCE_BARRIER_FLAG_NONE;
  barrier[1].Transition.pResource = dst_resource;
  barrier[1].Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
  barrier[1].Transition.StateBefore = D3D12_RESOURCE_STATE_PRESENT;
  barrier[1].Transition.StateAfter = D3D12_RESOURCE_STATE_RESOLVE_DEST;
  m_command_list->ResourceBarrier(2, barrier);
  m_command_list->ResolveSubresource(dst_resource, 0, src_resource, 0, dst_resource->GetDesc().Format);

Stream output

While this is not the last stage in the graphics pipeline, it was the last one to get added to the framework. I've updated the "Direct3D12 framework" gallery with a screenshot of the test program for this stage. It displays 2 quads formed by 2 triangles per quad. While graphically this is not a complex image, there is a fair amount going on in the test program to generate it. The quad on the left is a tessellated 4 point control patch where the 4 vertices are sent by a vertex buffer from the host to graphics card and the results of tessellation are rendered and also stored to a stream output buffer. The quad on the right takes the stream output buffer that was filled by the other quad and uses the vertex and pixel shader stages to render the same quad at a different location. Since Direct3D12 does not have an equivalent of Direct3D11's ID3D11DeviceContext::DrawAuto, the test program also uses the number of vertices the GPU wrote to the stream output buffer to determine how many vertices to draw. This involves reading back a UINT64 that is in a buffer on the GPU. For this simple of a test case the read back isn't strictly necessary, but it allows for testing functionality that should prove useful for more complex or dynamic tessellation.

Since the graphics pipeline for the two quads are so different from each other, they need separate pipelines and even root signatures. The pipeline for the left quad needs all the stages turned on and uses the corresponding overload and sets the nullable stream_output argument to point to an instance of the framework's StreamOutputConfig class to describe what is going to the stream output stage and which output stream should go to the rasterizer stage. The StreamOutputConfig class is a wrapper around D3D12_STREAM_OUTPUT_DESC that adds validation when setting the various fields and computes the stride in bytes for each output stream. The pipeline for the right quad needs the input assembler, vertex shader, rasterizer, pixel shader, and output merge stages enabled. Since the source for setting everything setup for these two pipelines is a bit lengthy rather than copy it here, it is available in the stream_output test program's GameMain::CreateNormalPipeline and GameMain::CreateStreamOutputPipeline functions in the attached source.

The key part linking the two quads is the stream output buffer. It is created in CreateNormalPipeline by calling the framework's StreamOutputBuffer::CreateD3D12 function with the same StreamOutputConfig object that was passed to creating the left quad's pipeline. The heap properties for creating a committed resource for the stream output buffer are normal for a GPU read/write buffer:

D3D12_HEAP_PROPERTIES heap_prop;
heap_prop.Type = D3D12_HEAP_TYPE_DEFAULT;
heap_prop.CPUPageProperty = D3D12_CPU_PAGE_PROPERTY_UNKNOWN;
heap_prop.MemoryPoolPreference = D3D12_MEMORY_POOL_UNKNOWN;
heap_prop.CreationNodeMask = 0;
heap_prop.VisibleNodeMask = 0;

Due to the BufferFilledSizeLocation field in D3D12_STREAM_OUTPUT_BUFFER_VIEW, there is a detail worth mentioning when filling out D3D12_RESOURCE_DESC. In Direct3D11 the number of bytes or vertices written to a stream output buffer was handled internally and application didn't need to think about it. This changed in Direct3D12 where the application needs to provide memory in a GPU writable buffer for this field (https://msdn.microsoft.com/en-us/library/windows/desktop/mt431709%28v=vs.85%29.aspx#fixed_function_rendering). In the framework, I regard this as part of the stream output buffer and want it to be part of its resource, which means increasing the number of bytes specified by the width field of D3D12_RESOURCE_DESC. The question becomes, how many bytes should the resource be increased by to include this field? The MSDN page for D3D12_STREAM_OUTPUT_BUFFER_VIEW (https://msdn.microsoft.com/en-us/library/windows/desktop/dn903817(v=vs.85).aspx) doesn't mention how large this field is, and https://msdn.microsoft.com/en-us/library/windows/desktop/dn903944%28v=vs.85%29.aspx discusses a 32-bit field named "BufferFilledSize" for how many bytes have been written to the buffer along with another field of unspecified size named "BufferFilledSizeOffsetInBytes". However, increasing the buffer by only sizeof(UINT) causes Direct3D12's debug layer to output all sorts of fun messages when trying to use the stream output buffer. Using a UINT64 works and would indicate that "BufferFilledSizeOffsetInBytes" is a 32-bit field as well. However, as will be discussed in the part about determining the number of vertices written to a stream output buffer, this 64-bit quantity seems to be 1 field for the number of bytes written rather than 2 separate fields. So, with it being determined that the additional field is a UINT64, the code for filling in the resource description becomes:

D3D12_RESOURCE_DESC res_desc;
res_desc.Dimension = D3D12_RESOURCE_DIMENSION_BUFFER;
res_desc.Width = num_bytes + sizeof(UINT64); // note: +sizeof(UINT64) due to graphics card counter for how many bytes it has written to the stream output buffer
res_desc.Height = 1;
res_desc.DepthOrArraySize = 1;
res_desc.MipLevels = 1;
res_desc.Format = DXGI_FORMAT_UNKNOWN;
res_desc.SampleDesc.Count = 1;
res_desc.SampleDesc.Quality = 0;
res_desc.Layout = D3D12_TEXTURE_LAYOUT_ROW_MAJOR;
res_desc.Flags = D3D12_RESOURCE_FLAG_NONE;

Since it makes sense for a stream output buffer to be initially ready for use with the stream output stage, the initial resource state should reflect that. So the code for creating the resource is:

ID3D12Resource* buffer; HRESULT rc = core.GetDevice()->CreateCommittedResource(&heap_prop, D3D12_HEAP_FLAG_NONE, &res_desc, D3D12_RESOURCE_STATE_STREAM_OUT, NULL, __uuidof(ID3D12Resource), (void**)&buffer); if (FAILED(rc)) { ostringstream out; out << "Failed to create committed resource for stream output buffer. HRESULT = " << rc; throw FrameworkException(out.str()); }

Next the stream output buffer view needs to be created. For D3D12_STREAM_OUTPUT_BUFFER_VIEW, the BufferLocation is the GPU address pointing to the beginning of the stream output buffer, SizeInBytes is the number of bytes in the stream output buffer, and BufferFilledSizeLocation is the GPU address of that additional field I previously talked about for the counter of how many bytes have been written to the stream output buffer. Initially I had put the byte counter field after the stream output buffer. However, so that when copying the byte counter to a host readable buffer (D3D12_StreamOutputBuffer::GetNumVerticesWritten) doesn't have to compute an offset when calling ID3D12GraphicsCommandList::CopyBufferRegion, I moved it to the beginning of the resource and pushed the stream output buffer back by sizeof(UINT64).

so_view.BufferFilledSizeLocation = buffer->GetGPUVirtualAddress();
so_view.BufferLocation = so_view.BufferFilledSizeLocation + sizeof(UINT64);
so_view.SizeInBytes = num_bytes;

Since the stream output buffer can also be used as a vertex buffer for the input assembler stage, a vertex buffer view needs to be created to support that. It's pretty straight forward of having the same start address as the stream output buffer (not including the byte counter field), setting the total size of the buffer, and the vertex stride in bytes which is taken from the StreamOutputConfig for the particular stream index.

vb_view.BufferLocation = so_view.BufferLocation;
vb_view.SizeInBytes = (UINT)num_bytes;
vb_view.StrideInBytes = (UINT)vertex_stride;

For getting the number of vertices written to a stream output buffer available to an application, the framework provides a static GetNumVerticesWritten function in the StreamOutputBuffer class. It's declaration is:

static void GetNumVerticesWrittenD3D12(GraphicsCore& graphics, CommandList& command_list, const std::vector so_buffers, ReadbackBuffer& readback_buffer, std::vector& num_vertices);

You'll notice that rather than take a single stream output buffer, this takes a collection of them. This is due to retrieve the data the byte count field needs to be copied from the resource for the stream output buffer, which is only GPU accessible, to a host readable buffer. That process involves using a command list to perform the copy, then executing the command list needs to wait for fence so that the copying is complete, and finally converting from number of bytes written to number of vertices in the buffer (which is dividing the value from the field by the vertex stride). Doing all of that per stream output buffer gets is basically stalling the application on the CPU by waiting for multiple fences to complete. By performing these operations on a collection, there is only 1 fence causing the CPU to wait. The source for this function is:

void D3D12_StreamOutputBuffer::GetNumVerticesWritten(GraphicsCore& graphics, CommandList& command_list, const vector so_buffers, ReadbackBuffer& readback_buffer, vector& num_vertices)
  D3D12_Core& core = (D3D12_Core&)graphics;
  ID3D12GraphicsCommandList* cmd_list = ((D3D12_CommandList&)command_list).GetCommandList();
  D3D12_ReadbackBuffer& rb_buffer = (D3D12_ReadbackBuffer&)readback_buffer;
  ID3D12Resource* resource = rb_buffer.GetResource();
  num_vertices.reserve(num_vertices.size() + so_buffers.size());
  vector::const_iterator so_it = so_buffers.begin();
  UINT64 dst_offset = 0;
  while (so_it != so_buffers.end())
    D3D12_StreamOutputBuffer* curr_so_buffer = (D3D12_StreamOutputBuffer*)(*so_it);
    cmd_list->CopyBufferRegion(resource, dst_offset, curr_so_buffer->m_buffer, 0, sizeof(UINT64));
    dst_offset += sizeof(UINT64);
  UINT64* cbuffer_value = (UINT64*)rb_buffer.GetHostMemStart();
  so_it = so_buffers.begin();
  while (so_it != so_buffers.end())
    D3D12_StreamOutputBuffer* curr_so_buffer = (D3D12_StreamOutputBuffer*)(*so_it);
    num_vertices.push_back((UINT)((*cbuffer_value) / curr_so_buffer->m_vb_view.StrideInBytes));

One additional thing I would like to mention about the stream output stage is that if multiple output streams are used by a shader, they must be of type PointStream. In the test program, I am sending the tessellated object space coordinate to the stream output buffer, but due to this restriction I am also sending it to the pixel shader (in addition to the projection coordinates). While sending this additional field to the pixel shader isn't harmful, it's also unnecessary since the pixel shader doesn't use it. Having the left quad drawn as points isn't the behavior I wanted for this test program, so using multiple output streams to minimize the data sent to both stream output and pixel shader stages wasn't viable in this case.

  1. How do I want to expose creating textures that use them?
  2. How to determine the subresource index for all the different types of textures?
  3. Since using compile time type checking to avoid incorrect operations is a goal of this project, how does this affect the TextureUploadBuffer interface and D3D12_TextureUploadBuffer implementation?

Sign in to follow this  


Recommended Comments

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!