• 13
• 18
• 19
• 27
• 9
• ### Similar Content

• By lubbe75
What is the best practice when you want to draw a surface (for instance a triangle strip) with a uniform color?
At the moment I send vertices to the shader, where each vertice has both position and color information. Since all vertices for that triangle strip have the same color I thought I could reduce memory use by sending the color separate somehow. A vertex could then be represented by three floats instead of seven (xyz instead of xys + rgba).
Does it make sense? What's the best practice?

• Hey all,
I'm trying to understand implicit state promotion for directx 12 as well as its intended use case. https://msdn.microsoft.com/en-us/library/windows/desktop/dn899226(v=vs.85).aspx#implicit_state_transitions
I'm attempting to utilize copy queues and finding that there's a lot of book-keeping I need to do to first "pre-transition" from my Graphics / Compute Read-Only state (P-SRV | NP-SRV) to Common, Common to Copy Dest, perform the copy on the copy command list, transition back to common, and then find another graphics command list to do the final Common -> (P-SRV | NP-SRV) again.
With state promotion, it would seem that I can 'nix the Common -> Copy Dest, Copy Dest -> Common bits on the copy queue easily enough, but I'm curious whether I could just keep all of my "read-only" buffers and images in the common state and effectively not perform any barriers at all.
This seems to be encouraged by the docs, but I'm not sure I fully understand the implications. Does this sound right?
Thanks.
• By NikiTo
I need to share heap between RTV and Stencil. I need to render to a texture and without copying it(only changing the barriers, etc) to be able to use that texture as stencil. without copying nothing around. But the creating of the placed resource fails. I think it could be because of the D3D12_RESOURCE_DESC has 8_UINT format, but D3D12_RESOURCE_FLAG_ALLOW_DEPTH_STENCIL enabled too, and MSDN says Stencil does not support that format. Is the format the problem? And if the format is the problem, what format I have to use?

For the texture of that resource I have the flags like: "D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET | D3D12_RESOURCE_FLAG_ALLOW_DEPTH_STENCIL" and it fails, but when I remove the allow-stencil flag, it works.

• I know vertex buffer is just another GPU resource represented by ID3D12Resource, but why is it said that vertex buffer don’t need a descriptor heap??
Other resources like depth/stencil resource, swap chain’s buffer need to have descriptor heaps. How does these resources differ from vertex buffer.

• Hello!
I would like to introduce Diligent Engine, a project that I've been recently working on. Diligent Engine is a light-weight cross-platform abstraction layer between the application and the platform-specific graphics API. Its main goal is to take advantages of the next-generation APIs such as Direct3D12 and Vulkan, but at the same time provide support for older platforms via Direct3D11, OpenGL and OpenGLES. Diligent Engine exposes common front-end for all supported platforms and provides interoperability with underlying native API. Shader source code converter allows shaders authored in HLSL to be translated to GLSL and used on all platforms. Diligent Engine supports integration with Unity and is designed to be used as a graphics subsystem in a standalone game engine, Unity native plugin or any other 3D application. It is distributed under Apache 2.0 license and is free to use. Full source code is available for download on GitHub.
Features:
True cross-platform Exact same client code for all supported platforms and rendering backends No #if defined(_WIN32) ... #elif defined(LINUX) ... #elif defined(ANDROID) ... No #if defined(D3D11) ... #elif defined(D3D12) ... #elif defined(OPENGL) ... Exact same HLSL shaders run on all platforms and all backends Modular design Components are clearly separated logically and physically and can be used as needed Only take what you need for your project (do not want to keep samples and tutorials in your codebase? Simply remove Samples submodule. Only need core functionality? Use only Core submodule) No 15000 lines-of-code files Clear object-based interface No global states Key graphics features: Automatic shader resource binding designed to leverage the next-generation rendering APIs Multithreaded command buffer generation 50,000 draw calls at 300 fps with D3D12 backend Descriptor, memory and resource state management Modern c++ features to make code fast and reliable The following platforms and low-level APIs are currently supported:
Windows Desktop: Direct3D11, Direct3D12, OpenGL Universal Windows: Direct3D11, Direct3D12 Linux: OpenGL Android: OpenGLES MacOS: OpenGL iOS: OpenGLES API Basics
Initialization
The engine can perform initialization of the API or attach to already existing D3D11/D3D12 device or OpenGL/GLES context. For instance, the following code shows how the engine can be initialized in D3D12 mode:
#include "RenderDeviceFactoryD3D12.h" using namespace Diligent; // ...  GetEngineFactoryD3D12Type GetEngineFactoryD3D12 = nullptr; // Load the dll and import GetEngineFactoryD3D12() function LoadGraphicsEngineD3D12(GetEngineFactoryD3D12); auto *pFactoryD3D11 = GetEngineFactoryD3D12(); EngineD3D12Attribs EngD3D12Attribs; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[0] = 1024; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[1] = 32; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[2] = 16; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[3] = 16; EngD3D12Attribs.NumCommandsToFlushCmdList = 64; RefCntAutoPtr<IRenderDevice> pRenderDevice; RefCntAutoPtr<IDeviceContext> pImmediateContext; SwapChainDesc SwapChainDesc; RefCntAutoPtr<ISwapChain> pSwapChain; pFactoryD3D11->CreateDeviceAndContextsD3D12( EngD3D12Attribs, &pRenderDevice, &pImmediateContext, 0 ); pFactoryD3D11->CreateSwapChainD3D12( pRenderDevice, pImmediateContext, SwapChainDesc, hWnd, &pSwapChain ); Creating Resources
Device resources are created by the render device. The two main resource types are buffers, which represent linear memory, and textures, which use memory layouts optimized for fast filtering. To create a buffer, you need to populate BufferDesc structure and call IRenderDevice::CreateBuffer(). The following code creates a uniform (constant) buffer:
BufferDesc BuffDesc; BufferDesc.Name = "Uniform buffer"; BuffDesc.BindFlags = BIND_UNIFORM_BUFFER; BuffDesc.Usage = USAGE_DYNAMIC; BuffDesc.uiSizeInBytes = sizeof(ShaderConstants); BuffDesc.CPUAccessFlags = CPU_ACCESS_WRITE; m_pDevice->CreateBuffer( BuffDesc, BufferData(), &m_pConstantBuffer ); Similar, to create a texture, populate TextureDesc structure and call IRenderDevice::CreateTexture() as in the following example:
TextureDesc TexDesc; TexDesc.Name = "My texture 2D"; TexDesc.Type = TEXTURE_TYPE_2D; TexDesc.Width = 1024; TexDesc.Height = 1024; TexDesc.Format = TEX_FORMAT_RGBA8_UNORM; TexDesc.Usage = USAGE_DEFAULT; TexDesc.BindFlags = BIND_SHADER_RESOURCE | BIND_RENDER_TARGET | BIND_UNORDERED_ACCESS; TexDesc.Name = "Sample 2D Texture"; m_pRenderDevice->CreateTexture( TexDesc, TextureData(), &m_pTestTex ); Initializing Pipeline State
Diligent Engine follows Direct3D12 style to configure the graphics/compute pipeline. One big Pipelines State Object (PSO) encompasses all required states (all shader stages, input layout description, depth stencil, rasterizer and blend state descriptions etc.)
To create a shader, populate ShaderCreationAttribs structure. An important member is ShaderCreationAttribs::SourceLanguage. The following are valid values for this member:
SHADER_SOURCE_LANGUAGE_DEFAULT  - The shader source format matches the underlying graphics API: HLSL for D3D11 or D3D12 mode, and GLSL for OpenGL and OpenGLES modes. SHADER_SOURCE_LANGUAGE_HLSL  - The shader source is in HLSL. For OpenGL and OpenGLES modes, the source code will be converted to GLSL. See shader converter for details. SHADER_SOURCE_LANGUAGE_GLSL  - The shader source is in GLSL. There is currently no GLSL to HLSL converter. To allow grouping of resources based on the frequency of expected change, Diligent Engine introduces classification of shader variables:
Static variables (SHADER_VARIABLE_TYPE_STATIC) are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. Mutable variables (SHADER_VARIABLE_TYPE_MUTABLE) define resources that are expected to change on a per-material frequency. Examples may include diffuse textures, normal maps etc. Dynamic variables (SHADER_VARIABLE_TYPE_DYNAMIC) are expected to change frequently and randomly. This post describes the resource binding model in Diligent Engine.
The following is an example of shader initialization:
To create a pipeline state object, define instance of PipelineStateDesc structure. The structure defines the pipeline specifics such as if the pipeline is a compute pipeline, number and format of render targets as well as depth-stencil format:
// This is a graphics pipeline PSODesc.IsComputePipeline = false; PSODesc.GraphicsPipeline.NumRenderTargets = 1; PSODesc.GraphicsPipeline.RTVFormats[0] = TEX_FORMAT_RGBA8_UNORM_SRGB; PSODesc.GraphicsPipeline.DSVFormat = TEX_FORMAT_D32_FLOAT; The structure also defines depth-stencil, rasterizer, blend state, input layout and other parameters. For instance, rasterizer state can be defined as in the code snippet below:
// Init rasterizer state RasterizerStateDesc &RasterizerDesc = PSODesc.GraphicsPipeline.RasterizerDesc; RasterizerDesc.FillMode = FILL_MODE_SOLID; RasterizerDesc.CullMode = CULL_MODE_NONE; RasterizerDesc.FrontCounterClockwise = True; RasterizerDesc.ScissorEnable = True; //RSDesc.MultisampleEnable = false; // do not allow msaa (fonts would be degraded) RasterizerDesc.AntialiasedLineEnable = False; When all fields are populated, call IRenderDevice::CreatePipelineState() to create the PSO:
Shader resource binding in Diligent Engine is based on grouping variables in 3 different groups (static, mutable and dynamic). Static variables are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. They are bound directly to the shader object:

m_pPSO->CreateShaderResourceBinding(&m_pSRB); Dynamic and mutable resources are then bound through SRB object:
m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "tex2DDiffuse")->Set(pDiffuseTexSRV); m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "cbRandomAttribs")->Set(pRandomAttrsCB); The difference between mutable and dynamic resources is that mutable ones can only be set once for every instance of a shader resource binding. Dynamic resources can be set multiple times. It is important to properly set the variable type as this may affect performance. Static variables are generally most efficient, followed by mutable. Dynamic variables are most expensive from performance point of view. This post explains shader resource binding in more details.
Setting the Pipeline State and Invoking Draw Command
Before any draw command can be invoked, all required vertex and index buffers as well as the pipeline state should be bound to the device context:
// Clear render target const float zero[4] = {0, 0, 0, 0}; m_pContext->ClearRenderTarget(nullptr, zero); // Set vertex and index buffers IBuffer *buffer[] = {m_pVertexBuffer}; Uint32 offsets[] = {0}; Uint32 strides[] = {sizeof(MyVertex)}; m_pContext->SetVertexBuffers(0, 1, buffer, strides, offsets, SET_VERTEX_BUFFERS_FLAG_RESET); m_pContext->SetIndexBuffer(m_pIndexBuffer, 0); m_pContext->SetPipelineState(m_pPSO); Also, all shader resources must be committed to the device context:
m_pContext->CommitShaderResources(m_pSRB, COMMIT_SHADER_RESOURCES_FLAG_TRANSITION_RESOURCES); When all required states and resources are bound, IDeviceContext::Draw() can be used to execute draw command or IDeviceContext::DispatchCompute() can be used to execute compute command. Note that for a draw command, graphics pipeline must be bound, and for dispatch command, compute pipeline must be bound. Draw() takes DrawAttribs structure as an argument. The structure members define all attributes required to perform the command (primitive topology, number of vertices or indices, if draw call is indexed or not, if draw call is instanced or not, if draw call is indirect or not, etc.). For example:
DrawAttribs attrs; attrs.IsIndexed = true; attrs.IndexType = VT_UINT16; attrs.NumIndices = 36; attrs.Topology = PRIMITIVE_TOPOLOGY_TRIANGLE_LIST; pContext->Draw(attrs); Tutorials and Samples
The GitHub repository contains a number of tutorials and sample applications that demonstrate the API usage.

AntTweakBar sample demonstrates how to use AntTweakBar library to create simple user interface.

Atmospheric scattering sample is a more advanced example. It demonstrates how Diligent Engine can be used to implement various rendering tasks: loading textures from files, using complex shaders, rendering to textures, using compute shaders and unordered access views, etc.

The repository includes Asteroids performance benchmark based on this demo developed by Intel. It renders 50,000 unique textured asteroids and lets compare performance of D3D11 and D3D12 implementations. Every asteroid is a combination of one of 1000 unique meshes and one of 10 unique textures.

Integration with Unity
Diligent Engine supports integration with Unity through Unity low-level native plugin interface. The engine relies on Native API Interoperability to attach to the graphics API initialized by Unity. After Diligent Engine device and context are created, they can be used us usual to create resources and issue rendering commands. GhostCubePlugin shows an example how Diligent Engine can be used to render a ghost cube only visible as a reflection in a mirror.

# DX12 Need ideas about 'Baking' instances into vertex shader (no GS)

This topic is 440 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey Guys,

My recent project in DX12 will require a pass to draw 200k+ cubes ( see image below, and I need one pass solid and one pass wireframe for debugging). This pass only have trivial ps so pretty vs bound.

[attachment=34179:Capture.PNG]

The way I do it is using instancing, and I was OK with the performance until I see this post.

Then I quickly crafted my version to benchmark it, and the vs version instantly save half of the time (for that pass). However, the implementation need gigantic IB, and in my case it need hundreds of MB, which I really don't want.

So I was wondering anyone have any idea about how to do this thing without instancing, without GS, and avoid this memory burden?

I know a trivial solution which is not using trianglestrip, instead using trianglelist. But that almost kills vertex reuse in vs (instead of doing 14 vert/cube, trianglelist will do 36 vert/cube) and given this pass is vs intensive, I guess trianglelist is sub-optimal (please correct me if my assumption is wrong)

I also know a trianglestrip solution which basically only need to add 2 duplicated vertex at the start and end of the trianglestrip to make 2 degenerated triangle to connect the previous and next cube. But that only work for solid pass, and I will have undesired line in my wireframe pass....

I then run out of ideas....

##### Share on other sites

However, the implementation need gigantic IB, and in my case it need hundreds of MB, which I really don't want.

You sure about that for 200k cubes I calculated that you'd need a 14MB index buffer. 200,000cubes * 12tris/cube * 3vertices/tri * 2bytes/index

I know a trivial solution which is not using trianglestrip, instead using trianglelist. But that almost kills vertex reuse in vs (instead of doing 14 vert/cube, trianglelist will do 36 vert/cube) and given this pass is vs intensive, I guess trianglelist is sub-optimal (please correct me if my assumption is wrong)

If you use indexed triangle list the vertex reuse happens because the post transform vertex cache. (Assuming good vertex ordering)

##### Share on other sites
You sure about that for 200k cubes I calculated that you'd need a 14MB index buffer. 200,000cubes * 12tris/cube * 3vertices/tri * 2bytes/index

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint (the method I used from the post require each index is unique so 16bit IB won't work here), so I have to use 32bit index number, so the total memory is more than 100MB

If you use indexed triangle list the vertex reuse happens because the post transform vertex cache. (Assuming good vertex ordering)

but if I use triangle list, how vs could know which vertex should be reused? I was thinking without IB GPU have no idea which vertex is shared by multiple triangle, and then reuse it from cache?

Thanks

Edited by Mr_Fox

##### Share on other sites

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint, so I have to use 32bit index number, so the total memory is more than 100MB

Break it up into multiple draw calls and use 16bit indices.

but if I use triangle list, how vs could know which vertex should be reused? I was thinking without IB GPU have no idea which vertex is shared by multiple triangle, and then reuse it from cache?

I said an indexed triangle list, not a plain triangle list.  So yes without an index buffer there is no reuse.

Are your vertex's pretransformed?  Or are you using a matrix per cube?

##### Share on other sites

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint, so I have to use 32bit index number, so the total memory is more than 100MB

Break it up into multiple draw calls and use 16bit indices.

but if I use triangle list, how vs could know which vertex should be reused? I was thinking without IB GPU have no idea which vertex is shared by multiple triangle, and then reuse it from cache?

I said an indexed triangle list, not a plain triangle list.  So yes without an index buffer there is no reuse.

Are your vertex's pretransformed?  Or are you using a matrix per cube?

Break it up sounds very promising, and I will give a indexed triangle list a try. Thanks so much.

My case, I only have 8 vertex VB, and I will read offset information from a buffer (which means each cube need a unique 'ID' in vs), and transform each cube accordingly in vs, so yes, I use a matrix per cube. So any suggestions? Thanks so much

##### Share on other sites

By the looks of your screenshot your cubes look the same size and adjacent to each other, is that right?  If so can you batch up cubes together and have more common vertices.  Then use a dynamic index buffer/ indexed triangle list for each chunk. (i.e. one matrix per chunk)

##### Share on other sites

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint (the method I used from the post require each index is unique so 16bit IB won't work here), so I have to use 32bit index number, so the total memory is more than 100MB

My case, I only have 8 vertex VB, and I will read offset information from a buffer (which means each cube need a unique 'ID' in vs), and transform each cube accordingly in vs, so yes, I use a matrix per cube.

If you are using one matrix per cube then you don't need 32bit indices.  Unless I'm missing something.

BTW - if you use straight instancing you will underutilize the GPU since AMD/Nvidia GPU's operate on 64/32 vertices at a time.

##### Share on other sites

If you are using one matrix per cube then you don't need 32bit indices.  Unless I'm missing something.

Thanks for being so helpful. I really appreciated it

I did not quite get it why I don't need 32bit indices for using one matrix per cube if I don't break my cubes into multiple patches?  If each cube need one matrix, then each cube need a unique ID (index to the matrix buffer)  to find the right matrix for that cube, and if we don't break the cubes into multiple patches, this won't work if we have more than 65536 cube since 16bit IB can't provide more than 65536 unique ID.

I think the original screenshot I posted is a little bit misleading. I got that from VSDG vertex shader capture. Here is the actual screenshot

[attachment=34182:Capture1.PNG]

So my project is about 3D reconstruction, the model I am reconstructing is inside a TSDF volume, and those cube is a spacial structure which indicate model surface are inside those cubes. And they are generated during model updating pass (block center location is added to a appendbuffer), then later I need to render those 'active blocks' to get a min/max depth (kinda like depth prepass, but I need not only the min depth, but also max depth). And this pass is the one I mentioned in this post which can benefit a lot from not using instancing.

And now you may know that those cubes are axis-aligned and are the same size, but you cannot assume more...

##### Share on other sites
If all you need to do is to render cube then you don't need an IB or VB at all:
14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0


Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

##### Share on other sites

If all you need to do is to render cube then you don't need an IB or VB at all:
14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0


Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

Wow~ that's brilliant!!  use bits to do these thing. hum... with a little bit modification, it seems we could use this method for lots of simple geometry, 16bit magic numbers for anything less than 16tristrip geometry(may need more magic number for cases where coordinates not just 0 or 1). How do you come up with that idea?

Thanks