• Advertisement
Sign in to follow this  

DX11 Why are we still using index/vertex/instance buffers?

Recommended Posts

There are a number of options now in each API for storing data on the GPU. Specifically in DX11, we have Buffers which can be bound as vertex buffers, index buffers, constant buffers or shader resources (structured buffer for example). Constant buffers have the most limitations and I think that's because they are so optimized for non random Access (an optimization of which we have no control of). Vertex buffers and index buffers however have not many limitations compared to shader resource buffers to the point that I question their value.

For example, the common way of drawing geometry is to provide a vertex buffer (and maybe an instance buffer) by a specific call to SetVertexBuffers. We also provide index buffers with a specific call. At this point we also have to provide an input layout. That is significantly more of a management overhead than it would be if we provided the vertex and index buffers through shader resources and indexed them with sysvalues (eg. SV_VertexID) in the shader.

Now, I haven't actually tried doing vertex buffer management this way but I actually looking forward to it if no one points out the faults in my way of thinking.

Share this post


Link to post
Share on other sites
Advertisement

Thank you! I actually didn't think of using a dedicated index buffer but I see now that it still has value. What I am also interested in is that this way you can easily do hard edge normals and UV discontinuities without duplicated position vertices. I am already using deinterleaved vertex buffers (for more efficient shadow rendering/zprepass) so implementing that should not be very hard.

Oh and something to keep in mind: graphics debuggers (at least Nsight) cannot visualize geometry information without an input layout, that is certainly a downside of it.

Edited by turanszkij

Share this post


Link to post
Share on other sites

BTW since it wasn't explicitly stated what you do is you use a null vertex buffer, this will allow you to generate vertex's procedurally or by fetching them manually using the SV_VertexID and SV_InstanceID system values.  It has been documented here:

https://www.slideshare.net/DevCentralAMD/vertex-shader-tricks-bill-bilodeau

or

 
Starting page seven.
Edited by Infinisearch

Share this post


Link to post
Share on other sites

At the end of the day, vertex buffer and index buffer is just another buffer with semantics attach..as pointed out above I think vertex caching is one of the biggest reason for the having this distinction still as without this, the API will have to be able to flag a generic buffer as being cacheable..

Share this post


Link to post
Share on other sites

MJP point 2 is the most important, you need an index buffer in order to benefit from post vertex transform cache, besides that you could, if you only target recent hardware, go SoA (not interleave your vertex data) and fetch manually, that's what will happen on any GCN anyway.

As mentionned by MJP also, nVidia hardware works differently, not sure about latest gen, all consoles being GCN we tend to optimise for it...

Share this post


Link to post
Share on other sites

Ugh, I implemented it in my engine for every scene mesh render pass and it performs significantly worse on my GTX 1070 than using regular vertex buffers. I was rendering shadows on the sponza scene in 2ms for 6 point lights and the custom vertex fetch moves it up to 11 ms which is insane). The Z prepass of 0.2 ms got up to 0.4ms. These passes are using position and sometimes texcoord and instance deinterleaved buffers. 

The vertex buffers are float4 buffers which I create as shader resources with DXGI_FORMAT_R32G32B32A32 views. In the shader I declare them as Buffer<float4>. The instance buffers are structured buffers holding 4x4 float matrices.

I don't understand what could be going on but it is very fishy, I expected a very minor performance difference.

Share this post


Link to post
Share on other sites

I haven't implemented this myself, but you could try eliminating the overhead of automatic type conversion that buffers have. i.e. the buffer SRV contains a format field, specifying that the data is in a particular format, and the HLSL code says that it wants it converted to DXGI_FORMAT_R32G32B32A32_FLOAT format -- this ability for general purpose conversion might have an overhead on NV?

To avoid that, you could try using a ByteAddressBuffer, and something like asfloat(buffer.Load4(vertexId*16))., which hard-codes the expectation that the buffer will be in DXGI_FORMAT_R32G32B32A32_FLOAT format.

Alternatively you could try using a StructuredBuffer<float4>.

I'd be very interested to know if these three types of buffers have any performance differences... :wink:

Share this post


Link to post
Share on other sites

Thanks guys, I will thoroughly experiment with different methods and GPUs. I will be conducting benchmarks on the Sponza scene with Nvidia GTX 1070, GTX 960, AMD RX 470 and Snapdragon 808 GPUs. All of them will be timed for Input layout rendering method, Custom fetch with typed buffers and custom fetch with raw buffers.

I will post results in this topic when I am finished.

A Quick spoiler: AMD RX470 is performing surprisingly well even with typed buffers custom fetch (screw that it has virtually no difference whatsoever), while GTX 960 suffers greatly with doubled rendering time.

Edited by turanszkij

Share this post


Link to post
Share on other sites

I have done the testing for an AMD and an NVIDIA GPU, the Snapdragon 808 will have to wait as setting up the scene for that will take some more time. I will also post the results for the GTX 1070 later.

Here you go:

 

 

Program: Wicked Engine Editor
API: DX11
 
Test scene: Sponza
 - 3 shadow cascades (2D) - 3 scene render passes
 - 1 spotlight shadow (2D) - 1 scene render pass
 - 4 pointlight shadows (Cubemap) - 4 scene render passes
 - Z prepass - 1 scene render pass
 - Opaque pass - 1 scene render pass
 
Timing method: DX11 timestamp queries
 
Methods:
 - InputLayout : The default hardware vertex buffer usage with CPU side input layout declarations. The instance buffers are bound as vertex buffers with each render call.
 - CustomFetch (typed buffer): Vertex buffers are bound as shader resource views with DXGI_FORMAT_R32G32B32A32_FLOAT format. Instance buffers are bound as Structured Buffers holding a 4x4 matrix each.
 - CustomFetch (RAW buffer 1): Vertex buffers are bound as shader resource views with a MiscFlag of D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS. In the shader the buffers are addressed in byte offsets from the beginning of the buffer. Instance buffers are bound as Structured Buffers holding a 4x4 matrix each.
 - CustomFetch (RAW buffer 2): Even instancing information is retrieved from raw buffers instead of structured buffers.
 
ShadowPass and ZPrepass: These are using 3 buffers max:
 - position (float4)
 - UV (float4) // only for alpha tested
 - instance buffer
OpaquePass: This is using 6 buffers:
 - position (float4)
 - normal (float4)
 - UV (float4)
 - previous frame position VB (float4)
 - instance buffer (float4x4)
 - previous frame instance buffer (float4x3)
 
 
RESULTS:

GPU     Method        ShadowPass    ZPrepass   OpaquePass   All GPU
NVidia GTX 960  InputLayout       4.52 ms     0.37 ms    6.12 ms    15.68 ms
NVidia GTX 960  CustomFetch (typed buffer)   18.89 ms    1.31 ms    8.68 ms    33.58 ms
NVidia GTX 960  CustomFetch (RAW buffer 1)   18.29 ms    1.35 ms    8.62 ms    33.03 ms
NVidia GTX 960  CustomFetch (RAW buffer 2)   18.42 ms    1.32 ms    8.61 ms    33.18 ms
AMD RX 470   InputLayout       7.43 ms     0.29 ms    3.06 ms    14.01 ms
AMD RX 470   CustomFetch (typed buffer)   7.41 ms     0.31 ms    3.12 ms    14.08 ms
AMD RX 470   CustomFetch (RAW buffer 1)   7.50 ms     0.29 ms    3.07 ms    14.09 ms
AMD RX 470   CustomFetch (RAW buffer 2)   7.56 ms     0.28 ms    3.09 ms    14.15 ms
 

I have attached a txt file with easier readability.

 

 

This is quite painful for me because I wanted to implement some features which require the custom fetching but seeing that it works so slow on nvidia it seems like wasted effort.

By the way, to quickly implement this, I bound my vertex buffers to texture slot 30 and upper, could it matter in performance?

Side note: It seems that this way the CPU time is also higher because VSSetShaderResources takes a longer time than IASetVertexBuffers. :(

Edited by turanszkij

Share this post


Link to post
Share on other sites

Are you using 1 buffer for all attributes or 1 per attribute ?
If you're not using 1 for all attributes you should probably try that.

Share this post


Link to post
Share on other sites

I am using 1 buffer per attribute (SoA). I recently switched from AoS layout to SoA layout and a complete rewrite of the scene rendering pipeline to allow more flexible buffer binding and more cache efficiency in depth only passes of which there are a lot more than regular passes. This gained me a substantial performance boost on both AMD and Nvidia, I am not interested in going back.

Share this post


Link to post
Share on other sites

I meant 1 buffer and all Positions, then all Normals and so on...

So still SoA but in a single buffer, each attribute array appended to the next if you will.

Edited by Ingenu

Share this post


Link to post
Share on other sites

In general, the reason for different types of seemingly similar resources is that at least one major IHV has (potentially legacy) fast-path hardware that differentiates between them. There are a number of buffer types which perform differently on NV GPUs while AMD's GCN GPUs simply don't care. You're seeing hardware design issues leaking through the software abstractions.

Ideally, we would just have buffers read by shaders and nothing else, not even textures. (I mean come on, texture buffers?) GPUs haven't reached that level of generalized functionality yet. MS originally pitched this design when they were sketching out D3D 10 and of course the IHVs explained it wasn't feasible.

Edited by Promit

Share this post


Link to post
Share on other sites

I meant 1 buffer and all Positions, then all Normals and so on...

So still SoA but in a single buffer, each attribute array appended to the next if you will.

Why do you think it would be better? I imagine it would be harder to manage, because now you would even had to provide the length of the buffers to know the correct offset in the shaders.

Share this post


Link to post
Share on other sites
On 2017/6/1 at 11:59 PM, turanszkij said:

I am using 1 buffer per attribute (SoA). I recently switched from AoS layout to SoA layout and a complete rewrite of the scene rendering pipeline to allow more flexible buffer binding and more cache efficiency in depth only passes of which there are a lot more than regular passes. This gained me a substantial performance boost on both AMD and Nvidia, I am not interested in going back.

I did some tests on mobile GPU, all tests use AOS style buffer.

Adreno 418, SSBO is slower (50fps) than Vertex Buffer (55fps),

Adreno 512, no performance difference between SSBO and Vertex Buffer, this might because Adreno 512 has a unified memory model.

Mali(T7xx, T8xx, G71), none of those GPU support SSBO in vertex shader(although they support opengl es 3.1). 

GL_MAX_TEXTURE_BUFFER_SIZE is about 64KB on Mali, so I can't use texture buffer for vertex pulling either.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Now

  • Advertisement
  • Similar Content

    • By AxeGuywithanAxe
      I wanted to see how others are currently handling descriptor heap updates and management.
      I've read a few articles and there tends to be three major strategies :
      1 ) You split up descriptor heaps per shader stage ( i.e one for vertex shader , pixel , hull, etc)
      2) You have one descriptor heap for an entire pipeline
      3) You split up descriptor heaps for update each update frequency (i.e EResourceSet_PerInstance , EResourceSet_PerPass , EResourceSet_PerMaterial, etc)
      The benefits of the first two approaches is that it makes it easier to port current code, and descriptor / resource descriptor management and updating tends to be easier to manage, but it seems to be not as efficient.
      The benefits of the third approach seems to be that it's the most efficient because you only manage and update objects when they change.
    • By evelyn4you
      hi,
      until now i use typical vertexshader approach for skinning with a Constantbuffer containing the transform matrix for the bones and an the vertexbuffer containing bone index and bone weight.
      Now i have implemented realtime environment  probe cubemaping so i have to render my scene from many point of views and the time for skinning takes too long because it is recalculated for every side of the cubemap.
      For Info i am working on Win7 an therefore use one Shadermodel 5.0 not 5.x that have more options, or is there a way to use 5.x in Win 7
      My Graphic Card is Directx 12 compatible NVidia GTX 960
      the member turanszkij has posted a good for me understandable compute shader. ( for Info: in his engine he uses an optimized version of it )
      https://turanszkij.wordpress.com/2017/09/09/skinning-in-compute-shader/
      Now my questions
       is it possible to feed the compute shader with my orignial vertexbuffer or do i have to copy it in several ByteAdressBuffers as implemented in the following code ?
        the same question is about the constant buffer of the matrixes
       my more urgent question is how do i feed my normal pipeline with the result of the compute Shader which are 2 RWByteAddressBuffers that contain position an normal
      for example i could use 2 vertexbuffer bindings
      1 containing only the uv coordinates
      2.containing position and normal
      How do i copy from the RWByteAddressBuffers to the vertexbuffer ?
       
      (Code from turanszkij )
      Here is my shader implementation for skinning a mesh in a compute shader:
      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 struct Bone { float4x4 pose; }; StructuredBuffer<Bone> boneBuffer;   ByteAddressBuffer vertexBuffer_POS; // T-Pose pos ByteAddressBuffer vertexBuffer_NOR; // T-Pose normal ByteAddressBuffer vertexBuffer_WEI; // bone weights ByteAddressBuffer vertexBuffer_BON; // bone indices   RWByteAddressBuffer streamoutBuffer_POS; // skinned pos RWByteAddressBuffer streamoutBuffer_NOR; // skinned normal RWByteAddressBuffer streamoutBuffer_PRE; // previous frame skinned pos   inline void Skinning(inout float4 pos, inout float4 nor, in float4 inBon, in float4 inWei) {  float4 p = 0, pp = 0;  float3 n = 0;  float4x4 m;  float3x3 m3;  float weisum = 0;   // force loop to reduce register pressure  // though this way we can not interleave TEX - ALU operations  [loop]  for (uint i = 0; ((i &lt; 4) &amp;&amp; (weisum&lt;1.0f)); ++i)  {  m = boneBuffer[(uint)inBon].pose;  m3 = (float3x3)m;   p += mul(float4(pos.xyz, 1), m)*inWei;  n += mul(nor.xyz, m3)*inWei;   weisum += inWei;  }   bool w = any(inWei);  pos.xyz = w ? p.xyz : pos.xyz;  nor.xyz = w ? n : nor.xyz; }   [numthreads(1024, 1, 1)] void main( uint3 DTid : SV_DispatchThreadID ) {  const uint fetchAddress = DTid.x * 16; // stride is 16 bytes for each vertex buffer now...   uint4 pos_u = vertexBuffer_POS.Load4(fetchAddress);  uint4 nor_u = vertexBuffer_NOR.Load4(fetchAddress);  uint4 wei_u = vertexBuffer_WEI.Load4(fetchAddress);  uint4 bon_u = vertexBuffer_BON.Load4(fetchAddress);   float4 pos = asfloat(pos_u);  float4 nor = asfloat(nor_u);  float4 wei = asfloat(wei_u);  float4 bon = asfloat(bon_u);   Skinning(pos, nor, bon, wei);   pos_u = asuint(pos);  nor_u = asuint(nor);   // copy prev frame current pos to current frame prev pos streamoutBuffer_PRE.Store4(fetchAddress, streamoutBuffer_POS.Load4(fetchAddress)); // write out skinned props:  streamoutBuffer_POS.Store4(fetchAddress, pos_u);  streamoutBuffer_NOR.Store4(fetchAddress, nor_u); }  
    • By mister345
      Hi, can someone please explain why this is giving an assertion EyePosition!=0 exception?
       
      _lightBufferVS->viewMatrix = DirectX::XMMatrixLookAtLH(XMLoadFloat3(&_lightBufferVS->position), XMLoadFloat3(&_lookAt), XMLoadFloat3(&up));
      It looks like DirectX doesnt want the 2nd parameter to be a zero vector in the assertion, but I passed in a zero vector with this exact same code in another program and it ran just fine. (Here is the version of the code that worked - note XMLoadFloat3(&m_lookAt) parameter value is (0,0,0) at runtime - I debugged it - but it throws no exceptions.
          m_viewMatrix = DirectX::XMMatrixLookAtLH(XMLoadFloat3(&m_position), XMLoadFloat3(&m_lookAt), XMLoadFloat3(&up)); Here is the repo for the broken code (See LightClass) https://github.com/mister51213/DirectX11Engine/blob/master/DirectX11Engine/LightClass.cpp
      and here is the repo with the alternative version of the code that is working with a value of (0,0,0) for the second parameter.
      https://github.com/mister51213/DX11Port_SoftShadows/blob/master/Engine/lightclass.cpp
    • By mister345
      Hi, can somebody please tell me in clear simple steps how to debug and step through an hlsl shader file?
      I already did Debug > Start Graphics Debugging > then captured some frames from Visual Studio and
      double clicked on the frame to open it, but no idea where to go from there.
       
      I've been searching for hours and there's no information on this, not even on the Microsoft Website!
      They say "open the  Graphics Pixel History window" but there is no such window!
      Then they say, in the "Pipeline Stages choose Start Debugging"  but the Start Debugging option is nowhere to be found in the whole interface.
      Also, how do I even open the hlsl file that I want to set a break point in from inside the Graphics Debugger?
       
      All I want to do is set a break point in a specific hlsl file, step thru it, and see the data, but this is so unbelievably complicated
      and Microsoft's instructions are horrible! Somebody please, please help.
       
       
       

    • By mister345
      I finally ported Rastertek's tutorial # 42 on soft shadows and blur shading. This tutorial has a ton of really useful effects and there's no working version anywhere online.
      Unfortunately it just draws a black screen. Not sure what's causing it. I'm guessing the camera or ortho matrix transforms are wrong, light directions, or maybe texture resources not being properly initialized.  I didnt change any of the variables though, only upgraded all types and functions DirectX3DVector3 to XMFLOAT3, and used DirectXTK for texture loading. If anyone is willing to take a look at what might be causing the black screen, maybe something pops out to you, let me know, thanks.
      https://github.com/mister51213/DX11Port_SoftShadows
       
      Also, for reference, here's tutorial #40 which has normal shadows but no blur, which I also ported, and it works perfectly.
      https://github.com/mister51213/DX11Port_ShadowMapping
       
  • Advertisement