Why are we still using index/vertex/instance buffers?

Started by
18 comments, last by Corvo 6 years, 3 months ago

Thanks guys, I will thoroughly experiment with different methods and GPUs. I will be conducting benchmarks on the Sponza scene with Nvidia GTX 1070, GTX 960, AMD RX 470 and Snapdragon 808 GPUs. All of them will be timed for Input layout rendering method, Custom fetch with typed buffers and custom fetch with raw buffers.

I will post results in this topic when I am finished.

A Quick spoiler: AMD RX470 is performing surprisingly well even with typed buffers custom fetch (screw that it has virtually no difference whatsoever), while GTX 960 suffers greatly with doubled rendering time.

Advertisement

I have done the testing for an AMD and an NVIDIA GPU, the Snapdragon 808 will have to wait as setting up the scene for that will take some more time. I will also post the results for the GTX 1070 later.

Here you go:

Program: Wicked Engine Editor
API: DX11
Test scene: Sponza
- 3 shadow cascades (2D) - 3 scene render passes
- 1 spotlight shadow (2D) - 1 scene render pass
- 4 pointlight shadows (Cubemap) - 4 scene render passes
- Z prepass - 1 scene render pass
- Opaque pass - 1 scene render pass
Timing method: DX11 timestamp queries
Methods:
- InputLayout : The default hardware vertex buffer usage with CPU side input layout declarations. The instance buffers are bound as vertex buffers with each render call.
- CustomFetch (typed buffer): Vertex buffers are bound as shader resource views with DXGI_FORMAT_R32G32B32A32_FLOAT format. Instance buffers are bound as Structured Buffers holding a 4x4 matrix each.
- CustomFetch (RAW buffer 1): Vertex buffers are bound as shader resource views with a MiscFlag of D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS. In the shader the buffers are addressed in byte offsets from the beginning of the buffer. Instance buffers are bound as Structured Buffers holding a 4x4 matrix each.
- CustomFetch (RAW buffer 2): Even instancing information is retrieved from raw buffers instead of structured buffers.

ShadowPass and ZPrepass: These are using 3 buffers max:
- position (float4)
- UV (float4) // only for alpha tested
- instance buffer
OpaquePass: This is using 6 buffers:
- position (float4)
- normal (float4)
- UV (float4)
- previous frame position VB (float4)
- instance buffer (float4x4)
- previous frame instance buffer (float4x3)
RESULTS:

GPU Method ShadowPass ZPrepass OpaquePass All GPU
NVidia GTX 960 InputLayout 4.52 ms 0.37 ms 6.12 ms 15.68 ms
NVidia GTX 960 CustomFetch (typed buffer) 18.89 ms 1.31 ms 8.68 ms 33.58 ms
NVidia GTX 960 CustomFetch (RAW buffer 1) 18.29 ms 1.35 ms 8.62 ms 33.03 ms
NVidia GTX 960 CustomFetch (RAW buffer 2) 18.42 ms 1.32 ms 8.61 ms 33.18 ms
AMD RX 470 InputLayout 7.43 ms 0.29 ms 3.06 ms 14.01 ms
AMD RX 470 CustomFetch (typed buffer) 7.41 ms 0.31 ms 3.12 ms 14.08 ms
AMD RX 470 CustomFetch (RAW buffer 1) 7.50 ms 0.29 ms 3.07 ms 14.09 ms
AMD RX 470 CustomFetch (RAW buffer 2) 7.56 ms 0.28 ms 3.09 ms 14.15 ms

I have attached a txt file with easier readability.

This is quite painful for me because I wanted to implement some features which require the custom fetching but seeing that it works so slow on nvidia it seems like wasted effort.

By the way, to quickly implement this, I bound my vertex buffers to texture slot 30 and upper, could it matter in performance?

Side note: It seems that this way the CPU time is also higher because VSSetShaderResources takes a longer time than IASetVertexBuffers. :(

Are you using 1 buffer for all attributes or 1 per attribute ?
If you're not using 1 for all attributes you should probably try that.

-* So many things to do, so little time to spend. *-

I am using 1 buffer per attribute (SoA). I recently switched from AoS layout to SoA layout and a complete rewrite of the scene rendering pipeline to allow more flexible buffer binding and more cache efficiency in depth only passes of which there are a lot more than regular passes. This gained me a substantial performance boost on both AMD and Nvidia, I am not interested in going back.

I meant 1 buffer and all Positions, then all Normals and so on...

So still SoA but in a single buffer, each attribute array appended to the next if you will.

-* So many things to do, so little time to spend. *-

In general, the reason for different types of seemingly similar resources is that at least one major IHV has (potentially legacy) fast-path hardware that differentiates between them. There are a number of buffer types which perform differently on NV GPUs while AMD's GCN GPUs simply don't care. You're seeing hardware design issues leaking through the software abstractions.

Ideally, we would just have buffers read by shaders and nothing else, not even textures. (I mean come on, texture buffers?) GPUs haven't reached that level of generalized functionality yet. MS originally pitched this design when they were sketching out D3D 10 and of course the IHVs explained it wasn't feasible.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

I meant 1 buffer and all Positions, then all Normals and so on...

So still SoA but in a single buffer, each attribute array appended to the next if you will.

Why do you think it would be better? I imagine it would be harder to manage, because now you would even had to provide the length of the buffers to know the correct offset in the shaders.

Why do you think it would be better?
Less registers used or less dx11 equivalent of root signature usage?

-potential energy is easily made kinetic-

On 2017/6/1 at 11:59 PM, turanszkij said:

I am using 1 buffer per attribute (SoA). I recently switched from AoS layout to SoA layout and a complete rewrite of the scene rendering pipeline to allow more flexible buffer binding and more cache efficiency in depth only passes of which there are a lot more than regular passes. This gained me a substantial performance boost on both AMD and Nvidia, I am not interested in going back.

I did some tests on mobile GPU, all tests use AOS style buffer.

Adreno 418, SSBO is slower (50fps) than Vertex Buffer (55fps),

Adreno 512, no performance difference between SSBO and Vertex Buffer, this might because Adreno 512 has a unified memory model.

Mali(T7xx, T8xx, G71), none of those GPU support SSBO in vertex shader(although they support opengl es 3.1). 

GL_MAX_TEXTURE_BUFFER_SIZE is about 64KB on Mali, so I can't use texture buffer for vertex pulling either.

This topic is closed to new replies.

Advertisement