Data oriented design and rendering

Started by
3 comments, last by Hodgman 8 years, 8 months ago

I'm trying to apply data oriented design to my engine's main systems, including the renderer. This is easily done for things like particle systems and agent simulation, but I'm having a hard time applying it to mesh rendering.

Before DOD, I had a "Mesh" class with its vertices, normals, texcoords, etc. My renderer also had arrays of vertex and index buffers. This was your typical "Array of Structures" pattern.

Now, I'm trying to convert this to "Structure of Arrays" but I'm hitting a wall when it comes to sending mesh data to the shaders. My vertex shader uses a StructuredBuffer for all the vertices. Here's how it looks like:


struct Vertex
{
    float3 position;
    float3 normal;
    float2 texcoord;
};

StructuredBuffer<Vertex> vertexBuffer : register(t0);

This encourages the "AoS" pattern. Before issuing a draw call, it forces me to upload a buffer of "Vertex" on the CPU, which is the opposite of what I want. Would shader performance be the same if I split the buffer in the shader into 3 StructuredBuffers, one for each vertex attribute? Am I doing all this for nothing in the end?

Advertisement

Rendering generally doesn't benefit from having SoA vertex buffer layouts (unless you have different vertex formats/shaders to be used with the same buffers, in which case you might want to separate the optional parts for speed).

Every vertex is fully (all used attributes/elements) loaded into the vertex shader, transformed and possibly saved in the post-transform cache. Many vertices could be processed at the same time inside the GPU, but it wouldn't change the amount/location of loaded data.


This encourages the "AoS" pattern. Before issuing a draw call, it forces me to upload a buffer of "Vertex" on the CPU, which is the opposite of what I want. Would shader performance be the same if I split the buffer in the shader into 3 StructuredBuffers, one for each vertex attribute? Am I doing all this for nothing in the end?

What you're finding out is that the vertex -- whatever that means to your render code -- is the natural atomic unit you're dealing with. You wouldn't split its normal from its position and more than you would split X from Y in a position structure.

DOD isn't about blindly decomposing things to their basest elements, its about understanding what elements are used strictly (or almost-so) together. You probably could decompose it if you really wanted, but I'm not sure it'd be worthwhile.

throw table_exception("(? ???)? ? ???");

+1 to everything been said.
Meshes are dealt mostly by the GPU, not the CPU. The GPU often prefers data in AoS style, not SoA.

DOD isn't about blindly decomposing things to their basest elements, its about understanding what elements are used strictly (or almost-so) together.

Cheers to that!

Rendering generally doesn't benefit from having SoA vertex buffer layouts (unless you have different vertex formats/shaders to be used with the same buffers, in which case you might want to separate the optional parts for speed).

I accidentally downvoted this... sad.png sorry!

Vertex shader hardware typically has had the capability to perform large fetches from video memory.
e.g. let's say it has a FETCH instruction, which moves 32 aligned bytes of data from VRAM into local cache, and then a MOVE instruction which moves 4 bytes from local cache into a register.
Given your AoS structure, the start of the VS would be 1 FETCH (because your struct is a very desirable size of exactly 32 bytes), and then 8 MOVE instructions.

If you split it out into 3 arrays (of size 12, 12, and 8 bytes), you'll require 3 FETCH instructions (fetching 96 bytes now!) and then 8 MOVEs.
However, it's worse than that! Those 3 FETCH instructions are actually now FETCH_UNALIGNED, because your smaller arrays no longer have a nice alignment! Those FETCH_UNALIGNED instructions might be implemented as 2x FETCH plus extra MOVEs (fetching 192 bytes now!).
Splitting into AoS in this case is obviously very bad! So the principles of DoD would tell you to stick to SoA.

This of course all depends on the hardware... Assuming that hardware is optimized for contiguous 16byte loads is probably a good rule of thumb though.
To really tune stuff properly, you'll need some good debuggers that let you know what the hardware is really doing, and/or manuals explaining how the hardware works in theory.

If you want to go deeper though, you can try and find a nice compact format that packs into a nice round size. e.g. for 16byte vertices, you could use:
position - DXGI_FORMAT_R16G16B16A16_SNORM (with scale/offset factors in your model->world or model->view matrix)
normals - DXGI_FORMAT_R10G10B10A2_UNORM (with a "normal=normal*2-1" in the shader... annoyingly pretty much every GPU can handle SNORM here but D3D doesn't allow it)
texcoords - DXGI_FORMAT_R16G16_FLOAT

If your rendering time is bottlenecked by the pre-transform vertex cache (or if you want to save video memory), then this will help a lot.... but in other situations it might not help at all.

DoD is about looking closely at the data layouts and the way that the hardware is transforming them, and then making decisions about how to improve the situation. It's not really possible to apply DoD to something like your vertex data layout without first reading a document on the architecture of the GPUs that you're targeting. If you don't know how the HW is transforming your data, there's no way that you can massage the data layouts to optimise said unknown transforms.

[edit]
SoA can still be useful though. E.g. When rendering opaque shadow maps, your VS only needs position data, and can ignore normals/UVs.
With your AoS layout, the pre-transform vertex cache gets polluted, as you'll continuously be fetching normal/uv data (because of tge large fetch size) but not using it.
Another way to put that is that your vertex stride is larger than your (actually read) vertex size.
In that situation, you can put positions into one struct and normals+UVs in a second struct.

That change will ensure the cache isn't polluted when doing shadow rendering, and that things are still mostly ok in the main rendering pass (two structs is better than three)...

If you've got RAM to spare, you can optimize both passes by duplicating the positions - shadow pass uses a position-only stream, anf the main pass uses a packed position+normal+uv stream.

This topic is closed to new replies.

Advertisement