Rendering generally doesn't benefit from having SoA vertex buffer layouts (unless you have different vertex formats/shaders to be used with the same buffers, in which case you might want to separate the optional parts for speed).
I accidentally downvoted this...
sorry!
Vertex shader hardware typically has had the capability to perform
large fetches from video memory.
e.g. let's say it has a FETCH instruction, which moves 32 aligned bytes of data from VRAM into local cache, and then a MOVE instruction which moves 4 bytes from local cache into a register.
Given your AoS structure, the start of the VS would be 1 FETCH (because your struct is a very desirable size of exactly 32 bytes), and then 8 MOVE instructions.
If you split it out into 3 arrays (of size 12, 12, and 8 bytes), you'll require 3 FETCH instructions (
fetching 96 bytes now!) and then 8 MOVEs.
However, it's worse than that! Those 3 FETCH instructions are actually now FETCH_UNALIGNED, because your smaller arrays no longer have a nice alignment! Those FETCH_UNALIGNED instructions might be implemented as 2x FETCH plus extra MOVEs (
fetching 192 bytes now!).
Splitting into AoS in this case is obviously very bad! So the principles of DoD would tell you to stick to SoA.
This of course all depends on the hardware... Assuming that hardware is optimized for contiguous 16byte loads is probably a good rule of thumb though.
To really tune stuff properly, you'll need some good debuggers that let you know what the hardware is really doing, and/or manuals explaining how the hardware works in theory.
If you want to go deeper though, you can try and find a nice compact format that packs into a nice round size. e.g. for 16byte vertices, you could use:
position -
DXGI_FORMAT_R16G16B16A16_SNORM (with scale/offset factors in your model->world or model->view matrix)
normals -
DXGI_FORMAT_R10G10B10A2_UNORM (with a "normal=normal*2-1" in the shader... annoyingly pretty much every GPU can handle SNORM here but D3D doesn't allow it)
texcoords -
DXGI_FORMAT_R16G16_FLOATIf your rendering time is bottlenecked by the pre-transform vertex cache (
or if you want to save video memory), then this will help a lot.... but in other situations it might not help at all.
DoD is about looking closely at the data layouts and the way that the hardware is transforming them, and then making decisions about how to improve the situation. It's not really possible to apply DoD to something like your vertex data layout without first reading a document on the architecture of the GPUs that you're targeting. If you don't know how the HW is transforming your data, there's no way that you can massage the data layouts to optimise said unknown transforms.
[edit]
SoA can still be useful though. E.g. When rendering opaque shadow maps, your VS only needs position data, and can ignore normals/UVs.
With your AoS layout, the pre-transform vertex cache gets polluted, as you'll continuously be fetching normal/uv data (because of tge large fetch size) but not using it.
Another way to put that is that your vertex stride is larger than your (actually read) vertex size.
In that situation, you can put positions into one struct and normals+UVs in a second struct.
That change will ensure the cache isn't polluted when doing shadow rendering, and that things are still mostly ok in the main rendering pass (two structs is better than three)...
If you've got RAM to spare, you can optimize both passes by duplicating the positions - shadow pass uses a position-only stream, anf the main pass uses a packed position+normal+uv stream.