Did you make sure that each vertex is a power-of-2 size? If your vertex is an odd size it will cross multiple cache lines - 16, 32 or 64 bytes are typically good sizes.
Aha! I had very odd vertex sizes, 12 bytes for a position, 12 bytes for a normal, 8 bytes for a texture or just 12 bytes for a position, 8 bytes for a texture. So I will pad the vertices, textures and normals out to 16 bytes each but make each stride between a whole bunch of vertex data 64 bytes as 48 bytes isn't a power of two. Is this a good strategy?
Actually it isn’t about it being a power of 2, but a multiple of the cache size, which is usually 16 or 32. MSDN quotes 32 explicitly for certain cases of vertex buffers but my testing reveals this to improve the performance in all cases, not just what they list. Of course, their documentation is for Direct3D, but cache issues and friends are a universal issue.
Performance Optimizations
Also, we are not talking about padding each vertex-buffer element. 12-byte normals and positions (etc.) are completely normal.
This padding is between each vertex in the buffer, not each element of each vertex.
Normally once you interleave position, normals, UV’s, tangents, and bitangents, you have 56-byte vertices. This should be padded out to 64.
L. Spiro