Sign in to follow this  

Optimal alignment of vertex data

This topic is 2537 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,

I just finished overhauling my vertex buffer management. The old system created a different VBO for every vertex attribute, the new system shares a single VBO among the attributes and interleaves the data.

Often, people suggest aligning interleaved vertex data to a multiple of a certain amount of bytes (32 comes up frequently) for optimal performance. I thought of this as plausible, because data for SIMD operations on the CPU has to be aligned too. So, I ran some tests using a large amount of vertices and simple vertex and fragment shaders to test the difference between 32 and 16 byte alignment, and tight packing. To my surprise I was unable to measure any difference in performance. These tests were performed on an AMD HD4870X2 with the latest drivers.

My questions are: Should vertex data be aligned for other (older?) models? If so, what is the recommended number of bytes? Is it worth the increase in memory usage?

Thanks in advance.

Share this post


Link to post
Share on other sites
I don't think actual alignment matters so much tbh, as long as you can cram as much as possible into the post-transform cache. Smaller vertices and good reuse fulfills that and also reduces bandwidth.

Share this post


Link to post
Share on other sites
The 32 byte is not a alignment requirement. It is a size requirement.
For SSE instructions, it is a different matter. That requires alignment on 16 byte addresses.

So back to GPUs. The 32 byte recommendation is in a old nVidia document. AMD has not said anything and I don't know about today's nVidia GPUs.
The vertex structure should be a multiple of 32 byte. If you are below 32, then add padding.
If you are above 32, then add padding until it becomes 64.

If you think it wastes too much space and makes no difference on your target GPUs, then don't bother with it.

Share this post


Link to post
Share on other sites
Alright, thanks to both for your answers. We're targeting OpenGL 2.0 capable cards as a minimum, I think all such cards should be able to handle non-padded data.

Another question:
Quote:

I don't think actual alignment matters so much tbh, as long as you can cram as much as possible into the post-transform cache. Smaller vertices and good reuse fulfills that and also reduces bandwidth.

Wouldn't the size per vertex in the ptnl cache depend on the vertex shader out variables, not the in variables?

As for the reuse: there's a mesh optimizer in the works, but the main difficulties are the fact that there's no way to query the ptnl cache size, and a mesh can of course be rendered with different shader programs. So it's impossible to determine the optimal ordering.

Share this post


Link to post
Share on other sites
It's the VS input I'm on about. Obviously the smaller your verts are, the longer they'll stay in the cache regardless of how optimally you read from them.

I think this might trounce the conflicting statement about 32/64 byte alignment these days, but I wouldn't argue the point as I really don't know anymore. Things move fast in GPU land.

In my engine we only use two main formats for 3D work. A 32 byte format with a position, two normals, a colour and two sets of 16bit uv's. The skinned version has the weights tagged on the end but isn't padded to 64. (I think it's about 52 iirc.)

There is a nice function in D3DX for optimising meshes and we just use that. I really don't know if it takes into account the hardware or just does a best job of minimising vertex changes generally tbh, though I suspect the latter.

Share this post


Link to post
Share on other sites
I think that u would only see some measurable difference if u used some strange vertex sizes like 23 or so. I think that I read somewhere that 32bit is the best value, because the memory registers(??) on the GPU are 32bit big. And to load the vertex data, so only one register has to be read. When u have 4,8,16bit vertices this is also true, but if u have other vertex data sometimes the gpu has to read 2 registers to get the data for a vertex.

In most cases this shouldn't be the limiting factory and doesn't matter if u pad or not, but if the padding isn't to big(33 padded to 64) it shouldn't to any harm and perhaps increase the performance a little bit,

Share this post


Link to post
Share on other sites
The vertex size is far more important for dynamic vertex buffers, because the CPU Memory Controller usually reads memory in aligned memory chunks of 64bytes, so if you use 48bytes per vertex, for some vertex will need one access and for others two.

If you are using static buffers and data is on GPU, size matter less but still is a good idea to use 32/64bytes per vertex, since sometimes you don't have enough video memory available, and static vertex data could remain in main memory.

Share this post


Link to post
Share on other sites
Sorry for the late reply, holidays and all that ;)

Quote:

The vertex size is far more important for dynamic vertex buffers, because the CPU Memory Controller usually reads memory in aligned memory chunks of 64bytes, so if you use 48bytes per vertex, for some vertex will need one access and for others two.


That's good advice. We mostly use hardware skinning, but there are a few places where we store vertices as dynamic draw (so most probably stored in AGP mem). I'll take a look to see if it makes a difference there.

Most of the models we use have 48 byte vertices, so padding to 64 bytes would be quite wasteful, since it appears to result in only a tiny performance improvement if any at all.

Quote:

It's the VS input I'm on about. Obviously the smaller your verts are, the longer they'll stay in the cache regardless of how optimally you read from them.

I think this might trounce the conflicting statement about 32/64 byte alignment these days, but I wouldn't argue the point as I really don't know anymore. Things move fast in GPU land.

In my engine we only use two main formats for 3D work. A 32 byte format with a position, two normals, a colour and two sets of 16bit uv's. The skinned version has the weights tagged on the end but isn't padded to 64. (I think it's about 52 iirc.)

There is a nice function in D3DX for optimising meshes and we just use that. I really don't know if it takes into account the hardware or just does a best job of minimising vertex changes generally tbh, though I suspect the latter.


While I was doing some research on the subject I came across this paper: http://www.cworldlab.com/board/data/GamePrograming/ATI-DX9_Optimization.pdf
Although it's pretty old, it mentions in the chapter 'Rendering Primitives' ID3DXMesh::Optimize has knowledge of the post tnl cache size. Thought you might be interested.

Anyways, the two relevant caches for the VS are the pre tnl cache and the post tnl cache. The pre cache benefits from sequential access, and the post cache benefits from vertex reuse. I'm pretty sure the cost per vertex in the post cache depends on the VS output size. Think about it - the point of the post cache is to avoid running the VS if the calculations have already been done. It's not the input we're interested in in such cases, only the output.

Quote:

A 32 byte format with a position, two normals, a colour and two sets of 16bit uv's.

Wouldn't that be 64 bytes ;)

[Edited by - Rene Z on January 2, 2011 11:26:49 AM]

Share this post


Link to post
Share on other sites

This topic is 2537 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this