Vertex buffer(s)

Started by
11 comments, last by Mr_Fox 7 years, 3 months ago

Hi forum!

While reversing one program, I found that it uses separate vertex buffers in combination with index buffer:

  • vb0 = Positions (always used)
  • vb1 = Normals (always used)
  • vb2 = UV (used only in high quality)
  • Index buffer

The only explanation I have: for some conditions (like don’t use UV), they simply don’t pass one of the buffer to the shader.

So, in low quality, GPU memory usage is lower.

Questions:

  1. Is there any benefit of separation positions and normals, if they are used in all circumstances?
  2. If all buffers are used in a shader, are there any benefits of using several buffers instead of one, that contains all data for one vertex?

Thanks in advance!

Advertisement
Is there any benefit of separation positions and normals, if they are used in all circumstances?

I am not sure what exactly is 'all circumstances' (even in depth prepass? if you have one in your engine) but IMO, you probably don't need normal during a depth prepass, thus have unused data (normal data) loaded into cache (cache line read is a chunk of continuous address) during depth prepass is not a good idea (less cache hit rate, more bandwidth pressure...).

with separated vb, you could use position only vb during depth prepass (cache line utilization is near 100%) and that should be faster than with vb contain both pos and normal.

It's a typical AoS(array of structure) / SoA question, and it seems in most cases SoA is preferred for better cache usage

Yeah often the same vertices will be used by at least two different vertex shaders during a frame - e.g. a shadowmapping shader and a forward shading shader. It's likely that there's some attributes that are only used by one of those shaders, so it's beneficial to split those attribs out, otherwise they'll just waste space in the L2 cache.

On different generations of GPU's, SoA has been faster and AoS has been faster in general. IIRC, current AMD advice is to always use full SoA and never pack two+ attribs together in an interleaved manner.

SoA is going to give better cache usage in cases where only position is needed, such as depth prepass or shadow drawing.

SoA is going to be worse in cases where both position and normal are needed; in these cases interleaved attributes are preferred.

This is something that it's impossible to give general-case "you should always be doing this" advice about. Everybody's program is different. Maybe your program will run better in the first case, maybe it won't. You're really going to have to write code for both, profile and determine which is faster for your program's requirements. Alternatively if it already runs well enough, and if you're satisfied with it from a code cleanliness/simplicity perspective too, why not leave it be?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Even with the AMD advice, i never measured a realistic case where non interleaved attributes when all used where performing better or worse than interleaved, or at best it is in a range lower than profiling noise. If only position is used in some passes, of course, keep it separate.

On PS3 once, i even paid me back the luxury to duplicate the position stream, one separate for depth prepass and shadow, and a full interleaved VB with position for normal rendering, the best of the two world :)

I can't speak for AMD, but here is my two cents on the matter

I usually always go for interleaved attributes. From my personal profiling (I've only ever used Nvidia cards, AMD may and probably will have different results), Interleaved attributes in a single vertex buffer takes noticeably less time for large meshes than having multiple buffers, one for each attribute like mentioned in the OP's post.

I don't know for sure how the hardware is handling it behind the scenes, but if you think about it, having a single buffer with interleaved attributes allows the input assembler, for each element (eg. vertex), index into the interleaved buffer to the start of that elements data, then it returns that chunk of data to the pipeline

for multiple buffers, it will need to index into each different buffer to get the element's attribute in that buffer, then combine them and return them to the pipeline.

This doesn't mean interleaved buffers are better though. You might have an attribute that changes frequently, like vertex positions or colors. Instead of updating the entire interleaved buffer, it would be faster to update only the buffer containing the attribute you need to update.

On top of that, as was already mentioned is that maybe some shaders only need the positions of the vertices, while others may need more information like normals. You could use the same buffers for different input layouts, but with an interleaved buffer, your input layout must match that buffer, while separating attributes into different buffers allows you to build different input layouts, using only whatever data your shaders are asking for.

I should also mention i don't have any solid numbers since i did this profiling quite a while ago (which means things surely have changed since then, but was using DX11), but i did find that there was no consistent difference in performance for small objects, but there definitely was for larger objects with >1 mil polygons

Thank you all for your answers!

My own engine uses interleaved VB.

Average mesh has about 1300 vertices/4000 indices.

The program I reversed, does not create shadows, so normals are always in use there.

After reading all the answers, my first thought was:

  1. Modify content pipeline, and engine’s mesh cache.
  2. Measure performance.
  3. Chose the best approach, remove unused code.

It seems it has a lot of sense to separate buffers for me because of shadows.

But after thinking a while, I came up with more fundamental questions about optimizations:

Consider I want to ship a game in 4-5 years.

I have no ideas right now what will be next AMD/NVidia/Intel architectures and their performance guidelines at that time.

Measuring timings right now and remembering the choice maybe outdated (for my 660GTX is for sure).

And I am not talking about this particular case, but for many other cases.

Similar story: a friend of mine remembered a bug in Visual Studio 5(very old one).

In 7 years on a code review, he asked me to insert this workaround in VS2003, because he remembered “this should be made this way”.

Question1: What is the best way to keep optimization, made in an engine, up-to-current hardware/assumptions?

Using 2 code paths for every optimization even with #ifdef’s is not sexy enough :)

Question2: I have plans to buy 7700k + 1070GTX and make it my reference platform in assumption, that in 4-5 years it will be average gaming PC.

For this platform, will make all optimizations (and probably place 660GTX in second PCIe slot as low-end video card).

Is this a good idea?

SoA is going to be worse in cases where both position and normal are needed; in these cases interleaved attributes are preferred.

Why SoA will be worse than AoS when all data are touched? For example here is a imaginary vs:



    float4 pos = mul(matrixA, input.pos); // part 1
    ...
    // use pos to compute other things
    ...
    float brightness = dot(input.nor, vLight); // part 2

Let's assume loading SoA for pos alone will take multiple cache lines, when one warp is executing part 1 SoA will only issue much less number of cache line read than use AoS, and the same will be true for part 2. So in this case even all attributes are touched, SoA should perform better than AoS. But as always, I may miss something important, so please correct me if I get something wrong.

Thanks

I usually always go for interleaved attributes. From my personal profiling (I've only ever used Nvidia cards, AMD may and probably will have different results), Interleaved attributes in a single vertex buffer takes noticeably less time for large meshes than having multiple buffers, one for each attribute like mentioned in the OP's post.

The OPs example is a bit limited though - for a large, graphics-heavy application, it could also look this way:

- Position

- Normals

- Binormals

- Tangents

- UV0-X

- Bone-IDs

- Bone-Weights

Not every mesh would use all of those, but for those that do, the vertex format is quite large. If you do a z-prepass and multiple shadow-passes, then only reading the position can be a huge gain. You wouldn't even have every attribute in its own vertex-buffer, you could group them by usage - one for position, one for normal+binormal+bitangent, one for UVs, one for Bone-IDs/Bone-Weights, or you could even group them more tightly if you really only have the choice of binding position +- everything else.

That should probably be the best of both worlds - especially shadow-passes can add up quite a bit, so being able to save ~75% bandwidth should outweigth the potentiall slower access times for rendering the full mesh (I cannot imagine that its THAT slower).

Why SoA will be worse than AoS when all data are touched?

It depends on the hardware... If we go back in time to the Xbox360, they had two instructions to read data out of a vertex buffer. One was the "large fetch", which IIRC read about 16 bytes from memory, but didn't decode it. The other was "small fetch", which IIRC decoded about 4 bytes from one of those 16 byte areas that you'd already retrieved with a large-fetch.

So if you had all your attributes interleaved together into a 32byte struct, you would only require two large fetches, followed by one small fetch per attribute.
Alternatively if you had every attribute in it's own buffer, you needed one large fetch and one small fetch per attribute.

Modern GPU's don't work this way any more though :wink:

I don't know for sure how the hardware is handling it behind the scenes, but if you think about it, having a single buffer with interleaved attributes allows the input assembler, for each element (eg. vertex), index into the interleaved buffer to the start of that elements data, then it returns that chunk of data to the pipeline

Most GPU's don't have any actual Input Assembler hardware -- the driver takes the IA/IL config from D3D / the VAO config from GL and converts it into shader code, and then adds that shader code onto the front of your vertex shader as a subroutine call.
For each per-vertex attribute, the vertex shader is performing one typed load from an SRV.

This topic is closed to new replies.

Advertisement