• Advertisement
Sign in to follow this  

Vertex buffer(s)

This topic is 466 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi forum!

 

While reversing one program, I found that it uses separate vertex buffers in combination with index buffer:

  • vb0 = Positions (always used)
  • vb1 = Normals (always used)
  • vb2 = UV (used only in high quality)
  • Index buffer

 

The only explanation I have: for some conditions (like don’t use UV), they simply don’t pass one of the buffer to the shader.

So, in low quality, GPU memory usage is lower.

 

Questions:

  1. Is there any benefit of separation positions and normals, if they are used in all circumstances?
  2. If all buffers are used in a shader, are there any benefits of using several buffers instead of one, that contains all data for one vertex?

 

Thanks in advance!

Edited by Happy SDE

Share this post


Link to post
Share on other sites
Advertisement
Is there any benefit of separation positions and normals, if they are used in all circumstances?

I am not sure what exactly is 'all circumstances' (even in depth prepass? if you have one in your engine) but IMO, you probably don't need normal during a depth prepass, thus have unused data (normal data) loaded into cache (cache line read is a chunk of continuous address) during depth prepass is not a good idea (less cache hit rate, more bandwidth pressure...).

with separated vb, you could use position only vb during depth prepass (cache line utilization is near 100%) and that should be faster than with vb contain both pos and normal.

 

It's a typical AoS(array of structure) / SoA question, and it seems in most cases SoA is preferred for better cache usage

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

Even with the AMD advice, i never measured a realistic case where non interleaved attributes when all used where performing better or worse than interleaved, or at best it is in a range lower than profiling noise. If only position is used in some passes, of course, keep it separate.

 

On PS3 once, i even paid me back the luxury to duplicate the position stream, one separate for depth prepass and shadow, and a full interleaved VB with position for normal rendering, the best of the two world :)

Share this post


Link to post
Share on other sites

I can't speak for AMD, but here is my two cents on the matter

 

I usually always go for interleaved attributes. From my personal profiling (I've only ever used Nvidia cards, AMD may and probably will have different results), Interleaved attributes in a single vertex buffer takes noticeably less time for large meshes than having multiple buffers, one for each attribute like mentioned in the OP's post.

 

I don't know for sure how the hardware is handling it behind the scenes, but if you think about it, having a single buffer with interleaved attributes allows the input assembler, for each element (eg. vertex), index into the interleaved buffer to the start of that elements data, then it returns that chunk of data to the pipeline

 

for multiple buffers, it will need to index into each different buffer to get the element's attribute in that buffer, then combine them and return them to the pipeline.

 

This doesn't mean interleaved buffers are better though. You might have an attribute that changes frequently, like vertex positions or colors. Instead of updating the entire interleaved buffer, it would be faster to update only the buffer containing the attribute you need to update.

 

On top of that, as was already mentioned is that maybe some shaders only need the positions of the vertices, while others may need more information like normals. You could use the same buffers for different input layouts, but with an interleaved buffer, your input layout must match that buffer, while separating attributes into different buffers allows you to build different input layouts, using only whatever data your shaders are asking for.

 

I should also mention i don't have any solid numbers since i did this profiling quite a while ago (which means things surely have changed since then, but was using DX11), but i did find that there was no consistent difference in performance for small objects, but there definitely was for larger objects with >1 mil polygons

Edited by iedoc

Share this post


Link to post
Share on other sites

Thank you all for your answers!

 

My own engine uses interleaved VB.

Average mesh has about 1300 vertices/4000 indices.

The program I reversed, does not create shadows, so normals are always in use there.

 

After reading all the answers, my first thought was:

  1. Modify content pipeline, and engine’s mesh cache.
  2. Measure performance.
  3. Chose the best approach, remove unused code.

It seems it has a lot of sense to separate buffers for me because of shadows.

But after thinking a while, I came up with more fundamental questions about optimizations:

 

Consider I want to ship a game in 4-5 years.

I have no ideas right now what will be next AMD/NVidia/Intel architectures and their performance guidelines at that time.

Measuring timings right now and remembering the choice maybe outdated (for my 660GTX is for sure).

And I am not talking about this particular case, but for many other cases.

 

Similar story: a friend of mine remembered a bug in Visual Studio 5(very old one).

In 7 years on a code review, he asked me to insert this workaround in VS2003, because he remembered “this should be made this way”.

 

Question1: What is the best way to keep optimization, made in an engine, up-to-current hardware/assumptions?

Using 2 code paths for every optimization even with #ifdef’s is not sexy enough :)

 

Question2: I have plans to buy 7700k + 1070GTX and make it my reference platform in assumption, that in 4-5 years it will be average gaming PC.

For this platform, will make all optimizations (and probably place 660GTX in second PCIe slot as low-end video card).

Is this a good idea?

Edited by Happy SDE

Share this post


Link to post
Share on other sites

SoA is going to be worse in cases where both position and normal are needed; in these cases interleaved attributes are preferred.

Why SoA will be worse than AoS when all data are touched? For example here is a imaginary vs:


    float4 pos = mul(matrixA, input.pos); // part 1
    ...
    // use pos to compute other things
    ...
    float brightness = dot(input.nor, vLight); // part 2

Let's assume loading SoA for pos alone will take multiple cache lines, when one warp is executing part 1 SoA will only issue much less number of cache line read than use AoS, and the same will be true for part 2. So in this case even all attributes are touched, SoA should perform better than AoS. But as always, I may miss something important, so please correct me if I get something wrong.

Thanks

Share this post


Link to post
Share on other sites

I usually always go for interleaved attributes. From my personal profiling (I've only ever used Nvidia cards, AMD may and probably will have different results), Interleaved attributes in a single vertex buffer takes noticeably less time for large meshes than having multiple buffers, one for each attribute like mentioned in the OP's post.

 

The OPs example is a bit limited though - for a large, graphics-heavy application, it could also look this way:

 

- Position

- Normals

- Binormals

- Tangents

- UV0-X

- Bone-IDs

- Bone-Weights

 

Not every mesh would use all of those, but for those that do, the vertex format is quite large. If you do a z-prepass and multiple shadow-passes, then only reading the position can be a huge gain. You wouldn't even have every attribute in its own vertex-buffer, you could group them by usage - one for position, one for normal+binormal+bitangent, one for UVs, one for Bone-IDs/Bone-Weights, or you could even group them more tightly if you really only have the choice of binding position +- everything else.

That should probably be the best of both worlds - especially shadow-passes can add up quite a bit, so being able to save ~75% bandwidth should outweigth the potentiall slower access times for rendering the full mesh (I cannot imagine that its THAT slower).

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement