One Buffer Vs. Multiple Buffers...

Started by
9 comments, last by Promit 9 years, 7 months ago

Hello,

Consider this a consultation of opinion question:

Is it better to use one interleaved buffer for multiple buffers for vertex/normal/UV data?

Was reading this question and response here: http://stackoverflow.com/questions/12245687/does-using-one-buffer-for-vertices-uvs-and-normals-in-opengl-perform-better-tha

Just seeing what is better; I have implemented both but my framerate was high enough in both cases where I didn't notice a big difference.

Thank you for your time.

Advertisement

Is it better to use one interleaved buffer for multiple buffers for vertex/normal/UV data?

Yes to interleaved buffer - the AoS layout is preferable.

Think about what is actually happening internally. You've got a warp of 32 elements being processed. Each one has a vertex input structure, which is going to be a single block of memory containing all of the vertex attributes. What the hardware wants to do is load the entire vertex into the registers of the processing cores. Most likely it's capable of doing this for an entire warp at once. Interleaved attributes allow it to do a single block memory copy to set up all of the vertices to run a warp.

When I was working at NV (2006), it was often the case that non-interleaved streams would be soft interleaved by the driver as part of draw call setup before being rendered. I'm not sure if that's still required on modern hardware or not. But it's best to assume that the underlying hardware is in most cases only able to work with interleaved data.

Current official advice is to de-interleave only when the update frequencies of the buffer are different.

This advice may further distorted by situations which have lots of data transfer, meaning dynamic/streaming buffers. See L.Spiro's comments here:

http://lspiroengine.com/?p=96

Personally I've never seen a reason to worry about this particular bandwidth issue but he probably has more 'in the field' experience than I do on PC platforms and may be able to expand on those comments.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Excellent information, thank you.

Very helpful. Other replies always welcome.

making separate buffers for attributes can find itself wise in case it allows you to exchange X vertex buffer changes of batched inteleved buffers per frame for a one multiple buffers binding over frame group, (or aplication entirely). Yet, in my eyes, driver ,or, does a sad cache coherency unfriendly delivery of vertex attributes , or, interleves the separate buffers to interleved one in the end, resulting in the same logic as rebinding 6 batched interleved vertex buffers per frame. Who knows


making separate buffers for attributes can find itself wise in case it allows you to exchange X vertex buffer changes of batched inteleved buffers per frame for a one multiple buffers binding over frame group

In this situation, it's better to build all the interleaved variations manually before uploading them to the driver, then just pick the most appropriate vertex buffer. I have seen engines that know what shaders are used for what mesh and create the correct custom buffer for that situation.

Always seemed like a hassle to me though.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

This is very interesting, because i've got very different results (gtx 670 and 480).

My use case is a compute shader tree traversal, and the nodes have vec4 data for position, color, direction and integer data packed in uvec4 (tree indices etc.)

First i've used a single shader storage buffer the AoS way. That was too slow, so i tried to put each vec in its own texture, so SoA.

Can't remember exactly but the speed up was 10-30 times i think.

Does this make sense? I assumed using SoA is faster because multiple texture units can be used to grab the data.

The other fact is that i do not need to read all the data for any node i visit, which is a difference to the fixed vertex pipeline example from above.

I do not need to read node direction if position is already too far away etc., and in AoS method i did read the full struct in any case before any test.

But i doupt this alone explains the huge speed up.

Please let me know what you think, i'm new to gpu and it's still hard to predict performance.

And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.

Sounds stupid, but it's true, especially for simple shaders. Someone discovered this before and posted on NV forum, but no official responce.

I assume the reason is reserving shared memory prevents the thread sheduler from doing too much task switching.

I don't know if this happens on other languages too (Cuda, CL, DX).

Other crazy things are:

It's faster to do a blur on the tree with all 4 children, 4 neighbours and parent, than to do a simple color averaging from children to parent on the same tree. ???

It's >2x faster to do a stackless but very divergent tree traversal one thread per node, than to do a perfect data / code / runtime coherent parallel traversal using a satck (stack is too big for shared memory).

I really have the feeling that drivers are not well polished for compute shader performance, but it's a little too much work to port to cuda to see the difference...

Any thoughts welcome :)

Please let me know what you think, i'm new to gpu and it's still hard to predict performance.

Experts can't predict performance either. We're just driven by common guidelines AND THEN WE TRY OUR OPTIONS AND PROFILE.

Computing (in general, i.e. not just GPUs, also CPUs, and RAM, and caches, etc) has become so complex it's virtually impossible to accurately predict what approach is going to be faster (although we can make educated guesses).
For example, I've seen an example where adding extra instructions to a CPU routine in a tight loop caused the loop to execute faster in Haswell chips.

The reason had to do with a "forward store to load" stall, where adding an instruction allowed to the CPU to prevent a full pipeline stall on every iteration.

It is anecdotic and was a rather synthetic benchmark (not real world code), but the point is that it is completely unintuitive to think that adding an instruction would help the code run faster; which is a great example that modern architectures are so complex we can't grasp it all.

Stop asking, just try, profile, and share the results.

Stop asking, just try, profile, and share the results.


I agree; in this case I did try both and the results for me were initially the same BUT I keep reading everywhere that interleaved is better so for now I will work with interleaved buffers.

And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.

This is actually not that uncommon. The problem is that the cores only have a limited amount of register space (64k per SMx core) which gets divided up by however many threads are running in parallel. So if you are running 1024 threads per SMx, every thread can use up to 64 registers. If you are running the maximum of 2048 threads, every thread only gets to use 32 registers. If more local variables are needed than registers are available, some registers are spilled onto the stack similarly to how it's done on the CPU. But contrary to the CPU, the GPU memory latencies are incredibly high so spilling stuff that is often needed onto the stack can increase the runtime.

Now shared memory is also a restricted resource (64KB on kepler per SMx) but one that can't be spilled. So, if every block only needs less than 2KB, you can get the maximum of 32 resident blocks per SMx. But if you increase the amount of reserved shared memory, lets say to just below 4KB, then you can only have 16 resident blocks. Now, halving the amount of resident blocks also halves the total amount of resident threads, so each thread has twice the amount of registers at its disposal.

So, increasing the amount of reserved shared memory can decrease the number of resident blocks/threads, which increases the number of registers each thread can use, which can reduce register spilling and costly loads from the stack. I don't know about compute shaders, but for cuda I believe the profiler can check for this.

This helps a lot (i copied this post to my source to read again later).

It explains also why performance is a matter of number of threads vs. needed memory, which is what i've recoginezed the last months.

There is one thing that i've got very wrong all the time: I thought a single SMX can only run 32 threads, and if you want more they would be spread across multiple SMX units.

I really understand much better now :)

Many thanks!

This topic is closed to new replies.

Advertisement