# OpenGL One Buffer Vs. Multiple Buffers...

## Recommended Posts

Hello,

Consider this a consultation of opinion question:

Is it better to use one interleaved buffer for multiple buffers for vertex/normal/UV data?

Was reading this question and response here: http://stackoverflow.com/questions/12245687/does-using-one-buffer-for-vertices-uvs-and-normals-in-opengl-perform-better-tha

Just seeing what is better; I have implemented both but my framerate was high enough in both cases where I didn't notice a big difference.

##### Share on other sites

Is it better to use one interleaved buffer for multiple buffers for vertex/normal/UV data?

Yes to interleaved buffer - the AoS layout is preferable.

Think about what is actually happening internally. You've got a warp of 32 elements being processed. Each one has a vertex input structure, which is going to be a single block of memory containing all of the vertex attributes. What the hardware wants to do is load the entire vertex into the registers of the processing cores. Most likely it's capable of doing this for an entire warp at once. Interleaved attributes allow it to do a single block memory copy to set up all of the vertices to run a warp.

When I was working at NV (2006), it was often the case that non-interleaved streams would be soft interleaved by the driver as part of draw call setup before being rendered. I'm not sure if that's still required on modern hardware or not. But it's best to assume that the underlying hardware is in most cases only able to work with interleaved data.

Current official advice is to de-interleave only when the update frequencies of the buffer are different.

This advice may further distorted by situations which have lots of data transfer, meaning dynamic/streaming buffers. See L.Spiro's comments here:

http://lspiroengine.com/?p=96

Personally I've never seen a reason to worry about this particular bandwidth issue but he probably has more 'in the field' experience than I do on PC platforms and may be able to expand on those comments.

Edited by Promit

##### Share on other sites
Excellent information, thank you.

Very helpful. Other replies always welcome.

##### Share on other sites

making separate buffers for attributes can find itself wise in case it allows you to exchange X vertex buffer changes of batched inteleved buffers per frame for a one multiple buffers binding over frame group, (or aplication entirely). Yet, in my eyes, driver ,or, does a sad cache coherency unfriendly delivery of vertex attributes , or, interleves the separate buffers to interleved one in the end, resulting in the same logic as rebinding 6 batched interleved vertex buffers per frame. Who knows

##### Share on other sites

making separate buffers for attributes can find itself wise in case it allows you to exchange X vertex buffer changes of batched inteleved buffers per frame for a one multiple buffers binding over frame group

In this situation, it's better to build all the interleaved variations manually before uploading them to the driver, then just pick the most appropriate vertex buffer. I have seen engines that know what shaders are used for what mesh and create the correct custom buffer for that situation.

Always seemed like a hassle to me though.

##### Share on other sites

This is very interesting, because i've got very different results (gtx 670 and 480).

My use case is a compute shader tree traversal, and the nodes have vec4 data for position, color, direction and integer data packed in uvec4 (tree indices etc.)

First i've used a single shader storage buffer the AoS way. That was too slow, so i tried to put each vec in its own texture, so SoA.

Can't remember exactly but the speed up was 10-30 times i think.

Does this make sense? I assumed using SoA is faster because multiple texture units can be used to grab the data.

The other fact is that i do not need to read all the data for any node i visit, which is a difference to the fixed vertex pipeline example from above.

I do not need to read node direction if position is already too far away etc., and in AoS method i did read the full struct in any case before any test.

But i doupt this alone explains the huge speed up.

Please let me know what you think, i'm new to gpu and it's still hard to predict performance.

And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.

Sounds stupid, but it's true, especially for simple shaders. Someone discovered this before and posted on NV forum, but no official responce.

I assume the reason is reserving shared memory prevents the thread sheduler from doing too much task switching.

I don't know if this happens on other languages too (Cuda, CL, DX).

Other crazy things are:

It's faster to do a blur on the tree with all 4 children, 4 neighbours and parent, than to do a simple color averaging from children to parent on the same tree. ???

It's >2x faster to do a stackless but very divergent tree traversal one thread per node, than to do a perfect data / code / runtime coherent parallel traversal using a satck (stack is too big for shared memory).

I really have the feeling that drivers are not well polished for compute shader performance, but it's a little too much work to port to cuda to see the difference...

Any thoughts welcome :)

##### Share on other sites

Please let me know what you think, i'm new to gpu and it's still hard to predict performance.

Experts can't predict performance either. We're just driven by common guidelines AND THEN WE TRY OUR OPTIONS AND PROFILE.

Computing (in general, i.e. not just GPUs, also CPUs, and RAM, and caches, etc) has become so complex it's virtually impossible to accurately predict what approach is going to be faster (although we can make educated guesses).
For example, I've seen an example where adding extra instructions to a CPU routine in a tight loop caused the loop to execute faster in Haswell chips.

The reason had to do with a "forward store to load" stall, where adding an instruction allowed to the CPU to prevent a full pipeline stall on every iteration.

It is anecdotic and was a rather synthetic benchmark (not real world code), but the point is that it is completely unintuitive to think that adding an instruction would help the code run faster; which is a great example that modern architectures are so complex we can't grasp it all.

Stop asking, just try, profile, and share the results.

##### Share on other sites

Stop asking, just try, profile, and share the results.

I agree; in this case I did try both and the results for me were initially the same BUT I keep reading everywhere that interleaved is better so for now I will work with interleaved buffers.

##### Share on other sites

And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.

This is actually not that uncommon. The problem is that the cores only have a limited amount of register space (64k per SMx core) which gets divided up by however many threads are running in parallel. So if you are running 1024 threads per SMx, every thread can use up to 64 registers. If you are running the maximum of 2048 threads, every thread only gets to use 32 registers. If more local variables are needed than registers are available, some registers are spilled onto the stack similarly to how it's done on the CPU. But contrary to the CPU, the GPU memory latencies are incredibly high so spilling stuff that is often needed onto the stack can increase the runtime.

Now shared memory is also a restricted resource (64KB on kepler per SMx) but one that can't be spilled. So, if every block only needs less than 2KB, you can get the maximum of 32 resident blocks per SMx. But if you increase the amount of reserved shared memory, lets say to just below 4KB, then you can only have 16 resident blocks. Now, halving the amount of resident blocks also halves the total amount of resident threads, so each thread has twice the amount of registers at its disposal.

So, increasing the amount of reserved shared memory can decrease the number of resident blocks/threads, which increases the number of registers each thread can use, which can reduce register spilling and costly loads from the stack. I don't know about compute shaders, but for cuda I believe the profiler can check for this.

##### Share on other sites

This helps a lot (i copied this post to my source to read again later).

It explains also why performance is a matter of number of threads vs. needed memory, which is what i've recoginezed the last months.

There is one thing that i've got very wrong all the time: I thought a single SMX can only run 32 threads, and if you want more they would be spread across multiple SMX units.

I really understand much better now :)

Many thanks!

##### Share on other sites

Experts can't predict performance either. We're just driven by common guidelines AND THEN WE TRY OUR OPTIONS AND PROFILE.

But let's be fair - most of us can't afford test labs with multiple generations of GPUs, multiple versions of drivers, and so forth. Nevermind the time consumed in testing the different permutations. A different driver (or version) has the potential to randomly upheave the whole thing, and it's not like NV or AMD are about to help us out with real application profiles. So most of us are stuck with the handful of hardware/software configurations that are at hand, making educated guesses about the rest. I'm probably better off than most here - for my part I have access to a 6770, 7970, a GTX 480, a GTX 670, a Titan, and a couple Macbooks' worth of mobile GPUs. If this were strictly hobby work, I'd probably be lucky to even have one sample of NV and AMD available.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
627701
• Total Posts
2978708
• ### Similar Content

• A friend of mine and I are making a 2D game engine as a learning experience and to hopefully build upon the experience in the long run.

-What I'm using:
C++;. Since im learning this language while in college and its one of the popular language to make games with why not.     Visual Studios; Im using a windows so yea.     SDL or GLFW; was thinking about SDL since i do some research on it where it is catching my interest but i hear SDL is a huge package compared to GLFW, so i may do GLFW to start with as learning since i may get overwhelmed with SDL.
-Questions
Knowing what we want in the engine what should our main focus be in terms of learning. File managements, with headers, functions ect. How can i properly manage files with out confusing myself and my friend when sharing code. Alternative to Visual studios: My friend has a mac and cant properly use Vis studios, is there another alternative to it?

• Both functions are available since 3.0, and I'm currently using glMapBuffer(), which works fine.
But, I was wondering if anyone has experienced advantage in using glMapBufferRange(), which allows to specify the range of the mapped buffer. Could this be only a safety measure or does it improve performance?
Note: I'm not asking about glBufferSubData()/glBufferData. Those two are irrelevant in this case.
• By xhcao
Before using void glBindImageTexture(    GLuint unit, GLuint texture, GLint level, GLboolean layered, GLint layer, GLenum access, GLenum format), does need to make sure that texture is completeness.
• By cebugdev
hi guys,
are there any books, link online or any other resources that discusses on how to build special effects such as magic, lightning, etc. in OpenGL? i mean, yeah most of them are using particles but im looking for resources specifically on how to manipulate the particles to look like an effect that can be use for games,. i did fire particle before, and I want to learn how to do the other 'magic' as well.
Like are there one book or link(cant find in google) that atleast featured how to make different particle effects in OpenGL (or DirectX)? If there is no one stop shop for it, maybe ill just look for some tips on how to make a particle engine that is flexible enough to enable me to design different effects/magic
let me know if you guys have recommendations.
• By dud3
How do we rotate the camera around x axis 360 degrees, without having the strange effect as in my video below?
Mine behaves exactly the same way spherical coordinates would, I'm using euler angles.
Tried googling, but couldn't find a proper answer, guessing I don't know what exactly to google for, googled 'rotate 360 around x axis', got no proper answers.

References:
Code: https://pastebin.com/Hcshj3FQ
The video shows the difference between blender and my rotation:

• 21
• 14
• 12
• 10
• 12