Instancing, and the various ways to supply per-instance data

Started by
3 comments, last by TheChubu 8 years, 4 months ago

From the reading I've been doing, it seems like there are a few different ways to supply per-instance data while using instancing in OpenGL. I've tried a couple of these. Here is a rundown, as I understand it. With each example, I'll use the mvp matrix (modelViewProjection) as the per-instance item. I'm hoping that you can help correct any errors in my understanding.

Array Uniforms w/ gl_InstanceId

Example:


layout(location = 0) in vec4 pos;
uniform mat4 mvp[1024];

...

void main() {
    gl_Position = mvp*pos;
}

With this method, you're just declaring an array of mat4 as a uniform, and you're using gl_InstanceId to index that array. The main advantage of this method is that it's easy, because it's hardly different than the normal way of using uniforms. However, each element in the array is given its own separate uniform location, and uniform locations are in limited supply (as few as 1024).

Vertex Attributes with Divisor=1

OpenGL example:


#define MVP_INDEX 2

...

glBindBuffer(GL_ARRAY_BUFFER, mvpBuffer);
for (int i = 0; i < 4; ++i) {
    GLuint index = MVP_INDEX+i;    
    glEnableVertexAttribArray(index);
    glVertexAttribPointer(index, 4, GL_FLOAT, GL_FALSE, sizeof(GLfloat)*16, (GLvoid*)(sizeof(GLfloat)* i * 4));
    glVertexAttribDivisor(index, 1);
}
glBindBuffer(GL_ARRAY_BUFFER, 0);

GLSL example:


layout(location = 0) in vec4 pos;
layout(location = 2) in mat4 mvp;

...

void main() {
    gl_Position = mvp*pos;
}

With this method, the mvp matrix just looks like a vertex attribute from the GLSL side of things. However, since a divisor of 1 was specified on the OpenGL side, there is only one matrix stored per instance, rather than one per vertex. This allows very clean access to a large number of matrices (as many as a buffer object can hold). You also get all of the advantages that other buffer objects have, such as streaming using orphaning or mapping strategies. However, each matrix uses four vertex attrib locations. There may be as few as 16 total vertex attrib locations available. If you plan on using a shader that requires multiple sets of UV coordinates, blend weights, etc., then you may not have enough vertex attrib locations to use this method.

So, I'm trying to find a method that will allow thousands of instances without using up precious vertex attrib locations. I am hoping that Uniform Buffer Objects or SSBOs will come to the rescue. I haven't yet attempted to use them for this purpose, nor have I found many examples of people online using them for this purpose. Maybe there is a reason for that. smile.png. So here's my current understanding of how it works. I would be much obliged if someone could read it over, and tell me where I'm wrong.

Uniform Buffer Objects

OpenGL example:


GLuint mvpBuffer;
// GenBuffers, BufferData, etc.
glBindBufferBase(GL_UNIFORM_BUFFER, 0, mvpBuffer);
GLuint uniformBlockIndex = glGetUniformBlockIndex(myProgram, "mvpBlock");
glUniformBlockBinding(myProgram, uniformBlockIndex, 0);

GLSL example:


layout(row_major) uniform MVP {
    mat4 mvp;
} mvps[1024];

void main() {
    gl_Position = mvps[gl_InstanceId]*pos;
}

It seems like this could alleviate restrictions with attrib locations. However, you are limited by GL_MAX_UNIFORM_BLOCK_SIZE, which I believe includes each instance in the instance array. This can be as low as 64kB, which in our case would only allow for 1024 instances, which is no better than the first method.

Shader Storage Buffer Objects

This method would be essentially identical to the Uniform Buffer method, except the interface block type is buffer and you can use a lot more memory. You can also write to the SSBO from within the shader, but that is not necessary for this application. On the down side, the Wiki says that this method is slower than Uniform buffers. Again, I haven't tested this myself, so I may be mistaken about how this works.

Advertisement
The Vertex Attribute method is the original, from GL2/D3D9 era hardware, so it's recommended if you want something that works everywhere.

Uniform variables and UBO's are themselves a hint to the driver that every instantiation of the shader (i.e. every pixel/vertex) will read every single byte of the uniform data. That hint is not true for instancing, so I wouldn't immediately choose it.

In between UBOs and SSBOs, there's another option -- storing the data in a texture or a VBO, and fetching from it be converting the instanceID into a texture coordinate.

In D3D10/11, buffers can be bound to shaders as "shader resource views". In GL3, it's pretty awkward -- you have to create a "buffer texture" (or is it a "texture buffer"?), which uses the storage from a VBO but lets you bind it to a texture slot...

In D3D11/GL4, you *can* use SSBO's/UAV's, but these are overkill unless the shader requires read-write access to the buffer.


The Vertex Attribute method is the original, from GL2/D3D9 era hardware, so it's recommended if you want something that works everywhere.

Thanks! In practice, do you find the limit on vertex attrib locations to be a pain? At mimumum, it seems you'll want to pass in the world matrix, which will use four. From there, I suppose the wv and wvp matrices can be calculated in the shader, if needed, but it seems wasteful to calculate per-instance matrices on a per-vertex basis. So, suppose you pass those matrices in as vertex attributes. That would mean that 12 out of 16 vertex attrib locations are used just for transformation matrices. Positions, normals, and texcoords take up three more. That leaves only one more. If you want to do skinned animation, it seems like you'd have to fall back on calculating the additional per-instance matrices in the vertex shader.

For skinning, you need a large array of matrices per instance, so it just won't work with the traditional method -- put them into a buffer (texture-buffer/SRV).

You can make the W matrix per-instance, and the VP matrix uniform. You then just have one extra matrix mul per vertex (pos*W*VP instead of pos*WVP). If you need the world position for shading, then it's not even an extra cost.

It seems like this could alleviate restrictions with attrib locations. However, you are limited by GL_MAX_UNIFORM_BLOCK_SIZE

Yes you are... but not totally.

Your limit is specifically for the amount of memory it can be bound at that particular UBO slot. AFAIK 64kB for everything except AMD these days. The trick is that the buffer itself can be bigger, much bigger.

Taking in account that the expensive part here often is uploading the data (many tiny uploads == bad), you can allocate something like 1Mb, upload all your data there with a couple calls at most (ideally only one), then just call glBindBufferRange in-between drawcalls.

Not sure how AMD handles glBindBufferRange, but nVidia, in their presentations, said that glBindBufferRange its a super cheap call to make, and since GCN seems to have the upper hand on memory management, probably its cheap to do with AMD too. Adding to that, since you already pre-uploaded all (or most) of your data to the big buffer, the state changes in between the draw calls should be minimal.

EDIT: The memory you can bind to an UBO slot can be as low as 16kB actually. 64kB is a common limit, but not the one that the spec defines as minimum. I recall one Intel forum post that asked why in D3D11 you can have 64kB for a constant buffer but in OpenGL you only had 16kB, only then the Intel driver team incremented the limit to 32kB. That was for Windows drivers, Linux Intel drivers apparently tend to have better support.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

This topic is closed to new replies.

Advertisement