Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 07 Mar 2007
Offline Last Active Feb 25 2016 04:52 PM

Topics I've Started

Instancing, and the various ways to supply per-instance data

29 November 2015 - 12:08 AM

From the reading I've been doing, it seems like there are a few different ways to supply per-instance data while using instancing in OpenGL. I've tried a couple of these. Here is a rundown, as I understand it. With each example, I'll use the mvp matrix (modelViewProjection) as the per-instance item. I'm hoping that you can help correct any errors in my understanding.


Array Uniforms w/ gl_InstanceId



layout(location = 0) in vec4 pos;
uniform mat4 mvp[1024];


void main() {
    gl_Position = mvp*pos;

With this method, you're just declaring an array of mat4 as a uniform, and you're using gl_InstanceId to index that array. The main advantage of this method is that it's easy, because it's hardly different than the normal way of using uniforms. However, each element in the array is given its own separate uniform location, and uniform locations are in limited supply (as few as 1024).


Vertex Attributes with Divisor=1


OpenGL example:

#define MVP_INDEX 2


glBindBuffer(GL_ARRAY_BUFFER, mvpBuffer);
for (int i = 0; i < 4; ++i) {
    GLuint index = MVP_INDEX+i;    
    glVertexAttribPointer(index, 4, GL_FLOAT, GL_FALSE, sizeof(GLfloat)*16, (GLvoid*)(sizeof(GLfloat)* i * 4));
    glVertexAttribDivisor(index, 1);
glBindBuffer(GL_ARRAY_BUFFER, 0);

GLSL example:

layout(location = 0) in vec4 pos;
layout(location = 2) in mat4 mvp;


void main() {
    gl_Position = mvp*pos;

With this method, the mvp matrix just looks like a vertex attribute from the GLSL side of things. However, since a divisor of 1 was specified on the OpenGL side, there is only one matrix stored per instance, rather than one per vertex. This allows very clean access to a large number of matrices (as many as a buffer object can hold). You also get all of the advantages that other buffer objects have, such as streaming using orphaning or mapping strategies. However, each matrix uses four vertex attrib locations. There may be as few as 16 total vertex attrib locations available. If you plan on using a shader that requires multiple sets of UV coordinates, blend weights, etc., then you may not have enough vertex attrib locations to use this method.


So, I'm trying to find a method that will allow thousands of instances without using up precious vertex attrib locations. I am hoping that Uniform Buffer Objects or SSBOs will come to the rescue. I haven't yet attempted to use them for this purpose, nor have I found many examples of people online using them for this purpose. Maybe there is a reason for that. smile.png. So here's my current understanding of how it works. I would be much obliged if someone could read it over, and tell me where I'm wrong.


Uniform Buffer Objects


OpenGL example:

GLuint mvpBuffer;
// GenBuffers, BufferData, etc.
glBindBufferBase(GL_UNIFORM_BUFFER, 0, mvpBuffer);
GLuint uniformBlockIndex = glGetUniformBlockIndex(myProgram, "mvpBlock");
glUniformBlockBinding(myProgram, uniformBlockIndex, 0);

GLSL example:

layout(row_major) uniform MVP {
    mat4 mvp;
} mvps[1024];

void main() {
    gl_Position = mvps[gl_InstanceId]*pos;

It seems like this could alleviate restrictions with attrib locations. However, you are limited by GL_MAX_UNIFORM_BLOCK_SIZE, which I believe includes each instance in the instance array. This can be as low as 64kB, which in our case would only allow for 1024 instances, which is no better than the first method.


Shader Storage Buffer Objects


This method would be essentially identical to the Uniform Buffer method, except the interface block type is buffer and you can use a lot more memory. You can also write to the SSBO from within the shader, but that is not necessary for this application. On the down side, the Wiki says that this method is slower than Uniform buffers. Again, I haven't tested this myself, so I may be mistaken about how this works.





I'm having trouble making sense of these performance numbers (OpenGL)

13 October 2015 - 10:33 PM

Greetings. This is one of those dreaded "shouldn't it be faster?" type questions, but I'm hoping someone can help me, because I am truly baffled.


I'm trying to explore instancing a bit. To that end, I created a demo that has 50,000 randomly-positioned cubes. It's running full-screen, at the native resolution of my monitor. Vsync is forced off through the NVidia control panel. No anti-aliasing. I'm also not doing any frustum culling, but I am doing back-face culling. Here is a screenshot:




The shaders are very simple. All they do is calculate some basic flat shading:

#version 430

layout(location = 0) in vec4 pos;
layout(location = 1) in vec3 norm;

uniform mat4 mv;
uniform mat4 mvp;

out vec3 varNorm;
out vec3 varLightDir;

void main() {
	gl_Position = mvp*pos;
	varNorm = (mv*vec4(norm,0)).xyz;
	varLightDir = (mv*vec4(1.5,2.0,1.0,0)).xyz;
#version 430

in vec3 varNorm;
in vec3 varLightDir;
out vec4 fragColor;

void main() {
	vec3 normal = normalize(varNorm);
	vec3 lightDir = normalize(varLightDir);
	float lambert = dot(normal,lightDir);
	fragColor = vec4(lambert,lambert,lambert,1);

I know I have a little bit of cruft in there (hard-coded light passed as a varying), but the shaders are not very complicated.


I eventually wrote three versions of the program:

  1. One that draws each cube individually with DrawArrays (no indexing)
  2. One that draws each cube individually with DrawElements (indexed, with 24 unique verts instead of 36, no vertex cache optimization)
  3. One that draws all cubes at once with DrawElementsInstanced (same indexing as before)

I noticed zero performance difference between these variations. In order to really test this, I decided to run each version of the program several times each, with a different number of cubes each time: 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000. I am using QueryPerformanceCounter and QueryPerformanceFrequency to measure the frame times. I store the frame times in memory until the program is closed, at which point I print them out to a csv file. I then opened each csv file in Excel and averaged the frame times. At times, I omitted the first few frames of data from the average, as these were often obvious outliers.


Here are the results.




This is a log-log plot showing that the increase in frame time is linear with respect to the number of cubes drawn, and performance is essentially the same no matter which technique I used. One word of explanation about the "Pan" suffix: I actually ran two versions of each program. In one version, the camera was static. In another version, the camera was panning. The reason I did this is that keeping the camera static allowed me to avoid updating the matrix uniforms each frame. I didn't expect this to cause a big performance increase, except for in the DrawElementsInstanced version, where the static camera allows me to actually skip updating the big buffers that hold all of the matrices. 




This is a linear plot of just the 100,000-1,000,000 cubes range. The log-log plot sometimes exaggerates or downplays differences, so I just wanted to show that the linear plot shows essentially the same thing. In fact, the DrawArraysPan method was fastest, even though I expected it to be the slowest.




This is just a plot of the triangles-per-second I'm getting with each method. As you can see, they are essentially all the same. I understand that triangles-per-second is not a great absolute measure of performance, but since I'm comparing apples-to-apples here, it seems to be a good relative measure.


Speaking of which, I feel like the triangles-per-second numbers are really low. I know that I just said that triangles-per-second are a bad absolute measure of performance, but hear me out. The computer I'm testing this on has an Intel Core i5-4570, 8GB RAM, and a GTX 770. I feel like these numbers are a couple orders of magnitude lower than what I would expect. 


Anyway, I'm trying to find what the bottleneck is, but everything just seems to be linear with respect to the number of models being drawn, regardless of how many unique verts are in that model, and regardless of how many draw calls are involved. 

Gift for someone who is just starting out with programming...

27 October 2014 - 08:37 PM

Hey folks.


A good friend of mine is a middle school math teacher, and he is taking a fairly bold step in quitting his job so that he can go back to school full-time to get a degree in computer science. He is very interested in programming. He doesn't have a lot of programming experience yet, but I think he has the right intelligence and mentality for it and he'll do well. He's been feeling an intense amount of trepidation about this decision (which, incidentally, is too late to reverse), and so I have been trying to think of a gift to give him that might make him feel encouraged. I was thinking of maybe a nice book, like The Pragmatic Programmer, although I don't know how useful that would be to a newbie. When I first started out, I was happy just to get a crappy shell game working and it was a while before I found the advice found in the big lofty books to be very helpful. Another idea I had was an Arduino, with maybe a project book to go along with it, but I am not sure how interested he is in the hardware side of things.


It doesn't necessarily have to be something for him to read or do, it could just be something that is cool and inspiring. I could make something for him that has a lot of programmer appeal, such as a 4-bit adder out of NAND gates and LEDs or something like that.


These are the sorts of ideas I've been having, but nothing yet seems like THE idea. Any ideas you could give me would be hugely appreciated. As far as price range, I was thinking something in the neighborhood of $50.

Debugging a system hang (entire OS freezes)

07 August 2014 - 01:54 PM

Hi. I have an application I'm working on that, perhaps 1 time in 500, hangs the entire system on startup. And by that, I mean that the entire OS freezes including the mouse pointer, and no input is accepted from the keyboard. I can't even imagine what I could be doing from within the confines of my program in order to cause these system-wide effects, but I assume that it's some kernel mode thing. Does anyone have any tips on debugging this sort of issue? This is a C++ OpenGL application running on Windows 7.

Just a couple of Data-Oriented Design questions.

15 June 2014 - 05:31 PM

So I'm just barely getting into this, and I think it's a really neat way to think about things. I'm going to program a simple Asteroids-like game (more of a Lunatic Fringe clone) just to get my feet wet, and I'm pretty excited about it. I just have a couple of things that I need to have clarified about doing a component-system type arrangement with Data-Oriented Design. 


My first question concerns how to organize the arrays of components. My first instinct is that there should just be one array of each type of component. For instance, there would only be one master array of transforms, and any game entity that owns a transform would just add their transform to this array. That way, any systems that need to affect the transforms can just iterate linearly through the whole array. Several drawbacks (none of them insurmountable) became immediately apparent:


  1. Entity deletion becomes more difficult. Not every entity needs every component, so the component arrays will all be of different sizes, and there is going to need to be some piece of code (in the Entity itself, perhaps) that keeps track of which components in which arrays belong to which entity. If it's keeping track via array indices, then these indices will have to be updated every time components are deleted from the arrays (assuming that components are always added to the end, and that deletion involves an actual erase-remove operation rather than just marking certain array elements as 'dead' and recycling them later).
  2. Not every system that works on transforms (say) needs to work on every transform. There may be certain systems that only need to perform operations on Asteroid transforms. This means that each component will have to identify the type of the entity to which it belongs, so that the system in question can check the type before operating on the component. Maybe this isn't such a bad thing. Conditional branches aren't incredibly expensive, and although this does mean you've potentially wasted a prefetch on an object that you don't need, doing nothing to an object is still faster than doing something.

So my next thought is that there should be a different component array for each Entity type that uses the component. In my surfing, I've come across several examples of code that looks like this:

struct Asteroids
    std::vector<Transform> transforms;
    std::vector<CollisionSphere> collisionSpheres;
    // etc...

So, the entity type here is Asteroid, and one instance of the Asteroids class holds a list of all asteroids in the scene. If you have a system that performs operations on asteroid transforms, but perhaps not the transforms of some other entity types, then you can just feed the system asteroids.transforms, and the system can do it's usual linear thing. There are a couple of drawbacks, it seems, to this approach as well:


  1. This breaks up a single big array (per component) into N smaller arrays per component (where N is the number of entity types that use that component), and some of those arrays (like players.transforms) might be very small. Running a system on a series of tiny arrays is probably not much better than jumping around in memory. However, there would still be significant savings, I would imagine, if your world contains several entity types that exist in large numbers (and thus have large component arrays).
  2. It seems that, suddenly, there are going to be a lot of areas of code that are going to need to know about the concrete entity types, which would not had they been programmed in a more traditional OOP style. In traditional OOP, you often just have an array of EntityBase* and you just call a polymorphic Update method on each entity. The code that iterates through this array doesn't know or care what the concrete types are. But suppose we're doing a data-oriented entity/system/component architecture, and suppose we have a system that performs some operation on transforms, but we only want Asteroids and BlackHoles to use it. There is going to need to be some code that calls system.update(asteroids.transforms) and then calls system.update(blackholes.transforms). My intuition (and maybe this is a good intuition, or maybe it's brought on by an overdose of OOP-thinking) is that main, central pieces of glue code should not know about concrete types; the only code that should care about the behavior of concrete types are the concrete types themselves, and the central "glue" code that brings all of these types together should know only about the abstract classes, and be completely decoupled from the concrete types themselves.

Despite these potential drawbacks, I still feel that option #2 (having entity classes like Asteroids that have their own arrays of components, rather than using a master array for each component that all entity types must share) is far preferable. However, I am interested in some other opinions, and to see if perhaps someone has thought of a third way that is even better.


My second question (or perhaps more of an observation) is that it seems like you can't entirely avoid random-access of objects. Take collision detection and response. Suppose you have some CollisionDetectionSystem whose job it is to iterate through all of the entities and find out which ones are colliding. For each collision, it adds a Collision object to an array of collisions. Then comes the collision resolution step. Even if you have just one CollisionResolutionSystem, that system is going to have to read each Collision object, find out which entities were involved in the collision, and then find the appropriate components for these entities in some other list. This find operation is going to necessarily entail jumping hither and yon through the component arrays, modifying them as called-for by the collision response.


Is there a clever solution to that sort of situation, or is it just a performance weakness that is accepted on the basis that it's still faster than the traditional OOP method of always jumping all over memory for everything?