OpenGL API Overhead

Programming

Graphics and GPU Programming

Published February 23, 2017 by Anatoliy Gerlits, posted by _Wizard_

Do you see issues with this article? Let us know.

Introduction

In modern projects, to produce a nice looking scene the engine will render thousands of different objects: characters, buildings, landscape, effects and more. Of course, there are several ways to render geometry on the screen. In this article, we consider how to do that effectively, measure and compare the cost of rendering API calls - note however that only the CPU load it measured, which may not be 100% accurate but can give an idea of the relative costs. Consider the cost of API calls:

state changes (frame buffers, vertex buffers, shaders, constants, textures)
different types of geometry instancing and compare their performance
several practical examples of how one should optimize geometry render in projects.

I will cover only the OpenGL API. I will not describe details, parameters and variations of each API call. There are reference books and manuals for this purpose. Computer configuration for all tests: Intel Core i5-4460 3.2GHz., Radeon R9 380.

Measurements:

The right way to correct performance measurements is to catch the overall frame timing. Even if there are some extra things like clearing back buffer & fbo binding.
Exact equation: AVG_TEST_TIME = (CURRENT_TIME_AFTER_N_ITERATIONS - TIME_WHEN_WE_STARTED_FIRST_ITERATION) / NUMBER_OF_ITERATIONS
In all calculations, time is in ms.
Perform one test several times and calculate average time. NUMBER_OF_ITERATIONS should be pretty large: 500-1000. Otherwise we get too large difference from launch to launch.
Measuring instancing time we must sure that GPU is not bottleneck. As we want to know just CPU time. And how it scales when use different amount of instances.

States changing

We want to see 'reach' picture on the screen, a lot of unique objects with a lot of details. For this purpose engine takes all visible objects in camera, sets their parameters (vertex buffers, shaders, material parameters, textures, etc.) and send them to render. All these actions performed with special API commands. Let's consider them, make some tests to understand how to organize the rendering process optimally.

Let's measure the cost of different OpenGL calls: dip (draw index primitive), change of shaders, vertex buffers, textures, shader parameters.

Dips

Dip (draw indexed primitive) -- command to GPU to render a bunch of geometry, more often triangles. Off course first we need to tell - what geometry we want to show, with what shader, set some options. But dip renders geometry; all other commands just describe parameters of what we want to show. The dip's price usually includes all related state changes - not only one command. Of course, all depends on the amount of state changes. First, consider the simplest case - cost of one thousand simple dips, without state changes.

 
void simple_dips() 
{ 
    glBindVertexArray(ws_complex_geometry_vao_id); //what geometry to render 
    simple_geometry_shader.bind(); //with what shader
    //a lot of simple dips 
    for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i+1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES*sizeof(int))); //simple dip 
}

For 1? dips we get 0.41 ms.

Frame buffer change

FBO (frame buffer object) -- is an object, which allows rendering image not to the screen, but to another surface, which lately one could use as texture in shaders. Fbo changes not so often as other elements, but at the same time, the change cost is quite expensive for the CPU.

 
void fbo_change_test() 
{ 
    //clear FBO 
    glViewport(0, 0, window_width, window_height); 
    glClearColor(0.0f / 255.0f, 0.0f / 255.0f, 0.0f / 255.0f, 0.0); 
    
    for (int i = 0; i < NUM_DIFFERENT_FBOS; i++) 
    {
        glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]);
        glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    }
    
    //prepare dip 
    glBindVertexArray(ws_complex_geometry_vao_id); 
    simple_geometry_shader.bind();
    
    //bind FBO, render one object... repeat N times 
    for (int i = 0; i < NUM_FBO_CHANGES; i++) 
    { 
        glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]); //bind fbo
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
    } 
    
    glBindFramebuffer(GL_FRAMEBUFFER, 0); //set rendering to the screen
}

For 200 fbo changes we get 1.97 ms.

One needs to change FBO usually for post effects and different passes, like: reflections, rendering into cubemap, creating virtual textures, etc. Many things like virtual textures could be organized as atlases, to set FBO only once and change for example just viewport. Render in cubemap might be replaced on another technique. For example on dual paraboloid rendering. The matter of course, not only in FBO changes, but in the number of passes of scene rendering, material changes, etc. In general, the less state changes the better.

Shader changes

Shaders usually describe one of the scene's materials or effect techniques. The more materials, kinds of surfaces the more shaders. Several materials might vary slightly. These should be combined into one and switching between them make as condition in the shader, The number of materials directly influence on dips amount.

 
void shaders_change_test() 
{ 
    glBindVertexArray(ws_complex_geometry_vao_id);
    
    for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
    {
        simple_color_shader[i%NUM_DIFFERENT_SIMPLE_SHADERS].bind(); //bind certain shader
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
    }
}

For 1000 shader changes we get 2.90 ms.

Changing shader here also includes transferring world-view-proj matrix as a parameter. Otherwise we could not render anything. Cost of parameters changing we measure in next step.

Shader parameters changing

Often materials make universal with a lot of options to get different kinds of materials. An easy way to make a variety of pictures, each character/object unique. We need somehow transfer to shader these parameters. This could be done with API commands glUniform*.

 
uniforms_changes_test_shader.bind(); 
glBindVertexArray(ws_complex_geometry_vao_id);
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
{
    //set uniforms for this dip 
    for (int j = 0; j < NUM_UNIFORM_CHANGES_PER_DIP; j++) 
        glUniform4fv(ColorShader_uniformLocation[j], 1, randomColors[(i*NUM_UNIFORM_CHANGES_PER_DIP + j) % MAX_RANDOM_COLORS].x);
    
    glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}

It is not optimal to set parameters individually for each instance/object. Usually all instance data might be packed into 1 large buffer and transferred to gpu with one command. It only remains for each object to set a shift - where it's data placed.


//copy data to ssbo buffer
glBindBuffer(GL_SHADER_STORAGE_BUFFER, instances_uniforms_ssbo);
float *gpu_data = (float*)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4), GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
memcpy(gpu_data, all_instances_uniform_data[0], CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4)); //copy instances data

//bind for shader to 0 point (shader will read data from this link point)
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, instances_uniforms_ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

//render 
uniforms_changes_ssbo_shader.bind();
glBindVertexArray(ws_complex_geometry_vao_id);
static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_changes_ssbo_shader.programm_id, instance_data_location);
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
{
    //set parameter to sahder - where object data located
    glUniform1i(uniformsInstancing_data_varLocation, i*NUM_UNIFORM_CHANGES_PER_DIP);
    glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}

Uniforms changes test takes 1.27 ms. Same test for shader parameters transferring using SSBO takes 0.80 ms.

Using glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_WRITE_ONLY); causes CPU and GPU synchronization which should be avoided. One should use glMapBufferRange with flag GL_MAP_UNSYNCHRONIZED_BIT, to prevent synchronization. But programmer should guaranty that overwriting data arren't using by GPU right now. Otherwise we get bugs as we rewriting data which are reading by GPU now. To completely resolve this problem, use triple buffering. When we use current buffer for writing data, the rest 2 uses GPU. Plus there is more optimal mapping buffer method with flags GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT.

Changing vertex buffers

There are a lot of objects with different geometries in the scene. This geometry is usually placed in different vertex buffers. To render another object with different geometry - even with the same material - we need to change vertex buffer. There are techniques which allow effectively rendering different geometry with same material with only one dip: MultiDrawIndirect, Dynamic vertex pulling. Such geometry should be placed in one buffer.

 
void vbo_change_test() 
{
    simple_geometry_shader.bind();
    
    for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
    {
        glBindVertexArray(separate_geometry_vao_id[i % NUM_SIMPLE_VERTEX_BUFFERS]); //change vbo
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
    }
}

For 1000 VBO changes we get 0.95 ms.

Textures changes

Textures give surfaces a detailed view. You can get a very large variety in the picture simply by changing the textures, blending different textures in the shader. Textures have to be changed frequently, but you can put them in the so-called texture array, to bind it only once for lots of dips and access to textures through an index in the shader. Same geometry with different textures might be rendered using instancing.

 
void textures_change_test() 
{
    glBindVertexArray(ws_complex_geometry_vao_id);
    int counter = 0;
    
    //switch between tests 
    if (test_type == ARRAY_OF_TEXTURES_TEST) 
    {
        array_of_textures_shader.bind();
        
        for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
        {
            //bind textures for this dip 
            for (int j = 0; j < NUM_TEXTURES_IN_COMPLEX_MATERIAL; j++) 
            {
                glActiveTexture(GL_TEXTURE0 + j);
                glBindTexture(GL_TEXTURE_2D, array_of_textures[counter % TEX_ARRAY_SIZE]);
                glBindSampler(j, Sampler_linear); counter++; 
            }
            
            glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
        }
    }
    else if (test_type == TEXTURES_ARRAY_TEST) 
    { 
        //bind texture aray for all dips
        glActiveTexture(GL_TEXTURE0);
        glBindTexture(GL_TEXTURE_2D_ARRAY, texture_array_id);
        glBindSampler(0, Sampler_linear);
        
        //variable to tell shader - what textures uses this dip 
        static int textureArray_usedTex_varLocation = glGetUniformLocation(textureArray_shader.programm_id, used_textures_i);
        textureArray_shader.bind();
        
        float used_textures_i[6];
        for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
        {
            //fill data - what textures uses this dip 
            for (int j = 0; j < 6; j++) 
            {
                used_textures_i[j] = counter % TEX_ARRAY_SIZE;
                counter++; 
            }
            
            glUniform1fv(textureArray_usedTex_varLocation, 6, used_textures_i[0]); //transfer to shader, tell what textures this material uses 
            glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
        }
    }
}

Simple textures changes test (for 1000 dips) takes 3.27 ms. But we make NUM_TEXTURES_IN_COMPLEX_MATERIAL textures changes per dip. We should take into account this later, calculating glBindTexture cost. Using texture array performing same test we get 0.87 ms.

Comparative estimation of state changes

Below is a table with the execution cost/time of all performed tests.

Table 1. State changes tests time

Test typeSIMPLE_DIPS_TEST0.41FBO_CHANGE_TEST 1.97SHADERS_CHANGE_TEST2.90UNIFORMS_SIMPLE_CHANGE_TEST1.27UNIFORMS_SSBO_CHANGE_TEST0.80VBO_CHANGE_TEST0.95 ARRAY_OF_TEXTURES_TEST3.27TEXTURES_ARRAY_TEST0.87

Using this results we are able to calculate API call cost. Absolute cost per 1000 API calls. Relative cost calculate in relation to the simple dip call (glDrawRangeElements).

Table 2. API call cost (per 1k calls)

API callAbsolute costRelative cost %glBindFramebuffer 9.442314%glUseProgram2.49610%glBindVertexArray 0.54132%glBindTexture0.48116%glDrawRangeElements 0.41100%glUniform4fv0.0921%

Of course, one should be very cautious to measurements as they will change depending on the version of the driver and hardware.

Instancing

Instancing invented to quickly render the same geometry with different parameters. Each object has a unique index according to which we can take desired for this object parameters in she shader, vary some options, etc. Main advantage of using instancing - we can greatly reduce the number of dips.

We can pack all instances parameters in the buffer, transfer them to GPU and make just one dip. Storing data in the buffer is a good optimization itself - we saving on what it is not necessary to constantly change the shader parameters. Also, if instance data do not change (for example we exactly know that it is static geometry), we don't need to transfer data to GPU every frame, actually just once at program/level start. In general, for optimal rendering we should first to pack all instances data to one buffer, transfer them to GPU with one command. For each dip, type og geometry - just set the offset where to find instances data for this dip. Using instance index (gl_InstanceID in OpenGL) we able to sample certain data for this instance/object.

There are a lot of ways to store data in OpenGL: vertex buffer (VBO), uniform buffer (UBO), texture buffer (TBO), shader storage buffer (SSBO), textures. There are various features for each buffer type. Consider that.

Texture instancing

All data stored in the texture. To effectively change data in texture one should use special structures - Pixel Buffer Object (PBO) which allow transferring data asynchronously from CPU to GPU. CPU does not wait until the data will be transferred and continues to work.

Creation code: glGenBuffers(2, textureInstancingPBO); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[0]); //GL_STREAM_DRAW_ARB means that we will change data every frame glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[1]); glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); //create texture where we will store instances data on gpu glGenTextures(1, textureInstancingDataTex); glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_R, GL_REPEAT); //in each line we store NUM_INSTANCES_PER_LINE object's data. 128 in our case //for each object we store PER_INSTANCE_DATA_VECTORS data-vectors. 2 in our case //GL_RGBA32F, we have float32 data //complex_mesh_instances_data source data of instances, if we are not going to update data in the texture glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, 0, GL_RGBA, GL_FLOAT, &complex_mesh_instances_data[0]); glBindTexture(GL_TEXTURE_2D, 0);

Texture update: glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex); glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[current_frame_index]); // copy pixels from PBO to texture object glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, GL_RGBA, GL_FLOAT, 0); // bind PBO to update pixel values glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[next_frame_index]); //http://www.songho.ca/opengl/gl_pbo.html // Note that glMapBufferARB() causes sync issue. // If GPU is working with this buffer, glMapBufferARB() will wait(stall) // until GPU to finish its job. To avoid waiting (idle), you can call // first glBufferDataARB() with NULL pointer before glMapBufferARB(). // If you do that, the previous data in PBO will be discarded and // glMapBufferARB() returns a new allocated pointer immediately // even if GPU is still working with the previous data. glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB); gpu_data = (float*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY_ARB); if (gpu_data) { memcpy(gpu_data, complex_mesh_instances_data[0], INSTANCES_DATA_SIZE); // update data glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER); //release pointer to mapping buffer }

Rendering using texture instancing:

 
//bind texture with instances data 
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
glBindSampler(0, Sampler_nearest);
glBindVertexArray(geometry_vao_id); //what geometry to render 
tex_instancing_shader.bind(); //with what shader

//tell shader texture with data located, what name it has 
static GLint location = glGetUniformLocation(tex_instancing_shader.programm_id, s_texture_0);
if (location >= 0) 
    glUniform1i(location, 0); //render group of objects 
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);

Vertex shader to access the data:

 
#version 150 core
in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix;

uniform sampler2D s_texture_0; out vec2 uv; out vec3 instance_color; void main() { const vec2 texel_size = vec2(1.0 / 256.0, 1.0 / 16.0); const int objects_per_row = 128; const vec2 half_texel = vec2(0.5, 0.5); //calc texture coordinates - where our instance data located //gl_InstanceID % objects_per_row - index of object in the line //multiple by 2 as each object has 2 vectors of data //gl_InstanceID / objects_per_row - in what line our data located //multiple by texel_size gieves us 0..1 uv to sample from texture from interer texel id vec2 texel_uv = (vec2((gl_InstanceID % objects_per_row) * 2, floor(gl_InstanceID / objects_per_row)) + half_texel) * texel_size; vec4 instance_pos = textureLod(s_texture_0, texel_uv, 0); instance_color = textureLod(s_texture_0, texel_uv + vec2(texel_size.x, 0.0), 0).xyz; uv = s_uv; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); }

Instancing through vertex buffer

Idea is to keep instance data in separate vertex buffer and have an axes to them in shader through vertex attributes. Code of buffer creation with data itself is trivial. Our main task is to modify information about vertex for shader (vertex declaration, vdecl) //...code of base vertex declaration creation //special atributes binding glBindBuffer(GL_ARRAY_BUFFER, all_instances_data_vbo); //size of per instance data (PER_INSTANCE_DATA_VECTORS = 2 - so we have to create 2 additional attributes to transfer data) const int per_instance_data_size = sizeof(vec4) * PER_INSTANCE_DATA_VECTORS; glEnableVertexAttribArray(4); //4th vertex attribute, has 4 floats, 0 data offset glVertexAttribPointer((GLuint)4, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(0)); //tell that we will change this attribute per instance, not per vertex glVertexAttribDivisor(4, 1); glEnableVertexAttribArray(5); //5th vertex attribute, has 4 floats, sizeof(vec4) data offset glVertexAttribPointer((GLuint)5, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(sizeof(vec4))); //tell that we will change this attribute per instance, not per vertex glVertexAttribDivisor(5, 1);

Rendering code: vbo_instancing_shader.bind(); //our vertex buffer wit modified vertex declaration (vdecl) glBindVertexArray(geometry_vao_vbo_instancing_id); glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);

Vertex shader to access data: #version 150 core in vec3 s_pos; in vec3 s_normal; in vec2 s_uv; in vec4 s_attribute_3; //some_data; in vec4 s_attribute_4; //instance pos in vec4 s_attribute_5; //instance color uniform mat4 ModelViewProjectionMatrix; out vec3 instance_color; void main() { instance_color = s_attribute_5.xyz; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + s_attribute_4.xyz, 1.0); }

Uniform buffer instancing, Texture buffer instancing, Shader Storage buffer instancing

These three methods are very similar to each other. They differ mostly by buffer type. Uniform buffer (UBO) characterized by small size, but it should theoretically be faster than the others. Texture buffer (TBO) has very big size. We able to store all scene instances data into it, skeletal transformations. Shader Storage Buffer (SSBO) has both properties - fast with a large size. Also, we can write data to it. The only thing - it is new extension, and the old hardware does not support it.

Uniform buffer

Creation code:

 
    glGenBuffers(1, dips_uniform_buffer); 
    glBindBuffer(GL_UNIFORM_BUFFER, dips_uniform_buffer); 
    glBufferData(GL_UNIFORM_BUFFER, INSTANCES_DATA_SIZE, &complex_mesh_instances_data[0], GL_STATIC_DRAW);
    
    //uniform_buffer_data 
    glBindBuffer(GL_UNIFORM_BUFFER, 0);
    
    //bind iniform buffer with instances data to shader 
    ubo_instancing_shader.bind(true); 
    GLint instanceData_location3 = glGetUniformLocation(ubo_instancing_shader.programm_id, "instance_data");
    
    //link to shader 
    glUniformBufferEXT(ubo_instancing_shader.programm_id, instanceData_location3, dips_uniform_buffer); //actually binding

Instancing vertex shader with uniform buffer:

 
#version 150 core 
#extension GL_ARB_bindable_uniform : enable 
#extension GL_EXT_gpu_shader4 : enable
in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix; 
bindable uniform vec4 instance_data[4096]; //our uniform with instances data
out vec3 instance_color;
void main() 
{
    vec4 instance_pos = instance_data[gl_InstanceID*2];
    instance_color = instance_data[gl_InstanceID*2+1].xyz;
    gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
}

Texture Buffer

Creation code:

 
tbo_instancing_shader.bind();
//bind to shader as special texture 
glActiveTexture(GL_TEXTURE0); 
glBindTexture(GL_TEXTURE_BUFFER, dips_texture_buffer_tex); 
glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, dips_texture_buffer);
glBindVertexArray(geometry_vao_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);

Vertex shader:

 
#version 150 core 
#extension GL_EXT_bindable_uniform : enable 
#extension GL_EXT_gpu_shader4 : enable
in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix; 
uniform samplerBuffer s_texture_0; //our TBO texture bufer
out vec3 instance_color;
void main() 
{ 
    //sample data from TBO 
    vec4 instance_pos = texelFetch(s_texture_0, gl_InstanceID*2);
    instance_color = texelFetch(s_texture_0, gl_InstanceID*2+1).xyz; 
    gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); 
}

SSBO

Creation code:

 
    glGenBuffers(1, ssbo); 
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);
    glBufferData(GL_SHADER_STORAGE_BUFFER, INSTANCES_DATA_SIZE, complex_mesh_instances_data[0], GL_STATIC_DRAW); 
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo); 
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind

Render:

 
//bind ssbo_instances_data, link to shader at 0 binding point
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo_instances_data);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo_instances_data);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
ssbo_instancing_shader.bind(); 
glBindVertexArray(geometry_vao_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);
glBindVertexArray(0);

Vertex shader:

 
    #version 430
    #extension GL_ARB_shader_storage_buffer_object : require
    in vec3 s_pos; 
    in vec3 s_normal; 
    in vec2 s_uv;
    uniform mat4 ModelViewProjectionMatrix;
    //ssbo should be binded to 0 
    binding point layout(std430, binding = 0) 
    buffer ssboData { vec4 instance_data[4096]; };
    out vec3 instance_color;
    void main() 
    {
        //gl_InstanceID is unique for each instance. So we able to set per instance data 
        vec4 instance_pos = instance_data[gl_InstanceID*2]; 
        instance_color = instance_data[gl_InstanceID*2+1].xyz;
        gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
    }

Uniforms instancing

Pretty simple. We have ability to set with special commands (glUniform*) several vectors with data to shader. Maximum amount depends on video card. Get the maximum number possible by calling glGetIntegerv with GL_MAX_VERTEX_UNIFORM_VECTORS parameter. For R9 380 will return 4096. Minimum value is 256.

 
uniforms_instancing_shader.bind();
glBindVertexArray(geometry_vao_id);
//variable - where in shader our array of uniforms located. We will write data to this array 
static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_instancing_shader.programm_id, instance_data);
//instances data might be written with just one call if there are enough vectors. 
//Just for clarity, divide into groups, because usually much more there are much more data than available uniforms. 
for (int i = 0; i < UNIFORMS_INSTANCING_NUM_GROUPS; i++) 
{ 
    //write data to uniforms 
    glUniform4fv(uniformsInstancing_data_varLocation, UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING, complex_mesh_instances_data[i*UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING].x);
    glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, UNIFORMS_INSTANCING_OBJECTS_PER_DIP); 
}

Multi draw indirect

Separately consider a command that allows drawing a huge number of dips for one call. This is a very useful command which allows rendering a group of instances with different geometry, even thousands of different groups with one command. As an input, it receives an array that describes the parameters of dips: the number of indexes, shifting in vertex buffers, amount of instances per group. The restriction is that the entire geometry should be placed in one vertex buffer and rendered with one shader. Additional plus is that we can fill information about dips for MultiDraw command on GPU side, which is very useful for GPU frustum culling for example.

 
//fill indirect buffer with dips information. Just simple array 
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
{
    multi_draw_indirect_buffer.vertexCount = BOX_NUM_INDICES; 
    multi_draw_indirect_buffer.instanceCount = 1; 
    multi_draw_indirect_buffer.firstVertex = i*BOX_NUM_INDICES; 
    multi_draw_indirect_buffer.baseVertex = 0; 
    multi_draw_indirect_buffer.baseInstance = 0; 
}
glBindVertexArray(ws_complex_geometry_vao_id); 
simple_geometry_shader.bind();
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (GLvoid*)multi_draw_indirect_buffer[0], //our information about dips 
                            CURRENT_NUM_INSTANCES, //number of dips 0);

glMultiDrawElementsIndirect command performs several glDrawElementsInstancedIndirect in one call. There is an unpleasant feature in the behavior of this command. Each such group (glDrawElementsInstancedIndirect) will have independent gl_InstanceID, i.e. each time it drops to 0 with new Draw*. Which makes difficult to access required per instance data. This problem solves by modifying vertex declaration of each type of objects being sent to the renderer. You can read an article about it Surviving without gl_DrawID. It is worth noting that glMultiDrawElementsIndirect performed huge number of dips with a single command. You don't need to compare this command with the other types of instancing.

Performance comparison of different types of instancing

Table 3. Instancing tests performance. Number of iterations = 100, top-amount of instances. cpu time (gpu time)

Instancing type50100200UBO_INSTANCING 0.35 (0.10)0.37 (0.13)0.36 (0.24)TBO_INSTANCING0.72 (0.11)0.73 (0.13)0.73 (0.25) SSBO_INSTANCING0.37 (0.09)0.40 (0.13)0.38 (0.24)VBO_INSTANCING0.36 (0.09)0.37 (0.12)0.37 (0.24)TEXTURE_INSTANCING0.38 (0.10)0.39 (0.13)0.39 (0.24) UNIFORMS_INSTANCING0.41 (0.13)0.52 (0.27)0.74 (0.51)MULTI_DRAW_INDIRECT0.63 (0.53) 1.17 (1.01)2.10 (1.93)

UBO, VBO, SSBO, TEXTURE instancing types have pretty much the same 'good' timing.

TBO instancing allows to store huge amount of information, but it is slow in comparison with UBO. If possible, you should use SSBO storage. It is fast, handy and has a huge size.

Texture instancing is also a good alternative to UBO. Supported by the old hardware, you can store any amount of information. A little uncomfortable to update.

Transfering data each frame through glUniform* obviously is the slowest instancing method.

glMultiDrawElementsIndirect in tests performed 5?, 10? ? 20? dips ! But we tested repetition of test. Such amount of dips might be done by just one call. The only thing - with so many dips, an array with dips description will be pretty huge (better to use GPU for to create description).

Recommendations for optimization and conclusions

In this paper we make an analysis of API calls, measured different types of instancing performance. In general, the less state switches, the better. Use the newest features the latest version of the API: textures array, SSBO, Draw Indirect, mapping buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags for fast data transferring. Recommendations:

The less states changes the better. One should group objects by material.
You may wrap state changes (textures, buffers, shaders and other states). Check if state really changed before API call because it is much slower than just flag/index checking.
Unite geometry in one buffer.
Use texture arrays.
Store data in large buffers and textures
Use as little shaders as possible. But too complicated/universal shader with many branches obviously will be a problem. Especially on older video cards, where branching is expensive.
Use instancing
Use Draw Indirect if it is possible and generate information about dips on GPU side.

Some general advice:

It is necessary to calculate bottlenecks and optimize them first.
You need to know what limits performance - CPU or GPU and optimize it.
Don't make work twice. Reuse results of different passes, reuse previous frames result (reprojection techniques, sorting, tracing, anything).
Difficult calculation might be precalculated
The best optimization - not to do the work
Use parallel calculations: split work into parts and do them on parallel threads.

Source code of all examples. [attachment=34883:GL_API_overhead.rar]

Links:

Anatoliy Gerlits February 2017

2 Likes 10 Comments

Comments

smr

There is a lot of good information here. Maybe if someone knowledgeable on the subject could assist the author in clarifying some of the grammar it would be a nice addition to the articles section.

February 23, 2017 05:34 PM

jpetrie

In any article that involves benchmarks I think it is critical to provide some explanation of how the benchmark works without having to download and extract and read through code. In this case, how are you measuring the "overhead" of these operations?

Many of the tables used also lack headers or identifying metrics making them difficult to parse.

February 23, 2017 06:11 PM

Matias Goldberg

I was going to say something similar to Josh.

The recommendations in "Recommendations for optimization and conclusions" are all correct except for using GL_MAP_COHERENT_BIT. One should not use it, and rather indicate the driver the dirty regions with glFlushMappedBufferRange.

However all that precedes is not only confusing, but mixing a lot of stuff. A draw call can have many parts and thus many bottlenecks:

CPU side: The CPU overhead itself, from validating the command and preparing a command to send to the GPU.
GPU side: Wavefront occupation: Rendering a million of cubes with 36 vertices each is going to waste a lot of GPU cores doing nothing. If one is not careful, you may be measuring gpu bottlenecks when you think you're measuring draw call cost (see GTruc's).
GPU side: pixel occupation (pixel shader's wavefront occupation): If the triangles being rendered occupy a few pixels, you are going to waste a lot of GPU cycles in helper pixels

Same happens with basically everything. When you switch a shader, you may be measuring the cost of the CPU driver, but you may also be measuring the cost of the GPU halting work execution and swapping the shader (which in some cases may be able to avoid the halt, also depends on architecture, the wave occupancy of these shaders, etc).

There's a lot of moving pieces which makes the metric of call cost very useless. It gives you an overall "eyeball estimate", but that's all.

I had an inquiry not long ago where a user thought he was CPU bound (lowering resolution didn't improve framerate, enabling "Minimum Geometry" in NSIGHT didn't improve framerate, lowering quality didn't do anything). Everything pointed out it was a CPU bound problem. But closer inspection revealed he was actually GPU bound: He was submitting too few vertices per draw. One solution is to draw less (i.e. cull more aggressively, remove some objects). But other solutions would be to do merge instancing, Ubisoft tackled the problem in AC Unity using a Compute Shader that computes visibilities of patches and then submits a patches of vertices.

February 23, 2017 08:43 PM

_Wizard_

smr
>There is a lot of good information here
thanks
>clarifying some of the grammar
just learning english :) hoped moderators will fix some obvious problems)
+ also - there is something strange with text formatting. Don't know what I am doing wrong.

Josh Petrie
>how the benchmark works
yes, there is benchmark in the downloadable file. Added a bit after I have written the article.
Run exe. Result will be written in benchmark_results.txt

There is very interesting information I collected through this benchmark in other thread.
There are some tests on different hardware (observations in the end):

CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
GPU: AMD Radeon (TM) R9 380 Series

Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200
 
---States changing time:
SIMPLE_DIPS_TEST 0.20
FBO_CHANGE_TEST 1.44
SHADERS_CHANGE_TEST 2.56
VBO_CHANGE_TEST 0.76
ARRAY_OF_TEXTURES_TEST 3.05
TEXTURES_ARRAY_TEST 0.70
UNIFORMS_SIMPLE_CHANGE_TEST 1.06
UNIFORMS_SSBO_TEST 0.61

---API call cost:
glBindFramebuffer: 6.99 3424%
glUseProgram: 2.36 1157%
glBindVertexArray: 0.55 271%
glBindTexture: 0.47 232%
glDrawRangeElements: 0.20 100%
glUniform4fv: 0.09 41%

---Instancing time:
UBO_INSTANCING 0.15
TBO_INSTANCING 0.50
SSBO_INSTANCING 0.17
VBO_INSTANCING 0.15
TEXTURE_INSTANCING 0.18
UNIFORMS_INSTANCING 5.73
MULTI_DRAW_INDIRECT_INSTANCING 1.06


CPU: Intel(R) Core(TM) i3 CPU M 380 @ 2.53GHz
GPU: AMD Mobility Radeon HD 5000 Series

Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

win7 linux mint 17.3 + fglrx 15.2

---States changing time:
SIMPLE_DIPS_TEST 0.41 0.60
FBO_CHANGE_TEST 2.90 4.70
SHADERS_CHANGE_TEST 4.48 7.12
VBO_CHANGE_TEST 1.13 1.57
ARRAY_OF_TEXTURES_TEST 4.15 5.79
TEXTURES_ARRAY_TEST 1.14 1.47
UNIFORMS_SIMPLE_CHANGE_TEST 1.24 2.30
UNIFORMS_SSBO_TEST 0.74 1.16

---API call cost:
glBindFramebuffer: 14.10 3439% 22.90 3841%
glUseProgram: 4.07 992% 6.53 1094%
glBindVertexArray: 0.72 175% 0.97 162%
glBindTexture: 0.62 152% 0.87 145%
glDrawRangeElements: 0.41 100% 0.60 100%
glUniform4fv: 0.08 20% 0.17 28%

---Instancing time:
UBO_INSTANCING 0.26 0.55
TBO_INSTANCING 1.17 2.18
SSBO_INSTANCING 0.31 0.82
VBO_INSTANCING 0.22 0.61
TEXTURE_INSTANCING 0.28 0.81
UNIFORMS_INSTANCING 16.97 17.22


---States changing time:
SIMPLE_DIPS_TEST 0.13
FBO_CHANGE_TEST 1.71
SHADERS_CHANGE_TEST 2.16
VBO_CHANGE_TEST 0.22
ARRAY_OF_TEXTURES_TEST 1.12
TEXTURES_ARRAY_TEST 0.23
UNIFORMS_SIMPLE_CHANGE_TEST 0.73
UNIFORMS_SSBO_TEST 0.36

---API call cost:
glBindFramebuffer: 8.41 6630%
glUseProgram: 2.03 1604%
glBindVertexArray: 0.09 74%
glBindTexture: 0.17 131%
glDrawRangeElements: 0.13 100%
glUniform4fv: 0.06 47%

---Instancing time:
UBO_INSTANCING 2.45
TBO_INSTANCING 4.67
SSBO_INSTANCING 2.50
VBO_INSTANCING 2.45
TEXTURE_INSTANCING 26.09
UNIFORMS_INSTANCING 6.74

CPU: Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
GPU: GeForce GTX 970/PCIe/SSE2
Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.11
FBO_CHANGE_TEST 1.28
SHADERS_CHANGE_TEST 2.04
VBO_CHANGE_TEST 0.22
ARRAY_OF_TEXTURES_TEST 1.04
TEXTURES_ARRAY_TEST 0.23
UNIFORMS_SIMPLE_CHANGE_TEST 0.75
UNIFORMS_SSBO_TEST 3.02

---API call cost:
glBindFramebuffer: 6.29 5481%
glUseProgram: 1.93 1679%
glBindVertexArray: 0.10 87%
glBindTexture: 0.15 134%
glDrawRangeElements: 0.11 100%
glUniform4fv: 0.06 55%

---Instancing time:
UBO_INSTANCING 2.38
TBO_INSTANCING 4.37
SSBO_INSTANCING 2.41
VBO_INSTANCING 2.38
TEXTURE_INSTANCING 2.40
UNIFORMS_INSTANCING 6.11


CPU: AMD FX(tm)-4350 Quad-Core Processor
GPU: AMD Radeon R7 200 Series
Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.27
FBO_CHANGE_TEST 3.74
SHADERS_CHANGE_TEST 4.06
VBO_CHANGE_TEST 1.27
ARRAY_OF_TEXTURES_TEST 4.89
TEXTURES_ARRAY_TEST 1.02
UNIFORMS_SIMPLE_CHANGE_TEST 1.54
UNIFORMS_SSBO_TEST 0.98

---API call cost:
glBindFramebuffer: 18.42 6835%
glUseProgram: 3.79 1407%
glBindVertexArray: 1.00 372%
glBindTexture: 0.77 285%
glDrawRangeElements: 0.27 100%
glUniform4fv: 0.13 47%

---Instancing time:
UBO_INSTANCING 0.68
TBO_INSTANCING 1.07
SSBO_INSTANCING 0.68
VBO_INSTANCING 0.56
TEXTURE_INSTANCING 0.45
UNIFORMS_INSTANCING 8.49


CPU: Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz
GPU: GeForce GTX 760/PCIe/SSE2
Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.13
FBO_CHANGE_TEST 1.12
SHADERS_CHANGE_TEST 1.40
VBO_CHANGE_TEST 0.22
ARRAY_OF_TEXTURES_TEST 0.96
TEXTURES_ARRAY_TEST 0.21
UNIFORMS_SIMPLE_CHANGE_TEST 0.72
UNIFORMS_SSBO_TEST 2.70

---API call cost:
glBindFramebuffer: 5.48 4135%
glUseProgram: 1.27 957%
glBindVertexArray: 0.09 68%
glBindTexture: 0.14 103%
glDrawRangeElements: 0.13 100%
glUniform4fv: 0.06 44%

---Instancing time:
UBO_INSTANCING 2.19
TBO_INSTANCING 5.60
SSBO_INSTANCING 2.23
VBO_INSTANCING 2.17
TEXTURE_INSTANCING 2.19
UNIFORMS_INSTANCING 5.93


CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
GPU: GeForce GTX 960/PCIe/SSE2

Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.08
FBO_CHANGE_TEST 0.97
SHADERS_CHANGE_TEST 1.07
VBO_CHANGE_TEST 0.15
ARRAY_OF_TEXTURES_TEST 0.74
TEXTURES_ARRAY_TEST 0.17
UNIFORMS_SIMPLE_CHANGE_TEST 0.56
UNIFORMS_SSBO_TEST 0.21

---API call cost:
glBindFramebuffer: 4.77 5823%
glUseProgram: 0.99 1203%
glBindVertexArray: 0.07 83%
glBindTexture: 0.11 133%
glDrawRangeElements: 0.08 100%
glUniform4fv: 0.05 58%

---Instancing time:
UBO_INSTANCING 1.90
TBO_INSTANCING 2.22
SSBO_INSTANCING 1.88
VBO_INSTANCING 1.84
TEXTURE_INSTANCING 1.87
UNIFORMS_INSTANCING 4.99


CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
GPU: Intel(R) HD Graphics 4000
Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.18
FBO_CHANGE_TEST 0.56
SHADERS_CHANGE_TEST 2.49
VBO_CHANGE_TEST 0.73
ARRAY_OF_TEXTURES_TEST 2.41
TEXTURES_ARRAY_TEST 0.44
UNIFORMS_SIMPLE_CHANGE_TEST 0.61
UNIFORMS_SSBO_TEST 0.00

---API call cost:
glBindFramebuffer: 2.62 1440%
glUseProgram: 2.30 1267%
glBindVertexArray: 0.55 301%
glBindTexture: 0.37 204%
glDrawRangeElements: 0.18 100%
glUniform4fv: 0.04 23%

---Instancing time:
UBO_INSTANCING 0.05
TBO_INSTANCING 0.24
SSBO_INSTANCING 0.00
VBO_INSTANCING 0.10
TEXTURE_INSTANCING 0.07
UNIFORMS_INSTANCING 25.08


CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
GPU: AMD Radeon R9 200 Series

Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.19
FBO_CHANGE_TEST 1.34
SHADERS_CHANGE_TEST 2.45
VBO_CHANGE_TEST 0.74
ARRAY_OF_TEXTURES_TEST 2.98
TEXTURES_ARRAY_TEST 0.71
UNIFORMS_SIMPLE_CHANGE_TEST 1.10
UNIFORMS_SSBO_TEST 0.65

---API call cost:
glBindFramebuffer: 6.49 3352%
glUseProgram: 2.25 1163%
glBindVertexArray: 0.55 283%
glBindTexture: 0.46 240%
glDrawRangeElements: 0.19 100%
glUniform4fv: 0.09 46%

---Instancing time:
UBO_INSTANCING 0.14
TBO_INSTANCING 0.52
SSBO_INSTANCING 0.16
VBO_INSTANCING 0.15
TEXTURE_INSTANCING 0.17
UNIFORMS_INSTANCING 5.57


CPU: Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz
GPU: GeForce GTX 760/PCIe/SSE2

Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.12
FBO_CHANGE_TEST 1.11
SHADERS_CHANGE_TEST 1.41
VBO_CHANGE_TEST 0.20
ARRAY_OF_TEXTURES_TEST 0.96
TEXTURES_ARRAY_TEST 0.21
UNIFORMS_SIMPLE_CHANGE_TEST 0.60
UNIFORMS_SSBO_TEST 0.20

---API call cost:
glBindFramebuffer: 5.42 4632%
glUseProgram: 1.29 1105%
glBindVertexArray: 0.09 74%
glBindTexture: 0.14 119%
glDrawRangeElements: 0.12 100%
glUniform4fv: 0.05 41%

---Instancing time:
UBO_INSTANCING 2.18
TBO_INSTANCING 5.39
SSBO_INSTANCING 2.25
VBO_INSTANCING 2.22
TEXTURE_INSTANCING 2.25
UNIFORMS_INSTANCING 6.00
MULTI_DRAW_INDIRECT_INSTANCING 0.57


CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
GPU: GeForce GTX 660/PCIe/SSE2
Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.21
FBO_CHANGE_TEST 1.88
SHADERS_CHANGE_TEST 1.95
VBO_CHANGE_TEST 0.30
ARRAY_OF_TEXTURES_TEST 1.03
TEXTURES_ARRAY_TEST 0.30
UNIFORMS_SIMPLE_CHANGE_TEST 0.84
UNIFORMS_SSBO_TEST 0.29

---API call cost:
glBindFramebuffer: 9.17 4284%
glUseProgram: 1.74 811%
glBindVertexArray: 0.09 41%
glBindTexture: 0.14 63%
glDrawRangeElements: 0.21 100%
glUniform4fv: 0.06 29%

---Instancing time:
UBO_INSTANCING 2.17
TBO_INSTANCING 2.83
SSBO_INSTANCING 2.19
VBO_INSTANCING 2.17
TEXTURE_INSTANCING 2.25
UNIFORMS_INSTANCING 5.81
MULTI_DRAW_INDIRECT_INSTANCING 2.59


CPU: AMD FX(tm)-4350 Quad-Core Processor
GPU: AMD Radeon R7 200 Series
Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200

---States changing time:
SIMPLE_DIPS_TEST 0.24
FBO_CHANGE_TEST 2.18
SHADERS_CHANGE_TEST 3.99
VBO_CHANGE_TEST 1.35
ARRAY_OF_TEXTURES_TEST 4.86
TEXTURES_ARRAY_TEST 1.11
UNIFORMS_SIMPLE_CHANGE_TEST 1.59
UNIFORMS_SSBO_TEST 1.04

---API call cost:
glBindFramebuffer: 10.68 4424%
glUseProgram: 3.75 1552%
glBindVertexArray: 1.10 457%
glBindTexture: 0.77 319%
glDrawRangeElements: 0.24 100%
glUniform4fv: 0.13 55%

---Instancing time:
UBO_INSTANCING 0.63
TBO_INSTANCING 0.99
SSBO_INSTANCING 0.77
VBO_INSTANCING 0.81
TEXTURE_INSTANCING 0.61
UNIFORMS_INSTANCING 8.64
MULTI_DRAW_INDIRECT_INSTANCING 1.71


-------------------------------------
observations
GPU: GeForce GTX 690/PCIe/SSE2
glBindVertexArray: 0.09 74%
glBindTexture: 0.17 131%
instancing 2.5 ms+

GPU: GeForce GTX 970/PCIe/SSE2
glBindVertexArray: 0.10 87%
glBindTexture: 0.15 134%
instancing 2.38 ms+

GPU: GeForce GTX 760/PCIe/SSE2
glBindVertexArray: 0.09 68%
glBindTexture: 0.14 103%
instancing 2.2 ms+

GPU: AMD Radeon R7 200 Series
glBindVertexArray: 1.00 372%
glBindTexture: 0.77 285%
Cheap instancing 0.5-0.7 ms

GPU: AMD Mobility Radeon HD 5000 Series
glBindVertexArray: 0.72 175% 0.97 162%
glBindTexture: 0.62 152% 0.87 145%
instancing 0.2-0.3 ms

GPU: AMD Radeon (TM) R9 380 Series
glBindVertexArray: 0.55 268%
glBindTexture: 0.48 232%
instancing 0.15-0.2 ms

on GeForce cards glBindVertexArray and glBindTexture much cheaper than on Radeon's
but at the same time instancing 5-10 times more expensive
other measurements +- the same

any of the following instancing types have +-same speed: UBO, SSBO, VBO, TEXTURE...
TBO is more expensive

The purpose of this article is to understand the best way to optimize the engine.
Obviously - changing render target and changing shaders are most expensive operations.
It is also good idea to gather textures in texture arrays and use indexing to get right texture in the shader.
Same for geometry - possible good idea unite vbo's in one. At least for bunch of debris, buildings...
glDraw* itself is pretty cheap. It is arround 0.1-0.25 ms on current hardware for 1k simple dips.
At least if you don't change other states.
glMultiDrawIndirect is awesome.
Hope it helped a bit.

February 23, 2017 08:44 PM

_Wizard_

Matias Goldberg
>A draw call can have many parts and thus many bottlenecks
yes, and I show, that lots of state changes - one of them.
Driver overhead is a big problem.
And new API (DX12, Vulkan, Metal, new GL extensions) are all about it - how to minimize driver overhead.

With all these measurements one can easy calculate potential win from certain improvements.
There are some examples:
1. Gathering all textures of different characters in texture atlas at the session start.
+-100 characters * +-10 body parts/clothes/decals in each (all parts are customizable) = 1k glBindTexture!
~0.15-0.5 ms improvement!
I have done it in the project which was very useful.

2. There are shaders with lots of options. Each option produces new DIFFERENT shaders.
If you have 5-7 such options - it gives you 2^5 - 2^7 = 32-128 shaders just for one object.
Ok - there is CPU work vs shader branching - but anyway.
We still have such problem of shader swithes.

3. One tank has +-100 different parts. It is 100 dips. * 32 tanks in the session = holy...
Ok, not all of them are visible + LODs have less parts, but anyway...
What about to render through bones in 1 dip per tank? It might save you a lot... up to 1 ms.

February 23, 2017 09:43 PM

jpetrie

I agree with Matias; the conclusions here are (mostly) fine. But I think the method by which they are obtained seems sloppy and poorly presented. In particular, it's critical to point out that all you are measuring is CPU-side work, here. Without that caveat your timings can easily be misconstrued.

February 23, 2017 11:45 PM

_Wizard_

Matias Goldberg
Josh Petrie
>it's critical to point out that all you are measuring is CPU-side work, here
agree, it is a bit incorrect to measure just CPU. Especially for instancing.
But anyway, it gives rough estimation + I write both CPU & GPU models to the log.

May be some presented measurements are incorrect due to several things I observed during collection them:
1. early tests were run in debug mode (not DBG compiling options - just mode in which GL catches errors & warning).
GL spams notices in debug.txt which makes influence on overall perf. So I disabled it.

2. changes a bit drawIndirect, was implemented a bit incorrect.
And its relative performance (in comparison with instancing timings) hugely depends on amount of iterations.

3. The right way to correct perfomance measurements is to catch the overall frame timing
Even if there are some extra things like clearing back buffer & fbo binding.
It earlier versions of benchmark I measured just test itself which is not correct.
FPS gives correct estimation of how much things cost in total.

4. Too small frames aggregation leads to large inaccuracy.

February 24, 2017 08:23 AM

jbadams

Editorial Note: We have made some clean-ups to the formatting and grammar in this article, though we would recommend the author may wish to incorporate small changes based on the feedback from other community members.

March 02, 2017 11:47 AM

MAnd

This is the second pretty promising article this author publishes here in a row. Nice. However, it's also the second time that he publishes it focusing in profiling comparisons but... without showing the code used for profilling. I really do think that the author - actually all authors, of all articles here and elsewhere - should be required to show how the performance measurement used. Without that, performance comparison are just words in the wind.

March 03, 2017 07:52 PM

_Wizard_

Updated the article:

- Added Measurements (on the top of the article)

- A bit changes timing as I moved to whole frame time calculation.

- Instancing measurements: added gpu timing, several tests to find out how timing scales according to number of instances.

What I get with updated code:


CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
GPU: AMD Radeon (TM) R9 380 Series

Parameters: CURRENT_NUM_INSTANCES 1000   NUM_FBO_CHANGES 200   INSTANCING_NUM_ITERATIONS 100. Time in ms.

---States changing time:
SIMPLE_DIPS_TEST                 0.41
FBO_CHANGE_TEST                  1.97
SHADERS_CHANGE_TEST              2.90
VBO_CHANGE_TEST                  0.95
ARRAY_OF_TEXTURES_TEST           3.27
TEXTURES_ARRAY_TEST              0.87
UNIFORMS_SIMPLE_CHANGE_TEST      1.27
UNIFORMS_SSBO_TEST               0.80

---API call cost:
glBindFramebuffer:               9.44   2314% 
glUseProgram:                    2.49   610% 
glBindVertexArray:               0.54   132% 
glBindTexture:                   0.48   116% 
glDrawRangeElements:             0.41   100% 
glUniform4fv:                    0.09   21% 

---Instancing time:
cpu time (gpu time)
num instances                       50            100           200
UBO_INSTANCING                   0.35 (0.10)   0.37 (0.13)   0.36 (0.24)
TBO_INSTANCING                   0.72 (0.11)   0.73 (0.13)   0.73 (0.25)
SSBO_INSTANCING                  0.37 (0.09)   0.40 (0.13)   0.38 (0.24)
VBO_INSTANCING                   0.36 (0.09)   0.37 (0.12)   0.37 (0.24)
TEXTURE_INSTANCING               0.38 (0.10)   0.39 (0.13)   0.39 (0.24)
UNIFORMS_INSTANCING              0.41 (0.13)   0.52 (0.27)   0.74 (0.51)
MULTI_DRAW_INDIRECT_INSTANCING   0.63 (0.53)   1.17 (1.01)   2.10 (1.93)

Post your measurements and write your thoughts.

March 04, 2017 10:30 PM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

There are several ways to render geometry on the screen. In this article we consider how to do that effectively, measure and compare cost of rendering API calls.
- state changes (frame buffers, vertex buffers, shaders, constants, textures)
- different types of geometry instancing and compare their performance
- several practical examples of how one should optimize geometry render in projects.

OpenGL API Overhead

Introduction

States changing

Dips

Frame buffer change

Shader changes

Shader parameters changing

Changing vertex buffers

Textures changes

Comparative estimation of state changes

Instancing

Texture instancing

Instancing through vertex buffer

Uniform buffer instancing, Texture buffer instancing, Shader Storage buffer instancing

Uniform buffer

Texture Buffer

SSBO

Uniforms instancing

Multi draw indirect

Performance comparison of different types of instancing

Recommendations for optimization and conclusions

Comments

Recommended Tutorials

Other Tutorials by _Wizard_

OpenGL API Overhead

Introduction

States changing

Dips

Frame buffer change

Shader changes

Shader parameters changing

Changing vertex buffers

Textures changes

Comparative estimation of state changes

Instancing

Texture instancing

Instancing through vertex buffer

Uniform buffer instancing, Texture buffer instancing, Shader Storage buffer instancing

Uniform buffer

Texture Buffer

SSBO

Uniforms instancing

Multi draw indirect

Performance comparison of different types of instancing

Recommendations for optimization and conclusions

Comments

Recommended Tutorials

Other Tutorials by _Wizard_

Reticulating splines