_Wizard_

Members
  • Content count

    0
  • Joined

  • Last visited

Community Reputation

461 Neutral

About _Wizard_

  • Rank
    Newbie

Personal Information

  • Interests
    |programmer|
  1. OpenGL API Overhead

    Updated the article: - Added Measurements (on the top of the article) - A bit changes timing as I moved to whole frame time calculation. - Instancing measurements: added gpu timing, several tests to find out how timing scales according to number of instances. What I get with updated code: CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz GPU: AMD Radeon (TM) R9 380 Series Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 INSTANCING_NUM_ITERATIONS 100. Time in ms. ---States changing time: SIMPLE_DIPS_TEST 0.41 FBO_CHANGE_TEST 1.97 SHADERS_CHANGE_TEST 2.90 VBO_CHANGE_TEST 0.95 ARRAY_OF_TEXTURES_TEST 3.27 TEXTURES_ARRAY_TEST 0.87 UNIFORMS_SIMPLE_CHANGE_TEST 1.27 UNIFORMS_SSBO_TEST 0.80 ---API call cost: glBindFramebuffer: 9.44 2314% glUseProgram: 2.49 610% glBindVertexArray: 0.54 132% glBindTexture: 0.48 116% glDrawRangeElements: 0.41 100% glUniform4fv: 0.09 21% ---Instancing time: cpu time (gpu time) num instances 50 100 200 UBO_INSTANCING 0.35 (0.10) 0.37 (0.13) 0.36 (0.24) TBO_INSTANCING 0.72 (0.11) 0.73 (0.13) 0.73 (0.25) SSBO_INSTANCING 0.37 (0.09) 0.40 (0.13) 0.38 (0.24) VBO_INSTANCING 0.36 (0.09) 0.37 (0.12) 0.37 (0.24) TEXTURE_INSTANCING 0.38 (0.10) 0.39 (0.13) 0.39 (0.24) UNIFORMS_INSTANCING 0.41 (0.13) 0.52 (0.27) 0.74 (0.51) MULTI_DRAW_INDIRECT_INSTANCING 0.63 (0.53) 1.17 (1.01) 2.10 (1.93) Post your measurements and write your thoughts.
  2. OpenGL API Overhead

    Matias Goldberg Josh Petrie >it's critical to point out that all you are measuring is CPU-side work, here agree, it is a bit incorrect to measure just CPU. Especially for instancing. But anyway, it gives rough estimation + I write both CPU & GPU models to the log. May be some presented measurements are incorrect due to several things I observed during collection them: 1. early tests were run in debug mode (not DBG compiling options - just mode in which GL catches errors & warning). GL spams notices in debug.txt which makes influence on overall perf. So I disabled it. 2. changes a bit drawIndirect, was implemented a bit incorrect. And its relative performance (in comparison with instancing timings) hugely depends on amount of iterations. 3. The right way to correct perfomance measurements is to catch the overall frame timing Even if there are some extra things like clearing back buffer & fbo binding. It earlier versions of benchmark I measured just test itself which is not correct. FPS gives correct estimation of how much things cost in total. 4. Too small frames aggregation leads to large inaccuracy.
  3. OpenGL API Overhead

    Matias Goldberg >A draw call can have many parts and thus many bottlenecks yes, and I show, that lots of state changes - one of them. Driver overhead is a big problem. And new API (DX12, Vulkan, Metal, new GL extensions) are all about it - how to minimize driver overhead. With all these measurements one can easy calculate potential win from certain improvements. There are some examples: 1. Gathering all textures of different characters in texture atlas at the session start. +-100 characters * +-10 body parts/clothes/decals in each (all parts are customizable) = 1k glBindTexture! ~0.15-0.5 ms improvement! I have done it in the project which was very useful. 2. There are shaders with lots of options. Each option produces new DIFFERENT shaders. If you have 5-7 such options - it gives you 2^5 - 2^7 = 32-128 shaders just for one object. Ok - there is CPU work vs shader branching - but anyway. We still have such problem of shader swithes. 3. One tank has +-100 different parts. It is 100 dips. * 32 tanks in the session = holy... Ok, not all of them are visible + LODs have less parts, but anyway... What about to render through bones in 1 dip per tank? It might save you a lot... up to 1 ms.
  4. OpenGL API Overhead

    smr >There is a lot of good information here thanks >clarifying some of the grammar just learning english :) hoped moderators will fix some obvious problems) + also - there is something strange with text formatting. Don't know what I am doing wrong.   Josh Petrie >how the benchmark works yes, there is benchmark in the downloadable file. Added a bit after I have written the article. Run exe. Result will be written in benchmark_results.txt There is very interesting information I collected through this benchmark in other thread. There are some tests on different hardware (observations in the end): CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz GPU: AMD Radeon (TM) R9 380 Series Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200   ---States changing time: SIMPLE_DIPS_TEST 0.20 FBO_CHANGE_TEST 1.44 SHADERS_CHANGE_TEST 2.56 VBO_CHANGE_TEST 0.76 ARRAY_OF_TEXTURES_TEST 3.05 TEXTURES_ARRAY_TEST 0.70 UNIFORMS_SIMPLE_CHANGE_TEST 1.06 UNIFORMS_SSBO_TEST 0.61 ---API call cost: glBindFramebuffer: 6.99 3424% glUseProgram: 2.36 1157% glBindVertexArray: 0.55 271% glBindTexture: 0.47 232% glDrawRangeElements: 0.20 100% glUniform4fv: 0.09 41% ---Instancing time: UBO_INSTANCING 0.15 TBO_INSTANCING 0.50 SSBO_INSTANCING 0.17 VBO_INSTANCING 0.15 TEXTURE_INSTANCING 0.18 UNIFORMS_INSTANCING 5.73 MULTI_DRAW_INDIRECT_INSTANCING 1.06 CPU: Intel(R) Core(TM) i3 CPU M 380 @ 2.53GHz GPU: AMD Mobility Radeon HD 5000 Series Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 win7 linux mint 17.3 + fglrx 15.2 ---States changing time: SIMPLE_DIPS_TEST 0.41 0.60 FBO_CHANGE_TEST 2.90 4.70 SHADERS_CHANGE_TEST 4.48 7.12 VBO_CHANGE_TEST 1.13 1.57 ARRAY_OF_TEXTURES_TEST 4.15 5.79 TEXTURES_ARRAY_TEST 1.14 1.47 UNIFORMS_SIMPLE_CHANGE_TEST 1.24 2.30 UNIFORMS_SSBO_TEST 0.74 1.16 ---API call cost: glBindFramebuffer: 14.10 3439% 22.90 3841% glUseProgram: 4.07 992% 6.53 1094% glBindVertexArray: 0.72 175% 0.97 162% glBindTexture: 0.62 152% 0.87 145% glDrawRangeElements: 0.41 100% 0.60 100% glUniform4fv: 0.08 20% 0.17 28% ---Instancing time: UBO_INSTANCING 0.26 0.55 TBO_INSTANCING 1.17 2.18 SSBO_INSTANCING 0.31 0.82 VBO_INSTANCING 0.22 0.61 TEXTURE_INSTANCING 0.28 0.81 UNIFORMS_INSTANCING 16.97 17.22 ---States changing time: SIMPLE_DIPS_TEST 0.13 FBO_CHANGE_TEST 1.71 SHADERS_CHANGE_TEST 2.16 VBO_CHANGE_TEST 0.22 ARRAY_OF_TEXTURES_TEST 1.12 TEXTURES_ARRAY_TEST 0.23 UNIFORMS_SIMPLE_CHANGE_TEST 0.73 UNIFORMS_SSBO_TEST 0.36 ---API call cost: glBindFramebuffer: 8.41 6630% glUseProgram: 2.03 1604% glBindVertexArray: 0.09 74% glBindTexture: 0.17 131% glDrawRangeElements: 0.13 100% glUniform4fv: 0.06 47% ---Instancing time: UBO_INSTANCING 2.45 TBO_INSTANCING 4.67 SSBO_INSTANCING 2.50 VBO_INSTANCING 2.45 TEXTURE_INSTANCING 26.09 UNIFORMS_INSTANCING 6.74 CPU: Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz GPU: GeForce GTX 970/PCIe/SSE2 Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.11 FBO_CHANGE_TEST 1.28 SHADERS_CHANGE_TEST 2.04 VBO_CHANGE_TEST 0.22 ARRAY_OF_TEXTURES_TEST 1.04 TEXTURES_ARRAY_TEST 0.23 UNIFORMS_SIMPLE_CHANGE_TEST 0.75 UNIFORMS_SSBO_TEST 3.02 ---API call cost: glBindFramebuffer: 6.29 5481% glUseProgram: 1.93 1679% glBindVertexArray: 0.10 87% glBindTexture: 0.15 134% glDrawRangeElements: 0.11 100% glUniform4fv: 0.06 55% ---Instancing time: UBO_INSTANCING 2.38 TBO_INSTANCING 4.37 SSBO_INSTANCING 2.41 VBO_INSTANCING 2.38 TEXTURE_INSTANCING 2.40 UNIFORMS_INSTANCING 6.11 CPU: AMD FX(tm)-4350 Quad-Core Processor GPU: AMD Radeon R7 200 Series Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.27 FBO_CHANGE_TEST 3.74 SHADERS_CHANGE_TEST 4.06 VBO_CHANGE_TEST 1.27 ARRAY_OF_TEXTURES_TEST 4.89 TEXTURES_ARRAY_TEST 1.02 UNIFORMS_SIMPLE_CHANGE_TEST 1.54 UNIFORMS_SSBO_TEST 0.98 ---API call cost: glBindFramebuffer: 18.42 6835% glUseProgram: 3.79 1407% glBindVertexArray: 1.00 372% glBindTexture: 0.77 285% glDrawRangeElements: 0.27 100% glUniform4fv: 0.13 47% ---Instancing time: UBO_INSTANCING 0.68 TBO_INSTANCING 1.07 SSBO_INSTANCING 0.68 VBO_INSTANCING 0.56 TEXTURE_INSTANCING 0.45 UNIFORMS_INSTANCING 8.49 CPU: Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz GPU: GeForce GTX 760/PCIe/SSE2 Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.13 FBO_CHANGE_TEST 1.12 SHADERS_CHANGE_TEST 1.40 VBO_CHANGE_TEST 0.22 ARRAY_OF_TEXTURES_TEST 0.96 TEXTURES_ARRAY_TEST 0.21 UNIFORMS_SIMPLE_CHANGE_TEST 0.72 UNIFORMS_SSBO_TEST 2.70 ---API call cost: glBindFramebuffer: 5.48 4135% glUseProgram: 1.27 957% glBindVertexArray: 0.09 68% glBindTexture: 0.14 103% glDrawRangeElements: 0.13 100% glUniform4fv: 0.06 44% ---Instancing time: UBO_INSTANCING 2.19 TBO_INSTANCING 5.60 SSBO_INSTANCING 2.23 VBO_INSTANCING 2.17 TEXTURE_INSTANCING 2.19 UNIFORMS_INSTANCING 5.93 CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GPU: GeForce GTX 960/PCIe/SSE2 Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.08 FBO_CHANGE_TEST 0.97 SHADERS_CHANGE_TEST 1.07 VBO_CHANGE_TEST 0.15 ARRAY_OF_TEXTURES_TEST 0.74 TEXTURES_ARRAY_TEST 0.17 UNIFORMS_SIMPLE_CHANGE_TEST 0.56 UNIFORMS_SSBO_TEST 0.21 ---API call cost: glBindFramebuffer: 4.77 5823% glUseProgram: 0.99 1203% glBindVertexArray: 0.07 83% glBindTexture: 0.11 133% glDrawRangeElements: 0.08 100% glUniform4fv: 0.05 58% ---Instancing time: UBO_INSTANCING 1.90 TBO_INSTANCING 2.22 SSBO_INSTANCING 1.88 VBO_INSTANCING 1.84 TEXTURE_INSTANCING 1.87 UNIFORMS_INSTANCING 4.99 CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz GPU: Intel(R) HD Graphics 4000 Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.18 FBO_CHANGE_TEST 0.56 SHADERS_CHANGE_TEST 2.49 VBO_CHANGE_TEST 0.73 ARRAY_OF_TEXTURES_TEST 2.41 TEXTURES_ARRAY_TEST 0.44 UNIFORMS_SIMPLE_CHANGE_TEST 0.61 UNIFORMS_SSBO_TEST 0.00 ---API call cost: glBindFramebuffer: 2.62 1440% glUseProgram: 2.30 1267% glBindVertexArray: 0.55 301% glBindTexture: 0.37 204% glDrawRangeElements: 0.18 100% glUniform4fv: 0.04 23% ---Instancing time: UBO_INSTANCING 0.05 TBO_INSTANCING 0.24 SSBO_INSTANCING 0.00 VBO_INSTANCING 0.10 TEXTURE_INSTANCING 0.07 UNIFORMS_INSTANCING 25.08 CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz GPU: AMD Radeon R9 200 Series Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.19 FBO_CHANGE_TEST 1.34 SHADERS_CHANGE_TEST 2.45 VBO_CHANGE_TEST 0.74 ARRAY_OF_TEXTURES_TEST 2.98 TEXTURES_ARRAY_TEST 0.71 UNIFORMS_SIMPLE_CHANGE_TEST 1.10 UNIFORMS_SSBO_TEST 0.65 ---API call cost: glBindFramebuffer: 6.49 3352% glUseProgram: 2.25 1163% glBindVertexArray: 0.55 283% glBindTexture: 0.46 240% glDrawRangeElements: 0.19 100% glUniform4fv: 0.09 46% ---Instancing time: UBO_INSTANCING 0.14 TBO_INSTANCING 0.52 SSBO_INSTANCING 0.16 VBO_INSTANCING 0.15 TEXTURE_INSTANCING 0.17 UNIFORMS_INSTANCING 5.57 CPU: Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz GPU: GeForce GTX 760/PCIe/SSE2 Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.12 FBO_CHANGE_TEST 1.11 SHADERS_CHANGE_TEST 1.41 VBO_CHANGE_TEST 0.20 ARRAY_OF_TEXTURES_TEST 0.96 TEXTURES_ARRAY_TEST 0.21 UNIFORMS_SIMPLE_CHANGE_TEST 0.60 UNIFORMS_SSBO_TEST 0.20 ---API call cost: glBindFramebuffer: 5.42 4632% glUseProgram: 1.29 1105% glBindVertexArray: 0.09 74% glBindTexture: 0.14 119% glDrawRangeElements: 0.12 100% glUniform4fv: 0.05 41% ---Instancing time: UBO_INSTANCING 2.18 TBO_INSTANCING 5.39 SSBO_INSTANCING 2.25 VBO_INSTANCING 2.22 TEXTURE_INSTANCING 2.25 UNIFORMS_INSTANCING 6.00 MULTI_DRAW_INDIRECT_INSTANCING 0.57 CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz GPU: GeForce GTX 660/PCIe/SSE2 Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.21 FBO_CHANGE_TEST 1.88 SHADERS_CHANGE_TEST 1.95 VBO_CHANGE_TEST 0.30 ARRAY_OF_TEXTURES_TEST 1.03 TEXTURES_ARRAY_TEST 0.30 UNIFORMS_SIMPLE_CHANGE_TEST 0.84 UNIFORMS_SSBO_TEST 0.29 ---API call cost: glBindFramebuffer: 9.17 4284% glUseProgram: 1.74 811% glBindVertexArray: 0.09 41% glBindTexture: 0.14 63% glDrawRangeElements: 0.21 100% glUniform4fv: 0.06 29% ---Instancing time: UBO_INSTANCING 2.17 TBO_INSTANCING 2.83 SSBO_INSTANCING 2.19 VBO_INSTANCING 2.17 TEXTURE_INSTANCING 2.25 UNIFORMS_INSTANCING 5.81 MULTI_DRAW_INDIRECT_INSTANCING 2.59 CPU: AMD FX(tm)-4350 Quad-Core Processor GPU: AMD Radeon R7 200 Series Parameters: CURRENT_NUM_INSTANCES 1000 NUM_FBO_CHANGES 200 ---States changing time: SIMPLE_DIPS_TEST 0.24 FBO_CHANGE_TEST 2.18 SHADERS_CHANGE_TEST 3.99 VBO_CHANGE_TEST 1.35 ARRAY_OF_TEXTURES_TEST 4.86 TEXTURES_ARRAY_TEST 1.11 UNIFORMS_SIMPLE_CHANGE_TEST 1.59 UNIFORMS_SSBO_TEST 1.04 ---API call cost: glBindFramebuffer: 10.68 4424% glUseProgram: 3.75 1552% glBindVertexArray: 1.10 457% glBindTexture: 0.77 319% glDrawRangeElements: 0.24 100% glUniform4fv: 0.13 55% ---Instancing time: UBO_INSTANCING 0.63 TBO_INSTANCING 0.99 SSBO_INSTANCING 0.77 VBO_INSTANCING 0.81 TEXTURE_INSTANCING 0.61 UNIFORMS_INSTANCING 8.64 MULTI_DRAW_INDIRECT_INSTANCING 1.71 ------------------------------------- observations GPU: GeForce GTX 690/PCIe/SSE2 glBindVertexArray: 0.09 74% glBindTexture: 0.17 131% instancing 2.5 ms+ GPU: GeForce GTX 970/PCIe/SSE2 glBindVertexArray: 0.10 87% glBindTexture: 0.15 134% instancing 2.38 ms+ GPU: GeForce GTX 760/PCIe/SSE2 glBindVertexArray: 0.09 68% glBindTexture: 0.14 103% instancing 2.2 ms+ GPU: AMD Radeon R7 200 Series glBindVertexArray: 1.00 372% glBindTexture: 0.77 285% Cheap instancing 0.5-0.7 ms GPU: AMD Mobility Radeon HD 5000 Series glBindVertexArray: 0.72 175% 0.97 162% glBindTexture: 0.62 152% 0.87 145% instancing 0.2-0.3 ms GPU: AMD Radeon (TM) R9 380 Series glBindVertexArray: 0.55 268% glBindTexture: 0.48 232% instancing 0.15-0.2 ms on GeForce cards glBindVertexArray and glBindTexture much cheaper than on Radeon's but at the same time instancing 5-10 times more expensive other measurements +- the same any of the following instancing types have +-same speed: UBO, SSBO, VBO, TEXTURE... TBO is more expensive The purpose of this article is to understand the best way to optimize the engine. Obviously - changing render target and changing shaders are most expensive operations. It is also good idea to gather textures in texture arrays and use indexing to get right texture in the shader. Same for geometry - possible good idea unite vbo's in one. At least for bunch of debris, buildings... glDraw* itself is pretty cheap. It is arround 0.1-0.25 ms on current hardware for 1k simple dips. At least if you don't change other states. glMultiDrawIndirect is awesome. Hope it helped a bit.
  5. OpenGL API Overhead

    Introduction In modern projects, to produce a nice looking scene the engine will render thousands of different objects: characters, buildings, landscape, effects and more. Of course, there are several ways to render geometry on the screen. In this article, we consider how to do that effectively, measure and compare the cost of rendering API calls - note however that only the CPU load it measured, which may not be 100% accurate but can give an idea of the relative costs. Consider the cost of API calls: state changes (frame buffers, vertex buffers, shaders, constants, textures) different types of geometry instancing and compare their performance several practical examples of how one should optimize geometry render in projects. I will cover only the OpenGL API. I will not describe details, parameters and variations of each API call. There are reference books and manuals for this purpose. Computer configuration for all tests: Intel Core i5-4460 3.2GHz., Radeon R9 380. Measurements: The right way to correct performance measurements is to catch the overall frame timing. Even if there are some extra things like clearing back buffer & fbo binding. Exact equation: AVG_TEST_TIME = (CURRENT_TIME_AFTER_N_ITERATIONS - TIME_WHEN_WE_STARTED_FIRST_ITERATION) / NUMBER_OF_ITERATIONS In all calculations, time is in ms. Perform one test several times and calculate average time. NUMBER_OF_ITERATIONS should be pretty large: 500-1000. Otherwise we get too large difference from launch to launch. Measuring instancing time we must sure that GPU is not bottleneck. As we want to know just CPU time. And how it scales when use different amount of instances. States changing We want to see 'reach' picture on the screen, a lot of unique objects with a lot of details. For this purpose engine takes all visible objects in camera, sets their parameters (vertex buffers, shaders, material parameters, textures, etc.) and send them to render. All these actions performed with special API commands. Let's consider them, make some tests to understand how to organize the rendering process optimally. Let's measure the cost of different OpenGL calls: dip (draw index primitive), change of shaders, vertex buffers, textures, shader parameters. Dips Dip (draw indexed primitive) -- command to GPU to render a bunch of geometry, more often triangles. Off course first we need to tell - what geometry we want to show, with what shader, set some options. But dip renders geometry; all other commands just describe parameters of what we want to show. The dip's price usually includes all related state changes - not only one command. Of course, all depends on the amount of state changes. First, consider the simplest case - cost of one thousand simple dips, without state changes. void simple_dips() { glBindVertexArray(ws_complex_geometry_vao_id); //what geometry to render simple_geometry_shader.bind(); //with what shader //a lot of simple dips for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i+1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES*sizeof(int))); //simple dip } For 1? dips we get 0.41 ms. Frame buffer change FBO (frame buffer object) -- is an object, which allows rendering image not to the screen, but to another surface, which lately one could use as texture in shaders. Fbo changes not so often as other elements, but at the same time, the change cost is quite expensive for the CPU. void fbo_change_test() { //clear FBO glViewport(0, 0, window_width, window_height); glClearColor(0.0f / 255.0f, 0.0f / 255.0f, 0.0f / 255.0f, 0.0); for (int i = 0; i < NUM_DIFFERENT_FBOS; i++) { glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]); glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); } //prepare dip glBindVertexArray(ws_complex_geometry_vao_id); simple_geometry_shader.bind(); //bind FBO, render one object... repeat N times for (int i = 0; i < NUM_FBO_CHANGES; i++) { glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]); //bind fbo glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } glBindFramebuffer(GL_FRAMEBUFFER, 0); //set rendering to the screen } For 200 fbo changes we get 1.97 ms. One needs to change FBO usually for post effects and different passes, like: reflections, rendering into cubemap, creating virtual textures, etc. Many things like virtual textures could be organized as atlases, to set FBO only once and change for example just viewport. Render in cubemap might be replaced on another technique. For example on dual paraboloid rendering. The matter of course, not only in FBO changes, but in the number of passes of scene rendering, material changes, etc. In general, the less state changes the better. Shader changes Shaders usually describe one of the scene's materials or effect techniques. The more materials, kinds of surfaces the more shaders. Several materials might vary slightly. These should be combined into one and switching between them make as condition in the shader, The number of materials directly influence on dips amount. void shaders_change_test() { glBindVertexArray(ws_complex_geometry_vao_id); for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { simple_color_shader[i%NUM_DIFFERENT_SIMPLE_SHADERS].bind(); //bind certain shader glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } } For 1000 shader changes we get 2.90 ms. Changing shader here also includes transferring world-view-proj matrix as a parameter. Otherwise we could not render anything. Cost of parameters changing we measure in next step. Shader parameters changing Often materials make universal with a lot of options to get different kinds of materials. An easy way to make a variety of pictures, each character/object unique. We need somehow transfer to shader these parameters. This could be done with API commands glUniform*. uniforms_changes_test_shader.bind(); glBindVertexArray(ws_complex_geometry_vao_id); for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { //set uniforms for this dip for (int j = 0; j < NUM_UNIFORM_CHANGES_PER_DIP; j++) glUniform4fv(ColorShader_uniformLocation[j], 1, randomColors[(i*NUM_UNIFORM_CHANGES_PER_DIP + j) % MAX_RANDOM_COLORS].x); glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } It is not optimal to set parameters individually for each instance/object. Usually all instance data might be packed into 1 large buffer and transferred to gpu with one command. It only remains for each object to set a shift - where it's data placed. //copy data to ssbo buffer glBindBuffer(GL_SHADER_STORAGE_BUFFER, instances_uniforms_ssbo); float *gpu_data = (float*)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4), GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT); memcpy(gpu_data, all_instances_uniform_data[0], CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4)); //copy instances data //bind for shader to 0 point (shader will read data from this link point) glUnmapBuffer(GL_SHADER_STORAGE_BUFFER); glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, instances_uniforms_ssbo); glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); //render uniforms_changes_ssbo_shader.bind(); glBindVertexArray(ws_complex_geometry_vao_id); static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_changes_ssbo_shader.programm_id, instance_data_location); for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { //set parameter to sahder - where object data located glUniform1i(uniformsInstancing_data_varLocation, i*NUM_UNIFORM_CHANGES_PER_DIP); glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } Uniforms changes test takes 1.27 ms. Same test for shader parameters transferring using SSBO takes 0.80 ms. Using glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_WRITE_ONLY); causes CPU and GPU synchronization which should be avoided. One should use glMapBufferRange with flag GL_MAP_UNSYNCHRONIZED_BIT, to prevent synchronization. But programmer should guaranty that overwriting data arren't using by GPU right now. Otherwise we get bugs as we rewriting data which are reading by GPU now. To completely resolve this problem, use triple buffering. When we use current buffer for writing data, the rest 2 uses GPU. Plus there is more optimal mapping buffer method with flags GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT. Changing vertex buffers There are a lot of objects with different geometries in the scene. This geometry is usually placed in different vertex buffers. To render another object with different geometry - even with the same material - we need to change vertex buffer. There are techniques which allow effectively rendering different geometry with same material with only one dip: MultiDrawIndirect, Dynamic vertex pulling. Such geometry should be placed in one buffer. void vbo_change_test() { simple_geometry_shader.bind(); for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { glBindVertexArray(separate_geometry_vao_id[i % NUM_SIMPLE_VERTEX_BUFFERS]); //change vbo glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } } For 1000 VBO changes we get 0.95 ms. Textures changes Textures give surfaces a detailed view. You can get a very large variety in the picture simply by changing the textures, blending different textures in the shader. Textures have to be changed frequently, but you can put them in the so-called texture array, to bind it only once for lots of dips and access to textures through an index in the shader. Same geometry with different textures might be rendered using instancing. void textures_change_test() { glBindVertexArray(ws_complex_geometry_vao_id); int counter = 0; //switch between tests if (test_type == ARRAY_OF_TEXTURES_TEST) { array_of_textures_shader.bind(); for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { //bind textures for this dip for (int j = 0; j < NUM_TEXTURES_IN_COMPLEX_MATERIAL; j++) { glActiveTexture(GL_TEXTURE0 + j); glBindTexture(GL_TEXTURE_2D, array_of_textures[counter % TEX_ARRAY_SIZE]); glBindSampler(j, Sampler_linear); counter++; } glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } } else if (test_type == TEXTURES_ARRAY_TEST) { //bind texture aray for all dips glActiveTexture(GL_TEXTURE0); glBindTexture(GL_TEXTURE_2D_ARRAY, texture_array_id); glBindSampler(0, Sampler_linear); //variable to tell shader - what textures uses this dip static int textureArray_usedTex_varLocation = glGetUniformLocation(textureArray_shader.programm_id, used_textures_i); textureArray_shader.bind(); float used_textures_i[6]; for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { //fill data - what textures uses this dip for (int j = 0; j < 6; j++) { used_textures_i[j] = counter % TEX_ARRAY_SIZE; counter++; } glUniform1fv(textureArray_usedTex_varLocation, 6, used_textures_i[0]); //transfer to shader, tell what textures this material uses glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip } } } Simple textures changes test (for 1000 dips) takes 3.27 ms. But we make NUM_TEXTURES_IN_COMPLEX_MATERIAL textures changes per dip. We should take into account this later, calculating glBindTexture cost. Using texture array performing same test we get 0.87 ms. Comparative estimation of state changes Below is a table with the execution cost/time of all performed tests. Table 1. State changes tests time Test typeSIMPLE_DIPS_TEST0.41FBO_CHANGE_TEST 1.97SHADERS_CHANGE_TEST2.90UNIFORMS_SIMPLE_CHANGE_TEST1.27UNIFORMS_SSBO_CHANGE_TEST0.80VBO_CHANGE_TEST0.95 ARRAY_OF_TEXTURES_TEST3.27TEXTURES_ARRAY_TEST0.87 Using this results we are able to calculate API call cost. Absolute cost per 1000 API calls. Relative cost calculate in relation to the simple dip call (glDrawRangeElements). Table 2. API call cost (per 1k calls) API callAbsolute costRelative cost %glBindFramebuffer 9.442314%glUseProgram2.49610%glBindVertexArray 0.54132%glBindTexture0.48116%glDrawRangeElements 0.41100%glUniform4fv0.0921% Of course, one should be very cautious to measurements as they will change depending on the version of the driver and hardware. Instancing Instancing invented to quickly render the same geometry with different parameters. Each object has a unique index according to which we can take desired for this object parameters in she shader, vary some options, etc. Main advantage of using instancing - we can greatly reduce the number of dips. We can pack all instances parameters in the buffer, transfer them to GPU and make just one dip. Storing data in the buffer is a good optimization itself - we saving on what it is not necessary to constantly change the shader parameters. Also, if instance data do not change (for example we exactly know that it is static geometry), we don't need to transfer data to GPU every frame, actually just once at program/level start. In general, for optimal rendering we should first to pack all instances data to one buffer, transfer them to GPU with one command. For each dip, type og geometry - just set the offset where to find instances data for this dip. Using instance index (gl_InstanceID in OpenGL) we able to sample certain data for this instance/object. There are a lot of ways to store data in OpenGL: vertex buffer (VBO), uniform buffer (UBO), texture buffer (TBO), shader storage buffer (SSBO), textures. There are various features for each buffer type. Consider that. Texture instancing All data stored in the texture. To effectively change data in texture one should use special structures - Pixel Buffer Object (PBO) which allow transferring data asynchronously from CPU to GPU. CPU does not wait until the data will be transferred and continues to work. Creation code: glGenBuffers(2, textureInstancingPBO); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[0]); //GL_STREAM_DRAW_ARB means that we will change data every frame glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[1]); glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); //create texture where we will store instances data on gpu glGenTextures(1, textureInstancingDataTex); glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_R, GL_REPEAT); //in each line we store NUM_INSTANCES_PER_LINE object's data. 128 in our case //for each object we store PER_INSTANCE_DATA_VECTORS data-vectors. 2 in our case //GL_RGBA32F, we have float32 data //complex_mesh_instances_data source data of instances, if we are not going to update data in the texture glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, 0, GL_RGBA, GL_FLOAT, &complex_mesh_instances_data[0]); glBindTexture(GL_TEXTURE_2D, 0); Texture update: glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex); glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[current_frame_index]); // copy pixels from PBO to texture object glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, GL_RGBA, GL_FLOAT, 0); // bind PBO to update pixel values glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[next_frame_index]); //http://www.songho.ca/opengl/gl_pbo.html // Note that glMapBufferARB() causes sync issue. // If GPU is working with this buffer, glMapBufferARB() will wait(stall) // until GPU to finish its job. To avoid waiting (idle), you can call // first glBufferDataARB() with NULL pointer before glMapBufferARB(). // If you do that, the previous data in PBO will be discarded and // glMapBufferARB() returns a new allocated pointer immediately // even if GPU is still working with the previous data. glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB); gpu_data = (float*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY_ARB); if (gpu_data) { memcpy(gpu_data, complex_mesh_instances_data[0], INSTANCES_DATA_SIZE); // update data glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER); //release pointer to mapping buffer } Rendering using texture instancing: //bind texture with instances data glActiveTexture(GL_TEXTURE0); glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex); glBindSampler(0, Sampler_nearest); glBindVertexArray(geometry_vao_id); //what geometry to render tex_instancing_shader.bind(); //with what shader //tell shader texture with data located, what name it has static GLint location = glGetUniformLocation(tex_instancing_shader.programm_id, s_texture_0); if (location >= 0) glUniform1i(location, 0); //render group of objects glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES); Vertex shader to access the data: #version 150 core in vec3 s_pos; in vec3 s_normal; in vec2 s_uv; uniform mat4 ModelViewProjectionMatrix; uniform sampler2D s_texture_0; out vec2 uv; out vec3 instance_color; void main() { const vec2 texel_size = vec2(1.0 / 256.0, 1.0 / 16.0); const int objects_per_row = 128; const vec2 half_texel = vec2(0.5, 0.5); //calc texture coordinates - where our instance data located //gl_InstanceID % objects_per_row - index of object in the line //multiple by 2 as each object has 2 vectors of data //gl_InstanceID / objects_per_row - in what line our data located //multiple by texel_size gieves us 0..1 uv to sample from texture from interer texel id vec2 texel_uv = (vec2((gl_InstanceID % objects_per_row) * 2, floor(gl_InstanceID / objects_per_row)) + half_texel) * texel_size; vec4 instance_pos = textureLod(s_texture_0, texel_uv, 0); instance_color = textureLod(s_texture_0, texel_uv + vec2(texel_size.x, 0.0), 0).xyz; uv = s_uv; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); } Instancing through vertex buffer Idea is to keep instance data in separate vertex buffer and have an axes to them in shader through vertex attributes. Code of buffer creation with data itself is trivial. Our main task is to modify information about vertex for shader (vertex declaration, vdecl) //...code of base vertex declaration creation //special atributes binding glBindBuffer(GL_ARRAY_BUFFER, all_instances_data_vbo); //size of per instance data (PER_INSTANCE_DATA_VECTORS = 2 - so we have to create 2 additional attributes to transfer data) const int per_instance_data_size = sizeof(vec4) * PER_INSTANCE_DATA_VECTORS; glEnableVertexAttribArray(4); //4th vertex attribute, has 4 floats, 0 data offset glVertexAttribPointer((GLuint)4, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(0)); //tell that we will change this attribute per instance, not per vertex glVertexAttribDivisor(4, 1); glEnableVertexAttribArray(5); //5th vertex attribute, has 4 floats, sizeof(vec4) data offset glVertexAttribPointer((GLuint)5, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(sizeof(vec4))); //tell that we will change this attribute per instance, not per vertex glVertexAttribDivisor(5, 1); Rendering code: vbo_instancing_shader.bind(); //our vertex buffer wit modified vertex declaration (vdecl) glBindVertexArray(geometry_vao_vbo_instancing_id); glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES); Vertex shader to access data: #version 150 core in vec3 s_pos; in vec3 s_normal; in vec2 s_uv; in vec4 s_attribute_3; //some_data; in vec4 s_attribute_4; //instance pos in vec4 s_attribute_5; //instance color uniform mat4 ModelViewProjectionMatrix; out vec3 instance_color; void main() { instance_color = s_attribute_5.xyz; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + s_attribute_4.xyz, 1.0); } Uniform buffer instancing, Texture buffer instancing, Shader Storage buffer instancing These three methods are very similar to each other. They differ mostly by buffer type. Uniform buffer (UBO) characterized by small size, but it should theoretically be faster than the others. Texture buffer (TBO) has very big size. We able to store all scene instances data into it, skeletal transformations. Shader Storage Buffer (SSBO) has both properties - fast with a large size. Also, we can write data to it. The only thing - it is new extension, and the old hardware does not support it. Uniform buffer Creation code: glGenBuffers(1, dips_uniform_buffer); glBindBuffer(GL_UNIFORM_BUFFER, dips_uniform_buffer); glBufferData(GL_UNIFORM_BUFFER, INSTANCES_DATA_SIZE, &complex_mesh_instances_data[0], GL_STATIC_DRAW); //uniform_buffer_data glBindBuffer(GL_UNIFORM_BUFFER, 0); //bind iniform buffer with instances data to shader ubo_instancing_shader.bind(true); GLint instanceData_location3 = glGetUniformLocation(ubo_instancing_shader.programm_id, "instance_data"); //link to shader glUniformBufferEXT(ubo_instancing_shader.programm_id, instanceData_location3, dips_uniform_buffer); //actually binding Instancing vertex shader with uniform buffer: #version 150 core #extension GL_ARB_bindable_uniform : enable #extension GL_EXT_gpu_shader4 : enable in vec3 s_pos; in vec3 s_normal; in vec2 s_uv; uniform mat4 ModelViewProjectionMatrix; bindable uniform vec4 instance_data[4096]; //our uniform with instances data out vec3 instance_color; void main() { vec4 instance_pos = instance_data[gl_InstanceID*2]; instance_color = instance_data[gl_InstanceID*2+1].xyz; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); } Texture Buffer Creation code: tbo_instancing_shader.bind(); //bind to shader as special texture glActiveTexture(GL_TEXTURE0); glBindTexture(GL_TEXTURE_BUFFER, dips_texture_buffer_tex); glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, dips_texture_buffer); glBindVertexArray(geometry_vao_id); glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES); Vertex shader: #version 150 core #extension GL_EXT_bindable_uniform : enable #extension GL_EXT_gpu_shader4 : enable in vec3 s_pos; in vec3 s_normal; in vec2 s_uv; uniform mat4 ModelViewProjectionMatrix; uniform samplerBuffer s_texture_0; //our TBO texture bufer out vec3 instance_color; void main() { //sample data from TBO vec4 instance_pos = texelFetch(s_texture_0, gl_InstanceID*2); instance_color = texelFetch(s_texture_0, gl_InstanceID*2+1).xyz; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); } SSBO Creation code: glGenBuffers(1, ssbo); glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo); glBufferData(GL_SHADER_STORAGE_BUFFER, INSTANCES_DATA_SIZE, complex_mesh_instances_data[0], GL_STATIC_DRAW); glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo); glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind Render: //bind ssbo_instances_data, link to shader at 0 binding point glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo_instances_data); glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo_instances_data); glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); ssbo_instancing_shader.bind(); glBindVertexArray(geometry_vao_id); glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES); glBindVertexArray(0); Vertex shader: #version 430 #extension GL_ARB_shader_storage_buffer_object : require in vec3 s_pos; in vec3 s_normal; in vec2 s_uv; uniform mat4 ModelViewProjectionMatrix; //ssbo should be binded to 0 binding point layout(std430, binding = 0) buffer ssboData { vec4 instance_data[4096]; }; out vec3 instance_color; void main() { //gl_InstanceID is unique for each instance. So we able to set per instance data vec4 instance_pos = instance_data[gl_InstanceID*2]; instance_color = instance_data[gl_InstanceID*2+1].xyz; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); } Uniforms instancing Pretty simple. We have ability to set with special commands (glUniform*) several vectors with data to shader. Maximum amount depends on video card. Get the maximum number possible by calling glGetIntegerv with GL_MAX_VERTEX_UNIFORM_VECTORS parameter. For R9 380 will return 4096. Minimum value is 256. uniforms_instancing_shader.bind(); glBindVertexArray(geometry_vao_id); //variable - where in shader our array of uniforms located. We will write data to this array static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_instancing_shader.programm_id, instance_data); //instances data might be written with just one call if there are enough vectors. //Just for clarity, divide into groups, because usually much more there are much more data than available uniforms. for (int i = 0; i < UNIFORMS_INSTANCING_NUM_GROUPS; i++) { //write data to uniforms glUniform4fv(uniformsInstancing_data_varLocation, UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING, complex_mesh_instances_data[i*UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING].x); glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, UNIFORMS_INSTANCING_OBJECTS_PER_DIP); } Multi draw indirect Separately consider a command that allows drawing a huge number of dips for one call. This is a very useful command which allows rendering a group of instances with different geometry, even thousands of different groups with one command. As an input, it receives an array that describes the parameters of dips: the number of indexes, shifting in vertex buffers, amount of instances per group. The restriction is that the entire geometry should be placed in one vertex buffer and rendered with one shader. Additional plus is that we can fill information about dips for MultiDraw command on GPU side, which is very useful for GPU frustum culling for example. //fill indirect buffer with dips information. Just simple array for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) { multi_draw_indirect_buffer.vertexCount = BOX_NUM_INDICES; multi_draw_indirect_buffer.instanceCount = 1; multi_draw_indirect_buffer.firstVertex = i*BOX_NUM_INDICES; multi_draw_indirect_buffer.baseVertex = 0; multi_draw_indirect_buffer.baseInstance = 0; } glBindVertexArray(ws_complex_geometry_vao_id); simple_geometry_shader.bind(); glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (GLvoid*)multi_draw_indirect_buffer[0], //our information about dips CURRENT_NUM_INSTANCES, //number of dips 0); glMultiDrawElementsIndirect command performs several glDrawElementsInstancedIndirect in one call. There is an unpleasant feature in the behavior of this command. Each such group (glDrawElementsInstancedIndirect) will have independent gl_InstanceID, i.e. each time it drops to 0 with new Draw*. Which makes difficult to access required per instance data. This problem solves by modifying vertex declaration of each type of objects being sent to the renderer. You can read an article about it Surviving without gl_DrawID. It is worth noting that glMultiDrawElementsIndirect performed huge number of dips with a single command. You don't need to compare this command with the other types of instancing. Performance comparison of different types of instancing Table 3. Instancing tests performance. Number of iterations = 100, top-amount of instances. cpu time (gpu time) Instancing type50100200UBO_INSTANCING 0.35 (0.10)0.37 (0.13)0.36 (0.24)TBO_INSTANCING0.72 (0.11)0.73 (0.13)0.73 (0.25) SSBO_INSTANCING0.37 (0.09)0.40 (0.13)0.38 (0.24)VBO_INSTANCING0.36 (0.09)0.37 (0.12)0.37 (0.24)TEXTURE_INSTANCING0.38 (0.10)0.39 (0.13)0.39 (0.24) UNIFORMS_INSTANCING0.41 (0.13)0.52 (0.27)0.74 (0.51)MULTI_DRAW_INDIRECT0.63 (0.53) 1.17 (1.01)2.10 (1.93) UBO, VBO, SSBO, TEXTURE instancing types have pretty much the same 'good' timing. TBO instancing allows to store huge amount of information, but it is slow in comparison with UBO. If possible, you should use SSBO storage. It is fast, handy and has a huge size. Texture instancing is also a good alternative to UBO. Supported by the old hardware, you can store any amount of information. A little uncomfortable to update. Transfering data each frame through glUniform* obviously is the slowest instancing method. glMultiDrawElementsIndirect in tests performed 5?, 10? ? 20? dips ! But we tested repetition of test. Such amount of dips might be done by just one call. The only thing - with so many dips, an array with dips description will be pretty huge (better to use GPU for to create description). Recommendations for optimization and conclusions In this paper we make an analysis of API calls, measured different types of instancing performance. In general, the less state switches, the better. Use the newest features the latest version of the API: textures array, SSBO, Draw Indirect, mapping buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags for fast data transferring. Recommendations: The less states changes the better. One should group objects by material. You may wrap state changes (textures, buffers, shaders and other states). Check if state really changed before API call because it is much slower than just flag/index checking. Unite geometry in one buffer. Use texture arrays. Store data in large buffers and textures Use as little shaders as possible. But too complicated/universal shader with many branches obviously will be a problem. Especially on older video cards, where branching is expensive. Use instancing Use Draw Indirect if it is possible and generate information about dips on GPU side. Some general advice: It is necessary to calculate bottlenecks and optimize them first. You need to know what limits performance - CPU or GPU and optimize it. Don't make work twice. Reuse results of different passes, reuse previous frames result (reprojection techniques, sorting, tracing, anything). Difficult calculation might be precalculated The best optimization - not to do the work Use parallel calculations: split work into parts and do them on parallel threads. Source code of all examples. [attachment=34883:GL_API_overhead.rar] Links: Beyond Porting OpenGL documentation Instancing in OpenGL OpenGL Pixel Buffer Object (PBO) Buffer Object Drawing with OpenGL Shader Storage Buffer Object Shader Storage Buffers Objects hardware caps, stats (for GL_MAX_VERTEX_UNIFORM_COMPONENTS) Array Texture Textures Array example MultiDrawIndirect example The Road to One Million Draws MultiDrawIndirect, Surviving without gl_DrawID Humus demo about Instancing Anatoliy Gerlits February 2017
  6. Frustum Culling

    Zaoshi Kaba >Have you considered trying AVX? On my i7-4770K 4.0 GHz I get roughly 500,000,000 AABB culls / second on a single thread, so it's roughly 0.2 ms per 100?. nice results no - not tested AVX yet   corysama looks good, will try it - thanks
  7. Frustum Culling

    MAnd >This is a great post in many aspects thanks   >I am not talking about the written English well, I just learning it )   >rushing over details any aspect of described optimizations is the whole universe. Just thought - it would be too boring for everyone to read basics again. For those who not familiar with concepts of SSE & multithreading & geometry shaders - it will be anyway too hard to understand the code. The article is about optimizations, speedup of naive implementation, comparison and how good is gpu.   I was a bit surprised, but CPU is really fast with all optimizations - as fast as GPU. Honestly, GPU is still a dark horse. Need some additional tests: geometry shaders vs compute, atomics overhead, indirect with lots of 'empty' dips, OBB culling, BVH/trees on gpu. CPU culling also requires some additional research: need to understand how queries are suitable for occlusion culling. Tests need to be conservative due to queries nature. Such approach was used in Doom 2016. It might be very interesting. So, even getting such 'not bad' results for CPU - I am still not sure how it is good in comparison with GPU. Additional research required.   >in many cases even a basic implementation of an Octree or KD-tree outperform even SSE+multithreaded I work on WarThunder. We tested hierarchical structures for culling. Surprisingly, but simple brute force culling of linear array of objects outperforms all hierarchies. It was not me, who tested this, but I believe in results: - SSE 'doesn't like' branching :) - branching is pretty slow - next calculations depends on prev ones - probably even important - processor's caches utilization (your data are not locally and you jump from one to another all the time, unpredictable, fetch different data each time)
  8. Frustum Culling

    Introduction Frustum culling is process of discarding objects not visible on the screen. As we don't see them aEUR" we don't need to spend resources on computer to prepare it for rendering and rendering itself. In this paper I will cover next themes: culling of: Bounding Spheres, Axis-Aligned Bounding Boxes (AABB), Oriented Bounding Boxes (OBB) culling of huge amount of objects using SSE multithreaded culling GPU culling comparison of approaches efficiency, working speed What I will not cover: using hierarchical structures, trees. We can unite objects in groups according to world positions and first check visibility of whole group. optimizations of one object, like using last 'successful' culling plane visibility test taking into account scene depth buffer. Object can be inside frustum, but can be completely blocked by another, closer to viewer object. Hence we also might discard this object from rendering software culling. We may perform blocking of one objects by another on CPU side. Culling for shadows Simple culling We have area of visibility, which is set by frustum of viewing pyramid. Objects that are not inside this area should be discarded from rendering process. Frustum one usually set with 6 planes. We have objects in the scene. Each object might be aproxinated with simple geometry such as sphere or box. All object's geometry lies inside this primitive. Visibility test of such simple geometry performs very fast. Our aim is to understand if this object visible in frustum. Consider the definition of visibility: spheres and boxes. There are different kinds of boxes: world axis aligned (AABB) and aligned according local object's axis (OBB). One could clearly see that OBB beter aproximates object, but it's visibility test performs harder than AABB. Sphere-frustum Algorithm: for object's center we find distance to each frustum plane. If point is behind any plane more than sphere radius, then sphere not in frustum. And found one is splitting plane. __forceinline bool SphereInFrustum(vec3 &pos, float &radius, vec4 *frustum_planes) { bool res = true; //test all 6 frustum planes for (int i = 0; i < 6; i++) { //calculate distance from sphere center to plane. //if distance larger then sphere radius - sphere is outside frustum if (frustum_planes.x * pos.x + frustum_planes.y * pos.y + frustum_planes.z * pos.z + frustum_planes.w 1 outside_positive_plane = obb_points[0] > obb_points[0].w && obb_points[1] > obb_points[1].w && obb_points[2] > obb_points[2].w && obb_points[3] > obb_points[3].w && obb_points[4] > obb_points[4].w && obb_points[5] > obb_points[5].w && obb_points[6] > obb_points[6].w && obb_points[7] > obb_points[7].w; outside_negative_plane = obb_points[0] < -obb_points[0].w && obb_points[1] < -obb_points[1].w && obb_points[2] < -obb_points[2].w && obb_points[3] < -obb_points[3].w && obb_points[4] < -obb_points[4].w && obb_points[5] < -obb_points[5].w && obb_points[6] < -obb_points[6].w && obb_points[7] < -obb_points[7].w; outside = outside || outside_positive_plane || outside_negative_plane; //if (outside_positive_plane || outside_negative_plane) //return false; } return !outside; //return true; } Table 1: culing results of 100Do objects. Intel Core I5. Single Thread. Simple Culling Sphere AABB OBB Just cullung 0,92 1,42 9,14 Whole frame 1,94 2,5 10,3 The results are obvious. The harder calculations the slower it works. OBB test much slower than Spheres or AABB tests. But we get more precise culling with OBB. May be, optimal solution is spiting objects into groups, For each group depending on distance to camera we use appropriate primitive. For closest groups use OBB, for middle one groups use ABB and Spheres for the rest. Also should be notices than whole frame time is larger than just culling. 1 ms. in average. Because of transferring data about visible objects to gpu has cost, couple of dips and API commands. But it is necessary actions. SSE SSE (Streaming SIMD Extensions) aEUR" with one instructions we perform calculations on group of operands. SSE includes in it's architecture eight 128 bit registers and set of instructions to perform any operations on them. Theoretically we might speedup code execution 4 times as we make operations with 4 operands simultaneously. Offcourse on practice perfomance win will be less because of SSE drawbacks. Not all algorithms could be easily rewrited in SSE data should be packed according to SSE requirements in registers to perform calculations SSE has some restrictions with vertical operations like dot products there are no conditions. One use so called static branching, when we execute bot 2 parts of condition an take just one interesting to us result. Loading data in registers and storing results back into memory don't forget about sse data striding Algorithm of SSE Spheres-frustum and SSE AABB-frustum culling almost identical to simple implementation. In exception of that we perform calculations on 4 objects simultaneously. void sse_culling_spheres(BSphere *sphere_data, int num_objects, int *culling_res, vec4 *frustum_planes) { float *sphere_data_ptr = reinterpret_cast(&sphere_data[0]); int *culling_res_sse = &culling_res[0]; //to optimize calculations we gather xyzw elements in separate vectors __m128 zero_v = _mm_setzero_ps(); __m128 frustum_planes_x[6]; __m128 frustum_planes_y[6]; __m128 frustum_planes_z[6]; __m128 frustum_planes_d[6]; int i, j; for (i = 0; i < 6; i++) { frustum_planes_x = _mm_set1_ps(frustum_planes.x); frustum_planes_y = _mm_set1_ps(frustum_planes.y); frustum_planes_z = _mm_set1_ps(frustum_planes.z); frustum_planes_d = _mm_set1_ps(frustum_planes.w); } //we process 4 objects per step for (i = 0; i < num_objects; i += 4) { //load bounding sphere data __m128 spheres_pos_x = _mm_load_ps(sphere_data_ptr); __m128 spheres_pos_y = _mm_load_ps(sphere_data_ptr + 4); __m128 spheres_pos_z = _mm_load_ps(sphere_data_ptr + 8); __m128 spheres_radius = _mm_load_ps(sphere_data_ptr + 12); sphere_data_ptr += 16; //but for our calculations we need transpose data, to collect x, y, z and w coordinates in separate vectors _MM_TRANSPOSE4_PS(spheres_pos_x, spheres_pos_y, spheres_pos_z, spheres_radius); __m128 spheres_neg_radius = _mm_sub_ps(zero_v, spheres_radius); // negate all elements __m128 intersection_res = _mm_setzero_ps(); for (j = 0; j < 6; j++) //plane index { //1. calc distance to plane dot(sphere_pos.xyz, plane.xyz) + plane.w //2. if distance < sphere radius, then sphere outside frustum __m128 dot_x = _mm_mul_ps(spheres_pos_x, frustum_planes_x[j]); __m128 dot_y = _mm_mul_ps(spheres_pos_y, frustum_planes_y[j]); __m128 dot_z = _mm_mul_ps(spheres_pos_z, frustum_planes_z[j]); __m128 sum_xy = _mm_add_ps(dot_x, dot_y); __m128 sum_zw = _mm_add_ps(dot_z, frustum_planes_d[j]); __m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw); __m128 plane_res = _mm_cmple_ps(distance_to_plane, spheres_neg_radius); //dist < -sphere_r ? intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - sphere behind the plane & outside frustum } //store result __m128i intersection_res_i = _mm_cvtps_epi32(intersection_res); _mm_store_si128((__m128i *)&culling_res_sse, intersection_res_i); } } void sse_culling_aabb(AABB *aabb_data, int num_objects, int *culling_res, vec4 *frustum_planes) { float *aabb_data_ptr = reinterpret_cast(&aabb_data[0]); int *culling_res_sse = &culling_res[0]; //to optimize calculations we gather xyzw elements in separate vectors __m128 zero_v = _mm_setzero_ps(); __m128 frustum_planes_x[6]; __m128 frustum_planes_y[6]; __m128 frustum_planes_z[6]; __m128 frustum_planes_d[6]; int i, j; for (i = 0; i < 6; i++) { frustum_planes_x = _mm_set1_ps(frustum_planes.x); frustum_planes_y = _mm_set1_ps(frustum_planes.y); frustum_planes_z = _mm_set1_ps(frustum_planes.z); frustum_planes_d = _mm_set1_ps(frustum_planes.w); } __m128 zero = _mm_setzero_ps(); //we process 4 objects per step for (i = 0; i < num_objects; i += 4) { //load objects data //load aabb min __m128 aabb_min_x = _mm_load_ps(aabb_data_ptr); __m128 aabb_min_y = _mm_load_ps(aabb_data_ptr + 8); __m128 aabb_min_z = _mm_load_ps(aabb_data_ptr + 16); __m128 aabb_min_w = _mm_load_ps(aabb_data_ptr + 24); //load aabb max __m128 aabb_max_x = _mm_load_ps(aabb_data_ptr + 4); __m128 aabb_max_y = _mm_load_ps(aabb_data_ptr + 12); __m128 aabb_max_z = _mm_load_ps(aabb_data_ptr + 20); __m128 aabb_max_w = _mm_load_ps(aabb_data_ptr + 28); aabb_data_ptr += 32; //for now we have points in vectors aabb_min_x..w, but for calculations we need to xxxx yyyy zzzz vectors representation - just transpose data _MM_TRANSPOSE4_PS(aabb_min_x, aabb_min_y, aabb_min_z, aabb_min_w); _MM_TRANSPOSE4_PS(aabb_max_x, aabb_max_y, aabb_max_z, aabb_max_w); __m128 intersection_res = _mm_setzero_ps(); for (j = 0; j < 6; j++) //plane index { //this code is similar to what we make in simple culling //pick closest point to plane and check if it begind the plane. if yes - object outside frustum //dot product, separate for each coordinate, for min & max aabb points __m128 aabbMin_frustumPlane_x = _mm_mul_ps(aabb_min_x, frustum_planes_x[j]); __m128 aabbMin_frustumPlane_y = _mm_mul_ps(aabb_min_y, frustum_planes_y[j]); __m128 aabbMin_frustumPlane_z = _mm_mul_ps(aabb_min_z, frustum_planes_z[j]); __m128 aabbMax_frustumPlane_x = _mm_mul_ps(aabb_max_x, frustum_planes_x[j]); __m128 aabbMax_frustumPlane_y = _mm_mul_ps(aabb_max_y, frustum_planes_y[j]); __m128 aabbMax_frustumPlane_z = _mm_mul_ps(aabb_max_z, frustum_planes_z[j]); //we have 8 box points, but we need pick closest point to plane. Just take max __m128 res_x = _mm_max_ps(aabbMin_frustumPlane_x, aabbMax_frustumPlane_x); __m128 res_y = _mm_max_ps(aabbMin_frustumPlane_y, aabbMax_frustumPlane_y); __m128 res_z = _mm_max_ps(aabbMin_frustumPlane_z, aabbMax_frustumPlane_z); //dist to plane = dot(aabb_point.xyz, plane.xyz) + plane.w __m128 sum_xy = _mm_add_ps(res_x, res_y); __m128 sum_zw = _mm_add_ps(res_z, frustum_planes_d[j]); __m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw); __m128 plane_res = _mm_cmple_ps(distance_to_plane, zero); //dist from closest point to plane < 0 ? intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - aabb behind the plane & outside frustum } //store result __m128i intersection_res_i = _mm_cvtps_epi32(intersection_res); _mm_store_si128((__m128i *)&culling_res_sse, intersection_res_i); } } OBB culling is a bit harder. We perform calculations on one object at once. But make calculations for three xyz axes simultaneously. It is not optimal but it reflects basic idea of algorithm. Besides, vector math (matrix multiplications and point transformations) with SSE perform faster. void sse_culling_obb(int firs_processing_object, int num_objects, int *culling_res, mat4 &cam_modelview_proj_mat) { mat4_sse sse_camera_mat(cam_modelview_proj_mat); mat4_sse sse_clip_space_mat; //box points in local space __m128 obb_points_sse[8]; obb_points_sse[0] = _mm_set_ps(1.f, box_min[2], box_max[1], box_min[0]); obb_points_sse[1] = _mm_set_ps(1.f, box_max[2], box_max[1], box_min[0]); obb_points_sse[2] = _mm_set_ps(1.f, box_max[2], box_max[1], box_max[0]); obb_points_sse[3] = _mm_set_ps(1.f, box_min[2], box_max[1], box_max[0]); obb_points_sse[4] = _mm_set_ps(1.f, box_min[2], box_min[1], box_max[0]); obb_points_sse[5] = _mm_set_ps(1.f, box_max[2], box_min[1], box_max[0]); obb_points_sse[6] = _mm_set_ps(1.f, box_max[2], box_min[1], box_min[0]); obb_points_sse[7] = _mm_set_ps(1.f, box_min[2], box_min[1], box_min[0]); ALIGN_SSE int obj_culling_res[4]; __m128 zero_v = _mm_setzero_ps(); int i, j; //process one object per step for (i = firs_processing_object; i < firs_processing_object+num_objects; i++) { //clip space matrix = camera_view_proj * obj_mat sse_mat4_mul(sse_clip_space_mat, sse_camera_mat, sse_obj_mat); //initially assume that planes are separating //if any axis is separating - we get 0 in certain outside_* place //NOTE: in _mm_set1_ps() should be negative value, because _mm_movemask_ps (while storing result) //cares about 'most significant bits' (it is sign of float value) __m128 outside_positive_plane = _mm_set1_ps(-1.f); __m128 outside_negative_plane = _mm_set1_ps(-1.f); //for all 8 box points for (j = 0; j < 8; j++) { //transform point to clip space __m128 obb_transformed_point = sse_mat4_mul_vec4(sse_clip_space_mat, obb_points_sse[j]); //gather w & -w __m128 wwww = _mm_shuffle_ps(obb_transformed_point, obb_transformed_point, _MM_SHUFFLE(3, 3, 3, 3)); //get w __m128 wwww_neg = _mm_sub_ps(zero_v, wwww); // negate all elements //box_point.xyz > box_point.w || box_point.xyz < -box_point.w ? //similar to point normalization: point.xyz /= point.w; And compare: point.xyz > 1 && point.xyz < -1 __m128 outside_pos_plane = _mm_cmpge_ps(obb_transformed_point, wwww); __m128 outside_neg_plane = _mm_cmple_ps(obb_transformed_point, wwww_neg); //if at least 1 of 8 points in front of the plane - we get 0 in outside_* flag outside_positive_plane = _mm_and_ps(outside_positive_plane, outside_pos_plane); outside_negative_plane = _mm_and_ps(outside_negative_plane, outside_neg_plane); } //all 8 points xyz < -1 or > 1 ? __m128 outside = _mm_or_ps(outside_positive_plane, outside_negative_plane); //store result, if any of 3 axes is separating (i.e. outside != 0) - object outside frustum //so, object inside frustum only if outside == 0 (there are no separating axes) culling_res = _mm_movemask_ps(outside) & 0x7; //& 0x7 mask, because we interested only in 3 axes } } Table 2. SSE culling result of 100k objects. Intel Core I5. Single Thread. SSE. SSE Culling Sphere AABB OBB Just culling 0,26 0,46 3,48 Whole frame 1,2 1,43 4,6 SSE implementation in average 3 times faster than simple one in C++. Multithreading Nowadays processors has several cores. Calculations might be performed simultaneously on all cores. Architecture on new games should be planned taking into account multithreading, i.e. split work on independent parts/tasks and solve them simultaneously, loading evenly all the processor cores. The design should be flexible. Too large amount of small tasks leads to overhead of synchronizing work and switching between tasks. Too small abound of big tasks leads to uneavenly loading of cores. Need a balance. In current games there might be from several hundreds to thousand tasks per frame. In our case of frustum culling each object is independent from the rest. Thats why we easily could split work into equal groups and cull them simultaneously with different cores of processor. After running jobs execution we need to wait threads to do their job and gather results. Off course we should not ask results right after execution start. Worker::Worker() : first_processing_oject(0), num_processing_ojects(0) { //create 2 events: 1. to signal that we have a job 2.signal that we finished job has_jobs_event = CreateEvent(NULL, false, false, NULL); jobs_finished_event = CreateEvent(NULL, false, true, NULL); } void Worker::doJob() { //make our part of work cull_objects(first_processing_oject, num_processing_ojects); } unsigned __stdcall thread_func(void* arguments) { printf("In thread...\n"); Worker *worker = static_cast(arguments); //each worker has endless loop untill we signal to quit (stop_work flag) while (true) { //wait for starting jobs //if we have no job - just wait (has_jobs_event event). We do not wasting cpu work. Events designed for this. WaitForSingleObject(worker->has_jobs_event, INFINITE); //if we have signal to break - exit endless loop if (worker->stop_work) break; //do job worker->doJob(); //signal that we finished the job SetEvent(worker->jobs_finished_event); } _endthreadex(0); return 0; } void create_threads() { //create the threads //split the work into parts between threads int worker_num_processing_ojects = MAX_SCENE_OBJECTS / num_workers; int first_processing_oject = 0; int i; for (i = 0; i < num_workers; i++) { //create threads workers.thread_handle = (HANDLE)_beginthreadex(NULL, 0, &thread_func, &workers, CREATE_SUSPENDED, &workers.thread_id); thread_handles = workers.thread_handle; //set threads parameters workers.first_processing_oject = first_processing_oject; workers.num_processing_ojects = worker_num_processing_ojects; first_processing_oject += worker_num_processing_ojects; } //run workers to do their jobs for (int i = 0; i < num_workers; i++) ResumeThread(workers.thread_handle); } void process_multithreading_culling() { //signal workers that they have the job for (int i = 0; i < num_workers; i++) SetEvent(workers.has_jobs_event); } void wate_multithreading_culling_done() { //wait threads to do their jobs HANDLE wait_events[num_workers]; for (int i = 0; i < num_workers; i++) wait_events = workers.jobs_finished_event; WaitForMultipleObjects(num_workers, &wait_events[0], true, INFINITE); } Table 3. Culling results of 100k objects. Intel Core I5 (4 cores). In brackets aEUR" speedup relatively to simple c++ implementation. Method Sphere AABB OBB Simple c++ 0,92 (1) 1,42 (1) 9,14 (1) SSE 0,26 (3,54) 0,46 (3,08) 3,48 (2,62) Simple c++, Mulithreaded 0,25 (3,68) 0,4 (3,55) 2,5 (3,65) SSE, Multithreaded 0,1 (9,2) 0,18 (7,89) 1 (9,14) Multithreaded version faster than single threaded in 3,6 times in average. Using SSE gieves us 3 times speedup, relatively to simple c++ implementation. Both using SSE and Multithreading gives us 8,7 times speedup! I.e. we optimize our calculations by almost 9 times, depending on used culling primitive type. GPU culling GPU designed to perform the same operation on huge amount of data. GPU has a lot more parallel threads (thousands) than in CPU (2-8 in most desktop cases). But culling on gpu not allwas comfortably: this assumes special graphics engine architecture there is unpleasant moment that we evaluate dip execution on cpu side. For this we need to know the amout of generated primitives by GPU (visible in frustum objects in our case). Thats why we need to ask feedback from GPU. There are special commands for this purpose. The problem is if we want get result in the same frame with culling and rendering we get GPU-stall, because we need to wait the result. This is bad for perfomance. If read result from previous frame aEUR" we get bugs. Full solution to this problem is using DrawIndirect commands and preparing information about dip on GPU side. This is available since DirectX11 and Opengl 4. Implementation on gpu culling consist from next steps: Pack all instances data in vertex buffer. Assume that one vertex is one object for culling. Amount of atributes for vertex equal to amount of data per one object. Enable transform feedback. Send prepared vertex buffer on render. All results redirect to another vertex buffer with visible instances data. In vertex shader check visibility on the object In geometry shader discard object / kill the vertex if instance is not visible in frustum. Thus, we formed buffer with just visible instances data. But now we need to get information amout amount of visible objects from GPU to make the dip on CPU side. In this case we do this with transform feedback from previous frame (just for code simplicity). void do_gpu_culling() { culling_shader.bind(); int cur_frame = frame_index % 2; int prev_frame = (frame_index + 1) % 2; //enable transform feedback & query glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, dips_texture_buffer); glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, num_visible_instances_query[cur_frame]); glBeginTransformFeedback(GL_POINTS); //render cloud of points which we interprent as objects data glBindVertexArray(all_instances_data_vao); glDrawArrays(GL_POINTS, 0, MAX_SCENE_OBJECTS); glBindVertexArray(0); //disable all glEndTransformFeedback(); glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN); glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, 0); glDisable(GL_RASTERIZER_DISCARD); //get feedback from prev frame num_visible_instances = 0; glGetQueryObjectiv(num_visible_instances_query[prev_frame], GL_QUERY_RESULT, &num_visible_instances); //next frame frame_index++; } Vertex shader from 3rd step: #version 330 core in vec4 s_attribute_0; in vec4 s_attribute_1; out vec4 instance_data1; out vec4 instance_data2; out int visible; uniform mat4 ModelViewProjectionMatrix; uniform vec4 frustum_planes[6]; int InstanceCloudReduction() { //sphere - frustum test bool inside = true; for (int i = 0; i < 6; i++) { if (dot(frustum_planes.xyz, s_attribute_0.xyz) + frustum_planes.w