Sign in to follow this  
CDProp

OpenGL I'm having trouble making sense of these performance numbers (OpenGL)

Recommended Posts

Greetings. This is one of those dreaded "shouldn't it be faster?" type questions, but I'm hoping someone can help me, because I am truly baffled.

 

I'm trying to explore instancing a bit. To that end, I created a demo that has 50,000 randomly-positioned cubes. It's running full-screen, at the native resolution of my monitor. Vsync is forced off through the NVidia control panel. No anti-aliasing. I'm also not doing any frustum culling, but I am doing back-face culling. Here is a screenshot:

 

pMU3Yxa.png

 

The shaders are very simple. All they do is calculate some basic flat shading:

#version 430

layout(location = 0) in vec4 pos;
layout(location = 1) in vec3 norm;

uniform mat4 mv;
uniform mat4 mvp;

out vec3 varNorm;
out vec3 varLightDir;

void main() {
	gl_Position = mvp*pos;
	varNorm = (mv*vec4(norm,0)).xyz;
	varLightDir = (mv*vec4(1.5,2.0,1.0,0)).xyz;
}
#version 430

in vec3 varNorm;
in vec3 varLightDir;
out vec4 fragColor;

void main() {
	vec3 normal = normalize(varNorm);
	vec3 lightDir = normalize(varLightDir);
	float lambert = dot(normal,lightDir);
	fragColor = vec4(lambert,lambert,lambert,1);
}

I know I have a little bit of cruft in there (hard-coded light passed as a varying), but the shaders are not very complicated.

 

I eventually wrote three versions of the program:

  1. One that draws each cube individually with DrawArrays (no indexing)
  2. One that draws each cube individually with DrawElements (indexed, with 24 unique verts instead of 36, no vertex cache optimization)
  3. One that draws all cubes at once with DrawElementsInstanced (same indexing as before)

I noticed zero performance difference between these variations. In order to really test this, I decided to run each version of the program several times each, with a different number of cubes each time: 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000. I am using QueryPerformanceCounter and QueryPerformanceFrequency to measure the frame times. I store the frame times in memory until the program is closed, at which point I print them out to a csv file. I then opened each csv file in Excel and averaged the frame times. At times, I omitted the first few frames of data from the average, as these were often obvious outliers.

 

Here are the results.

 

QBxOPRV.png

 

This is a log-log plot showing that the increase in frame time is linear with respect to the number of cubes drawn, and performance is essentially the same no matter which technique I used. One word of explanation about the "Pan" suffix: I actually ran two versions of each program. In one version, the camera was static. In another version, the camera was panning. The reason I did this is that keeping the camera static allowed me to avoid updating the matrix uniforms each frame. I didn't expect this to cause a big performance increase, except for in the DrawElementsInstanced version, where the static camera allows me to actually skip updating the big buffers that hold all of the matrices. 

 

fV2dqry.png

 

This is a linear plot of just the 100,000-1,000,000 cubes range. The log-log plot sometimes exaggerates or downplays differences, so I just wanted to show that the linear plot shows essentially the same thing. In fact, the DrawArraysPan method was fastest, even though I expected it to be the slowest.

 

IoqK1Kf.png

 

This is just a plot of the triangles-per-second I'm getting with each method. As you can see, they are essentially all the same. I understand that triangles-per-second is not a great absolute measure of performance, but since I'm comparing apples-to-apples here, it seems to be a good relative measure.

 

Speaking of which, I feel like the triangles-per-second numbers are really low. I know that I just said that triangles-per-second are a bad absolute measure of performance, but hear me out. The computer I'm testing this on has an Intel Core i5-4570, 8GB RAM, and a GTX 770. I feel like these numbers are a couple orders of magnitude lower than what I would expect. 

 

Anyway, I'm trying to find what the bottleneck is, but everything just seems to be linear with respect to the number of models being drawn, regardless of how many unique verts are in that model, and regardless of how many draw calls are involved. 

Share this post


Link to post
Share on other sites

On more bit of explanation:

 

  • When I was drawing 50,000 cubes using DrawArrays, I was getting about 48fps.
  • I thought that, by indexing the cube geometry (thus reducing the number of unique verts per cube from 36 to 24), I would see about a 1/3rd reduction in frame time. I did not optimize the vert order for the vertex cache. However, I would be surprised if the cache is smaller than 36 verts (just positions and normals). Anyway, I did not see any performance increase, so I thought, "Maybe I'm CPU bound."
  • So, I next implemented instancing with DrawElementsInstanced, which allowed me to draw all 50,000 cubes with one draw call. This almost surely eliminated the CPU overhead. However, there was no change in performance. So, I felt that ruled out the CPU as the bottleneck also.
  • At this point, I actually tried reducing the fragment shader to one that does no calculation; it just outputs the color white. Still no change in performance.
  • So, if I'm not vertex bound, and I'm not CPU bound, and I'm not fill-rate limited, then what can it be? I wondered if maybe it was something about sending 50,000 mvp and mv matrices (each) over the bus. So, that's when I started running it with different numbers of models (1000, 2000, 5000, etc.), with each variation above (except for the white-only variation) to see if there is a point where the bottleneck presents itself.

I don't feel that the bottleneck has presented itself, but I don't know where else to look. I could post my C++ code, if that'd help, but it's really pretty straightforward. One-file sort of deal.

Share this post


Link to post
Share on other sites

So, you're trying to measure the CPU-side impact of different API usage patterns -- first things first, make sure you can exclude the GPU's performance from the picture.

  • Add a CPU timer (QueryPerformanceCounter) around the SwapBuffers function -- if the GPU is the bottleneck, the CPU will usually stall in this function. If the time recorded here starts increasing or displays a large amount of variance, then GPU-side performance is probably polluting your experiment.
  • Add a GPU timer (ARB_timer_query) for the start/end of each frame, and make sure to only read back the query results (timestamps) 2 or 3 frames after submitting the queries. Use the timestamps to compute GPU-side time-per-frame. If these values are similar to or higher than your QueryPerformanceCounter-derived time-per-frame values, then GPU-side performance is definately polluting your experiment

I thought that, by indexing the cube geometry (thus reducing the number of unique verts per cube from 36 to 24), I would see about a 1/3rd reduction in frame time.

Only if your were GPU vertex processing bound.

Anyway, I did not see any performance increase, so I thought, "Maybe I'm CPU bound."

Instancing is a CPU-side optimization, so you should make sure that you are CPU bound in order to test it's effectiveness!

So, I next implemented instancing with DrawElementsInstanced, which allowed me to draw all 50,000 cubes with one draw call. This almost surely eliminated the CPU overhead. However, there was no change in performance. So, I felt that ruled out the CPU as the bottleneck also.
So, if I'm not vertex bound, and I'm not CPU bound, and I'm not fill-rate limited, then what can it be?

Maybe you were CPU bound, and now you're GPU bound. Maybe the CPU-side and GPU-side time-per-frame values are just very similar? Start by getting your hands on both values! smile.png

Also, what kind of frame-time range were you dealing with here? Values that are too small (e.g. smaller than a typical frame) aren't great for benchmarking because the OS and drivers may well be optimized to slow down programs that are running unreasonably fast. e.g. displaying 1000 frames per second may just be seen as a waste of power by the OS/driver.

Speaking of which, I feel like the triangles-per-second numbers are really low.

You need to have more triangles per batch to get that value up -- instancing with low-poly meshes doesn't really fix the "small batch" problem. Change your cube to a high-poly model and triangles-per-second will almost certainly increase (and your vertex-processing-related optimizations will suddenly make a big impact on frametime).

Edited by Hodgman

Share this post


Link to post
Share on other sites

Thanks to both of you for reading this and helping me out. 

 


Add a CPU timer (QueryPerformanceCounter) around the SwapBuffers function -- if the GPU is the bottleneck, the CPU will usually stall in this function. If the time recorded here starts increasing or displays a large amount of variance, then GPU-side performance is probably polluting your experiment.
Add a GPU timer (ARB_timer_query) for the start/end of each frame, and make sure to only read back the query results (timestamps) 2 or 3 frames after submitting the queries. Use the timestamps to compute GPU-side time-per-frame. If these values are similar to or higher than your QueryPerformanceCounter-derived time-per-frame values, then GPU-side performance is definately polluting your experiment

 

..

 

Start by getting your hands on both values!

 

Great advice, thanks. Any info I can get on what's really going on will be a big help. 

 


You need to have more triangles per batch to get that value up -- instancing with low-poly meshes doesn't really fix the "small batch" problem.

 

I didn't realize this. I thought that, by getting everything into one VAO and drawing it all with one draw call (no state changes in between), I had effectively solved the batching issues. Do you know why the GPU is still "seeing" these thousands of cubes as separate batches instead of one?

 


http://www.g-truc.net/post-0662.html
http://www.g-truc.net/post-0666.html

 

I have a few questions about these articles. I can believe what they're saying, but some things need clarification:

 

1. Concerning the small triangles, it looks to me like there is a linear relationship between the frame times and the number of triangles drawn. The author is graphing the polygon size vs. the frame time. The polygon size is cut in half with each step, which means the number of polygons is increased by four. So, the graph looks quadratic, which is what we'd expect if there was a linear relationship between the number of triangles and the frame time. If I were to look at this graph (and admittedly, I'm just learning to analyze this stuff properly), I would think that the system becomes vertex-bound somewhere between 8x8 and 4x4, where there are 388,800 vertices on the screen. Before that, there is some other bottleneck, ensuring that changes in vertex count don't matter. How is the author controlling for this possibility?

 

2. If you look further down, the author shows a graph that the performance cliff is exponential, but that's hard to see. The vertical axis is log10, and the horizontal axis is log2 with respect to vertex count. I really suspect that the relationship is actually linear with respect to vertex count.

 

3. Concerning the triangles per draw call, it looks like the author says that it's not the number of draw calls per se that makes small batching slow, but rather all of the validation that happens for each draw call due to the state changes that happen in between the draw calls. This was my understanding as well. However, it doesn't look like he's making any state changes in between draw calls, so I'm not sure how his experiment demonstrates the point he's trying to make. In any case, am I to conclude that my DrawArrays implementation is no better than my DrawElementsInstanced implementation because I wasn't making any state changes (other than uniforms) in between calls to DrawArrays? 

 

4. It also looks like, although he is varying the number of triangles drawn per draw call (and thus varying the number of draw calls needed to draw the entire buffer), he is still submitting only one instance per draw call. Again, this supports the idea that performance is worse if you make more draw calls. However, I am still confused as to why performance problems persist if everything is drawn with on DrawElementsInstanced call.

Share this post


Link to post
Share on other sites


I didn't realize this. I thought that, by getting everything into one VAO and drawing it all with one draw call (no state changes in between), I had effectively solved the batching issues. Do you know why the GPU is still "seeing" these thousands of cubes as separate batches instead of one?
If you perform "psuedo-instancing" where you duplicate the one cube mesh 10000 times into a very large VBO, then it will be a single batch, and will render very efficiently.

 

Perhaps it's been solved on the latest GPUs, but for a long time, it's been a rule of thumb that instancing does not perform well for low-poly meshes. I'm not sure why... Either there's still overhead that has to be performed for each instance, or perhaps different instances cannot be grouped into the same wavefront/thread-group on the GPU? e.g. AMD's processors can operate on 64 pixels/vertices at a time -- if this is true, within one processor, 8 threads would be busy running the vertex shaders for one cube instance, while 56 threads sit idle.


Concerning the small triangles, it looks to me like there is a linear relationship between the frame times and the number of triangles drawn. The author is graphing the polygon size vs. the frame time. The polygon size is cut in half with each step, which means the number of polygons is increased by four. So, the graph looks quadratic, which is what we'd expect if there was a linear relationship between the number of triangles and the frame time.
The graph is flat (no change in frame-time) until the quad size reaches 16x16 pixels -- he goes from a single 1920*1080px tile to 32*32px tiles (1 tile to 2040 tiles) with no increase in frame time. It's only once the tiles reach 8*8 pixels that the graph shoots upwards suddenly.

As above, this is likely because AMD GPU cores use 64-wide SIMD instructions to shade 64 pixels at a time.

 
Also note in his graph that tiles of size 32px * 8px take a different amount of time to render than tiles of size 8px * 32px! That's partly because of cache and memory layout reasons, but also partly because every GPU rasterizes triangles in a different pattern, often somewhat hierarchically. Some triangle shapes will better match that pattern than others.
 

Furthermore, almost every GPU (going back 10 years or more up until today!) does not rasterize individual pixels. GPUs rasterize "pixel quads", which are 2*2px areas of the screen. If a triangle cuts through part of a 2*2 area -- e.g. it only covers 1 pixel -- then the whole pixel quad is still shaded, but some of the pixels are discarded. That's one reason why the 1*1 pixel tiles are incredibly slow.

It's also a reason why LOD'ing models is important! On one game I worked on, we weren't going to bother with LODs, as vertex shading wasn't much of a bottleneck for us... However, profiling showed that distant meshes were taking waaay too long to draw -- these meshes were mostly made up of sub-pixel triangles, where most triangles covered zero pixels, and a few lucky ones covered one pixel. After implementing LOD'ing, the vertex shading time of course imrpoved, but the pixel shading time also improved by ~200 to 300% due to the reduction in small triangles (a.k.a. a massive improvement in pixel-quad efficiency).


Concerning the triangles per draw call, it looks like the author says that it's not the number of draw calls per se that makes small batching slow, but rather all of the validation that happens for each draw call due to the state changes that happen in between the draw calls.
Validation is a CPU bottleneck -- he says that batching is usually done to help out the CPU here, but goes on to say:

In this post, we are looking at the GPU draw call performance ... To make sure that we are not CPU bound, I ..... In these tests, we are GPU bound somewhere.

Share this post


Link to post
Share on other sites

Alright, so I wasn't able to start on this until late this evening, but I do have some results to share. The following graph shows the time vs. frame number for 50,000 cubes rendered using DrawElementsInstanced (no camera panning):

 

 62mC7JW.png

So, it seems that the gpu is the bottleneck in this case. Almost the entire frame time is spent waiting for SwapBuffers to return. I tried this same experiment with 5,000 cubes, and got the same results (albeit with smaller frame times). That is, gpuTime and swapBuffersTime were very close to the total frame time.

 

I then tried running the same experiments with DrawElements (not instanced), and I got a very different plot. This time, the frame times and gpu time were about equal still, but the swap buffers time was way lower:

 

lTtY30j.png

This looks to me like the gpu is still taking the same amount of time to draw the cubes as in the Instanced case, but since the CPU is spending so much more time submitting draw calls, there is much less time left over for waiting for the buffer swap. Does that sound right?

 

I also tried using an object that is more complex than a cube -- just a quick mesh I made in Blender that has 804 unique verts. Once again, there is no performance difference between the DrawArrays, DrawElements, and DrawElementsInstanced cases. However, the good news is that the triangles-per-second increased by more than 2X with the more complex model, just as you predicted.

 

So, it appears that my test cases are not great -- they take long enough to draw on the GPU that there is plenty of time on the CPU side to submit all of the draw calls individually.

 

However, the vertex processing stage does not seem to be the culprit, since there is no difference in GPU time between the indexed and non-indexed cases. Next, I'll experiment more with fragment processing and reducing the number of single- and sub-pixel triangles in the scene.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628281
    • Total Posts
      2981800
  • Similar Content

    • By mellinoe
      Hi all,
      First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
      Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
      The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
      Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.
    • By aejt
      I recently started getting into graphics programming (2nd try, first try was many years ago) and I'm working on a 3d rendering engine which I hope to be able to make a 3D game with sooner or later. I have plenty of C++ experience, but not a lot when it comes to graphics, and while it's definitely going much better this time, I'm having trouble figuring out how assets are usually handled by engines.
      I'm not having trouble with handling the GPU resources, but more so with how the resources should be defined and used in the system (materials, models, etc).
      This is my plan now, I've implemented most of it except for the XML parts and factories and those are the ones I'm not sure of at all:
      I have these classes:
      For GPU resources:
      Geometry: holds and manages everything needed to render a geometry: VAO, VBO, EBO. Texture: holds and manages a texture which is loaded into the GPU. Shader: holds and manages a shader which is loaded into the GPU. For assets relying on GPU resources:
      Material: holds a shader resource, multiple texture resources, as well as uniform settings. Mesh: holds a geometry and a material. Model: holds multiple meshes, possibly in a tree structure to more easily support skinning later on? For handling GPU resources:
      ResourceCache<T>: T can be any resource loaded into the GPU. It owns these resources and only hands out handles to them on request (currently string identifiers are used when requesting handles, but all resources are stored in a vector and each handle only contains resource's index in that vector) Resource<T>: The handles given out from ResourceCache. The handles are reference counted and to get the underlying resource you simply deference like with pointers (*handle).  
      And my plan is to define everything into these XML documents to abstract away files:
      Resources.xml for ref-counted GPU resources (geometry, shaders, textures) Resources are assigned names/ids and resource files, and possibly some attributes (what vertex attributes does this geometry have? what vertex attributes does this shader expect? what uniforms does this shader use? and so on) Are reference counted using ResourceCache<T> Assets.xml for assets using the GPU resources (materials, meshes, models) Assets are not reference counted, but they hold handles to ref-counted resources. References the resources defined in Resources.xml by names/ids. The XMLs are loaded into some structure in memory which is then used for loading the resources/assets using factory classes:
      Factory classes for resources:
      For example, a texture factory could contain the texture definitions from the XML containing data about textures in the game, as well as a cache containing all loaded textures. This means it has mappings from each name/id to a file and when asked to load a texture with a name/id, it can look up its path and use a "BinaryLoader" to either load the file and create the resource directly, or asynchronously load the file's data into a queue which then can be read from later to create the resources synchronously in the GL context. These factories only return handles.
      Factory classes for assets:
      Much like for resources, these classes contain the definitions for the assets they can load. For example, with the definition the MaterialFactory will know which shader, textures and possibly uniform a certain material has, and with the help of TextureFactory and ShaderFactory, it can retrieve handles to the resources it needs (Shader + Textures), setup itself from XML data (uniform values), and return a created instance of requested material. These factories return actual instances, not handles (but the instances contain handles).
       
       
      Is this a good or commonly used approach? Is this going to bite me in the ass later on? Are there other more preferable approaches? Is this outside of the scope of a 3d renderer and should be on the engine side? I'd love to receive and kind of advice or suggestions!
      Thanks!
    • By nedondev
      I 'm learning how to create game by using opengl with c/c++ coding, so here is my fist game. In video description also have game contain in Dropbox. May be I will make it better in future.
      Thanks.
    • By Abecederia
      So I've recently started learning some GLSL and now I'm toying with a POM shader. I'm trying to optimize it and notice that it starts having issues at high texture sizes, especially with self-shadowing.
      Now I know POM is expensive either way, but would pulling the heightmap out of the normalmap alpha channel and in it's own 8bit texture make doing all those dozens of texture fetches more cheap? Or is everything in the cache aligned to 32bit anyway? I haven't implemented texture compression yet, I think that would help? But regardless, should there be a performance boost from decoupling the heightmap? I could also keep it in a lower resolution than the normalmap if that would improve performance.
      Any help is much appreciated, please keep in mind I'm somewhat of a newbie. Thanks!
    • By test opty
      Hi,
      I'm trying to learn OpenGL through a website and have proceeded until this page of it. The output is a simple triangle. The problem is the complexity.
      I have read that page several times and tried to analyse the code but I haven't understood the code properly and completely yet. This is the code:
       
      #include <glad/glad.h> #include <GLFW/glfw3.h> #include <C:\Users\Abbasi\Desktop\std_lib_facilities_4.h> using namespace std; //****************************************************************************** void framebuffer_size_callback(GLFWwindow* window, int width, int height); void processInput(GLFWwindow *window); // settings const unsigned int SCR_WIDTH = 800; const unsigned int SCR_HEIGHT = 600; const char *vertexShaderSource = "#version 330 core\n" "layout (location = 0) in vec3 aPos;\n" "void main()\n" "{\n" " gl_Position = vec4(aPos.x, aPos.y, aPos.z, 1.0);\n" "}\0"; const char *fragmentShaderSource = "#version 330 core\n" "out vec4 FragColor;\n" "void main()\n" "{\n" " FragColor = vec4(1.0f, 0.5f, 0.2f, 1.0f);\n" "}\n\0"; //******************************* int main() { // glfw: initialize and configure // ------------------------------ glfwInit(); glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3); glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3); glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE); // glfw window creation GLFWwindow* window = glfwCreateWindow(SCR_WIDTH, SCR_HEIGHT, "My First Triangle", nullptr, nullptr); if (window == nullptr) { cout << "Failed to create GLFW window" << endl; glfwTerminate(); return -1; } glfwMakeContextCurrent(window); glfwSetFramebufferSizeCallback(window, framebuffer_size_callback); // glad: load all OpenGL function pointers if (!gladLoadGLLoader((GLADloadproc)glfwGetProcAddress)) { cout << "Failed to initialize GLAD" << endl; return -1; } // build and compile our shader program // vertex shader int vertexShader = glCreateShader(GL_VERTEX_SHADER); glShaderSource(vertexShader, 1, &vertexShaderSource, nullptr); glCompileShader(vertexShader); // check for shader compile errors int success; char infoLog[512]; glGetShaderiv(vertexShader, GL_COMPILE_STATUS, &success); if (!success) { glGetShaderInfoLog(vertexShader, 512, nullptr, infoLog); cout << "ERROR::SHADER::VERTEX::COMPILATION_FAILED\n" << infoLog << endl; } // fragment shader int fragmentShader = glCreateShader(GL_FRAGMENT_SHADER); glShaderSource(fragmentShader, 1, &fragmentShaderSource, nullptr); glCompileShader(fragmentShader); // check for shader compile errors glGetShaderiv(fragmentShader, GL_COMPILE_STATUS, &success); if (!success) { glGetShaderInfoLog(fragmentShader, 512, nullptr, infoLog); cout << "ERROR::SHADER::FRAGMENT::COMPILATION_FAILED\n" << infoLog << endl; } // link shaders int shaderProgram = glCreateProgram(); glAttachShader(shaderProgram, vertexShader); glAttachShader(shaderProgram, fragmentShader); glLinkProgram(shaderProgram); // check for linking errors glGetProgramiv(shaderProgram, GL_LINK_STATUS, &success); if (!success) { glGetProgramInfoLog(shaderProgram, 512, nullptr, infoLog); cout << "ERROR::SHADER::PROGRAM::LINKING_FAILED\n" << infoLog << endl; } glDeleteShader(vertexShader); glDeleteShader(fragmentShader); // set up vertex data (and buffer(s)) and configure vertex attributes float vertices[] = { -0.5f, -0.5f, 0.0f, // left 0.5f, -0.5f, 0.0f, // right 0.0f, 0.5f, 0.0f // top }; unsigned int VBO, VAO; glGenVertexArrays(1, &VAO); glGenBuffers(1, &VBO); // bind the Vertex Array Object first, then bind and set vertex buffer(s), //and then configure vertex attributes(s). glBindVertexArray(VAO); glBindBuffer(GL_ARRAY_BUFFER, VBO); glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW); glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void*)0); glEnableVertexAttribArray(0); // note that this is allowed, the call to glVertexAttribPointer registered VBO // as the vertex attribute's bound vertex buffer object so afterwards we can safely unbind glBindBuffer(GL_ARRAY_BUFFER, 0); // You can unbind the VAO afterwards so other VAO calls won't accidentally // modify this VAO, but this rarely happens. Modifying other // VAOs requires a call to glBindVertexArray anyways so we generally don't unbind // VAOs (nor VBOs) when it's not directly necessary. glBindVertexArray(0); // uncomment this call to draw in wireframe polygons. //glPolygonMode(GL_FRONT_AND_BACK, GL_LINE); // render loop while (!glfwWindowShouldClose(window)) { // input // ----- processInput(window); // render // ------ glClearColor(0.2f, 0.3f, 0.3f, 1.0f); glClear(GL_COLOR_BUFFER_BIT); // draw our first triangle glUseProgram(shaderProgram); glBindVertexArray(VAO); // seeing as we only have a single VAO there's no need to // bind it every time, but we'll do so to keep things a bit more organized glDrawArrays(GL_TRIANGLES, 0, 3); // glBindVertexArray(0); // no need to unbind it every time // glfw: swap buffers and poll IO events (keys pressed/released, mouse moved etc.) glfwSwapBuffers(window); glfwPollEvents(); } // optional: de-allocate all resources once they've outlived their purpose: glDeleteVertexArrays(1, &VAO); glDeleteBuffers(1, &VBO); // glfw: terminate, clearing all previously allocated GLFW resources. glfwTerminate(); return 0; } //************************************************** // process all input: query GLFW whether relevant keys are pressed/released // this frame and react accordingly void processInput(GLFWwindow *window) { if (glfwGetKey(window, GLFW_KEY_ESCAPE) == GLFW_PRESS) glfwSetWindowShouldClose(window, true); } //******************************************************************** // glfw: whenever the window size changed (by OS or user resize) this callback function executes void framebuffer_size_callback(GLFWwindow* window, int width, int height) { // make sure the viewport matches the new window dimensions; note that width and // height will be significantly larger than specified on retina displays. glViewport(0, 0, width, height); } As you see, about 200 lines of complicated code only for a simple triangle. 
      I don't know what parts are necessary for that output. And also, what the correct order of instructions for such an output or programs is, generally. That start point is too complex for a beginner of OpenGL like me and I don't know how to make the issue solved. What are your ideas please? What is the way to figure both the code and the whole program out correctly please?
      I wish I'd read a reference that would teach me OpenGL through a step-by-step method. 
  • Popular Now