• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By tj8146
      I am using immediate mode for OpenGL and I am creating a 2D top down car game. I am trying to configure my game loop in order to get my car-like physics working on a square shape. I have working code but it is not doing as I want it to. I am not sure as to whether it is my game loop that is incorrect or my code for the square is incorrect, or maybe both! Could someone help because I have been trying to work this out for over a day now
      I have attached my .cpp file if you wish to run it for yourself.. 
      WinMain code:
      /******************* WIN32 FUNCTIONS ***************************/ int WINAPI WinMain( HINSTANCE hInstance, // Instance HINSTANCE hPrevInstance, // Previous Instance LPSTR lpCmdLine, // Command Line Parameters int nCmdShow) // Window Show State { MSG msg; // Windows Message Structure bool done=false; // Bool Variable To Exit Loop Car car; car.x = 220; car.y = 140; car.dx = 0; car.dy = 0; car.ang = 0; AllocConsole(); FILE *stream; freopen_s(&stream, "CONOUT$", "w", stdout); // Create Our OpenGL Window if (!CreateGLWindow("OpenGL Win32 Example",screenWidth,screenHeight)) { return 0; // Quit If Window Was Not Created } while(!done) // Loop That Runs While done=FALSE { if (PeekMessage(&msg,NULL,0,0,PM_REMOVE)) // Is There A Message Waiting? { if (msg.message==WM_QUIT) // Have We Received A Quit Message? { done=true; // If So done=TRUE break; } else // If Not, Deal With Window Messages { TranslateMessage(&msg); // Translate The Message DispatchMessage(&msg); // Dispatch The Message } } else // If There Are No Messages { if(keys[VK_ESCAPE]) done = true; void processKeys(Car& car); //process keyboard while (game_is_running) { loops = 0; while (GetTickCount() > next_game_tick && loops < MAX_FRAMESKIP) { update(car); // update variables next_game_tick += SKIP_TICKS; loops++; } display(car); // Draw The Scene SwapBuffers(hDC); // Swap Buffers (Double Buffering) } } } // Shutdown KillGLWindow(); // Kill The Window return (int)(msg.wParam); // Exit The Program } //WIN32 Processes function - useful for responding to user inputs or other events. LRESULT CALLBACK WndProc( HWND hWnd, // Handle For This Window UINT uMsg, // Message For This Window WPARAM wParam, // Additional Message Information LPARAM lParam) // Additional Message Information { switch (uMsg) // Check For Windows Messages { case WM_CLOSE: // Did We Receive A Close Message? { PostQuitMessage(0); // Send A Quit Message return 0; // Jump Back } break; case WM_SIZE: // Resize The OpenGL Window { reshape(LOWORD(lParam),HIWORD(lParam)); // LoWord=Width, HiWord=Height return 0; // Jump Back } break; case WM_LBUTTONDOWN: { mouse_x = LOWORD(lParam); mouse_y = screenHeight - HIWORD(lParam); LeftPressed = true; } break; case WM_LBUTTONUP: { LeftPressed = false; } break; case WM_MOUSEMOVE: { mouse_x = LOWORD(lParam); mouse_y = screenHeight - HIWORD(lParam); } break; case WM_KEYDOWN: // Is A Key Being Held Down? { keys[wParam] = true; // If So, Mark It As TRUE return 0; // Jump Back } break; case WM_KEYUP: // Has A Key Been Released? { keys[wParam] = false; // If So, Mark It As FALSE return 0; // Jump Back } break; } // Pass All Unhandled Messages To DefWindowProc return DefWindowProc(hWnd,uMsg,wParam,lParam); }  
      C++ and OpenGL code:
      int mouse_x=0, mouse_y=0; bool LeftPressed = false; int screenWidth=1080, screenHeight=960; bool keys[256]; float radiansFromDegrees(float deg) { return deg * (M_PI / 180.0f); } float degreesFromRadians(float rad) { return rad / (M_PI / 180.0f); } bool game_is_running = true; const int TICKS_PER_SECOND = 50; const int SKIP_TICKS = 1000 / TICKS_PER_SECOND; const int MAX_FRAMESKIP = 10; DWORD next_game_tick = GetTickCount(); int loops; typedef struct { float x, y; float dx, dy; float ang; }Car; //OPENGL FUNCTION PROTOTYPES void display(const Car& car); //called in winmain to draw everything to the screen void reshape(int width, int height); //called when the window is resized void init(); //called in winmain when the program starts. void processKeys(Car& car); //called in winmain to process keyboard input void update(Car& car); //called in winmain to update variables /************* START OF OPENGL FUNCTIONS ****************/ void display(const Car& car) { const float w = 50.0f; const float h = 50.0f; glClear(GL_COLOR_BUFFER_BIT); glLoadIdentity(); glTranslatef(100, 100, 0); glBegin(GL_POLYGON); glVertex2f(car.x, car.y); glVertex2f(car.x + w, car.y); glVertex2f(car.x + w, car.y + h); glVertex2f(car.x, car.y + h); glEnd(); glFlush(); } void reshape(int width, int height) // Resize the OpenGL window { screenWidth = width; screenHeight = height; // to ensure the mouse coordinates match // we will use these values to set the coordinate system glViewport(0, 0, width, height); // Reset the current viewport glMatrixMode(GL_PROJECTION); // select the projection matrix stack glLoadIdentity(); // reset the top of the projection matrix to an identity matrix gluOrtho2D(0, screenWidth, 0, screenHeight); // set the coordinate system for the window glMatrixMode(GL_MODELVIEW); // Select the modelview matrix stack glLoadIdentity(); // Reset the top of the modelview matrix to an identity matrix } void init() { glClearColor(1.0, 1.0, 0.0, 0.0); //sets the clear colour to yellow //glClear(GL_COLOR_BUFFER_BIT) in the display function //will clear the buffer to this colour. } void processKeys(Car& car) { if (keys[VK_UP]) { float cdx = sinf(radiansFromDegrees(car.ang)); float cdy = -cosf(radiansFromDegrees(car.ang)); car.dx += cdx; car.dy += cdy; } if (keys[VK_DOWN]) { float cdx = sinf(radiansFromDegrees(car.ang)); float cdy = -cosf(radiansFromDegrees(car.ang)); car.dx += -cdx; car.dy += -cdy; } if (keys[VK_LEFT]) { car.ang -= 2; } if (keys[VK_RIGHT]) { car.ang += 2; } } void update(Car& car) { car.x += car.dx*next_game_tick; }  
      game.cpp
    • By tj8146
      I am using immediate mode for OpenGL and I am creating a 2D top down car game. I am trying to configure my game loop in order to get my car-like physics working on a square shape. I have working code but it is not doing as I want it to. I am not sure as to whether it is my game loop that is incorrect or my code for the square is incorrect, or maybe both! Could someone help because I have been trying to work this out for over a day now
      I have attached my .cpp file if you wish to run it for yourself.. 
       
      This is my C++ and OpenGL code:
      int mouse_x=0, mouse_y=0; bool LeftPressed = false; int screenWidth=1080, screenHeight=960; bool keys[256]; float radiansFromDegrees(float deg) { return deg * (M_PI / 180.0f); } float degreesFromRadians(float rad) { return rad / (M_PI / 180.0f); } bool game_is_running = true; const int TICKS_PER_SECOND = 50; const int SKIP_TICKS = 1000 / TICKS_PER_SECOND; const int MAX_FRAMESKIP = 10; DWORD next_game_tick = GetTickCount(); int loops; typedef struct { float x, y; float dx, dy; float ang; }Car; //OPENGL FUNCTION PROTOTYPES void display(const Car& car); //called in winmain to draw everything to the screen void reshape(int width, int height); //called when the window is resized void init(); //called in winmain when the program starts. void processKeys(Car& car); //called in winmain to process keyboard input void update(Car& car); //called in winmain to update variables /************* START OF OPENGL FUNCTIONS ****************/ void display(const Car& car) { const float w = 50.0f; const float h = 50.0f; glClear(GL_COLOR_BUFFER_BIT); glLoadIdentity(); glTranslatef(100, 100, 0); glBegin(GL_POLYGON); glVertex2f(car.x, car.y); glVertex2f(car.x + w, car.y); glVertex2f(car.x + w, car.y + h); glVertex2f(car.x, car.y + h); glEnd(); glFlush(); } void reshape(int width, int height) // Resize the OpenGL window { screenWidth = width; screenHeight = height; // to ensure the mouse coordinates match // we will use these values to set the coordinate system glViewport(0, 0, width, height); // Reset the current viewport glMatrixMode(GL_PROJECTION); // select the projection matrix stack glLoadIdentity(); // reset the top of the projection matrix to an identity matrix gluOrtho2D(0, screenWidth, 0, screenHeight); // set the coordinate system for the window glMatrixMode(GL_MODELVIEW); // Select the modelview matrix stack glLoadIdentity(); // Reset the top of the modelview matrix to an identity matrix } void init() { glClearColor(1.0, 1.0, 0.0, 0.0); //sets the clear colour to yellow //glClear(GL_COLOR_BUFFER_BIT) in the display function //will clear the buffer to this colour. } void processKeys(Car& car) { if (keys[VK_UP]) { float cdx = sinf(radiansFromDegrees(car.ang)); float cdy = -cosf(radiansFromDegrees(car.ang)); car.dx += cdx; car.dy += cdy; } if (keys[VK_DOWN]) { float cdx = sinf(radiansFromDegrees(car.ang)); float cdy = -cosf(radiansFromDegrees(car.ang)); car.dx += -cdx; car.dy += -cdy; } if (keys[VK_LEFT]) { car.ang -= 2; } if (keys[VK_RIGHT]) { car.ang += 2; } } void update(Car& car) { car.x += car.dx*next_game_tick; } My WinMain code:
      /******************* WIN32 FUNCTIONS ***************************/ int WINAPI WinMain( HINSTANCE hInstance, // Instance HINSTANCE hPrevInstance, // Previous Instance LPSTR lpCmdLine, // Command Line Parameters int nCmdShow) // Window Show State { MSG msg; // Windows Message Structure bool done=false; // Bool Variable To Exit Loop Car car; car.x = 220; car.y = 140; car.dx = 0; car.dy = 0; car.ang = 0; AllocConsole(); FILE *stream; freopen_s(&stream, "CONOUT$", "w", stdout); // Create Our OpenGL Window if (!CreateGLWindow("OpenGL Win32 Example",screenWidth,screenHeight)) { return 0; // Quit If Window Was Not Created } while(!done) // Loop That Runs While done=FALSE { if (PeekMessage(&msg,NULL,0,0,PM_REMOVE)) // Is There A Message Waiting? { if (msg.message==WM_QUIT) // Have We Received A Quit Message? { done=true; // If So done=TRUE break; } else // If Not, Deal With Window Messages { TranslateMessage(&msg); // Translate The Message DispatchMessage(&msg); // Dispatch The Message } } else // If There Are No Messages { if(keys[VK_ESCAPE]) done = true; void processKeys(Car& car); //process keyboard while (game_is_running) { loops = 0; while (GetTickCount() > next_game_tick && loops < MAX_FRAMESKIP) { update(car); // update variables next_game_tick += SKIP_TICKS; loops++; } display(car); // Draw The Scene SwapBuffers(hDC); // Swap Buffers (Double Buffering) } } } // Shutdown KillGLWindow(); // Kill The Window return (int)(msg.wParam); // Exit The Program } //WIN32 Processes function - useful for responding to user inputs or other events. LRESULT CALLBACK WndProc( HWND hWnd, // Handle For This Window UINT uMsg, // Message For This Window WPARAM wParam, // Additional Message Information LPARAM lParam) // Additional Message Information { switch (uMsg) // Check For Windows Messages { case WM_CLOSE: // Did We Receive A Close Message? { PostQuitMessage(0); // Send A Quit Message return 0; // Jump Back } break; case WM_SIZE: // Resize The OpenGL Window { reshape(LOWORD(lParam),HIWORD(lParam)); // LoWord=Width, HiWord=Height return 0; // Jump Back } break; case WM_LBUTTONDOWN: { mouse_x = LOWORD(lParam); mouse_y = screenHeight - HIWORD(lParam); LeftPressed = true; } break; case WM_LBUTTONUP: { LeftPressed = false; } break; case WM_MOUSEMOVE: { mouse_x = LOWORD(lParam); mouse_y = screenHeight - HIWORD(lParam); } break; case WM_KEYDOWN: // Is A Key Being Held Down? { keys[wParam] = true; // If So, Mark It As TRUE return 0; // Jump Back } break; case WM_KEYUP: // Has A Key Been Released? { keys[wParam] = false; // If So, Mark It As FALSE return 0; // Jump Back } break; } // Pass All Unhandled Messages To DefWindowProc return DefWindowProc(hWnd,uMsg,wParam,lParam); }  
      game.cpp
    • By lxjk
      Hi guys,
      There are many ways to do light culling in tile-based shading. I've been playing with this idea for a while, and just want to throw it out there.
      Because tile frustums are general small compared to light radius, I tried using cone test to reduce false positives introduced by commonly used sphere-frustum test.
      On top of that, I use distance to camera rather than depth for near/far test (aka. sliced by spheres).
      This method can be naturally extended to clustered light culling as well.
      The following image shows the general ideas

       
      Performance-wise I get around 15% improvement over sphere-frustum test. You can also see how a single light performs as the following: from left to right (1) standard rendering of a point light; then tiles passed the test of (2) sphere-frustum test; (3) cone test; (4) spherical-sliced cone test
       

       
      I put the details in my blog post (https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html), GLSL source code included!
       
      Eric
    • By Fadey Duh
      Good evening everyone!

      I was wondering if there is something equivalent of  GL_NV_blend_equation_advanced for AMD?
      Basically I'm trying to find more compatible version of it.

      Thank you!
    • By Jens Eckervogt
      Hello guys, 
       
      Please tell me! 
      How do I know? Why does wavefront not show for me?
      I already checked I have non errors yet.
      using OpenTK; using System.Collections.Generic; using System.IO; using System.Text; namespace Tutorial_08.net.sourceskyboxer { public class WaveFrontLoader { private static List<Vector3> inPositions; private static List<Vector2> inTexcoords; private static List<Vector3> inNormals; private static List<float> positions; private static List<float> texcoords; private static List<int> indices; public static RawModel LoadObjModel(string filename, Loader loader) { inPositions = new List<Vector3>(); inTexcoords = new List<Vector2>(); inNormals = new List<Vector3>(); positions = new List<float>(); texcoords = new List<float>(); indices = new List<int>(); int nextIdx = 0; using (var reader = new StreamReader(File.Open("Contents/" + filename + ".obj", FileMode.Open), Encoding.UTF8)) { string line = reader.ReadLine(); int i = reader.Read(); while (true) { string[] currentLine = line.Split(); if (currentLine[0] == "v") { Vector3 pos = new Vector3(float.Parse(currentLine[1]), float.Parse(currentLine[2]), float.Parse(currentLine[3])); inPositions.Add(pos); if (currentLine[1] == "t") { Vector2 tex = new Vector2(float.Parse(currentLine[1]), float.Parse(currentLine[2])); inTexcoords.Add(tex); } if (currentLine[1] == "n") { Vector3 nom = new Vector3(float.Parse(currentLine[1]), float.Parse(currentLine[2]), float.Parse(currentLine[3])); inNormals.Add(nom); } } if (currentLine[0] == "f") { Vector3 pos = inPositions[0]; positions.Add(pos.X); positions.Add(pos.Y); positions.Add(pos.Z); Vector2 tc = inTexcoords[0]; texcoords.Add(tc.X); texcoords.Add(tc.Y); indices.Add(nextIdx); ++nextIdx; } reader.Close(); return loader.loadToVAO(positions.ToArray(), texcoords.ToArray(), indices.ToArray()); } } } } } And It have tried other method but it can't show for me.  I am mad now. Because any OpenTK developers won't help me.
      Please help me how do I fix.

      And my download (mega.nz) should it is original but I tried no success...
      - Add blend source and png file here I have tried tried,.....  
       
      PS: Why is our community not active? I wait very longer. Stop to lie me!
      Thanks !
  • Advertisement
  • Advertisement
Sign in to follow this  

OpenGL I'm having trouble making sense of these performance numbers (OpenGL)

This topic is 921 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Greetings. This is one of those dreaded "shouldn't it be faster?" type questions, but I'm hoping someone can help me, because I am truly baffled.

 

I'm trying to explore instancing a bit. To that end, I created a demo that has 50,000 randomly-positioned cubes. It's running full-screen, at the native resolution of my monitor. Vsync is forced off through the NVidia control panel. No anti-aliasing. I'm also not doing any frustum culling, but I am doing back-face culling. Here is a screenshot:

 

pMU3Yxa.png

 

The shaders are very simple. All they do is calculate some basic flat shading:

#version 430

layout(location = 0) in vec4 pos;
layout(location = 1) in vec3 norm;

uniform mat4 mv;
uniform mat4 mvp;

out vec3 varNorm;
out vec3 varLightDir;

void main() {
	gl_Position = mvp*pos;
	varNorm = (mv*vec4(norm,0)).xyz;
	varLightDir = (mv*vec4(1.5,2.0,1.0,0)).xyz;
}
#version 430

in vec3 varNorm;
in vec3 varLightDir;
out vec4 fragColor;

void main() {
	vec3 normal = normalize(varNorm);
	vec3 lightDir = normalize(varLightDir);
	float lambert = dot(normal,lightDir);
	fragColor = vec4(lambert,lambert,lambert,1);
}

I know I have a little bit of cruft in there (hard-coded light passed as a varying), but the shaders are not very complicated.

 

I eventually wrote three versions of the program:

  1. One that draws each cube individually with DrawArrays (no indexing)
  2. One that draws each cube individually with DrawElements (indexed, with 24 unique verts instead of 36, no vertex cache optimization)
  3. One that draws all cubes at once with DrawElementsInstanced (same indexing as before)

I noticed zero performance difference between these variations. In order to really test this, I decided to run each version of the program several times each, with a different number of cubes each time: 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000. I am using QueryPerformanceCounter and QueryPerformanceFrequency to measure the frame times. I store the frame times in memory until the program is closed, at which point I print them out to a csv file. I then opened each csv file in Excel and averaged the frame times. At times, I omitted the first few frames of data from the average, as these were often obvious outliers.

 

Here are the results.

 

QBxOPRV.png

 

This is a log-log plot showing that the increase in frame time is linear with respect to the number of cubes drawn, and performance is essentially the same no matter which technique I used. One word of explanation about the "Pan" suffix: I actually ran two versions of each program. In one version, the camera was static. In another version, the camera was panning. The reason I did this is that keeping the camera static allowed me to avoid updating the matrix uniforms each frame. I didn't expect this to cause a big performance increase, except for in the DrawElementsInstanced version, where the static camera allows me to actually skip updating the big buffers that hold all of the matrices. 

 

fV2dqry.png

 

This is a linear plot of just the 100,000-1,000,000 cubes range. The log-log plot sometimes exaggerates or downplays differences, so I just wanted to show that the linear plot shows essentially the same thing. In fact, the DrawArraysPan method was fastest, even though I expected it to be the slowest.

 

IoqK1Kf.png

 

This is just a plot of the triangles-per-second I'm getting with each method. As you can see, they are essentially all the same. I understand that triangles-per-second is not a great absolute measure of performance, but since I'm comparing apples-to-apples here, it seems to be a good relative measure.

 

Speaking of which, I feel like the triangles-per-second numbers are really low. I know that I just said that triangles-per-second are a bad absolute measure of performance, but hear me out. The computer I'm testing this on has an Intel Core i5-4570, 8GB RAM, and a GTX 770. I feel like these numbers are a couple orders of magnitude lower than what I would expect. 

 

Anyway, I'm trying to find what the bottleneck is, but everything just seems to be linear with respect to the number of models being drawn, regardless of how many unique verts are in that model, and regardless of how many draw calls are involved. 

Share this post


Link to post
Share on other sites
Advertisement

On more bit of explanation:

 

  • When I was drawing 50,000 cubes using DrawArrays, I was getting about 48fps.
  • I thought that, by indexing the cube geometry (thus reducing the number of unique verts per cube from 36 to 24), I would see about a 1/3rd reduction in frame time. I did not optimize the vert order for the vertex cache. However, I would be surprised if the cache is smaller than 36 verts (just positions and normals). Anyway, I did not see any performance increase, so I thought, "Maybe I'm CPU bound."
  • So, I next implemented instancing with DrawElementsInstanced, which allowed me to draw all 50,000 cubes with one draw call. This almost surely eliminated the CPU overhead. However, there was no change in performance. So, I felt that ruled out the CPU as the bottleneck also.
  • At this point, I actually tried reducing the fragment shader to one that does no calculation; it just outputs the color white. Still no change in performance.
  • So, if I'm not vertex bound, and I'm not CPU bound, and I'm not fill-rate limited, then what can it be? I wondered if maybe it was something about sending 50,000 mvp and mv matrices (each) over the bus. So, that's when I started running it with different numbers of models (1000, 2000, 5000, etc.), with each variation above (except for the white-only variation) to see if there is a point where the bottleneck presents itself.

I don't feel that the bottleneck has presented itself, but I don't know where else to look. I could post my C++ code, if that'd help, but it's really pretty straightforward. One-file sort of deal.

Share this post


Link to post
Share on other sites

So, you're trying to measure the CPU-side impact of different API usage patterns -- first things first, make sure you can exclude the GPU's performance from the picture.

  • Add a CPU timer (QueryPerformanceCounter) around the SwapBuffers function -- if the GPU is the bottleneck, the CPU will usually stall in this function. If the time recorded here starts increasing or displays a large amount of variance, then GPU-side performance is probably polluting your experiment.
  • Add a GPU timer (ARB_timer_query) for the start/end of each frame, and make sure to only read back the query results (timestamps) 2 or 3 frames after submitting the queries. Use the timestamps to compute GPU-side time-per-frame. If these values are similar to or higher than your QueryPerformanceCounter-derived time-per-frame values, then GPU-side performance is definately polluting your experiment

I thought that, by indexing the cube geometry (thus reducing the number of unique verts per cube from 36 to 24), I would see about a 1/3rd reduction in frame time.

Only if your were GPU vertex processing bound.

Anyway, I did not see any performance increase, so I thought, "Maybe I'm CPU bound."

Instancing is a CPU-side optimization, so you should make sure that you are CPU bound in order to test it's effectiveness!

So, I next implemented instancing with DrawElementsInstanced, which allowed me to draw all 50,000 cubes with one draw call. This almost surely eliminated the CPU overhead. However, there was no change in performance. So, I felt that ruled out the CPU as the bottleneck also.
So, if I'm not vertex bound, and I'm not CPU bound, and I'm not fill-rate limited, then what can it be?

Maybe you were CPU bound, and now you're GPU bound. Maybe the CPU-side and GPU-side time-per-frame values are just very similar? Start by getting your hands on both values! smile.png

Also, what kind of frame-time range were you dealing with here? Values that are too small (e.g. smaller than a typical frame) aren't great for benchmarking because the OS and drivers may well be optimized to slow down programs that are running unreasonably fast. e.g. displaying 1000 frames per second may just be seen as a waste of power by the OS/driver.

Speaking of which, I feel like the triangles-per-second numbers are really low.

You need to have more triangles per batch to get that value up -- instancing with low-poly meshes doesn't really fix the "small batch" problem. Change your cube to a high-poly model and triangles-per-second will almost certainly increase (and your vertex-processing-related optimizations will suddenly make a big impact on frametime).

Edited by Hodgman

Share this post


Link to post
Share on other sites

Thanks to both of you for reading this and helping me out. 

 


Add a CPU timer (QueryPerformanceCounter) around the SwapBuffers function -- if the GPU is the bottleneck, the CPU will usually stall in this function. If the time recorded here starts increasing or displays a large amount of variance, then GPU-side performance is probably polluting your experiment.
Add a GPU timer (ARB_timer_query) for the start/end of each frame, and make sure to only read back the query results (timestamps) 2 or 3 frames after submitting the queries. Use the timestamps to compute GPU-side time-per-frame. If these values are similar to or higher than your QueryPerformanceCounter-derived time-per-frame values, then GPU-side performance is definately polluting your experiment

 

..

 

Start by getting your hands on both values!

 

Great advice, thanks. Any info I can get on what's really going on will be a big help. 

 


You need to have more triangles per batch to get that value up -- instancing with low-poly meshes doesn't really fix the "small batch" problem.

 

I didn't realize this. I thought that, by getting everything into one VAO and drawing it all with one draw call (no state changes in between), I had effectively solved the batching issues. Do you know why the GPU is still "seeing" these thousands of cubes as separate batches instead of one?

 


http://www.g-truc.net/post-0662.html
http://www.g-truc.net/post-0666.html

 

I have a few questions about these articles. I can believe what they're saying, but some things need clarification:

 

1. Concerning the small triangles, it looks to me like there is a linear relationship between the frame times and the number of triangles drawn. The author is graphing the polygon size vs. the frame time. The polygon size is cut in half with each step, which means the number of polygons is increased by four. So, the graph looks quadratic, which is what we'd expect if there was a linear relationship between the number of triangles and the frame time. If I were to look at this graph (and admittedly, I'm just learning to analyze this stuff properly), I would think that the system becomes vertex-bound somewhere between 8x8 and 4x4, where there are 388,800 vertices on the screen. Before that, there is some other bottleneck, ensuring that changes in vertex count don't matter. How is the author controlling for this possibility?

 

2. If you look further down, the author shows a graph that the performance cliff is exponential, but that's hard to see. The vertical axis is log10, and the horizontal axis is log2 with respect to vertex count. I really suspect that the relationship is actually linear with respect to vertex count.

 

3. Concerning the triangles per draw call, it looks like the author says that it's not the number of draw calls per se that makes small batching slow, but rather all of the validation that happens for each draw call due to the state changes that happen in between the draw calls. This was my understanding as well. However, it doesn't look like he's making any state changes in between draw calls, so I'm not sure how his experiment demonstrates the point he's trying to make. In any case, am I to conclude that my DrawArrays implementation is no better than my DrawElementsInstanced implementation because I wasn't making any state changes (other than uniforms) in between calls to DrawArrays? 

 

4. It also looks like, although he is varying the number of triangles drawn per draw call (and thus varying the number of draw calls needed to draw the entire buffer), he is still submitting only one instance per draw call. Again, this supports the idea that performance is worse if you make more draw calls. However, I am still confused as to why performance problems persist if everything is drawn with on DrawElementsInstanced call.

Share this post


Link to post
Share on other sites


I didn't realize this. I thought that, by getting everything into one VAO and drawing it all with one draw call (no state changes in between), I had effectively solved the batching issues. Do you know why the GPU is still "seeing" these thousands of cubes as separate batches instead of one?
If you perform "psuedo-instancing" where you duplicate the one cube mesh 10000 times into a very large VBO, then it will be a single batch, and will render very efficiently.

 

Perhaps it's been solved on the latest GPUs, but for a long time, it's been a rule of thumb that instancing does not perform well for low-poly meshes. I'm not sure why... Either there's still overhead that has to be performed for each instance, or perhaps different instances cannot be grouped into the same wavefront/thread-group on the GPU? e.g. AMD's processors can operate on 64 pixels/vertices at a time -- if this is true, within one processor, 8 threads would be busy running the vertex shaders for one cube instance, while 56 threads sit idle.


Concerning the small triangles, it looks to me like there is a linear relationship between the frame times and the number of triangles drawn. The author is graphing the polygon size vs. the frame time. The polygon size is cut in half with each step, which means the number of polygons is increased by four. So, the graph looks quadratic, which is what we'd expect if there was a linear relationship between the number of triangles and the frame time.
The graph is flat (no change in frame-time) until the quad size reaches 16x16 pixels -- he goes from a single 1920*1080px tile to 32*32px tiles (1 tile to 2040 tiles) with no increase in frame time. It's only once the tiles reach 8*8 pixels that the graph shoots upwards suddenly.

As above, this is likely because AMD GPU cores use 64-wide SIMD instructions to shade 64 pixels at a time.

 
Also note in his graph that tiles of size 32px * 8px take a different amount of time to render than tiles of size 8px * 32px! That's partly because of cache and memory layout reasons, but also partly because every GPU rasterizes triangles in a different pattern, often somewhat hierarchically. Some triangle shapes will better match that pattern than others.
 

Furthermore, almost every GPU (going back 10 years or more up until today!) does not rasterize individual pixels. GPUs rasterize "pixel quads", which are 2*2px areas of the screen. If a triangle cuts through part of a 2*2 area -- e.g. it only covers 1 pixel -- then the whole pixel quad is still shaded, but some of the pixels are discarded. That's one reason why the 1*1 pixel tiles are incredibly slow.

It's also a reason why LOD'ing models is important! On one game I worked on, we weren't going to bother with LODs, as vertex shading wasn't much of a bottleneck for us... However, profiling showed that distant meshes were taking waaay too long to draw -- these meshes were mostly made up of sub-pixel triangles, where most triangles covered zero pixels, and a few lucky ones covered one pixel. After implementing LOD'ing, the vertex shading time of course imrpoved, but the pixel shading time also improved by ~200 to 300% due to the reduction in small triangles (a.k.a. a massive improvement in pixel-quad efficiency).


Concerning the triangles per draw call, it looks like the author says that it's not the number of draw calls per se that makes small batching slow, but rather all of the validation that happens for each draw call due to the state changes that happen in between the draw calls.
Validation is a CPU bottleneck -- he says that batching is usually done to help out the CPU here, but goes on to say:

In this post, we are looking at the GPU draw call performance ... To make sure that we are not CPU bound, I ..... In these tests, we are GPU bound somewhere.

Share this post


Link to post
Share on other sites

Alright, so I wasn't able to start on this until late this evening, but I do have some results to share. The following graph shows the time vs. frame number for 50,000 cubes rendered using DrawElementsInstanced (no camera panning):

 

 62mC7JW.png

So, it seems that the gpu is the bottleneck in this case. Almost the entire frame time is spent waiting for SwapBuffers to return. I tried this same experiment with 5,000 cubes, and got the same results (albeit with smaller frame times). That is, gpuTime and swapBuffersTime were very close to the total frame time.

 

I then tried running the same experiments with DrawElements (not instanced), and I got a very different plot. This time, the frame times and gpu time were about equal still, but the swap buffers time was way lower:

 

lTtY30j.png

This looks to me like the gpu is still taking the same amount of time to draw the cubes as in the Instanced case, but since the CPU is spending so much more time submitting draw calls, there is much less time left over for waiting for the buffer swap. Does that sound right?

 

I also tried using an object that is more complex than a cube -- just a quick mesh I made in Blender that has 804 unique verts. Once again, there is no performance difference between the DrawArrays, DrawElements, and DrawElementsInstanced cases. However, the good news is that the triangles-per-second increased by more than 2X with the more complex model, just as you predicted.

 

So, it appears that my test cases are not great -- they take long enough to draw on the GPU that there is plenty of time on the CPU side to submit all of the draw calls individually.

 

However, the vertex processing stage does not seem to be the culprit, since there is no difference in GPU time between the indexed and non-indexed cases. Next, I'll experiment more with fragment processing and reducing the number of single- and sub-pixel triangles in the scene.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement