Performance Profiling and Optimization

posted in We stumble at noonday as in the dark.

Published December 02, 2005

I arrived at an appropriate development stage in my graphics engine that current bottlenecks are interesting to locate and certain optimizations may be appropriate. Here is an abbreviated account of what I did to explore these bottlenecks and search for possible optimizations.

The first thing I wanted to do was to identify any openGL errors that may have crept in my code unknowingly, since these errors can be a bottleneck in certain situations. I defined the following error checking macro and wrapped all the engine's openGL calls appropriately (i.e. CHECK_OPENGL_ERROR(glPopMatrix());)

#define CHECK_OPENGL_ERROR( cmd )   cmd;   { GLenum error;     while ( (error = glGetError()) != GL_NO_ERROR)        { printf( "[%s:%d], '%s' failed with error %s\n",           __FILE__, __LINE__, #cmd, gluErrorString(error) ); }   }

After running the engine with the gl calls wrapped, stdout posted the following:

[C:\src\engine\main.cpp:164], 'glPopMatrix()', failed with error stack underflow

The problem was that I was popping the matrix stack at the beginning of my render function and pushing at the end of the function. This was fine for everything but the first pass through the function, but the first call to glPopMatrix would pop an empty stack. Adding a glPushMatrix call to the setup routine solved this problem. No other errors were detected.

Satisfied that no openGL errors could be degrading my performance, I was ready to identify the bottlenecks in the rendering engine. I was prepared to identify five potential bottlenecks: a framebuffer limitation, a texture limitation, a transform limitation, a transfer limitation, and a CPU limitation.

For this particular exercise, I tested on a laptop. The laptop was a Dell Latitude D810 with a Pentium 2.13GHz and 1GB of RAM. The video card was an ATI mobility Radeon X600.

First, I rendered a simple mesh of Venus twice, without instancing, to a 1900x1200 framebuffer to get a baseline framerate.

At the 1900x1200 resolution, about 40,000 vertices and 87,000 faces were rendered at 37.2 frames per second. I dropped the framebuffer size to 1280x800 and rendered identical geometry.

The framerate did not change. This suggested that either a geometric bottleneck or a application CPU bottleneck was occurring at this level of scene complexity.

Before I determined whether this was CPU (application) limited or geometry (transfer) limited, I wanted to find out where fill limitation (framebuffer) threshold was crossed.

Instead of rendering Venus, I created a course helix in Maya to allow a higher degree of mesh granularity.

At 1920x1200, the engine rendered 48,000 vertices and 49,000 faces; the framerate was 78.4.

Rendering the same scene at 1280x800 revealed a framerate of 78.4. The same geometric-limitation or CPU-limitation existed at this level.

I removed a column of helix and rendered again.

Now a framerate discrepancy was revealed. Rendering 42,000 faces at 1920x1200 offered a framerate at 85.8; while rendering to a 1280x800 framebuffer offered a framerate at 90.2. The fill-limitation of this engine running on the laptop threshold was crossed at ~45K faces.

Now I needed to go back and determine if the original bottleneck was due to my data structures (CPU-limitation) or if the graphics hardware was pushing all the geometry that was possible at the given framerate.

To accomplish this, I needed to be able push the same quantity of data from the engine to the OpenGL pipeline. I used the C preprocessor to substitute glVertex calls with glNormal calls. I created the following header file:

 #define glVertex2d(x, y)         glNormal3d(x, y, 0)#define glVertex2f(x, y)         glNormal3f(x, y, 0)#define glVertex2i(x, y)         glNormal3i(x, y, 0)#define glVertex2s(x, y)         glNormal3s(x, y, 0)#define glVertex3d(x, y, z)      glNormal3d(x, y, z)#define glVertex3f(x, y, z)      glNormal3f(x, y, z)#define glVertex3i(x, y, z)      glNormal3i(x, y, z)#define glVertex3s(x, y, z)      glNormal3s(x, y, z)#define glVertex4d(x, y, z, w)   glNormal3d(x, y, z)#define glVertex4f(x, y, z, w)   glNormal3f(x, y, z)#define glVertex4i(x, y, z, w)   glNormal3i(x, y, z)#define glVertex4s(x, y, z, w)   glNormal3s(x, y, z)#define glVertex2dv(v)           glNormal3d(v[0], v[1])#define glVertex2fv(v)           glNormal3f(v[0], v[1])#define glVertex2iv(v)           glNormal3i(v[0], v[1])#define glVertex2sv(v)           glNormal3s(v[0], v[1])#define glVertex3dv(v)           glNormal3dv(v)#define glVertex3fv(v)           glNormal3fv(v)#define glVertex3iv(v)           glNormal3iv(v)#define glVertex3sv(v)           glNormal3sv(v)#define glVertex4dv(v)           glNormal3dv(v)#define glVertex4fv(v)           glNormal3fv(v)#define glVertex4iv(v)           glNormal3iv(v)#define glVertex4sv(v)           glNormal3sv(v)

I now included this header file and rendered the mesh of Venus, exactly as I had done in the previous step. This was useful because the same quantity of data was being pushed to the openGL pipeline but no geometry was being rendered.

When rendering to the 1900x1200 framebuffer a framerate of 147.7 was noted - a 400% increase in performance. Similarly, using the 1280x800 framebuffer, 333 frames per second was observed - a 425% increase in performance.

This confirmed that when rendering at a scene complexity of about 45,000 faces on my laptop, the engine moves from a fill-limitation to a geometry limitation; and a CPU limitation is never observed.

To test for texturing limitations, I rendered a Poser model with 128,000 texture coordinates using a 3000x3000 resolution .png skin texture. I then resampled the .png to 128x128 and rendered again. The framerate stayed consistent regardless of texture size, thus revealing no texture limitations.

I was able to observe that no transform limitation existed by not using any visibility culling. I then observed the framerate while not moving through the scene and observed the framerate while moving through the scene. The framerate stayed consistent, thus revealing that no transform limitation existed.

Another area of interest was the object selection feedback algorithms. There was room for appropriate optimization here. For a baseline reading, I rendered a hellskull mesh to a 1900x1200 framebuffer.

The baseline revealed that rendering 14,000 faces to 1920x1200 framebuffer allowed the engine to maintain a 79.8 framerate.

When placing the mouse cursor over the skull and selecting it, the following result was observed:

While rendering the correct selection feedback silhouette, the framerate dropped to 38.4 - a 52% drop in performance.

To achieve the silhouette effect, I was doing the following:
1. Setting the stencil buffer to create the stencil
2. Rendering the entire mesh into the stencil buffer
3. Setting the stencil buffer to use the stencil
4. Setting line width
4. Rendering the entire object in wireframe mode

This worked well visually. The line width stayed consistent regardless of camera placement. But there were two things here that could be changed that might improve performance.

First, there was no reason to render the entire image (texture and normal calls included) to create the stencil. So I changed the code to create a display list that was as simple as possible to get the stencil. This revealed another problem in regards to the use of the glStencilOp state function. This was corrected.

Second, there was no need to render the entire wireframe when only the silhouette was needed. So I decided to render only the front and back facing edges.

This was not trivial, as it turned out. First, I had to go back into my original OBJ parsing code to create the edge list. I also had to move my camera class into the world class so I could have convenient access to the eye vector. Once this was accomplished, I had to decide what was a front facing polygon and what was a back facing polygon independent of viewpoint. I implemented this algorithm as a world class method as follows:

>

// This is an function that will render a silhouette around a selected object (if any) in the scene. void myWorld::renderSelectedOutline(void) {  JRB tempedges;  JRB myedges;  obj_edge *thisedge;  JRB tempfaces;  obj_face *thisface;  float eyeVector[3];  float newPoints[3];  float normVector[3];  float eyeDot[2];  GLfloat *tmat;    int i;  // if there is not a currently selected object, this function will return  if (selectedWorldObject == NULL) {     return;     }   else     {     myedges = selectedWorldObject->localobj->getEdgesJRB();     tmat = selectedWorldObject->getMatrix();          glPushMatrix();     glDisable(GL_LIGHTING);     glDisable(GL_TEXTURE_2D);     //glDisable(GL_DEPTH_TEST);        glPushMatrix();           // prepare stencil for rendering           glClearStencil(0x0);           glClear(GL_STENCIL_BUFFER_BIT);           glEnable(GL_STENCIL_TEST);           glStencilFunc (GL_ALWAYS, 0x1, 0x1);           glStencilOp(GL_REPLACE, GL_REPLACE, GL_REPLACE);           glMultMatrixf(selectedWorldObject->transMatrix);           glCallList(selectedWorldObject->stencillist);           // set stencil            glStencilFunc(GL_NOTEQUAL, 0x1, 0x1);        glPopMatrix();     glLineWidth(3);     glMultMatrixf(selectedWorldObject->transMatrix);     jrb_traverse(tempedges, myedges)         {        thisedge = (obj_edge *)tempedges->val.v;        // This optimization is only concerned with edges that connect exactly two faces        if (thisedge->face_count == 2)           {           // The reference point used to calculate the eye vector is multiplied against the transformation matrix of this object           newPoints[0] = (tmat[0] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)])    +                           (tmat[4] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)+1])  +                           (tmat[8] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)+2])  + (tmat[12]);           newPoints[1] = (tmat[1] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)])    +                           (tmat[5] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)+1])  +                           (tmat[9] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)+2])  + (tmat[13]);           newPoints[2] = (tmat[2] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)])    +                           (tmat[6] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)+1])  +                           (tmat[10] * selectedWorldObject->localobj->obj_vertex_array[(thisedge->edges[0]*3)+2]) + (tmat[14]);                      // The eye vector is created           eyeVector[0] = -mycamera->cameraPosition[0] - newPoints[0];           eyeVector[1] = -mycamera->cameraPosition[1] - newPoints[1];           eyeVector[2] = -mycamera->cameraPosition[2] - newPoints[2];                  i = 0;           jrb_traverse(tempfaces, thisedge->face_list)              {              thisface = (obj_face *)tempfaces->val.v;              // The surface normal of the face is multiplied against the rotation matrix.  translations do not affect the normal.              normVector[0] = (tmat[0] * thisface->surface_normal[0]) + (tmat[4] * thisface->surface_normal[1]) + (tmat[8] * thisface->surface_normal[2]);              normVector[1] = (tmat[1] * thisface->surface_normal[0]) + (tmat[5] * thisface->surface_normal[1]) + (tmat[9] * thisface->surface_normal[2]);              normVector[2] = (tmat[2] * thisface->surface_normal[0]) + (tmat[6] * thisface->surface_normal[1]) + (tmat[10] * thisface->surface_normal[2]);              //calculate the dot product for this face              eyeDot = (eyeVector[0] *  thisface->surface_normal[0]) +                          (eyeVector[1] *  thisface->surface_normal[1]) +                          (eyeVector[2] *  thisface->surface_normal[2]);              i++;                  } // end of jrb face traverse inside edge class                   // test for an edge that has both a front facing and back facing polygon              if ( (eyeDot[0] < 0) && (eyeDot[1] >0) )                 {                 glBegin(GL_LINES);                    glVertex3fv(&selectedWorldObject->localobj->obj_vertex_array[thisedge->edges[0]*3]);                    glVertex3fv(&selectedWorldObject->localobj->obj_vertex_array[thisedge->edges[1]*3]);                 glEnd();                 }              else                 if ( (eyeDot[0] > 0) && (eyeDot[1] <0) )                    {                    glBegin(GL_LINES);                       glVertex3fv(&selectedWorldObject->localobj->obj_vertex_array[thisedge->edges[0]*3]);                       glVertex3fv(&selectedWorldObject->localobj->obj_vertex_array[thisedge->edges[1]*3]);                    glEnd();                    }              }  // end of face_count conditional                          // else, this edge doesn't connect exactly two faces, so it will be rendered            else              {              glBegin(GL_LINES);                 glVertex3fv(&selectedWorldObject->localobj->obj_vertex_array[thisedge->edges[0]*3]);                 glVertex3fv(&selectedWorldObject->localobj->obj_vertex_array[thisedge->edges[1]*3]);              glEnd();              }         }   glEnable(GL_LIGHTING);   glEnable(GL_TEXTURE_2D);   glEnable(GL_DEPTH_TEST);   glDisable(GL_STENCIL_TEST);       glPopMatrix();   }  // end of master else (if there was a selected object conditional)

Now I could achieve the same effect with the following results:

Now the engine enjoyed a 43.9 framerate when object selected was active. With this optimization, the framerate dropped 45% due to object selection, thus mitigating the expense by 7%.

It is important to note that in this particular scene, the object itself is the entire scene. This is an upper bound that would never be realistically presented to the engine in a production state. More commonly, an object that is selectable will be a low-polygon object that is only a small part of the entire scene.

Having coded the edge detection algorithm for the mesh, I decided it might be interesting to bypass the stencil buffer completely. I achieved the following results:

I was actually pleased with these results. Only a %15 drop in performance is observed when bypassing the stencil and the results are visually interesting. I included this code in the engine so that either choice can be made to inform the user what objects are selected in the world - either the stenciled silhouette or the complete edge detection highlights.Here is another example of the edge detection highlights option:

Related resources:

Shreiner, David; Performance OpenGL: Platform Independent Techniques Siggraph 2001 Course #3; April 27, 2001

Hart, Evan; OpenGL Performance Tuning; ATI; Game Developers Conference 2005

West, Mick; Practical Hash IDs - Using 32-bit CRC hash as a unique identifier for game resources; Game Developer Magazine; December 2005

Previous Entry Blending Eyeballs and Eyelashes

Next Entry Performance Profiling and Optimization

0 likes 2 comments

Comments

johnhattan

Boobies

December 02, 2005 02:12 PM

jdaniel

I have been playing the XBox 360 - Perfect Dark Zero - recently. I noticed that they use the exact same selection feedback methods in that game. Most noticable, in the jungle levels where the secondary feature of one of the guns has an infrared mode that sillouttes the enemies in red and friendlies in green.

December 06, 2005 12:09 AM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

jdaniel

Author

Performance Profiling and Optimization

Comments

jdaniel

Latest Entries

Texture Hunting in the Middle East

Quad-Trees and Erratas

Paul Nettle, Night of the Raven, and Adaptoids

Smooth key scrolling

Leveling Up

collision and backgrounds

Quake 3 running at 11,520x3072 resolution

ray casting and printf

Preliminary level...

silhouette detection and tool development

Performance Profiling and Optimization

Comments

jdaniel

Latest Entries

Texture Hunting in the Middle East

Quad-Trees and Erratas

Paul Nettle, Night of the Raven, and Adaptoids

Smooth key scrolling

Leveling Up

collision and backgrounds

Quake 3 running at 11,520x3072 resolution

ray casting and printf

Preliminary level...

silhouette detection and tool development

Reticulating splines