Archived

This topic is now archived and is closed to further replies.

Boosting performance (again)

This topic is 5569 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm wading through the posts (since search is disabled), and haven't found anything that helps so far... I'm getting about 150K tris/second, with textures, directional lighting, and fog. From what I keep reading, this is awfully slow for my hardware (2.0 GHz/1 GB RAM/ATI Radeon VE at work, 1.8 GHz/512 MB DDR/GeForce 3 at home). I'm running at 800x600x32x24 windowed, or 1024x768x32x24 fullscreen. I'm using glDrawElements for my meshes, and I did try putting them in compiled lists (actually slowed it down). Texture binding is optimized so that it's only done twice per loop (that's two bindings for 1000 meshes). The meshes themselves are collections of vertices and normals stored in vectors. The 150K is with backface culling disabled...it jumps to an amazing 160K with it enabled. In fact, by pulling textures entirely it's still only 150K. I have a bare-bones loop that calls the Draw function:
  void Draw( void )
{
	//int    width, height;						// Window dimensions

  
 	glClear( GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT );
	
	glLoadIdentity();
	gluLookAt( mCamZoom(0), mCamZoom(0)*1.25, mCamZoom(0),		
		0.0f, 0.0f, 0.0f,					
		0.0f, 1.0f, 0.0f);						        glRotatef (mRotX(0),0,1,0);
	glLightfv( GL_LIGHT0, GL_POSITION, g_LightPosition );

	glEnable(GL_ALPHA_TEST);
	glDisable(GL_BLEND);
	glEnable(GL_DEPTH_TEST);
	glDepthMask(GL_TRUE);
	glDisable (GL_CULL_FACE);

	float highlight[4] = {0.0f, 1.0f, 0.0f, 1.0f};
	float plain[4]={1.0f, 1.0f, 1.0f, 1.0f};
	int indCount=-1;

	bool GoneWhite=true;

	for (int q=0;q<Meshes.RootMesh.size();q++) {
		indCount+=1;
		if (q==2) {
     	glMaterialfv(GL_FRONT_AND_BACK,GL_DIFFUSE,highlight);
			GoneWhite=false;
		} else {
			if (GoneWhite==false) {
				glMaterialfv(GL_FRONT_AND_BACK,GL_DIFFUSE,plain);
				GoneWhite=true;
			}
		}
		Meshes.DrawMesh(q);
	}
}    
Leaving the backface culling enabled provides the aforementioned minor performance boost. Removing the rest makes no difference. And now the DrawMesh function:
  int glMESH::DrawMesh(int MESHID) {
	if (MESHID>-1 && MESHID<RootMesh.size()) {
		// set texture

		glPushMatrix();
		int k=0;
		int subtot=0;
		static int LastTex=0;

		glTranslatef (RootMesh[MESHID].x,RootMesh[MESHID].y,RootMesh[MESHID].z);
		if (RootMesh[MESHID].xr!=0) {glRotatef (RootMesh[MESHID].xr,1,0,0);}	// rotate on x-axis

		if (RootMesh[MESHID].yr!=0) {glRotatef (RootMesh[MESHID].yr,0,1,0);}	// rotate on y-axis

		if (RootMesh[MESHID].zr!=0) {glRotatef (RootMesh[MESHID].zr,0,0,1);}	// rotate on z-axis


		for (int ii=0; ii<RootMesh[MESHID].submeshcount+ii; ii++) {
			subtot=0;
			for (k=0;k<ii;k++) {
		subtot=subtot+RootMesh[MESHID].vertexcounts[k];
			}
			if (RootMesh[MESHID].texture[ii]!=LastTex) {
			LastTex=RootMesh[MESHID].texture[ii];
			glBindTexture (GL_TEXTURE_2D, LastTex);
			}

			glVertexPointer( 3, GL_FLOAT, 0, &RootMesh[MESHID].vertices[subtot*3] ); 
	                glNormalPointer( GL_FLOAT, 0, &RootMesh[MESHID].normals[subtot*3]);
			glTexCoordPointer( 2, GL_FLOAT, 0, &RootMesh[MESHID].texcoords[subtot*2]);
			glDrawElements( GL_TRIANGLES, RootMesh[MESHID].vertexcounts[ii], GL_UNSIGNED_INT, &RootMesh[MESHID].indices[subtot]  );
		}
		glPopMatrix();
		return 0;
	} else {
		return 1; //invalid id

	}
}
   
Commenting out the rotations/translations has no effect. That's the meat of it. What stupid things have I done wrong? [edited by - Brother Erryn on September 10, 2002 1:45:53 PM] [edited by - Yann L on September 11, 2002 8:45:40 PM]

Share this post


Link to post
Share on other sites
quote:

for (int ii=0; ii<RootMesh[MESHID].submeshcount+ii; ii++) {


I guess this is a typo.

quote:

subtot=0;
for (k=0;k<ii;k++) {
subtot=subtot+RootMesh[MESHID].vertexcounts[k];
}


Ouch. Don't do that in the inner loop of your mesh renderer, it will kill performance, if your RootMesh[MESHID].submeshcount is large.

Do this instead:

    
subtot=0;
for (int ii=0; ii<RootMesh[MESHID].submeshcount; ii++) {
if (RootMesh[MESHID].texture[ii]!=LastTex) {
LastTex=RootMesh[MESHID].texture[ii];
glBindTexture (GL_TEXTURE_2D, LastTex);
}

glVertexPointer( 3, GL_FLOAT, 0, &RootMesh[MESHID].vertices[subtot*3] );
glNormalPointer( GL_FLOAT, 0, &RootMesh[MESHID].normals[subtot*3]);
glTexCoordPointer( 2, GL_FLOAT, 0, &RootMesh[MESHID].texcoords[subtot*2]);
glDrawElements( GL_TRIANGLES, RootMesh[MESHID].vertexcounts[ii], GL_UNSIGNED_INT, &RootMesh[MESHID].indices[subtot] );

subtot += RootMesh[MESHID].vertexcounts[ii];
}


/ Yann


[edited by - Yann L on September 11, 2002 4:58:50 PM]

Share this post


Link to post
Share on other sites
Your mesh loop is near optimal for standard (non-VAR/VAO) OpenGL. Your bottleneck must be somewhere else.

A quick optimization would be to drop your submesh vertex counter 'subtot', and add the offset directly onto the indices by a preprocess. That way, you could move the gl*Pointer() calls out of the loop. But don't expect too much performance gain from that.

You said, that you have 1000 meshes per frame. You are calling glDrawElements() 1000 times per frame. How many faces do you display per frame, on average ? If it's less than 300k per frame, then you are outside the recommended polycount range for 1000 vertex arrays. You should push a minimum of 300 faces per glDrawElements() call, otherwise you'll lose performance due to the call overhead.

Try this: comment out all gl*Pointer() and the glDrawElement() call. Don't change anything else, time your application and post the results.

[Edit: I just added source tags to your original post, the code tags don't break correctly. Creates annoying horizontal scrollbars in the browser window]

/ Yann

[edited by - Yann L on September 11, 2002 8:54:15 PM]

Share this post


Link to post
Share on other sites
First of all, I''ve done a little tweaking (changing lighting and fog, cleaning up glMaterialfv calls, etc.) and that boosted it to 260K tri/sec on my GeForce (about half that for the ATI).

quote:
A quick optimization would be to drop your submesh vertex counter ''subtot'', and add the offset directly onto the indices by a preprocess. That way, you could move the gl*Pointer() calls out of the loop. But don''t expect too much performance gain from that.


It''s done this way because each mesh has a seperate texture. In my "test" version though, each mesh only has one submesh anyway.

quote:
You said, that you have 1000 meshes per frame. You are calling glDrawElements() 1000 times per frame. How many faces do you display per frame, on average ? If it''s less than 300k per frame, then you are outside the recommended polycount range for 1000 vertex arrays. You should push a minimum of 300 faces per glDrawElements() call, otherwise you''ll lose performance due to the call overhead.


Almost all of the meshes are two triangles (six vertices). Then there''s five cubes, and a sphere, for a total of about 2000 faces. But you mention that I should have at least 300 faces per call, when most of them are only two faces. I was under the impression that single GL_TRIANGLES with glBegin/glEnd was much less efficient than glDrawElements, and trying to draw those that way didn''t boost performance. I suspect the solution is something painfully obvious that I''m overlooking.

quote:
Try this: comment out all gl*Pointer() and the glDrawElement() call. Don''t change anything else, time your application and post the results.


The application when from 130 FPS to 260 FPS (the equivalent to 520K tri/sec, I suppose).

Thanks for your help, and for the source tags. I''d forgotten which tag it was.

Share this post


Link to post
Share on other sites
quote:

First of all, I''ve done a little tweaking (changing lighting and fog, cleaning up glMaterialfv calls, etc.) and that boosted it to 260K tri/sec on my GeForce (about half that for the ATI).


Still too low. On your config, you should at least get 2-3 million.

quote:

It''s done this way because each mesh has a seperate texture. In my "test" version though, each mesh only has one submesh anyway.


Doesn''t matter, you can still put all vertices in a single array. As long as the index arrays are separate, you can change the renderstate inbetween glDrawElement() calls.

quote:

Almost all of the meshes are two triangles (six vertices). Then there''s five cubes, and a sphere, for a total of about 2000 faces. But you mention that I should have at least 300 faces per call, when most of them are only two faces. I was under the impression that single GL_TRIANGLES with glBegin/glEnd was much less efficient than glDrawElements, and trying to draw those that way didn''t boost performance
...
The application when from 130 FPS to 260 FPS (the equivalent to 520K tri/sec, I suppose).


OK, you obviously have two different problems there.

1) Yes, glDrawElements() is faster than individual glVertex3f(), etc, calls. But only on larger chunks of geometry.

immediate type commands (glBegin/glEnd) send their data directly as-is to the GPU, and it gets processed on the fly. For a small number of faces, that''s OK. But with large amounts, you''ll get all the call overhead, you can''t share vertices and the geometric data can''t be cached in VRAM.

glVertexPointer() type commands works differently. Each time you call a gl*Pointer() or glDrawElement() command, the GPU has to do internal setup work. Reposition the memory pointers, reset the vertex cache, fill in the streaming pipe, init the DMA engine to the new address, etc. All that takes time. But once the GPU gives the run command, it will crunch through your data at extremely high speeds.

Now consider what happens, if you only send two triangles per call: the GPU will do all the costly setup, and then speed through 2 triangles. Not really worth it. That''s why a minimum of 300 faces is recommended. VAs become really efficient from around 500-600 faces per call. Calling them with less than, say, 50 faces will slow you down tremendeously.

2) Your performance doesn''t go up that much, if you comment out the rendering code. That simply means, that you not limited by the performance of the render hardware. You have a big bottleneck somewhere else in your code. I would suggest using a profiler.

/ Yann

Share this post


Link to post
Share on other sites
quote:
Still too low. On your config, you should at least get 2-3 million.

Funny you should mention this...instead of loading all those two-triangle meshes, I had it load spheres with 150 triangles each. It slowed the FPS, but the result was about 2.5 millions tri/sec. Much better, and obviously the overhead of so many gl calls.

quote:
Doesn't matter, you can still put all vertices in a single array. As long as the index arrays are separate, you can change the renderstate inbetween glDrawElement() calls.

All the vertices for all the submeshes of a single mesh are in a single vector. Miscommunication, perhaps? I just change the start point and changes states for the different submeshes.

quote:
Now consider what happens, if you only send two triangles per call: the GPU will do all the costly setup, and then speed through 2 triangles. Not really worth it. That's why a minimum of 300 faces is recommended. VAs become really efficient from around 500-600 faces per call. Calling them with less than, say, 50 faces will slow you down tremendeously.

Thank you for the clear explanation, I think I've wrapped my brain around it now. But moving the "small" meshes into glBegin/glEnd did not change the performance enough to measure...what other option do I have?

quote:
2) Your performance doesn't go up that much, if you comment out the rendering code. That simply means, that you not limited by the performance of the render hardware. You have a big bottleneck somewhere else in your code. I would suggest using a profiler.

Here's the significant (greater than 0.0) profile lines:
14214.721 94.5 15038.136 100.0 1 _WinMain@16 (glengine.obj)
346.531 2.3 580.656 3.9 2194 _glfwSwapBuffers (window.obj)
203.894 1.4 203.894 1.4 118 __glfwWindowCallback@16 (window.obj)
81.142 0.5 81.142 0.5 905 glMESH::LoadMesh(char *,float,float,float,float,float,float) (glatom_mesh.obj)
61.896 0.4 64.174 0.4 2194 Draw(void) (glengine.obj)
38.938 0.3 40.019 0.3 1 _glfwOpenWindow (window.obj)
33.204 0.2 33.208 0.2 1 __glfwInitTimer (time.obj)
31.096 0.2 31.107 0.2 2 _glfwCloseWindow (window.obj)
10.757 0.1 12.399 0.1 1 _glfwSetWindowPos (window.obj)
3.868 0.0 234.124 1.6 2194 _glfwPollEvents (window.obj)

Thanks!

[edited by - Brother Erryn on September 12, 2002 12:30:27 PM]

Share this post


Link to post
Share on other sites
quote:

All the vertices for all the submeshes of a single mesh are in a single vector. Miscommunication, perhaps? I just change the start point and changes states for the different submeshes.


Yes, but you can avoid that (the calls to gl*Pointer). Just set the pointers to the beginning of the mesh vertex data, outside of your submesh loop. Then all you need to do inside the loop, is to set textures and call glDrawElements(). Obviously, you''ll need to pre-add the appropriate offsets to the index arrays for that to work.

quote:

Thank you for the clear explanation, I think I''ve wrapped my brain around it now. But moving the "small" meshes into glBegin/glEnd did not change the performance enough to measure...what other option do I have?


Cluster geometry. Modern 3D cards don''t like small meshes. To get optimum performance, you''ll have to group related submeshes together (related in terms of texture and renderstate), and issue them as a single glDrawElements call.

quote:

Here''s the significant (greater than 0.0) profile lines:
[snip]


Is that with or without the gl* calls in the inner loop ?

/ Yann

Share this post


Link to post
Share on other sites
quote:
Yes, but you can avoid that (the calls to gl*Pointer). Just set the pointers to the beginning of the mesh vertex data, outside of your submesh loop. Then all you need to do inside the loop, is to set textures and call glDrawElements(). Obviously, you''ll need to pre-add the appropriate offsets to the index arrays for that to work.

Ah, I understand now. Fortunately the way I''ve been creating and loading my meshes lends itself to this method very well. I''ve made this change, and it has boosted performance slightly.

quote:
Cluster geometry. Modern 3D cards don''t like small meshes. To get optimum performance, you''ll have to group related submeshes together (related in terms of texture and renderstate), and issue them as a single glDrawElements call.

I was afraid you''d say something like that. I may follow the NWN approach, where each "tile" is fairly large, rather than having multiple smaller/simpler ones.

quote:
Is that with or without the gl* calls in the inner loop ?

With.

Share this post


Link to post
Share on other sites