Performance problems at ~1,000,000 triangles. Am I doing something wrong?

Started by
28 comments, last by SimonForsman 10 years, 3 months ago

Hello,

I've hit a snag with rendering triangles. At approximately 1,000,000 triangles, the frame rate drops to 30-40 fps. By the time I hit 5,000,000 triangles, I'm down to 10 fps. My CPU is an i5-3570K @ 3.4ghz and my video card is a ZOTAC GeForce GTX 660 w/ 2GB of RAM. Should I experience an fps drop at one million triangles? Or am I doing something wrong?

My lighting model is a simple phong shading with only diffuse lighting. I haven't added ambient or specular lighting yet. There is only one point light source.

I've also enabled the following settings:


glEnable(GL_DEPTH_TEST);
glFrontFace(GL_CW);
glEnable(GL_CULL_FACE);     // Cull back facing polygons
glCullFace(GL_BACK);

Thank you for your help.

-Nick

Advertisement
Or am I doing something wrong?

How are we supposed to know that without seeing your code? Maybe you are calling glDrawElements/glDrawArrays after every triangle, maybe you have old drivers, maybe you are doing many OpenGL state changes. Many things could be wrong, but you are not helping yourself. Show us code, or give us more info.

“There are thousands and thousands of people out there leading lives of quiet, screaming desperation, where they work long, hard hours at jobs they hate to enable them to buy things they don't need to impress people they don't like.”? Nigel Marsh

At approximately 1,000,000 triangles, the frame rate drops to 30-40 fps.

Drops to 30-40 from what?
FPS is only meaningful as a relative measurement, not an absolute measurement.

e.g.

* saying "it decreased from 60fps by 20fps, down to 40fps" -- this means that the frame-time increased from 16.67ms per frame to 25ms per frame.
* whereas, "it decreased to 40fps" just means that it is now 25ms per frame, with no data on what it was before. The actual change in performance is unknown here.

* likewise, "it decreased by 20fps" is a completely meaningless statement, because there's no absolute starting point (if it dropped from 60->40 that's an increase of 8.3ms/frame, but if it dropped from 260->240 then it's only an increase of 0.32ms/frame).

Should I experience an fps drop at one million triangles?

As above, we can't know what you mean here without more information. What was the FPS at 999999 triangles? Was there really a sudden huge performance change when adding one more triangle? If so, then yes, something is very strange.

At approximately 1,000,000 triangles, the frame rate drops to 30-40 fps. By the time I hit 5,000,000 triangles, I'm down to 10 fps.

40fps is 25ms per frame. 10fps is 100ms per frame.
1M tris in 25ms is 40M triangles per second.
5M tris in 100ms is 50M triangles per second.
So... your triangle throughput - one measure of performance - has actually increased here!

if you think your rendering is too slow it might help if you tell us how you are drawing things, how big your triangles are, etc. (1 million big triangles take longer to render than 1 million small ones, 1 million triangles rendered using a single drawcall is alot faster than 1 million triangles rendered using multiple draw calls)

[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!

The data points I've been able to get are 786,432 triangles at 60 fps and 55fps at 983,040 triangles. At 1,228,800 triangles it drops to 40 fps.

I'm rendering individual cubes, each with 12 triangles. The 12 triangles are in a single vertex buffer.

I'm fairly certain I'm using OpenGL 3.2.

Here's my drawing method. Chunk::CHUNK_SIZE is 16.


        glViewport(0, 0, _windowWidth, _windowHeight); // Set the viewport size to fill the window  
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT); // Clear required buffers
	_modelMatrix = glm::mat4();  // Create our model matrix which will halve the size of our model  
	_lightPosition = glm::vec4(-10.0, 30.0, 16.0, 1.0);
	_shader->bind(); // Bind our shader
	int projectionMatrixLocation = glGetUniformLocation(_shader->id(), "projectionMatrix"); // Get the location of our projection matrix in the shader  
	int viewMatrixLocation = glGetUniformLocation(_shader->id(), "viewMatrix"); // Get the location of our view matrix in the shader  
	ChunkList chunks = _chunkManager.GetVisibleChunks();
	_viewMatrix = _camera->GetViewMatrix();
	glUniformMatrix4fv(projectionMatrixLocation, 1, GL_FALSE, &_projectionMatrix[0][0]); // Send our projection matrix to the shader  
	glUniformMatrix4fv(viewMatrixLocation, 1, GL_FALSE, &_viewMatrix[0][0]); // Send our view matrix to the shader  
	for (ChunkList::iterator it = chunks.begin(); it != chunks.end(); ++it)
	{
		Chunk *chunk = *it;
		glm::mat4 world = glm::mat4();
		world = glm::translate(world, glm::vec3(chunk->GetX() * Chunk::CHUNK_SIZE, chunk->GetY() * Chunk::CHUNK_SIZE, chunk->GetZ() * Chunk::CHUNK_SIZE));
		for (float i = 0; i < Chunk::CHUNK_SIZE; ++i)
		{
			
			for (float j = 0; j < Chunk::CHUNK_SIZE; ++j)
			{
				for (float k = 0; k < Chunk::CHUNK_SIZE; ++k)
				{
					Block &block = *(chunk->GetBlock(i, j, k));
					if (!block.IsActive())
					{
						continue;
					}
					//_modelMatrix = world;
					_modelMatrix = glm::translate(world, glm::vec3(i, j, k));
					int modelMatrixLocation = glGetUniformLocation(_shader->id(), "modelMatrix"); // Get the location of our model matrix in the shader  
					int pointLightPosition = glGetUniformLocation(_shader->id(), "pointLightPosition"); // Get the location of our model matrix in the shader  
					int material = glGetUniformLocation(_shader->id(), "color"); // Get the location of our model matrix in the shader  
					
					glUniformMatrix4fv(modelMatrixLocation, 1, GL_FALSE, &_modelMatrix[0][0]); // Send our model matrix to the shader  
					glm::vec4 colorArray;
					switch (block.GetBlockType())
					{
					case BlockType::Dirt:
						colorArray.r = 0.55;
						colorArray.g = 0.27;
						colorArray.b = 0.074;
						colorArray.a = 1.0;
						break;
					case BlockType::Grass:
						colorArray.r = 0.0;
						colorArray.g = 1.0;
						colorArray.b = 0.0;
						colorArray.a = 1.0;
						break;
					case BlockType::Bedrock:
						colorArray.r = 0.5;
						colorArray.g = 0.5;
						colorArray.b = 0.5;
						colorArray.a = 1.0;
						break;
					}
					glUniform4fv(pointLightPosition, 1, &_lightPosition[0]);
					glUniform4fv(material, 1, &colorArray[0]);
					glBindVertexArray(_vaoId[0]); // Bind our Vertex Array Object  
					glDrawArrays(GL_TRIANGLES, 0, 36); // Draw our square  
				}
			}
		}
	}
	glBindVertexArray(0);

	_shader->unbind(); // Unbind our shader

Here's my initialization code:


        _hwnd = hwnd; // Set the HWND for our window  

	_hdc = GetDC(hwnd); // Get the device context for our window  

	PIXELFORMATDESCRIPTOR pfd; // Create a new PIXELFORMATDESCRIPTOR (PFD)  
	memset(&pfd, 0, sizeof(PIXELFORMATDESCRIPTOR)); // Clear our  PFD  
	pfd.nSize = sizeof(PIXELFORMATDESCRIPTOR); // Set the size of the PFD to the size of the class  
	pfd.dwFlags = PFD_DOUBLEBUFFER | PFD_SUPPORT_OPENGL | PFD_DRAW_TO_WINDOW; // Enable double buffering, opengl support and drawing to a window  
	pfd.iPixelType = PFD_TYPE_RGBA; // Set our application to use RGBA pixels  
	pfd.cColorBits = 32; // Give us 32 bits of color information (the higher, the more colors)  
	pfd.cDepthBits = 32; // Give us 32 bits of depth information (the higher, the more depth levels)  
	pfd.iLayerType = PFD_MAIN_PLANE; // Set the layer of the PFD

	int nPixelFormat = ChoosePixelFormat(_hdc, &pfd); // Check if our PFD is valid and get a pixel format back  
	if (nPixelFormat == 0) // If it fails  
		return false;

	bool bResult = SetPixelFormat(_hdc, nPixelFormat, &pfd); // Try and set the pixel format based on our PFD  
	if (!bResult) // If it fails  
		return false;

	HGLRC tempOpenGLContext = wglCreateContext(_hdc); // Create an OpenGL 2.1 context for our device context  
	wglMakeCurrent(_hdc, tempOpenGLContext); // Make the OpenGL 2.1 context current and active  

	GLenum error = glewInit(); // Enable GLEW  
	if (error != GLEW_OK) // If GLEW fails  
		return false;

	int attributes[] = {
		WGL_CONTEXT_MAJOR_VERSION_ARB, 3, // Set the MAJOR version of OpenGL to 3  
		WGL_CONTEXT_MINOR_VERSION_ARB, 2, // Set the MINOR version of OpenGL to 2  
		WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_FORWARD_COMPATIBLE_BIT_ARB, // Set our OpenGL context to be forward compatible  
		0
	};

	if (wglewIsSupported("WGL_ARB_create_context") == 1) { // If the OpenGL 3.x context creation extension is available  
		_hrc = wglCreateContextAttribsARB(_hdc, NULL, attributes); // Create and OpenGL 3.x context based on the given attributes  
		wglMakeCurrent(NULL, NULL); // Remove the temporary context from being active  
		wglDeleteContext(tempOpenGLContext); // Delete the temporary OpenGL 2.1 context  
		wglMakeCurrent(_hdc, _hrc); // Make our OpenGL 3.0 context current  
	}
	else {
		_hrc = tempOpenGLContext; // If we didn't have support for OpenGL 3.x and up, use the OpenGL 2.1 context  
	}

The projection matrix is set to:


glm::perspective(60.0f, (float)_windowWidth / (float)_windowHeight, 0.1f, 100.f);

First of all - get rid of glGetUniformLocation calls during drawing. Get those uniform locations after shader is linked, and store somewhere. Calling glGetSomething usually is expensive operation in OpenGL.

Also drawing only 36 vertices per draw call is very small number. You should be drawing much more to have high performance with large number of triangles. Look into Instancing.

Another thing to look into is uniform buffer object - uploading all uniforms at once could be much better that calling individual glUniformX functions per each uniform.

Moving the glGetUniformLocation and constant glUniforms out of the drawing loop gave me a boost from 10 fps to 13 15fps at 5MM triangles, so that's a start. I'll look into Instancing.

Also note that 1 milion of primitives is a big amount.. Some days ago i was asking the question how many triangles per second can process today gpus - noone want gave me the answer there, now i can treat your results 40 M triangles as a possible answer to that maybe (?)

Also note that 1 milion of primitives is a big amount.. Some days ago i was asking the question how many triangles per second can process today gpus - noone want gave me the answer there, now i can treat your results 40 M triangles as a possible answer to that maybe (?)

1 million triangles is almost nothing for a modern GPU (you could render 1 million triangles per frame on an old GeForce7800GTX at 60fps without any problems), the poor performance in the OPs case is because he only renders 12 triangles per draw call (over 80.000 drawcalls per frame is quite a bit, and on top of the drawcalls there are also quite a lot of unnecessary work being done before each drawcall is issued, the getBlock method is pretty insane, if the memory layout is sane then each chunk holds a array of blocks (if it is an array of pointers to individually allocated blocks there will most likely be tons of cache misses) and he could get a pointer to the first block in the block array outside the outer loop and iterate over that(from start to finish, access each element in the order they are located in RAM to maximize the odds that the data is cached when you need it) rather than repeatedly calling the getblock method with different indices passed to it.

personally i would also skip the methods on the blocks and just use public member variables (so a struct rather than a class) as there doesn't seem to be any need to enforce any invariants on it nor are there any behaviours, its just data.

[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!

Also note that 1 milion of primitives is a big amount.. Some days ago i was asking the question how many triangles per second can process today gpus - noone want gave me the answer there, now i can treat your results 40 M triangles as a possible answer to that maybe (?)

1 million triangles is almost nothing for a modern GPU (you could render 1 million triangles per frame on an old GeForce7800GTX at 60fps without any problems), the poor performance in the OPs case is because he only renders 12 triangles per draw call (over 80.000 drawcalls per frame is quite a bit, and on top of the drawcalls there are also quite a lot of unnecessary work being done before each drawcall is issued, the getBlock method is pretty insane, if the memory layout is sane then each chunk holds a array of blocks (if it is an array of pointers to individually allocated blocks there will most likely be tons of cache misses) and he could get a pointer to the first block in the block array outside the outer loop and iterate over that(from start to finish, access each element in the order they are located in RAM to maximize the odds that the data is cached when you need it) rather than repeatedly calling the getblock method with different indices passed to it.

personally i would also skip the methods on the blocks and just use public member variables (so a struct rather than a class) as there doesn't seem to be any need to enforce any invariants on it nor are there any behaviours, its just data.

well maybe (tnx for interesting answer) but you say 1M at 60 fps

with no problems (it is close result to this 40 M) but maybe 3 M would be then hard to get

(can say why i say million is a lot - blit of million PIXELS can take about not sure 0.2 ms (somethin about that - 0.2 ms is quite noticable time even for such simple thing as a pixel thats why i am saying milion is a lot) - this is base hardware ram speed limit I think- transforming a triangle

must be X times slower than just writing raw integer to ram, where X is i do not know how many but probably about 10,100, 1000?

assume that one traingle cost 100 ints

if ya got 5G pixels/s then you will have 50 M triangles about

the result we talking about

(the calculations are not quite precise but i think it may be about to be correct) (well maybe not calculation but some kind of reasning about estimations)

this is maybe strange callculation because it is not about parallel processing but just multiplies all as it were serial, but maybe this non parallel parts of doing the stuff is bottleneck - if this would be all parallel then blitting 1M pixels shoul take 1nanosecond as it qould be one pixel, and rasterizing 1M triangles should take about the same thing as rasterizung one :-/

At least i think it shouldnt be faster than cost of copying 1M traingles ram -> gpu (+ cost of filling ram screen buffer?)

how many bytes take one triangle 36 bytes for 3d coordinastes or yet more ? coping 36MB of ram should take noticable time perself - i think (I do not know to much about gpu stuff, this is only my thoughts here)

This topic is closed to new replies.

Advertisement