# Performance problems at ~1,000,000 triangles. Am I doing something wrong?

## Recommended Posts

nickwinters    183

Hello,

I've hit a snag with rendering triangles.  At approximately 1,000,000 triangles, the frame rate drops to 30-40 fps.  By the time I hit 5,000,000 triangles, I'm down to 10 fps.  My CPU is an i5-3570K @ 3.4ghz and my video card is a ZOTAC GeForce GTX 660 w/ 2GB of RAM.  Should I experience an fps drop at one million triangles?  Or am I doing something wrong?

My lighting model is a simple phong shading with only diffuse lighting.  I haven't added ambient or specular lighting yet.  There is only one point light source.

I've also enabled the following settings:

glEnable(GL_DEPTH_TEST);
glFrontFace(GL_CW);
glEnable(GL_CULL_FACE);     // Cull back facing polygons
glCullFace(GL_BACK);

-Nick

##### Share on other sites
Pufixas    1167
Or am I doing something wrong?

How are we supposed to know that without seeing your code? Maybe you are calling glDrawElements/glDrawArrays after every triangle, maybe you have old drivers, maybe you are doing many OpenGL state changes. Many things could be wrong, but you are not helping yourself. Show us code, or give us more info.

##### Share on other sites
SimonForsman    7642

if you think your rendering is too slow it might help if you tell us how you are drawing things, how big your triangles are, etc. (1 million big triangles take longer to render than 1 million small ones, 1 million triangles rendered using a single drawcall is alot faster than 1 million triangles rendered using multiple draw calls)

##### Share on other sites
nickwinters    183

The data points I've been able to get are 786,432 triangles at 60 fps and 55fps at 983,040 triangles.  At 1,228,800 triangles it drops to 40 fps.

I'm rendering individual cubes, each with 12 triangles.  The 12 triangles are in a single vertex buffer.

I'm fairly certain I'm using OpenGL 3.2.

Here's my drawing method.  Chunk::CHUNK_SIZE is 16.

        glViewport(0, 0, _windowWidth, _windowHeight); // Set the viewport size to fill the window
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT); // Clear required buffers
_modelMatrix = glm::mat4();  // Create our model matrix which will halve the size of our model
_lightPosition = glm::vec4(-10.0, 30.0, 16.0, 1.0);
int projectionMatrixLocation = glGetUniformLocation(_shader->id(), "projectionMatrix"); // Get the location of our projection matrix in the shader
int viewMatrixLocation = glGetUniformLocation(_shader->id(), "viewMatrix"); // Get the location of our view matrix in the shader
ChunkList chunks = _chunkManager.GetVisibleChunks();
_viewMatrix = _camera->GetViewMatrix();
glUniformMatrix4fv(projectionMatrixLocation, 1, GL_FALSE, &_projectionMatrix[0][0]); // Send our projection matrix to the shader
glUniformMatrix4fv(viewMatrixLocation, 1, GL_FALSE, &_viewMatrix[0][0]); // Send our view matrix to the shader
for (ChunkList::iterator it = chunks.begin(); it != chunks.end(); ++it)
{
Chunk *chunk = *it;
glm::mat4 world = glm::mat4();
world = glm::translate(world, glm::vec3(chunk->GetX() * Chunk::CHUNK_SIZE, chunk->GetY() * Chunk::CHUNK_SIZE, chunk->GetZ() * Chunk::CHUNK_SIZE));
for (float i = 0; i < Chunk::CHUNK_SIZE; ++i)
{

for (float j = 0; j < Chunk::CHUNK_SIZE; ++j)
{
for (float k = 0; k < Chunk::CHUNK_SIZE; ++k)
{
Block &block = *(chunk->GetBlock(i, j, k));
if (!block.IsActive())
{
continue;
}
//_modelMatrix = world;
_modelMatrix = glm::translate(world, glm::vec3(i, j, k));
int modelMatrixLocation = glGetUniformLocation(_shader->id(), "modelMatrix"); // Get the location of our model matrix in the shader
int pointLightPosition = glGetUniformLocation(_shader->id(), "pointLightPosition"); // Get the location of our model matrix in the shader
int material = glGetUniformLocation(_shader->id(), "color"); // Get the location of our model matrix in the shader

glUniformMatrix4fv(modelMatrixLocation, 1, GL_FALSE, &_modelMatrix[0][0]); // Send our model matrix to the shader
glm::vec4 colorArray;
switch (block.GetBlockType())
{
case BlockType::Dirt:
colorArray.r = 0.55;
colorArray.g = 0.27;
colorArray.b = 0.074;
colorArray.a = 1.0;
break;
case BlockType::Grass:
colorArray.r = 0.0;
colorArray.g = 1.0;
colorArray.b = 0.0;
colorArray.a = 1.0;
break;
case BlockType::Bedrock:
colorArray.r = 0.5;
colorArray.g = 0.5;
colorArray.b = 0.5;
colorArray.a = 1.0;
break;
}
glUniform4fv(pointLightPosition, 1, &_lightPosition[0]);
glUniform4fv(material, 1, &colorArray[0]);
glBindVertexArray(_vaoId[0]); // Bind our Vertex Array Object
glDrawArrays(GL_TRIANGLES, 0, 36); // Draw our square
}
}
}
}
glBindVertexArray(0);

_shader->unbind(); // Unbind our shader

Here's my initialization code:

        _hwnd = hwnd; // Set the HWND for our window

_hdc = GetDC(hwnd); // Get the device context for our window

PIXELFORMATDESCRIPTOR pfd; // Create a new PIXELFORMATDESCRIPTOR (PFD)
memset(&pfd, 0, sizeof(PIXELFORMATDESCRIPTOR)); // Clear our  PFD
pfd.nSize = sizeof(PIXELFORMATDESCRIPTOR); // Set the size of the PFD to the size of the class
pfd.dwFlags = PFD_DOUBLEBUFFER | PFD_SUPPORT_OPENGL | PFD_DRAW_TO_WINDOW; // Enable double buffering, opengl support and drawing to a window
pfd.iPixelType = PFD_TYPE_RGBA; // Set our application to use RGBA pixels
pfd.cColorBits = 32; // Give us 32 bits of color information (the higher, the more colors)
pfd.cDepthBits = 32; // Give us 32 bits of depth information (the higher, the more depth levels)
pfd.iLayerType = PFD_MAIN_PLANE; // Set the layer of the PFD

int nPixelFormat = ChoosePixelFormat(_hdc, &pfd); // Check if our PFD is valid and get a pixel format back
if (nPixelFormat == 0) // If it fails
return false;

bool bResult = SetPixelFormat(_hdc, nPixelFormat, &pfd); // Try and set the pixel format based on our PFD
if (!bResult) // If it fails
return false;

HGLRC tempOpenGLContext = wglCreateContext(_hdc); // Create an OpenGL 2.1 context for our device context
wglMakeCurrent(_hdc, tempOpenGLContext); // Make the OpenGL 2.1 context current and active

GLenum error = glewInit(); // Enable GLEW
if (error != GLEW_OK) // If GLEW fails
return false;

int attributes[] = {
WGL_CONTEXT_MAJOR_VERSION_ARB, 3, // Set the MAJOR version of OpenGL to 3
WGL_CONTEXT_MINOR_VERSION_ARB, 2, // Set the MINOR version of OpenGL to 2
WGL_CONTEXT_FLAGS_ARB, WGL_CONTEXT_FORWARD_COMPATIBLE_BIT_ARB, // Set our OpenGL context to be forward compatible
0
};

if (wglewIsSupported("WGL_ARB_create_context") == 1) { // If the OpenGL 3.x context creation extension is available
_hrc = wglCreateContextAttribsARB(_hdc, NULL, attributes); // Create and OpenGL 3.x context based on the given attributes
wglMakeCurrent(NULL, NULL); // Remove the temporary context from being active
wglDeleteContext(tempOpenGLContext); // Delete the temporary OpenGL 2.1 context
wglMakeCurrent(_hdc, _hrc); // Make our OpenGL 3.0 context current
}
else {
_hrc = tempOpenGLContext; // If we didn't have support for OpenGL 3.x and up, use the OpenGL 2.1 context
}


The projection matrix is set to:

glm::perspective(60.0f, (float)_windowWidth / (float)_windowHeight, 0.1f, 100.f);

##### Share on other sites
bubu LV    1436

First of all - get rid of glGetUniformLocation calls during drawing. Get those uniform locations after shader is linked, and store somewhere. Calling glGetSomething usually is expensive operation in OpenGL.

Also drawing only 36 vertices per draw call is very small number. You should be drawing much more to have high performance with large number of triangles. Look into Instancing.

Another thing to look into is uniform buffer object - uploading all uniforms at once could be much better that calling individual glUniformX functions per each uniform.

Edited by Martins Mozeiko

##### Share on other sites
nickwinters    183

Moving the glGetUniformLocation and constant glUniforms out of the drawing loop gave me a boost from 10 fps to 13 15fps at 5MM triangles, so that's a start.  I'll look into Instancing.

Edited by nickwinters

##### Share on other sites
fir    460

Also note that 1 milion of primitives is a big amount..   Some days ago i was asking the question how many triangles per second can process today gpus - noone want gave me the answer there, now i can treat your results 40 M triangles as a possible answer to that maybe (?)

##### Share on other sites
SimonForsman    7642

Also note that 1 milion of primitives is a big amount..   Some days ago i was asking the question how many triangles per second can process today gpus - noone want gave me the answer there, now i can treat your results 40 M triangles as a possible answer to that maybe (?)

1 million triangles is almost nothing for a modern GPU (you could render 1 million triangles per frame on an old GeForce7800GTX at 60fps without any problems), the poor performance in the OPs case is because he only renders 12 triangles per draw call (over 80.000 drawcalls per frame is quite a bit, and on top of the drawcalls there are also quite a lot of unnecessary work being done before each drawcall is issued, the getBlock method is pretty insane, if the memory layout is sane then each chunk holds a array of blocks (if it is an array of pointers to individually allocated blocks there will most likely be tons of cache misses) and he could get a pointer to the first block in the block array outside the outer loop and iterate over that(from start to finish, access each element in the order they are located in RAM to maximize the odds that the data is cached when you need it) rather than repeatedly calling the getblock method with different indices passed to it.

personally i would also skip the methods on the blocks and just use public member variables (so a struct rather than a class) as there doesn't seem to be any need to enforce any invariants on it nor are there any behaviours, its just data.

Edited by SimonForsman

##### Share on other sites
fir    460

Also note that 1 milion of primitives is a big amount..   Some days ago i was asking the question how many triangles per second can process today gpus - noone want gave me the answer there, now i can treat your results 40 M triangles as a possible answer to that maybe (?)

1 million triangles is almost nothing for a modern GPU (you could render 1 million triangles per frame on an old GeForce7800GTX at 60fps without any problems), the poor performance in the OPs case is because he only renders 12 triangles per draw call (over 80.000 drawcalls per frame is quite a bit, and on top of the drawcalls there are also quite a lot of unnecessary work being done before each drawcall is issued, the getBlock method is pretty insane, if the memory layout is sane then each chunk holds a array of blocks (if it is an array of pointers to individually allocated blocks there will most likely be tons of cache misses) and he could get a pointer to the first block in the block array outside the outer loop and iterate over that(from start to finish, access each element in the order they are located in RAM to maximize the odds that the data is cached when you need it) rather than repeatedly calling the getblock method with different indices passed to it.

personally i would also skip the methods on the blocks and just use public member variables (so a struct rather than a class) as there doesn't seem to be any need to enforce any invariants on it nor are there any behaviours, its just data.

well maybe (tnx for interesting answer) but you say 1M at 60 fps

with no problems (it is close result to this 40 M) but maybe 3 M would be then hard to get

(can say why i say million is a lot - blit of million PIXELS can take about not sure 0.2 ms  (somethin about that - 0.2 ms is quite noticable time even for such simple thing as a pixel thats why i am saying milion is a lot) - this is base hardware ram speed limit I think- transforming a triangle

must be X times slower than just writing raw integer to ram, where X is i do not know how many but probably about 10,100, 1000?

assume that one traingle cost 100 ints

if ya got 5G pixels/s then you will have 50 M triangles about

(the calculations are not quite precise but i think it may be about to be correct) (well maybe not calculation but some kind of reasning about estimations)

this is maybe strange callculation because it is not about parallel processing but just multiplies all as it were serial, but maybe this non parallel parts of doing the stuff is bottleneck - if this would be all parallel then blitting 1M pixels shoul take 1nanosecond as it qould be one pixel, and rasterizing 1M triangles should take about the same thing as rasterizung one :-/

At least i think it shouldnt be faster than cost of copying 1M traingles ram -> gpu (+ cost of filling ram screen buffer?)

how many bytes take one triangle 36 bytes for 3d coordinastes or yet more ? coping 36MB of ram should take noticable time perself - i think (I do not know to much about gpu stuff, this is only my thoughts here)

Edited by fir

##### Share on other sites
richardurich    1352

Some people seem to want raw numbers and math, so I'll give some. Note that taking these numbers at face value would be very foolish. They are ignoring extremely important things like overhead and are such terrible oversimplifications that I can guarantee they are wrong. Anyways, hopefully the numbers will still give you a concept of the right ballpark for the speeds. Just always test actual code instead of relying on theoretical numbers.

An AMD R9 290x can push around 4 billion triangles a second, and has about 300 GB/s of bandwidth. PS4 pushes 1.6 billion triangles/sec, 176 GB/s. Xbox 360 (last gen) did 500 million triangles/sec.

For fill rate, the numbers for R9 290x are 176 billion texels/sec and 44 billion pixels/sec. So 1 million pixels would take ~0.025ms on R9 290x.

Copying 36 MB internally in a R9 290x would take about 1/10,000th of a second or 0.1 ms. Copying across a 16-lane PCIe v3 bus would take about 20 times as long, or about 2 ms. That's why you want to minimize the size of update data. Writing data from CPU to RAM (with dual-channel RAM) is roughly twice as fast as a 16-lane PCIe v3 bus for reference.

Transforming a triangle is sort of slow since you're sending data across the bus, but transforming every triangle in a million polygon model by a constant amount takes the exact same amount of time as transforming a single triangle if you're using the model matrix for the transform (multiply by model, view, projection). You probably knew there was a really good reason we all use model, view, projection matrices, so there it is.

##### Share on other sites
Hodgman    51223

Moving the glGetUniformLocation and constant glUniforms out of the drawing loop gave me a boost from 10 fps to 13 15fps at 5MM triangles, so that's a start.  I'll look into Instancing.

The important thing to note here is that you reduced the amount of work that the CPU is doing each frame, and you got a substantial performance improvement.

That means that the GPU (which is actually processing triangles) is not your problem, and the number of triangles being drawn is completely irrelevant.
Currently, your CPU is taking more time to prepare the drawin commands than the GPU is taking to actually execute those commands!!

You need to profile your CPU-side code and see why it's a bottleneck.

##### Share on other sites
slicer4ever    6760

Should I experience an fps drop at one million triangles?

As above, we can't know what you mean here without more information. What was the FPS at 999999 triangles? Was there really a sudden huge performance change when adding one more triangle? If so, then yes, something is very strange.

Is it that strange? what if that one triangle takes just a fraction of a fraction of a second longer, that now you missed v-sync, and have to wait until the next v-sync, so instead of 60 fps, your now running at 30fps, even though your frame time might be 17ms.

@OP: it sounds like your bottleneck is the bandwidth of sending the draw calls, and doing state changes. with your current setup, i'd defiantly recommend learning how to do instancing, and using uniform buffers.

##### Share on other sites
nickwinters    183

I'm now attempting instancing, but am doing something wrong.  Nothing is being rendered.

I changed modelMatrix and color from uniform to regular inputs.

#version 420 core

uniform mat4 projectionMatrix;
uniform mat4 viewMatrix;
uniform vec4 pointLightPosition;

in vec4 in_Position;
in vec4 in_Normal;
in vec2 in_uv;
in mat4 modelMatrix;
in vec4 color;

out vec4 pass_Color;
smooth out vec3 normal;
smooth out vec3 lightVector;

void main(void)
{
gl_Position = projectionMatrix * viewMatrix * modelMatrix * in_Position;
vec4 normalWorld = transpose(modelMatrix) * in_Normal;
normal = normalize(vec3(normalWorld));
lightVector = normalize(vec3(pointLightPosition) - vec3(modelMatrix * in_Position));
pass_Color = color;
}


Initialized my vbos.  The two new ones are at the bottom.

    glGenVertexArrays(1, &_vaoId); // Create our Vertex Array Object
glBindVertexArray(_vaoId); // Bind our Vertex Array Object so we can use it

glGenBuffers(1, &_vboId); // Generate our Vertex Buffer Object

glBindBuffer(GL_ARRAY_BUFFER, _vboId); // Bind our Vertex Buffer Object
int vertexSize = sizeof(Vertex);
glBufferData(GL_ARRAY_BUFFER, 36 * vertexSize, vertices, GL_STATIC_DRAW); // Set the size and data of our VBO and set it to STATIC_DRAW
glVertexAttribPointer(positionId, 4, GL_FLOAT, GL_FALSE, vertexSize, 0); // Set up our vertex attributes pointer
glVertexAttribPointer(normalId, 4, GL_FLOAT, GL_FALSE, vertexSize, (void*)16); // Set up our vertex attributes pointer
glEnableVertexAttribArray(positionId); // Enable the first vertex attribute array
glEnableVertexAttribArray(normalId); // Enable the first vertex attribute array

glBindVertexArray(0); // Disable our Vertex Array Object

delete[] vertices;
glGenBuffers(1, &_colorVbo);
glBindBuffer(GL_ARRAY_BUFFER, _colorVbo);
glEnableVertexAttribArray(colorId);
glVertexAttribPointer(colorId, 4, GL_FLOAT, GL_FALSE, sizeof(vec4), 0);
glVertexAttribDivisor(colorId, 1); //is it instanced?

glGenBuffers(1, &_modelVbo);
glBindBuffer(GL_ARRAY_BUFFER, _modelVbo);
for (int c = 0; c < 4; ++c)
{
glEnableVertexAttribArray(modelMatrixId + c);
glVertexAttribPointer(modelMatrixId + c, 4, GL_FLOAT, GL_FALSE, sizeof(mat4), (void*)(c * sizeof(vec4)));
glVertexAttribDivisor(modelMatrixId + c, 1); //is it instanced?
}


In my draw call, I bind the two vectors to the vbo pointers, and then attempt to draw my triangles.

    glViewport(0, 0, _windowWidth, _windowHeight); // Set the viewport size to fill the window
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT); // Clear required buffers
_modelMatrix = glm::mat4();  // Create our model matrix which will halve the size of our model
_lightPosition = glm::vec4(-10.0, 30.0, 16.0, 1.0);
int projectionMatrixLocation = glGetUniformLocation(_shader->id(), "projectionMatrix"); // Get the location of our projection matrix in the shader
int viewMatrixLocation = glGetUniformLocation(_shader->id(), "viewMatrix"); // Get the location of our view matrix in the shader
int modelMatrixLocation = glGetUniformLocation(_shader->id(), "modelMatrix"); // Get the location of our model matrix in the shader
int pointLightPosition = glGetUniformLocation(_shader->id(), "pointLightPosition"); // Get the location of our model matrix in the shader
int material = glGetUniformLocation(_shader->id(), "color"); // Get the location of our model matrix in the shader
ChunkList chunks = _chunkManager.GetVisibleChunks();
_viewMatrix = _camera->GetViewMatrix();

glUniformMatrix4fv(projectionMatrixLocation, 1, GL_FALSE, &_projectionMatrix[0][0]); // Send our projection matrix to the shader
glUniformMatrix4fv(viewMatrixLocation, 1, GL_FALSE, &_viewMatrix[0][0]); // Send our view matrix to the shader
glUniform4fv(pointLightPosition, 1, &_lightPosition[0]);
vector<mat4> modelMatrices;
vector<vec4> colors;
for (ChunkList::iterator it = chunks.begin(); it != chunks.end(); ++it)
{
Chunk *chunk = *it;
glm::mat4 world = glm::mat4();
world = glm::translate(world, glm::vec3(chunk->GetX() * Chunk::CHUNK_SIZE, chunk->GetY() * Chunk::CHUNK_SIZE, chunk->GetZ() * Chunk::CHUNK_SIZE));
for (float i = 0; i < Chunk::CHUNK_SIZE; ++i)
{

for (float j = 0; j < Chunk::CHUNK_SIZE; ++j)
{
for (float k = 0; k < Chunk::CHUNK_SIZE; ++k)
{
Block &block = *(chunk->GetBlock(i, j, k));
if (!block.IsActive())
{
continue;
}

glm::vec4 colorArray;
switch (block.GetBlockType())
{
case BlockType::Dirt:
colorArray.r = 0.55;
colorArray.g = 0.27;
colorArray.b = 0.074;
colorArray.a = 1.0;
break;
case BlockType::Grass:
colorArray.r = 0.0;
colorArray.g = 1.0;
colorArray.b = 0.0;
colorArray.a = 1.0;
break;
case BlockType::Bedrock:
colorArray.r = 0.5;
colorArray.g = 0.5;
colorArray.b = 0.5;
colorArray.a = 1.0;
break;
}

_modelMatrix = glm::translate(world, glm::vec3(i, j, k));
modelMatrices.push_back(_modelMatrix);
colors.push_back(colorArray);
}
}
}
}
glBindVertexArray(_vaoId); // Bind our Vertex Array Object

glBindBuffer(GL_ARRAY_BUFFER, _modelVbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(mat4)* modelMatrices.size(), &modelMatrices[0], GL_DYNAMIC_DRAW);

glBindBuffer(GL_ARRAY_BUFFER, _colorVbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(vec4)* colors.size(), &colors[0], GL_DYNAMIC_DRAW);
glDrawArraysInstanced(GL_TRIANGLES, 0, 36, colors.size());

glBindVertexArray(0);

SwapBuffers(_hdc); // Swap buffers so we can see our rendering


Did I miss any steps?  Is the order of something incorrect?

Thank you.

Edit: Changed my initialization of the matrix vertex buffer to use 4 vectors.

Edited by nickwinters

##### Share on other sites
fir    460

Some people seem to want raw numbers and math, so I'll give some. Note that taking these numbers at face value would be very foolish. They are ignoring extremely important things like overhead and are such terrible oversimplifications that I can guarantee they are wrong. Anyways, hopefully the numbers will still give you a concept of the right ballpark for the speeds. Just always test actual code instead of relying on theoretical numbers.

An AMD R9 290x can push around 4 billion triangles a second, and has about 300 GB/s of bandwidth. PS4 pushes 1.6 billion triangles/sec, 176 GB/s. Xbox 360 (last gen) did 500 million triangles/sec.

For fill rate, the numbers for R9 290x are 176 billion texels/sec and 44 billion pixels/sec. So 1 million pixels would take ~0.025ms on R9 290x.

Copying 36 MB internally in a R9 290x would take about 1/10,000th of a second or 0.1 ms. Copying across a 16-lane PCIe v3 bus would take about 20 times as long, or about 2 ms. That's why you want to minimize the size of update data. Writing data from CPU to RAM (with dual-channel RAM) is roughly twice as fast as a 16-lane PCIe v3 bus for reference.

Transforming a triangle is sort of slow since you're sending data across the bus, but transforming every triangle in a million polygon model by a constant amount takes the exact same amount of time as transforming a single triangle if you're using the model matrix for the transform (multiply by model, view, projection). You probably knew there was a really good reason we all use model, view, projection matrices, so there it is.

I still thing this are theoretical values and would me more interesting in practical test results - say 4000 M triangles a second when asker says about 40 M a second - I suspect 40 M

is much closer to reality , for example do you think that 400M is realistic? Also this filrate and copying ram values may be

somewhat higher then real ones (for example I got old core 2 duo with gt 610 at home, 1MB memset takes 0.25 ms here)

well i testet yet once with a newer code

void OnIdle()
{
static char tab[36*1000*1000];
memset(tab, 77, 36*1000*1000);

double ns = TakeTimeNs(); //gives delta time between cals

printf("\n delta %f ns", ns);
printf(" fps %f ", 1.0/(0.000000001*ns));
}



this itself gives 80 fps (12 ms delta) for such only a memset

(when commented memset it gives 0.05 ms for printf and timer stuff) this would give a speed of 0.33 ms for 1MB memset ,  when i memset 1GB i got 3fps this is 330 ms and this is 0.33 ms for MB (the thing i thought)

I got old machine but am not sure if on the top new it is much better

Edited by fir

##### Share on other sites
richardurich    1352

Real-world numbers are nearly as useless as theoretical numbers. My game has my shaders, my animations, my models, etc. Your game has yours. Given that warning, I'd recommend looking at benchmarks if you want real-world numbers since they often give triangle data and such, and reviewers give the fps data and such for tons of different system setups. As an example, 3DMark's Cloud Gate benchmark processes over 3 million vertices per frame, rasterizes 1.1 million triangles to shadow maps and the screen. A GeForce 660 like the OP is using hit ~170 fps on that back in Feb 2013.

If you have a R9 290x, you should be able to hit 400 million triangles (6.7 million @ 60 fps) even with fairly complex scenes. Heck, actual XBox 360 games were running scenes with 4 million even a few years ago. 40 million is definitely low for relatively modern hardware, although obviously what you're rendering and how you're rendering makes all the difference.

It is worth noting that if you are targeting integrated graphics (a massively huge chunk of the PC population), that 40 million triangles per second is a lot closer to reality. Heavily optimizing on Ivy Bridge got about 100 million triangles per second. Most code won't be that optimized. But now we have instancing which should be huge for integrated graphics, and Intel has actually improved things a fair bit. I have no idea what integrated graphics should handle in DirectX 11 (instancing) or using more recent hardware, and I hope not to have to start working with that again for at least a year.

Is your memset example generating SSE code? What speed is your RAM? Is it dual-channel? For single-channel PC2 3200 memory, that would obviously be a really good result. Without knowing anything about your hardware or even your compiler, it's hard to say much. Again, you can turn to benchmarks to see what actual code is actually getting in terms of bandwidth. For memory benchmarks, they usually even tell you the way they're doing the memory transfers.

##### Share on other sites
fir    460

Real-world numbers are nearly as useless as theoretical numbers. My game has my shaders, my animations, my models, etc. Your game has yours. Given that warning, I'd recommend looking at benchmarks if you want real-world numbers since they often give triangle data and such, and reviewers give the fps data and such for tons of different system setups. As an example, 3DMark's Cloud Gate benchmark processes over 3 million vertices per frame, rasterizes 1.1 million triangles to shadow maps and the screen. A GeForce 660 like the OP is using hit ~170 fps on that back in Feb 2013.

If you have a R9 290x, you should be able to hit 400 million triangles (6.7 million @ 60 fps) even with fairly complex scenes. Heck, actual XBox 360 games were running scenes with 4 million even a few years ago. 40 million is definitely low for relatively modern hardware, although obviously what you're rendering and how you're rendering makes all the difference.

It is worth noting that if you are targeting integrated graphics (a massively huge chunk of the PC population), that 40 million triangles per second is a lot closer to reality. Heavily optimizing on Ivy Bridge got about 100 million triangles per second. Most code won't be that optimized. But now we have instancing which should be huge for integrated graphics, and Intel has actually improved things a fair bit. I have no idea what integrated graphics should handle in DirectX 11 (instancing) or using more recent hardware, and I hope not to have to start working with that again for at least a year.

Is your memset example generating SSE code? What speed is your RAM? Is it dual-channel? For single-channel PC2 3200 memory, that would obviously be a really good result. Without knowing anything about your hardware or even your compiler, it's hard to say much. Again, you can turn to benchmarks to see what actual code is actually getting in terms of bandwidth. For memory benchmarks, they usually even tell you the way they're doing the memory transfers.

Well tnx for interesting info again.. This would be the good answer to my

previous question nobody had to answer (how many triangles..).. And i can treat it as a result 40M as a some base some can ev optymize up

I am not sure as to this 170 * 1.1 and 400 M can it be a raw power or this is becouse of some frustrum culling maybe?

As to memset im using MinGW (got old machine core 2 duo @2.3GHz, gt610 card, 32bit xp) years back i tested memset 1MB on old P4 2.4 machine and it was 3-4x worse about 1.2 ms) as to ram program siw raports this data

Device set: XMM1

Device locator: XMM1

manufacturer JTAG technologies

capacity 1024 MB

memory type synchronous DDR2

Speed 800MHz

Data Width 64 bits

Form factor DIMM

Total width 64 bits

Device set: XMM2

Device locator: XMM2

manufacturer micron technology

capacity 1024 MB

memory type synchronous DDR2

Speed 800MHz

Data Width 64 bits

Form factor DIMM

Total width 64 bits

Device set: JEDEC ID: 2c 00 00 00 00 00 00

Device locator: XMM3

manufacturer micron technology

capacity 1024 MB

memory type synchronous DDR2

Speed 800MHz

Data Width 64 bits

Form factor DIMM

Total width 64 bits

got 3 MB total - is this dual chanell ?

I think this basic fact (0.3ms memstet)  is most limiting thing

for the machines - would ddr3 be noticably faster than 0.3 ms

here? are the vrams in gpu (especialli pixel buffer of screen contents)

much faster?

(when i run simple cube rendering test on my card I got insane speed of 300 000 fps raport (*) (3 microsecond on each frame (400x300 window) but i am not sure if it really blits 300 thousands of the screens per second, this would be 400x300x4x300x1000 this is 144GB of bytes blitt per second, when my cpu ability is as i said stable 3GB for a second not less not more)

(*) i take info from the same function call as with memset it is called

in each display() call putted in the idleloop it drawc cube with vertex

array gldrawElements call

Edited by fir

##### Share on other sites
richardurich    1352

got 3 MB total - is this dual chanell ?

I think this basic fact (0.3ms memstet)  is most limiting thing

for the machines - would ddr3 be noticably faster than 0.3 ms

here? are the vrams in gpu (especialli pixel buffer of screen contents)

much faster?

I think Core 2 supports dual-channel if you have equal memory in both channels, which you do not. You have 3x1GB, so it will be running in single-channel instead. If you remove 1 GB of RAM and make sure to use the right memory banks, your memory bandwidth may very well double.

VRAM in a GPU is an order of magnitude faster than system RAM.

It is not at all safe to say system memory bandwidth is normally the limiting factor for performance. It sometimes is. Sometimes CPU processing capacity is. Sometimes GPU memory bandwidth is. Sometimes PCIe bus bandwidth is. Sometimes GPU processing capactiy is.

This is getting way too off-topic though. We're in OpenGL forum, but talking about generic system concepts. Feel free to copy/paste anything I've written here if you want to start a new thread in a more general forum. I'm sure something I've said is wrong or misleading, so getting a wider audience will let others correct anywhere I've screwed up.

I'm now attempting instancing, but am doing something wrong.  Nothing is being rendered.
...

Did I miss any steps?  Is the order of something incorrect?

For the OP, have you done an instanced drawing tutorial? I'd recommend that as a first step. I don't really bother trying to help people with code snippets since it requires way too much effort and is usually something in the code not provided anyways. Give me something I can just copy/paste into Visual Studio and run and then it's a simple debugging problem.

Sorry for derailing your thread though, I didn't expect a back-and-forth from just providing numbers.

Edited by richardurich

##### Share on other sites
nickwinters    183
I've followed a couple, but haven't written one myself. I'll try a stripped down version of my project tonight and if it doesn't work, post it here. Thank you!

##### Share on other sites
fir    460

got 3 MB total - is this dual chanell ?

I think this basic fact (0.3ms memstet)  is most limiting thing

for the machines - would ddr3 be noticably faster than 0.3 ms

here? are the vrams in gpu (especialli pixel buffer of screen contents)

much faster?

I think Core 2 supports dual-channel if you have equal memory in both channels, which you do not. You have 3x1GB, so it will be running in single-channel instead. If you remove 1 GB of RAM and make sure to use the right memory banks, your memory bandwidth may very well double.

VRAM in a GPU is an order of magnitude faster than system RAM.

It is not at all safe to say system memory bandwidth is normally the limiting factor for performance. It sometimes is. Sometimes CPU processing capacity is. Sometimes GPU memory bandwidth is. Sometimes PCIe bus bandwidth is. Sometimes GPU processing capactiy is.

This is getting way too off-topic though. We're in OpenGL forum, but talking about generic system concepts. Feel free to copy/paste anything I've written here if you want to start a new thread in a more general forum. I'm sure something I've said is wrong or misleading, so getting a wider audience will let others correct anywhere I've screwed up.

I do not fell misinformed, it was more a some kind of good vision how the things look like here.

Well indeed i do not know which one of the 5 things you mention (3 bandwith and 2 rocessing powers) is most limiting in general - i mostly

was measuring only my own cpu codes and here I noticed that this

0.3 ms/MB is probably most crucial thing as I said. I bought the GPU

recently :/ (never before i got gpu onboard :\ i know i am 15 years late ;/)

but now i have so maybe I will be able to run some OpenCl or something like that and try to do some raw test of its other 2 bandwidth's and 1 processing power awaliable :\ It can be measured with some snippets right? opencl is the thing i should be interested in ?

(well I forgot I am using mingw and got very limited modem connection - could not take big downloads so i am not sure if I would be able to test that:U)

##### Share on other sites

I saw the original few posts but nothing further. Try batching them if those cubes never change. You shouldn't need to set the dirt alpha values etc. You should put those all in one giant 3d model and render it. You can also divide it up into say a 4x4 grid, where you have 16 smaller parts of a bigger model and only draw the ones that you see.

##### Share on other sites
laztrezort    1058

For the OP:

I assume you are attempting something like shown in this tutorial: http://en.wikibooks.org/wiki/OpenGL_Programming/Glescraft_1

If so, go through that and the following pages and see how they handle rendering.  Notice that they build an entire chunk (16x16x16 or whatever) of cubes all at once an put them in a VBO, and draw each chunk with one draw call.  You can optimize even more (and reduce z-fighting and artifacts when you get to doing lighting) by not building faces that are adjacent to opaque blocks.  Only rebuild chunks when something has changed.

Doing what is shown in that tutorial, you can render a very large amount of cubes without any problem.  Unless I'm misunderstanding something, I don't think instancing is going to help at this point.

##### Share on other sites
nickwinters    183

@Laztrezort:  Thanks for the page, that's an interesting find.  I hadn't considered giving each chunk its own vbo.  I'll go through the tutorials to see how it works.  I believe I can only set one texture per vbo.  That shouldn't be a problem, since I can merge the possible textures into one png.  However, a single vbo means each cube in a chunk would share its own material.  I'll have to think of the ramifications of that.

I planned to not render cubes that cannot be seen, but I thought I should ask since a couple million triangles shouldn't be a problem for today's graphics cards.

@dpadam450: is there a particular batching technique you recommend I look into?

##### Share on other sites
laztrezort    1058

Merging textures into one (or as few as possible) larger textures (usually called a "texture atlas") is a standard technique, precisely so the number of draw calls can be reduced.  You just need to give each vertex an appropriate texture coordinate.  Same thing can be done with other material data - the more you can reduce the number of calls, the better performance you will see.  This is the most important optimization you can probably make at this point.

I believe dpadam450 was referring to the same thing with "batching."  This is basically just putting as many vertices in a VBO as you can, to reduce the amount of state changes and draw calls.  In fact, a good amount of rendering optimization is based around sorting data intelligently into batches.

Million triangles is nothing for any modern hardware - as long as you are not making thousands of separate draw calls.  You will find that the difference in rendering speed between batching vs. not-batching is huge.

##### Share on other sites
nickwinters    183

I just moved each chunk into it's own vertex buffer.  It's running at 60fps in debug mode.  I haven't even begun to remove blocks that cannot be seen.  This means I'll have to add textures earlier than I thought.