Rendering a one million triangle semi-static mesh

Started by
12 comments, last by Aks9 12 years, 11 months ago
Hey Gamedev.net forum,

I'm creating some research related code wherein I am rendering a mesh as a proxy to run some GLSL ray-tracing code on volume data. This mesh used to be a simple cube, but I've grown greedy and want to eliminate superfluous rays which will never hit anything in my volume data. As a first attempt I've created a tight-fitting mesh around my data where I used quads to create a mesh encapsulating all my voxels. This creates a mesh consisting of approximately half a million quads...

For some reason I was convinced that my setup (2x SLi 280m Geforce cards) would easily push about 1 million triangles without too much hassle. Not so. My question therefore, is it possible for me to optimize the rendering without too much mesh preprocessing...?

Right now, I throw all the quads in a display list and render that to generate my start-end points for my rays. I've searched around and found some various suggestions... Yann L wrote the following in an older post (to a related question):

[color=#1C2837][size=2]* Don't use display lists. Use VBOs with batch sizes of around 50k to 60k, using unsigned short indices (in index arrays).
* Don't use triangle strips, but indexed triangle lists instead.
* Don't call glFlush

[color=#1C2837][size=2]

[color=#1C2837][size=2]Would this significantly increase performance when rendering a million triangles?
[color=#1C2837][size=2]

[color=#1C2837][size=2]I am also aware that my quads currently have several vertices that overlap without "proper" sharing. But I am constantly worrying if anything I do will give a significant enough performance increase. The simple box runs as full speed of course (vsynced to 59 fps), but the million mesh runs about at 12 fps. Not quite the performance I was hoping for.
[color=#1C2837][size=2]

[color=#1C2837][size=2]So again, does anyone have experience with the kind of performance possible with minimal mesh preprocessing? That is - avoid using spatial partitioning.
[color=#1C2837][size=2]

[color=#1C2837][size=2]Regards,
[color=#1C2837][size=2]Gazoo
[color=#1C2837][size=2]

[color=#1C2837][size=2]Ps. I should perhaps mention that I will be modifying the mesh quite a bit, hence the reluctance to do too much pre-processing because the interaction should run fairly real-time...
Advertisement
When you say 'quads', do you mean you're drawing with GL_QUADS? That's not supported in hardware and is guaranteed to be very slow.

Using VBOs and indexed-tri-lists, I can render 1million polys at home at 30fps on a GTX260 (with lots of other stuff going on too, like deferred lighting).
Checked my current app and it happens to render a little over 1 million quads (ie. 4 million vertices) at about 110fps on a GTX260, though shaders aren't overly complex. It's far from the most optimized way to submit data.

-No indices (quad vertices aren't generally shared anyway)
-Small batch sizes (hundreds smallish VBOs, about 1000 binds/draw calls)

I have yet to run into hardware that has trouble with GL_QUADS. Sure, quads might not be accelerated in hardware, but somewhere along the way they will split into triangles anyway (apparently by simply repeating the 3rd vertex). So for this scenario quads turned out faster simply by saving 33% of vertex data over a triangle list. Vertex data is compressed to 6 bytes (8 after padding) and worst case scenarios might have more than 65k vertices. Don't really feel like adding complexity to split VBOs, iterate over lists, barely remove any redundant vertex data and add another 2-4 byte per vertex for index data (saving memory is currently a bigger priority).

Though seriously, why ask? Just forget about your project for a day and write a simple test where you push a million quads with a display list, then use a vbo, then compare with triangle lists and an indexed list. If any of that makes a big difference, modify your project.
f@dzhttp://festini.device-zero.de
hi,

on my GF 8600 GT I can render 1.5 Million polys with 26-30 FPS using VBOs and VAOs, though this is only one object that is one VAO and 3 VBOs (vertex, normal, tex coord), and I render the geometry using glDrawArrays(), since the vertices are in order and the video card just has to iterate through them. And I'm not using any optimizations now like triangle strips occlusion querys or other stuff... The only thing I use is culling the back of the polys. However there's much to improve on the rendering speed.

there's a good article on impoving rendering performance:
http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/
also on NVidia cards packed vertices (normalized byte e.g.) rendering faster float vertices.
its benchmark: http://axelynx.googl...ked_normals.zip

on my notebook (9600M GT, P8400 2x2.26GHz) this scene (if all hebemissions in frame) rendered with fps:
fpp - 218 (interleaved arrays)
GL 3.3 forward context:
float - 185
packed - 255

(code: http://code.google.c...Surface.cpp#402)

on ATI card i don't see diffirents with packed/unpacked vertices attribs

also with packed vertices DIPs (glDraw...) take more time that unpacked.
sorry for bad english
axelynx: http://likosoft.com
If you are changing your mesh in the run time, using display list is not a good choice. The reason is obvious. You cannot modify display list! You have to destroy and recreate DL and it takes some time. VBOs are much better for dynamic data.
For better utilization of vertex post-processing cache, you should use indexed drawing with appropriate (cache-friendly) indices. If you just modify position of the vertices, but not number and shape of primitives, you could prepare indices just once and reuse multiple times. If your program is not vertex shader bound, you will see no improvement with optimizing vertex cache.

Use as large buffers as possible. That will reduce driver "management" time, and the number of function calls. Using shorts instead of ints for indices is not important nowadays.

I'm creating some research related code wherein I am rendering a mesh as a proxy to run some GLSL ray-tracing code on volume data. This mesh used to be a simple cube, but I've grown greedy and want to eliminate superfluous rays which will never hit anything in my volume data. As a first attempt I've created a tight-fitting mesh around my data where I used quads to create a mesh encapsulating all my voxels. This creates a mesh consisting of approximately half a million quads...


Are you sure that the problem is in the number of triangles, and not in the misses of your rays, which probably creates a number of useless loops in your shaders?

Yann L wrote the following in an older post (to a related question):

[color="#1C2837"]* Don't use display lists. Use VBOs with batch sizes of around 50k to 60k, using unsigned short indices (in index arrays).

[color="#1C2837"]* Don't use triangle strips, but indexed triangle lists instead.

[color="#1C2837"]* Don't call glFlush


We can debate about the previous statements. I don't know in which context Yann said that. DLs are still faster on NV hardware than VBOs, but only for static geometry. Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation. glFlush you should use whenever you need a synchronization. Even a sync objects require it, because you'll be locked if you don't force execution of command list. But it is certainly less optimal to flush command buffer more frequently than you need.

We can debate about the previous statements. I don't know in which context Yann said that. DLs are still faster on NV hardware than VBOs, but only for static geometry. Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation. glFlush you should use whenever you need a synchronization. Even a sync objects require it, because you'll be locked if you don't force execution of command list. But it is certainly less optimal to flush command buffer more frequently than you need.


A larger index buffer size is not a big deal. The advantage of TRIANGLE_LIST is that you would probably need a single draw call (or a few) while with TRIANGLE_STRIP, you need a enormous number of draw calls. The only way to reduce draw calls with TRIANGLE_STRIPS is to use primitive restart or some null triangles. I would like to see a benchmark on these.

glFlush seems to be something that people use randomly. They don't know what it is for. Why don't you let the driver automatically deal with it. Once I used it after every X amount of draw calls and performance went worst. I'm pretty sure you should NEVER use it. Let the driver flush things on its own.
Sig: http://glhlib.sourceforge.net
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhLoadIdentityf2(matrix);
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);

A largerindex buffer size is not a big deal. The advantage of TRIANGLE_LIST isthat you would probably need a single draw call (or a few) while withTRIANGLE_STRIP, you need a enormous number of draw calls. The only way toreduce draw calls with TRIANGLE_STRIPS is to use primitive restart or some nulltriangles. I would like to see a benchmark on these.

The usage of degeneratedtriangles or primitive restart is almost necessity. Personally, I thinkprimitive restart is more efficient than degenerated triangles, but thishave to be proved by more extensive experiments. What I can say for sure isthat glDrawElements() with primitive restart is significantly more efficientthan equivalent glMultiDrawElements() considering CPU time (not GPU time).


glFlush seems to be something that people use randomly. They don't knowwhat it is for. Why don't you let the driver automatically deal with it. Once Iused it after every X amount of draw calls and performance went worst. I'mpretty sure you should NEVER use it. Let the driver flush things on its own.

I don't understand why did you use glFlush() after every X commands? I thoughtI was clear; glFlush() should be used for the synchronization purposes only.Several years ago, I used glFlush() only before SwapBuffers(), in order toensure flushing commands buffer, but then I realized that SwapBuffers()implicitly calls glFush(). But NEVER is too long time.

The usage of degeneratedtriangles or primitive restart is almost necessity. Personally, I thinkprimitive restart is more efficient than degenerated triangles, but thishave to be proved by more extensive experiments. What I can say for sure isthat glDrawElements() with primitive restart is significantly more efficientthan equivalent glMultiDrawElements() considering CPU time (not GPU time).


I don't understand why did you use glFlush() after every X commands? I thoughtI was clear; glFlush() should be used for the synchronization purposes only.Several years ago, I used glFlush() only before SwapBuffers(), in order toensure flushing commands buffer, but then I realized that SwapBuffers()implicitly calls glFush(). But NEVER is too long time.


glMultiDrawElements() should not have be approved. Primitive restart went into core in one of the GL versions (I think 3.2, check the spec if you want to be sure) and GL ends up with a more confusing API then ever with a ton a ways to achieve the same thing. It is a shame.

What do you want to synchronize? Why would you want to syncronize? Why can't GL automatically synchronize?
Well of course, you don't need to call glFlush() before calling SwapBuffers. That has turned into superstition from all the unofficial tutorials out there.
It is sort of like the case of creating a texture and calling glTexEnv, which has also turned into a popular myth thanks to all the tutorials out there.
Sig: http://glhlib.sourceforge.net
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhLoadIdentityf2(matrix);
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);
What do you want to synchronize? Why would you want to syncronize? Why can't GL automatically synchronize?


Something went wrong with my previous post. The significant amount of text is missing. Generally, I have problems with posting on GameDev using my favorite Web browser - Opera. :(


I'll paraphrase the missing part. Well, imagine the situation in which you have to synchronize two threads using OpenGL through two contexts in the same sharing group (remember our previous correspondence). In the first thread you need to wait something to be finished in another, so you are executing glWaitSync()/glClientWaitSync(). In the other thread, when the task is finish, you have to signal that with glFenceSync(). glFenceSync() is just a command that is stored in a command queue/buffer. If the buffer is not full after adding the new command, it will not be automatically flushed. So, the first thread might potentially wait forever. That's why you should execute glFlush() right after glFenceSync().

This topic is closed to new replies.

Advertisement