Rendering a one million triangle semi-static mesh

Gazoo101 · 2011-05-27T23:32:43

Hey Gamedev.net forum, I'm creating some research related code wherein I am rendering a mesh as a proxy to run some GLSL ray-tracing code on volume data. This mesh used to be a simple cube, but I've grown greedy and want to eliminate superfluous rays which will never hit anything in my volume data. As a first attempt I've created a tight-fitting mesh around my data where I used quads to create a mesh encapsulating all my voxels. This creates a mesh consisting of approximately half a million quads... For some reason I was convinced that my setup (2x SLi 280m Geforce cards) would easily push about 1 million triangles without too much hassle. Not so. My question therefore, is it possible for me to optimize the rendering without too much mesh preprocessing...? Right now, I throw all the quads in a display list and render that to generate my start-end points for my rays. I've searched around and found some various suggestions... Yann L wrote the following in an older post (to a related question): [color=#1C2837][size=2]* Don't use display lists. Use VBOs with batch sizes of around 50k to 60k, using unsigned short indices (in index arrays). * Don't use triangle strips, but indexed triangle lists instead. * Don't call glFlush [color=#1C2837][size=2] [color=#1C2837][size=2]Would this significantly increase performance when rendering a million triangles? [color=#1C2837][size=2] [color=#1C2837][size=2]I am also aware that my quads currently have several vertices that overlap without "proper" sharing. But I am constantly worrying if anything I do will give a significant enough performance increase. The simple box runs as full speed of course (vsynced to 59 fps), but the million mesh runs about at 12 fps. Not quite the performance I was hoping for. [color=#1C2837][size=2] [color=#1C2837][size=2]So again, does anyone have experience with the kind of performance possible with minimal mesh preprocessing? That is - avoid using spatial partitioning. [color=#1C2837][size=2] [color=#1C2837][size=2]Regards, [color=#1C2837][size=2]Gazoo [color=#1C2837][size=2] [color=#1C2837][size=2]Ps. I should perhaps mention that I will be modifying the mesh quite a bit, hence the reluctance to do too much pre-processing because the interaction should run fairly real-time...

Graphics and GPU Programming Programming

Started by Gazoo101 May 25, 2011 02:36 AM

12 comments, last by Aks9 12 years, 11 months ago

V-man

813

May 26, 2011 11:32 PM

Something went wrong with my previous post. The significant amount of text is missing. Generally, I have problems with posting on GameDev using my favorite Web browser - Opera.

I'll paraphrase the missing part. Well, imagine the situation in which you have to synchronize two threads using OpenGL through two contexts in the same sharing group (remember our previous correspondence). In the first thread you need to wait something to be finished in another, so you are executing glWaitSync()/glClientWaitSync(). In the other thread, when the task is finish, you have to signal that with glFenceSync(). glFenceSync() is just a command that is stored in a command queue/buffer. If the buffer is not full after adding the new command, it will not be automatically flushed. So, the first thread might potentially wait forever. That's why you should execute glFlush() right after glFenceSync().

I've never used glFenceSync() and glWaitSync()/glClientWaitSync() and these seem to be introduced in GL 3.2
Anyway, why doesn't glFenceSync() sync it?

Sig: http://glhlib.sourceforge.net
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhLoadIdentityf2(matrix);
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);

Aks9

1,499

May 27, 2011 11:11 AM

I've never used glFenceSync() and glWaitSync()/glClientWaitSync() and these seem to be introduced in GL 3.2

It seems you are stuck in pre-OpenGL3 era.

Just kidding!

Anyway, why doesn't glFenceSync() sync it?

Should I forward your request to ARB?

Well, glClientWaitSync() can flush command-buffer by using SYNC_FLUSH_COMMANDS_BIT but it is only useful if glFenceSync() is issued in the same context. Why it is implemented so, and not on glFenceSync() don't ask me.

Yann L

1,806

May 27, 2011 09:34 PM

DLs are still faster on NV hardware than VBOs, but only for static geometry.

To quote your own reply to V-man above: "It seems you are stuck in pre-OpenGL3 era.

Just kidding!"

Display lists are legacy. It is irrelevant if they are faster (which is highly debatable), they are deprecated and should not be used in non-legacy code. While both NV and AMD have promised to continue their support in the compatibility profile, there is absolutely no guarantee that they will be optimized as much as technically possible in future drivers. VBOs, however, are guaranteed to be the most optimal way to submit geometry.

Oh and while I can't go into details for legal reasons, at least one major GPU manufacturer applies a heuristic during DL compilation which will transform larger geometry chunks from a DL into an internal VBO... So much for DLs being faster. Usually the difference comes from the usage of non-optimal data formats and alignment in VBOs.

Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation.

It is obviously implied that indexed tri-lists are preprocessed for maximal vertex cache efficiency. Comparing indexed strips (which require pre-processing) to "easy", raw, random and cache-thrashing tri-list data is naive. As always, the real-world is a bit more complicated. Performance statistics comparing unindexed tri-strips, indexed tri-strips and indexed tri-lists tremendously depend on the mesh topology, vertex shader complexity, deepness and strategy of the vertex caches (ie. on the GPU), pipeline bottlenecks and many more factors. Given common closed models with typically connected topology (ie. NOT terrains or similar regular quad-grids), running on PC consumer hardware with deep caches, properly preprocessed tri-lists often allow for a much better cache usage. Embedded devices, on the other hand, tend to prefer unindexed tri-strips.

Why it is implemented so, and not on glFenceSync() don't ask me.

Because implementing implicit flushing (or calling glFlush) directly after glFenceSync is inefficient if the issuing thread has any other work to do after setting the fence. In this case, it is better to continue issuing commands (possibly with a glFlush at some later time) into the fifo after glFenceSync in order to avoid stalling.

It seems you are stuck in pre-OpenGL3 era. Just kidding!

There are very few scenarios where these synchronisation primitives actually increase performance, while there are tons of scenarios where they will make your performance worse. Top of the line next-gen 3D engines can be written without ever using any of these primitives.

Aks9

1,499

May 27, 2011 11:32 PM

Oh and while I can't go into details for legal reasons, at least one major GPU manufacturer applies a heuristic during DL compilation which will transform larger geometry chunks from a DL into an internal VBO... So much for DLs being faster. Usually the difference comes from the usage of non-optimal data formats and alignment in VBOs.

I'm not using DLs any more, but when I try to benchmark their performance for the purpose of some polemics on the forum an year ago, I got higher performance on NV hardware/drivers with DLs than VBOs. DLs are used just as a container. DLs contained only data; not transformations or states. You are probably talking about AMD/ATI implementation (I really don't see the reason not to name the manufacturer).

It is obviously implied that indexed tri-lists are preprocessed for maximal vertex cache efficiency. Comparing indexed strips (which require pre-processing) to "easy", raw, random and cache-thrashing tri-list data is naive. As always, the real-world is a bit more complicated. Performance statistics comparing unindexed tri-strips, indexed tri-strips and indexed tri-lists tremendously depend on the mesh topology, vertex shader complexity, deepness and strategy of the vertex caches (ie. on the GPU), pipeline bottlenecks and many more factors. Given common closed models with typically connected topology (ie. NOT terrains or similar regular quad-grids), running on PC consumer hardware with deep caches, properly preprocessed tri-lists often allow for a much better cache usage. Embedded devices, on the other hand, tend to prefer unindexed tri-strips.

Agree!

Because implementing implicit flushing (or calling glFlush) directly after glFenceSync is inefficient if the issuing thread has any other work to do after setting the fence. In this case, it is better to continue issuing commands (possibly with a glFlush at some later time) into the fifo after glFenceSync in order to avoid stalling.

That is quite clear, but V-man asked why glFenceSync() doesn't trigger flushing. In some cases it is not useful, but sometimes it does. So, with additional flag in glFenceSync() we could trigger flushing. It could be very convenient in a case of multi-context usage.

There are very few scenarios where these synchronisation primitives actually increase performance, while there are tons of scenarios where they will make your performance worse. Top of the line next-gen 3D engines can be written without ever using any of these primitives.

But we need a "weapon" to fight when such scenarios (when multi-threading is effective) arise.

Almost all drivers' implementations serialize access to a graphics hardware nowadays, but I hope that will change in a foreseeable future. Then the synchronization will be crucial for reliable software execution as well as for the performance reasons. The other important issue is a synchronization with other APIs, like CUDA or OpenCL. I also hope this kind of synchronization will be lightweight, because currently it is not. Now drivers have to execute all pending commands before swapping to other API.

Rendering a one million triangle semi-static mesh

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Rendering a one million triangle semi-static mesh

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines