Rendering a one million triangle semi-static mesh

Started by
12 comments, last by Aks9 12 years, 11 months ago

Something went wrong with my previous post. The significant amount of text is missing. Generally, I have problems with posting on GameDev using my favorite Web browser - Opera. :(

I'll paraphrase the missing part. Well, imagine the situation in which you have to synchronize two threads using OpenGL through two contexts in the same sharing group (remember our previous correspondence). In the first thread you need to wait something to be finished in another, so you are executing glWaitSync()/glClientWaitSync(). In the other thread, when the task is finish, you have to signal that with glFenceSync(). glFenceSync() is just a command that is stored in a command queue/buffer. If the buffer is not full after adding the new command, it will not be automatically flushed. So, the first thread might potentially wait forever. That's why you should execute glFlush() right after glFenceSync().


I've never used glFenceSync() and glWaitSync()/glClientWaitSync() and these seem to be introduced in GL 3.2
Anyway, why doesn't glFenceSync() sync it?
Sig: http://glhlib.sourceforge.net
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhLoadIdentityf2(matrix);
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);
Advertisement
I've never used glFenceSync() and glWaitSync()/glClientWaitSync() and these seem to be introduced in GL 3.2


It seems you are stuck in pre-OpenGL3 era. ;) Just kidding!

Anyway, why doesn't glFenceSync() sync it?


Should I forward your request to ARB? :)


Well, glClientWaitSync() can flush command-buffer by using SYNC_FLUSH_COMMANDS_BIT but it is only useful if glFenceSync() is issued in the same context. Why it is implemented so, and not on glFenceSync() don't ask me.

DLs are still faster on NV hardware than VBOs, but only for static geometry.

To quote your own reply to V-man above: "It seems you are stuck in pre-OpenGL3 era. ;) Just kidding!"

Display lists are legacy. It is irrelevant if they are faster (which is highly debatable), they are deprecated and should not be used in non-legacy code. While both NV and AMD have promised to continue their support in the compatibility profile, there is absolutely no guarantee that they will be optimized as much as technically possible in future drivers. VBOs, however, are guaranteed to be the most optimal way to submit geometry.

Oh and while I can't go into details for legal reasons, at least one major GPU manufacturer applies a heuristic during DL compilation which will transform larger geometry chunks from a DL into an internal VBO... So much for DLs being faster. Usually the difference comes from the usage of non-optimal data formats and alignment in VBOs.


Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation.

It is obviously implied that indexed tri-lists are preprocessed for maximal vertex cache efficiency. Comparing indexed strips (which require pre-processing) to "easy", raw, random and cache-thrashing tri-list data is naive. As always, the real-world is a bit more complicated. Performance statistics comparing unindexed tri-strips, indexed tri-strips and indexed tri-lists tremendously depend on the mesh topology, vertex shader complexity, deepness and strategy of the vertex caches (ie. on the GPU), pipeline bottlenecks and many more factors. Given common closed models with typically connected topology (ie. NOT terrains or similar regular quad-grids), running on PC consumer hardware with deep caches, properly preprocessed tri-lists often allow for a much better cache usage. Embedded devices, on the other hand, tend to prefer unindexed tri-strips.


Why it is implemented so, and not on glFenceSync() don't ask me.

Because implementing implicit flushing (or calling glFlush) directly after glFenceSync is inefficient if the issuing thread has any other work to do after setting the fence. In this case, it is better to continue issuing commands (possibly with a glFlush at some later time) into the fifo after glFenceSync in order to avoid stalling.


It seems you are stuck in pre-OpenGL3 era. ;) Just kidding!

There are very few scenarios where these synchronisation primitives actually increase performance, while there are tons of scenarios where they will make your performance worse. Top of the line next-gen 3D engines can be written without ever using any of these primitives.
Oh and while I can't go into details for legal reasons, at least one major GPU manufacturer applies a heuristic during DL compilation which will transform larger geometry chunks from a DL into an internal VBO... So much for DLs being faster. Usually the difference comes from the usage of non-optimal data formats and alignment in VBOs.


I'm not using DLs any more, but when I try to benchmark their performance for the purpose of some polemics on the forum an year ago, I got higher performance on NV hardware/drivers with DLs than VBOs. DLs are used just as a container. DLs contained only data; not transformations or states. You are probably talking about AMD/ATI implementation (I really don't see the reason not to name the manufacturer).

It is obviously implied that indexed tri-lists are preprocessed for maximal vertex cache efficiency. Comparing indexed strips (which require pre-processing) to "easy", raw, random and cache-thrashing tri-list data is naive. As always, the real-world is a bit more complicated. Performance statistics comparing unindexed tri-strips, indexed tri-strips and indexed tri-lists tremendously depend on the mesh topology, vertex shader complexity, deepness and strategy of the vertex caches (ie. on the GPU), pipeline bottlenecks and many more factors. Given common closed models with typically connected topology (ie. NOT terrains or similar regular quad-grids), running on PC consumer hardware with deep caches, properly preprocessed tri-lists often allow for a much better cache usage. Embedded devices, on the other hand, tend to prefer unindexed tri-strips.


Agree!


Because implementing implicit flushing (or calling glFlush) directly after glFenceSync is inefficient if the issuing thread has any other work to do after setting the fence. In this case, it is better to continue issuing commands (possibly with a glFlush at some later time) into the fifo after glFenceSync in order to avoid stalling.


That is quite clear, but V-man asked why glFenceSync() doesn't trigger flushing. In some cases it is not useful, but sometimes it does. So, with additional flag in glFenceSync() we could trigger flushing. It could be very convenient in a case of multi-context usage.



There are very few scenarios where these synchronisation primitives actually increase performance, while there are tons of scenarios where they will make your performance worse. Top of the line next-gen 3D engines can be written without ever using any of these primitives.

But we need a "weapon" to fight when such scenarios (when multi-threading is effective) arise. :)

Almost all drivers' implementations serialize access to a graphics hardware nowadays, but I hope that will change in a foreseeable future. Then the synchronization will be crucial for reliable software execution as well as for the performance reasons. The other important issue is a synchronization with other APIs, like CUDA or OpenCL. I also hope this kind of synchronization will be lightweight, because currently it is not. Now drivers have to execute all pending commands before swapping to other API.

This topic is closed to new replies.

Advertisement