Sign in to follow this  

Rendering a one million triangle semi-static mesh

This topic is 2395 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Gamedev.net forum,

I'm creating some research related code wherein I am rendering a mesh as a proxy to run some GLSL ray-tracing code on volume data. This mesh used to be a simple cube, but I've grown greedy and want to eliminate superfluous rays which will never hit anything in my volume data. As a first attempt I've created a tight-fitting mesh around my data where I used quads to create a mesh encapsulating all my voxels. This creates a mesh consisting of approximately half a million quads...

For some reason I was convinced that my setup (2x SLi 280m Geforce cards) would easily push about 1 million triangles without too much hassle. Not so. My question therefore, is it possible for me to optimize the rendering without too much mesh preprocessing...?

Right now, I throw all the quads in a display list and render that to generate my start-end points for my rays. I've searched around and found some various suggestions... Yann L wrote the following in an older post (to a related question):

[color=#1C2837][size=2]* Don't use display lists. Use VBOs with batch sizes of around 50k to 60k, using unsigned short indices (in index arrays).
* Don't use triangle strips, but [i]indexed[/i] triangle lists instead.
* Don't call glFlush
[/size][/color]
[color=#1C2837][size=2]
[/size][/color]
[color=#1C2837][size=2]Would this significantly increase performance when rendering a million triangles?[/size][/color]
[color=#1C2837][size=2]
[/size][/color]
[color=#1C2837][size=2]I am also aware that my quads currently have several vertices that overlap without "proper" sharing. But I am constantly worrying if anything I do will give a significant enough performance increase. The simple box runs as full speed of course (vsynced to 59 fps), but the million mesh runs about at 12 fps. Not quite the performance I was hoping for.[/size][/color]
[color=#1C2837][size=2]
[/size][/color]
[color=#1C2837][size=2]So again, does anyone have experience with the kind of performance possible with minimal mesh preprocessing? That is - avoid using spatial partitioning.[/size][/color]
[color=#1C2837][size=2]
[/size][/color]
[color=#1C2837][size=2]Regards,[/size][/color]
[color=#1C2837][size=2]Gazoo[/size][/color]
[color=#1C2837][size=2]
[/size][/color]
[color=#1C2837][size=2]Ps. I should perhaps mention that I will be modifying the mesh quite a bit, hence the reluctance to do too much pre-processing because the interaction should run fairly real-time...[/size][/color]

Share this post


Link to post
Share on other sites
When you say 'quads', do you mean you're drawing with GL_QUADS? That's not supported in hardware and is guaranteed to be very slow.

Using VBOs and indexed-tri-lists, I can render 1million polys at home at 30fps on a GTX260 (with lots of other stuff going on too, like deferred lighting).

Share this post


Link to post
Share on other sites
Checked my current app and it happens to render a little over 1 million quads (ie. 4 million vertices) at about 110fps on a GTX260, though shaders aren't overly complex. It's far from the most optimized way to submit data.

-No indices (quad vertices aren't generally shared anyway)
-Small batch sizes (hundreds smallish VBOs, about 1000 binds/draw calls)

I have yet to run into hardware that has trouble with GL_QUADS. Sure, quads might not be accelerated in hardware, but somewhere along the way they will split into triangles anyway (apparently by simply repeating the 3rd vertex). So for this scenario quads turned out faster simply by saving 33% of vertex data over a triangle list. Vertex data is compressed to 6 bytes (8 after padding) and worst case scenarios might have more than 65k vertices. Don't really feel like adding complexity to split VBOs, iterate over lists, barely remove any redundant vertex data and add another 2-4 byte per vertex for index data (saving memory is currently a bigger priority).

Though seriously, why ask? Just forget about your project for a day and write a simple test where you push a million quads with a display list, then use a vbo, then compare with triangle lists and an indexed list. If any of that makes a big difference, modify your project.

Share this post


Link to post
Share on other sites
hi,

on my GF 8600 GT I can render 1.5 Million polys with 26-30 FPS using VBOs and VAOs, though this is only one object that is one VAO and 3 VBOs (vertex, normal, tex coord), and I render the geometry using glDrawArrays(), since the vertices are in order and the video card just has to iterate through them. And I'm not using any optimizations now like triangle strips occlusion querys or other stuff... The only thing I use is culling the back of the polys. However there's much to improve on the rendering speed.

there's a good article on impoving rendering performance:
[url="http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/"]http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/[/url]

Share this post


Link to post
Share on other sites
also on NVidia cards packed vertices (normalized byte e.g.) rendering faster float vertices.
its benchmark: [url="http://axelynx.googlecode.com/files/test_packed_normals.zip"]http://axelynx.googl...ked_normals.zip[/url]

on my notebook (9600M GT, P8400 2x2.26GHz) this scene (if all hebemissions in frame) rendered with fps:
fpp - 218 (interleaved arrays)
GL 3.3 forward context:
float - 185
packed - 255

(code: [url="http://code.google.com/p/axelynx/source/browse/trunk/axelynx/axelynx/axelynx/source/CSurface.cpp#402"]http://code.google.c...Surface.cpp#402[/url])

on ATI card i don't see diffirents with packed/unpacked vertices attribs

also with packed vertices DIPs (glDraw...) take more time that unpacked.

Share this post


Link to post
Share on other sites
If you are changing your mesh in the run time, using display list is not a good choice. The reason is obvious. [b]You cannot modify display list! [/b]You have to destroy and recreate DL and it takes some time. VBOs are much better for dynamic data.
For better utilization of vertex post-processing cache, you should use indexed drawing with appropriate (cache-friendly) indices. If you just modify position of the vertices, but not number and shape of primitives, you could prepare indices just once and reuse multiple times. If your program is not vertex shader bound, you will see no improvement with optimizing vertex cache.

Use as large buffers as possible. That will reduce driver "management" time, and the number of function calls. Using shorts instead of ints for indices is not important nowadays.

[quote name='Gazoo101' timestamp='1306291013' post='4815413']I'm creating some research related code wherein I am rendering a mesh as a proxy to run some GLSL ray-tracing code on volume data. This mesh used to be a simple cube, but I've grown greedy and want to eliminate superfluous rays which will never hit anything in my volume data. As a first attempt I've created a tight-fitting mesh around my data where I used quads to create a mesh encapsulating all my voxels. This creates a mesh consisting of approximately half a million quads...[/quote]

Are you sure that the problem is in the number of triangles, and not in the misses of your rays, which probably creates a number of useless loops in your shaders?

[quote name='Gazoo101' timestamp='1306291013' post='4815413'] Yann L wrote the following in an older post (to a related question):

[color="#1C2837"][size="2"]* Don't use display lists. Use VBOs with batch sizes of around 50k to 60k, using unsigned short indices (in index arrays).
[/size][/color]
[color="#1C2837"][size="2"]* Don't use triangle strips, but [i]indexed[/i] triangle lists instead.[/size][/color]

[color="#1C2837"][size="2"]* Don't call glFlush[/size][/color][/quote]

We can debate about the previous statements. I don't know in which context Yann said that. DLs are still faster on NV hardware than VBOs, but only for static geometry. Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation. glFlush you should use whenever you need a synchronization. Even a sync objects require it, because you'll be locked if you don't force execution of command list. But it is certainly less optimal to flush command buffer more frequently than you need.

Share this post


Link to post
Share on other sites
[quote name='Aks9' timestamp='1306347116' post='4815699']
We can debate about the previous statements. I don't know in which context Yann said that. DLs are still faster on NV hardware than VBOs, but only for static geometry. Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation. glFlush you should use whenever you need a synchronization. Even a sync objects require it, because you'll be locked if you don't force execution of command list. But it is certainly less optimal to flush command buffer more frequently than you need.
[/quote]

A larger index buffer size is not a big deal. The advantage of TRIANGLE_LIST is that you would probably need a single draw call (or a few) while with TRIANGLE_STRIP, you need a enormous number of draw calls. The only way to reduce draw calls with TRIANGLE_STRIPS is to use primitive restart or some null triangles. I would like to see a benchmark on these.

glFlush seems to be something that people use randomly. They don't know what it is for. Why don't you let the driver automatically deal with it. Once I used it after every X amount of draw calls and performance went worst. I'm pretty sure you should NEVER use it. Let the driver flush things on its own.

Share this post


Link to post
Share on other sites
[left][left] [size="2"][quote name='V-man' timestamp='1306358168' post='4815774']A largerindex buffer size is not a big deal. The advantage of TRIANGLE_LIST isthat you would probably need a single draw call (or a few) while withTRIANGLE_STRIP, you need a enormous number of draw calls. The only way toreduce draw calls with TRIANGLE_STRIPS is to use primitive restart or some nulltriangles. I would like to see a benchmark on these.[/quote][/size][/left][left] [/left][left] [/left][left] [/left][left] [/left][left] [/left][left] [/left][left][size="2"]The usage of degeneratedtriangles or primitive restart is almost necessity. Personally, I thinkprimitive restart is more efficient than degenerated triangles, but thishave to be proved by more extensive experiments. What I can say for sure isthat glDrawElements() with primitive restart is significantly more efficientthan equivalent glMultiDrawElements() considering CPU time (not GPU time).[/size][/left][/left]
[size="2"][quote name='V-man' timestamp='1306358168' post='4815774'][/size][size="2"]glFlush seems to be something that people use randomly. They don't knowwhat it is for. Why don't you let the driver automatically deal with it. Once Iused it after every X amount of draw calls and performance went worst. I'mpretty sure you should NEVER use it. Let the driver flush things on its own.[/quote][/size]
[size="2"]I don't understand why did you use glFlush() after every X commands? I thoughtI was clear; glFlush() should be used for the synchronization purposes only.Several years ago, I used glFlush() only before SwapBuffers(), in order toensure flushing commands buffer, but then I realized that SwapBuffers()implicitly calls glFush(). But NEVER is too long time.[/size]

Share this post


Link to post
Share on other sites
[quote name='Aks9' timestamp='1306433752' post='4816118']
[left][left] [/left][left] [/left][left] [/left][left] [/left][left] [/left][left] [/left][left][size="2"]The usage of degeneratedtriangles or primitive restart is almost necessity. Personally, I thinkprimitive restart is more efficient than degenerated triangles, but thishave to be proved by more extensive experiments. What I can say for sure isthat glDrawElements() with primitive restart is significantly more efficientthan equivalent glMultiDrawElements() considering CPU time (not GPU time).[/size][/left][/left]
[size="2"]I don't understand why did you use glFlush() after every X commands? I thoughtI was clear; glFlush() should be used for the synchronization purposes only.Several years ago, I used glFlush() only before SwapBuffers(), in order toensure flushing commands buffer, but then I realized that SwapBuffers()implicitly calls glFush(). But NEVER is too long time.[/size]
[/quote]

[size="2"]glMultiDrawElements() should not have be approved. Primitive restart went into core in one of the GL versions (I think 3.2, check the spec if you want to be sure) and GL ends up with a more confusing API then ever with a ton a ways to achieve the same thing. It is a shame.

What do you want to synchronize? Why would you want to syncronize? Why can't GL automatically synchronize?
Well of course, you don't need to call glFlush() before calling SwapBuffers. That has turned into superstition from all the unofficial tutorials out there.
It is sort of like the case of creating a texture and calling glTexEnv, which has also turned into a popular myth thanks to all the tutorials out there.
[/size]

Share this post


Link to post
Share on other sites
[quote name='V-man' timestamp='1306440051' post='4816153'][size="2"]What do you want to synchronize? Why would you want to syncronize? Why can't GL automatically synchronize?[/size][/quote]

Something went wrong with my previous post. The significant amount of text is missing. Generally, I have problems with posting on GameDev using my favorite Web browser - Opera. :(


I'll paraphrase the missing part. Well, imagine the situation in which you have to synchronize two threads using OpenGL through two contexts in the same sharing group (remember our previous correspondence). In the first thread you need to wait something to be finished in another, so you are executing glWaitSync()/glClientWaitSync(). In the other thread, when the task is finish, you have to signal that with glFenceSync(). glFenceSync() is just a command that is stored in a command queue/buffer. If the buffer is not full after adding the new command, it will not be automatically flushed. So, the first thread might potentially wait forever. That's why you should execute glFlush() right after glFenceSync().

Share this post


Link to post
Share on other sites
[quote name='Aks9' timestamp='1306443163' post='4816174']
Something went wrong with my previous post. The significant amount of text is missing. Generally, I have problems with posting on GameDev using my favorite Web browser - Opera. :(

I'll paraphrase the missing part. Well, imagine the situation in which you have to synchronize two threads using OpenGL through two contexts in the same sharing group (remember our previous correspondence). In the first thread you need to wait something to be finished in another, so you are executing glWaitSync()/glClientWaitSync(). In the other thread, when the task is finish, you have to signal that with glFenceSync(). glFenceSync() is just a command that is stored in a command queue/buffer. If the buffer is not full after adding the new command, it will not be automatically flushed. So, the first thread might potentially wait forever. That's why you should execute glFlush() right after glFenceSync().
[/quote]

I've never used glFenceSync() and glWaitSync()/glClientWaitSync() and these seem to be introduced in GL 3.2
Anyway, why doesn't glFenceSync() sync it?

Share this post


Link to post
Share on other sites
[quote name='V-man' timestamp='1306452744' post='4816235'] I've never used glFenceSync() and glWaitSync()/glClientWaitSync() and these seem to be introduced in GL 3.2[/quote]

It seems you are stuck in pre-OpenGL3 era. ;) Just kidding!

[quote name='V-man' timestamp='1306452744' post='4816235'] Anyway, why doesn't glFenceSync() sync it?[/quote]

Should I forward your request to ARB? :)


Well, glClientWaitSync() can flush command-buffer by using [size="3"]SYNC_FLUSH_COMMANDS_BIT [/size] but it is only useful if glFenceSync() is issued in the same context. Why it is implemented so, and not on glFenceSync() don't ask me.

Share this post


Link to post
Share on other sites
[quote name='Aks9' timestamp='1306347116' post='4815699']
DLs are still faster on NV hardware than VBOs, but only for static geometry.
[/quote]
To quote your own reply to V-man above: [i]"It seems you are stuck in pre-OpenGL3 era. ;) Just kidding!"[/i]

Display lists are legacy. It is irrelevant if they are faster (which is highly debatable), they are deprecated and should not be used in non-legacy code. While both NV and AMD have promised to continue their support in the compatibility profile, there is absolutely no guarantee that they will be optimized as much as technically possible in future drivers. VBOs, however, are guaranteed to be the most optimal way to submit geometry.

Oh and while I can't go into details for legal reasons, at least one major GPU manufacturer applies a heuristic during DL compilation which will transform larger geometry chunks from a DL into an internal VBO... So much for DLs being faster. Usually the difference comes from the usage of non-optimal data formats and alignment in VBOs.

[quote name='Aks9' timestamp='1306347116' post='4815699']
Whenever I use triangle strips they are indexed. It enables efficient rendering with minimal index buffers. Generally, it is hard to generate optimized indices for triangle strips, that's why indexed triangle lists are usually proposed as the best solution. They have much larger index buffers, but are easier for creation.
[/quote]
It is obviously implied that indexed tri-lists are preprocessed for maximal vertex cache efficiency. Comparing indexed strips (which require pre-processing) to "easy", raw, random and cache-thrashing tri-list data is naive. As always, the real-world is a bit more complicated. Performance statistics comparing unindexed tri-strips, indexed tri-strips and indexed tri-lists tremendously depend on the mesh topology, vertex shader complexity, deepness and strategy of the vertex caches (ie. on the GPU), pipeline bottlenecks and many more factors. Given common closed models with typically connected topology (ie. NOT terrains or similar regular quad-grids), running on PC consumer hardware with deep caches, properly preprocessed tri-lists often allow for a much better cache usage. Embedded devices, on the other hand, tend to prefer [b]un[/b]indexed tri-strips.

[quote name='Aks9' timestamp='1306347116' post='4815699']
Why it is implemented so, and not on glFenceSync() don't ask me.
[/quote]
Because implementing implicit flushing (or calling glFlush) directly after glFenceSync is inefficient if the issuing thread has any other work to do after setting the fence. In this case, it is better to continue issuing commands (possibly with a glFlush at some later time) into the fifo after glFenceSync in order to avoid stalling.

[quote name='Aks9' timestamp='1306347116' post='4815699']
It seems you are stuck in pre-OpenGL3 era. ;) Just kidding!
[/quote]
There are very few scenarios where these synchronisation primitives actually increase performance, while there are tons of scenarios where they will make your performance worse. Top of the line next-gen 3D engines can be written without ever using any of these primitives.

Share this post


Link to post
Share on other sites
[quote name='Yann L' timestamp='1306532051' post='4816571'] Oh and while I can't go into details for legal reasons, at least one major GPU manufacturer applies a heuristic during DL compilation which will transform larger geometry chunks from a DL into an internal VBO... So much for DLs being faster. Usually the difference comes from the usage of non-optimal data formats and alignment in VBOs.[/quote]

I'm not using DLs any more, but when I try to benchmark their performance for the purpose of some polemics on the forum an year ago, I got higher performance on NV hardware/drivers with DLs than VBOs. DLs are used just as a container. DLs contained only data; not transformations or states. You are probably talking about AMD/ATI implementation (I really don't see the reason not to name the manufacturer).

[quote name='Yann L' timestamp='1306532051' post='4816571']It is obviously implied that indexed tri-lists are preprocessed for maximal vertex cache efficiency. Comparing indexed strips (which require pre-processing) to "easy", raw, random and cache-thrashing tri-list data is naive. As always, the real-world is a bit more complicated. Performance statistics comparing unindexed tri-strips, indexed tri-strips and indexed tri-lists tremendously depend on the mesh topology, vertex shader complexity, deepness and strategy of the vertex caches (ie. on the GPU), pipeline bottlenecks and many more factors. Given common closed models with typically connected topology (ie. NOT terrains or similar regular quad-grids), running on PC consumer hardware with deep caches, properly preprocessed tri-lists often allow for a much better cache usage. Embedded devices, on the other hand, tend to prefer [b]un[/b]indexed tri-strips.[/quote]

Agree!


[quote name='Yann L' timestamp='1306532051' post='4816571']Because implementing implicit flushing (or calling glFlush) directly after glFenceSync is inefficient if the issuing thread has any other work to do after setting the fence. In this case, it is better to continue issuing commands (possibly with a glFlush at some later time) into the fifo after glFenceSync in order to avoid stalling.[/quote]

That is quite clear, but V-man asked why glFenceSync() doesn't trigger flushing. In some cases it is not useful, but sometimes it does. So, with additional flag in glFenceSync() we could trigger flushing. It could be very convenient in a case of multi-context usage.



[quote name='Yann L' timestamp='1306532051' post='4816571']There are very few scenarios where these synchronisation primitives actually increase performance, while there are tons of scenarios where they will make your performance worse. Top of the line next-gen 3D engines can be written without ever using any of these primitives.[/quote]
But we need a "weapon" to fight when such scenarios (when multi-threading is effective) arise. :)

Almost all drivers' implementations serialize access to a graphics hardware nowadays, but I hope that will change in a foreseeable future. Then the synchronization will be crucial for reliable software execution as well as for the performance reasons. The other important issue is a synchronization with other APIs, like CUDA or OpenCL. I also hope this kind of synchronization will be lightweight, because currently it is not. Now drivers have to execute all pending commands before swapping to other API.

Share this post


Link to post
Share on other sites

This topic is 2395 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this