Jump to content
  • Advertisement
Sign in to follow this  
hellraiser

OpenGL Performance issues rendering triangles vs tri-strips

This topic is 3930 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello all, I've just converted a class that generated a triangle-strip mesh of a skydome to generating it with triangles. I did this as I've read a couple of articles that state that rendering triangles is slightly faster than strips because the GPU is able to take advantage of its fast vertex cache. Also, numerous posts here on GameDev by many gurus state just the same. However, after converting the class I tested it with the old tri-strip (a) and triangle (b) skydome meshes hoping I would get an increase in FPS (if only small.) Results: a) Tri-strip
================================================
Viewport: (0, 0, 1024, 768)
Run time: 36470ms , ~36s
Total frames: 29984
Highest frame rate: 865
Lowest frame rate: 759
Average frame rate: 832
b) Triangle mesh
================================================
Viewport: (0, 0, 1024, 768)
Run time: 90693ms , ~90s
Total frames: 67752
Highest frame rate: 780
Lowest frame rate: 692
Average frame rate: 752
The two test programs are release builds, were ran in 1024x768 full-screen res and render about 20000 triangles, though the skydome mesh alone consists only of 5180 triangles (5184 tri-strip elements in a; 15540 vertices in b.) I let test b) run for longer because I couldn't believe the (significant) drop in FPS and was hoping for some miracle to happen... My graphics card is an ATI Mobility Radeon x700 (128MB, PCIe.) What could be the reason to the drop in FPS? <edit> The new skydome generating algorithm is in essence the same as it was before when it generated a tri-strip mesh. The only differences now is that the vertex buffer is larger so as to accomodate all the triangle vertices of the skydome ((vertices-2)*3) and also each vertice is stored at every 3rd position in the vertex buffer after the first 3 elements (vertexbuffer[n*3]=triangleVertice, n>3, 1 being the lowest index.) Then I iterate through the vertex buffer to finalize the triangles by using OpenGL's rules when rendering triangle_strips {odd=(n,n+1,n+2);even=(n+1,n,n+2)}. All this to say that the algorithm isn't suffering from some lack of floating-point precision because two vertices for every triangle in the mesh are shared between adjacent triangles. Therefore, the GPU's vertex cache should be kicking in and I shouldn't be seeing a decrease in FPS. </edit> [Edited by - hellraiser on September 18, 2007 6:32:32 PM]

Share this post


Link to post
Share on other sites
Advertisement
That's an insignificant difference in time.

865 fps = 1.15ms per frame

780 fps = 1.28ms per frame


that's a difference of 0.15ms (i.e. one one hundredth of one one thousandanth of a second.

i.e. there is effectively no difference in framerate.

-me

Share this post


Link to post
Share on other sites
Quote:
Original post by Palidine
That's an insignificant difference in time.

865 fps = 1.15ms per frame

780 fps = 1.28ms per frame


that's a difference of 0.15ms (i.e. one one hundredth of one one thousandanth of a second.

i.e. there is effectively no difference in framerate.

-me


Very much true but what's troubling me is the decrease in FPS in the first place. What happens when my scenes grow in complexity and likewise my graphics engine does too? Should I now maybe be thinking about changing algorithmic strategies and focus more on generating tri-strip meshes rather than triangles? Is this an isolated issue related to my graphics card alone?

I mean, I've got so many questions right now and no answers that it's making me doubt everything I've done so far in my modest graphics engine .

Thanks for your reply. :-)


PS: I've edited my original post and added some more info at the bottom.

Share this post


Link to post
Share on other sites
You're prematurely optimizing. There's no point in optimizing like this because it isn't a bottle-neck in your application. It currently makes absolutely no difference in the performance of your application whether you use strips or meshes. Therefore it doesn't matter (for now) what you choose.

Only when your application grows, and you detect through profiling that you need to revisit this decision is it a good time to optimize this part of your game.

-me

Share this post


Link to post
Share on other sites
Quote:
Original post by hellraiser
<edit>
The new skydome generating algorithm is in essence the same as it was before when it generated a tri-strip mesh. The only differences now is that the vertex buffer is larger so as to accomodate all the triangle vertices of the skydome ((vertices-2)*3) and also each vertice is stored at every 3rd position in the vertex buffer after the first 3 elements (vertexbuffer[n*3]=triangleVertice, n>3, 1 being the lowest index.) Then I iterate through the vertex buffer to finalize the triangles by using OpenGL's rules when rendering triangle_strips {odd=(n,n+1,n+2);even=(n+1,n,n+2)}.

All this to say that the algorithm isn't suffering from some lack of floating-point precision because two vertices for every triangle in the mesh are shared between adjacent triangles. Therefore, the GPU's vertex cache should be kicking in and I shouldn't be seeing a decrease in FPS.
</edit>


I'm slightly confused by this statement "The only differences now is that the vertex buffer is larger so as to accomodate all the triangle vertices of the skydome ((vertices-2)*3).

The vertex buffer shouldn't need to be any bigger, a triangle list in strip format uses the same amount of vertex data as a triangle strip does, the only difference is that it uses more index data.

So, for two shared triangles both method would have 4 vertices defined, however the triangle strip would have an index buffer of [0,1,2,3] and the triangle list would have an index buffer of [0,1,2,0,2,3].

The fact you don't mention an index buffer in any of your posts makes me doubt you are even using one; you should.
Simply setting positional information the same isn't enough to make use of the post-T&L cache; at data look up time, without an index, the GPU has no way of knowing that the data at position 4 is the same as the data at position 0. What the index list does is allow the GPU to say 'I know this data is the same, therefore I can use this stored result'.

I suspect you are rendering with glDrawArrays(), which is the slowest of the vertex array functions (well, the ones which don't pick the data one vertex at a time anyways), you should be using glDrawElements() or glDrawRangeElements(), these are MUCH faster due to the use of the index buffer (I don't have the results to hand right now, but I'm pretty sure in a vertex shader heavy scene I was seeing ~10x improvement between glDrawArrays and glDrawElements for the data submission).

In short;
- You need to use indices
- You don't need to generate more data

Share this post


Link to post
Share on other sites
Quote:
I'm slightly confused by this statement "The only differences now is that the vertex buffer is larger so as to accomodate all the triangle vertices of the skydome ((vertices-2)*3).

The vertex buffer shouldn't need to be any bigger, a triangle list in strip format uses the same amount of vertex data as a triangle strip does, the only difference is that it uses more index data.

So, for two shared triangles both method would have 4 vertices defined, however the triangle strip would have an index buffer of [0,1,2,3] and the triangle list would have an index buffer of [0,1,2,0,2,3].

The fact you don't mention an index buffer in any of your posts makes me doubt you are even using one; you should.


You are absolutely right; I'm not! :) That's why I've expanded the vertex buffer to store 3 vertices/triangle. I can now see what an idiot I was.

Quote:
Simply setting positional information the same isn't enough to make use of the post-T&L cache; at data look up time, without an index, the GPU has no way of knowing that the data at position 4 is the same as the data at position 0. What the index list does is allow the GPU to say 'I know this data is the same, therefore I can use this stored result'.

So that's how the vertex cache works...

In all honesty I always thought using index lists was an unnecessary waste of bandwidth, but then again I never quite understood the benefits from using them in the first place.

Quote:
I suspect you are rendering with glDrawArrays() [...]

Again, right on the money!

Quote:
[...] which is the slowest of the vertex array functions (well, the ones which don't pick the data one vertex at a time anyways), you should be using glDrawElements() or glDrawRangeElements(), these are MUCH faster due to the use of the index buffer (I don't have the results to hand right now, but I'm pretty sure in a vertex shader heavy scene I was seeing ~10x improvement between glDrawArrays and glDrawElements for the data submission).

In short;
- You need to use indices
- You don't need to generate more data

Thank you ever so much for the eye opener. There's not much I can say but to slap myself on the wrist...

You have no idea how helpful your post was to me, Phantom! Thanks again!

[Edited by - hellraiser on September 18, 2007 8:31:27 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by hellraiser
I've just converted a class that generated a triangle-strip mesh of a skydome to generating it with triangles. I did this as I've read a couple of articles that state that rendering triangles is slightly faster than strips because the GPU is able to take advantage of its fast vertex cache. Also, numerous posts here on GameDev by many gurus state just the same.
Who and where said that? In the last months/years, I never heard people saying there's a good speed increment.

Vertex cache is independant from the primitive. What's probably happening is that your tristrips are so long they trash your vcache.
Generating vcache aware strips for arbitrary geometry isn't trivial.

This is just to make clear that this tri-over-strip stuff is a myth.

Quote:
Original post by Palidine
You're prematurely optimizing.
Reasons: many algorithms doesn't scale linearly and graphics systems usually underperform at lower than average complexity.

Share this post


Link to post
Share on other sites
About that vertex cache...
I've just done some tests on line-strips vs lines. I expected drawing linestrips would be faster since I was inputting half the amount of vertices compared to drawing GL_LINES, but indead drawing 1 million lines took almost exactly the same time using GL_LINE_STRIP as GL_LINES.
So, my question I guess is, how does the vertex cache work? Why does it not help me in drawing ridiculously long LINE_STRIPS?

Erik Sintorn

Share this post


Link to post
Share on other sites
When it comes to drawing lines there is one very important point you need to remember: GPUs are truely rubbish at doing so.

This may come as a surprise, however if you consider it it's not that suprising; GPUs are optimised for the most common case, which is drawing triangles. While I can't recall the specifics off the top of my head drawing lines really causes issues, but as most games don't require them it really isn't a problem in the grand scheme of things.

Now, the vertex cache, this comes in two flavours;
1) pre-transform
2) post-transform

Flavour 1 just helps with data transfer, for the most part you'll never have to care about it.

Flavour 2 is the one we are intrested in. While it might be a little more complex than this in reality for practical purposes you can think of it as a kinda of map (an array of key-value pairs).

Once a vertex has been transformed by a vertex shader its output data is store in this array and the 'key' is set to the index. When the graphics card next goes to pull a piece of data for processing it will use the index of the vertex its about to deal with and first check if its in the cache. If it is then it reuses that data, if not then it fetches the data and performs the transform.

This array has a certain number of entries it can hold, so once it's full and it has another vertex to add it has to remove an entry; this is most probably done with a 'least-recently-used' scheme; so the vertex data which was accessed/stored longest ago is replaced with the new data.

So, if we take a VERY simple GPU which processes one vertex at a time and can cache 3 entries and apply it to my earlier example;
Quote:

So, for two shared triangles both method would have 4 vertices defined, however the triangle strip would have an index buffer of [0,1,2,3] and the triangle list would have an index buffer of [0,1,2,0,2,3].


It would go something like this (for the triangle list);
Quote:

Vertex 0 - not in cache; transform and store
Vertex 1 - not in cache; transform and store
Vertex 2 - not in cache; transform and store
Vertex 0 - in cache; reuse data
Vertex 2 - in cache; reuse data
Vertex 3 - not in cache; transform, cache full, remove entry 1, store

Note how the last vertex hits a full cache and dumps vertex 1 from it.

Now, in the hardware reality GPUs generally have more vertices in flight for processing than 1 at a time (my X1900XT for example can process 8 at once) and have larger caches (16 to 32 entries, this might well depend on the amount of outputs from a vertex shader), but the general priniple holds.

This is why you want to try to arrange your indices so you rehit as many vertices as possible to make use of the cache and reduce vertex processing overhead.

Of course, this does some what assume your bottleneck is in the vertex shader stage; if your pixel shader is doing so much heavy work that it dwarfs the vertex shader time then you'll want to focus your efforts there.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!