Archived

This topic is now archived and is closed to further replies.

sjelkjd

Trying to reach 380 M triangles/s

Recommended Posts

Basically, I''m trying to max out my video card. Supposedly a radeon 9800 pro can do 380 M triangles/s. Now, I wrote a small demo where I try to achieve this. The only object I draw is a sphere, which is tesselated by polar angles. Right now I have it at 40x40, which means that each sphere has 3200 triangles. I draw 200 spheres in my scene, which are randomly placed. I''ve been trying out several methods, with the following results: glDrawElements Basically, I created a big vertex array, and then created indexes into this. I''m drawing plain GL_TRIANGLES. This gets me 20M triangles/s glDrawArrays I create separate vertex arrays from the indexes of the first method - so there is more geometry in this method, and it shows - 3.2M triangles/s Display lists I tried putting both the drawElements and the DrawArrays calls in a display list. Both get approximately the same rate, 35M triangles/s VBOs using DrawArrays Same as the original DrawArrays, but using static VBOs to transfer my sphere geometry over once. 36.5M triangles/s VBOs using DrawElements By far the fastest. I put the indices into a VBO, and I get 108M triangles/s VAO using DrawElements I can''t put the indices in a VAO(or I don''t know how), so it''s slower than it could be. 48M tris/s I have lighting enabled, and am using 1 light with the standard opengl pipeline. It''s not fill limited, because I use a very small window size. So how the heck am I supposed to get up to 380M? Or is that a big lie? Do I need larger batches per DrawElements call(currently at 3200)? Do I need to drop the per-object transforms? Or is 380M/s a big lie? If so, what is the best anyone here has gotten?

Share this post


Link to post
Share on other sites
Triangle strips would help improve your speed considerably, since it cuts back on geometry processing a lot, and doesn''t have to pass around as much memory.


The theoretical limit is also done on non-lit, untextured, flat shaded polygons, and being a theoretical limit, can pretty much never be achieved . They base it off of clock speed of the geometry processor, and doesn''t take into account the delays of the AGP bus, bus transfer rates, driver overhead, etc. In a normal scene, you will most likely be fillrate limited anyways, so I wouldn''t worry about pushing more polygons... if you have over 100million polygons in a single scene (after occlussoin culling, etc) you either have one hell of a complex scene, or need to re-think your engine culling design .

Share this post


Link to post
Share on other sites
quote:
Original post by Ready4Dis
Triangle strips would help improve your speed considerably, since it cuts back on geometry processing a lot, and doesn''t have to pass around as much memory.


Would it? Memory transfer shouldn''t be an issue with VBO, since I use static buffers(should be created once on the card, never updated). As for geometry processing, you are assured that 2/3 vertices per triangle are in the vertex cache. That should be the case with plain triangles though, since I generate them sequentially.

quote:

The theoretical limit is also done on non-lit, untextured, flat shaded polygons, and being a theoretical limit, can pretty much never be achieved .


Ok, but even if I turn off lighting, no texturing, flat shading, I only get 125 M/s.

quote:
In a normal scene, you will most likely be fillrate limited anyways, so I wouldn''t worry about pushing more polygons... if you have over 100million polygons in a single scene (after occlussoin culling, etc) you either have one hell of a complex scene, or need to re-think your engine culling design .

Yeah, true =) I really wish I could get 380 though, just for the fun of it.

Share this post


Link to post
Share on other sites
quote:
Original post by Joe-Bob
Oh yeah, and if you could reach that number, it would run at 1 FPS exactly.


Why is that? You could have, for instance, a 1M tri scene that ran at 380 fps

Share this post


Link to post
Share on other sites
Also interesting is that display lists max out at about 38 M tris/s, and are less dependent on number of spheres, number of tris/sphere than the other methods - that is, the triangle rate doesn''t deviate from 38M/s.

Share this post


Link to post
Share on other sites
quote:
Original post by sjelkjd

Why is that? You could have, for instance, a 1M tri scene that ran at 380 fps


its a huge overhead to clear and swaw, and other stuff you and the driver has to do per frame

Share this post


Link to post
Share on other sites
quote:
Original post by Prosper/LOADED
Make sure the rendered triangles are *really* tiny otherwise you''ll become fillrate limited. Also run your program in fullscreen with a low resolution (640x480x16).



320x240x16 should be better .
If the Radeon9800 supports (my GeForce 4 does) the 320x240x16 resolution of course. I don''t understand why modern video cards such as the GeForce 4Ti support that tiny resolutions...

"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"

Share this post


Link to post
Share on other sites
quote:
Original post by ShlomiSteinberg
quote:
Original post by Prosper/LOADED
Make sure the rendered triangles are *really* tiny otherwise you''ll become fillrate limited. Also run your program in fullscreen with a low resolution (640x480x16).



320x240x16 should be better .
If the Radeon9800 supports (my GeForce 4 does) the 320x240x16 resolution of course. I don''t understand why modern video cards such as the GeForce 4Ti support that tiny resolutions...

"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"


Backwards compatibility I guess, and not only that, but if you already support multiple resolutions, it''s trivial to support it, so why not support it? It takes little to no effort to add in resolutions, might as well support anything that you think may ever be used and has ever been used. 320x200, 320x240, 400x300, 512x384, etc...

Share this post


Link to post
Share on other sites
quote:
Original post by sjelkjd
Would it? Memory transfer shouldn''t be an issue with VBO, since I use static buffers(should be created once on the card, never updated). As for geometry processing, you are assured that 2/3 vertices per triangle are in the vertex cache. That should be the case with plain triangles though, since I generate them sequentially.



Well, maybe maybe no, but it sure wouldn''t hurt to try . Also, it has to send the entire buffer to the video card each frame right? So, the index buffer would be smaller using a triangle strip, therefore less AGP transfer. Also, the over-head associated with making function calls to drivers, etc is also responsible for never letting you achieve the maximum throughput. It''s much easier to become fillrate limited than geometry limited, I''ve become fillrate limited on my game @ 640x480x16 with only 30k triangles filling the scene... once I hit 60k, I''ll still be fillrate limited and not geometry .

Share this post


Link to post
Share on other sites
Use pretransformed vertices and bypass that entire stage altogether. I''m not sure how gl handles them, maybe an identity modelview and projection matrix or some glEnable bit, or write a vertex shader that simply moves the input vertex to the output reg.

You''ll have to take care to make sure that your triangles are all in screen space.

------------
- outRider -

Share this post


Link to post
Share on other sites
quote:
Original post by ShlomiSteinberg If the Radeon9800 supports (my GeForce 4 does) the 320x240x16 resolution of course. I don''t understand why modern video cards such as the GeForce 4Ti support that tiny resolutions...



Hmm, you raise an interesting point. This is all happening in a window, not fullscreen. Is that going to significantly cut down my frame rate? I''d test, but I''m using glut and not win32 for setup.

Share this post


Link to post
Share on other sites
If u''ve ever played around with windowed mode you''ll know that a g-card can actually support "ANY" resolution. I''m not entirely sure if this can''t be done in full screen (damn i''m lazy) as well but that would explain the notibally small resolutions. Also note that most g-card manufactures are trying to reach ''realism'' consider that we can get small amounts of realism (without text however which is one prob with) TVs!

also note that you could always try to profile your application. what calls are u making to your cpu...some of these calls are made from the graphics apis as explained before. I''m guessing that there is a similar function to DX''s store vertex buffer in ''graphics card memory'' (or adapter memory) so u could try that as well

< krysole | krysollix >
sleep... caffeine for the weak minded!

Share this post


Link to post
Share on other sites
quote:
Original post by Krysole
If u''ve ever played around with windowed mode you''ll know that a g-card can actually support "ANY" resolution. I''m not entirely sure if this can''t be done in full screen (damn i''m lazy) as well but that would explain the notibally small resolutions. Also note that most g-card manufactures are trying to reach ''realism'' consider that we can get small amounts of realism (without text however which is one prob with) TVs!




Of course in windowed mode you can make the window size whatever you want. I''m talking about fullscreen where resolutions like 123x321 aren''t supported, although you could create a window sized 123x321.

"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"

Share this post


Link to post
Share on other sites
quote:

Hmm, you raise an interesting point. This is all happening in a window, not fullscreen. Is that going to significantly cut down my frame rate? I''d test, but I''m using glut and not win32 for setup.



Yes, it does. Correct me if I''m wrong, but I think it has to do with the fact that in fullscreen you have exclusive access to video-memory, whereas in windowed you don''t. (Windows still has to draw the rest of the screen in case you move the window etc.)

Here''s a small test I did: (GF4 Ti4800 SE, ~7k triangles)
640x480 windowed : ~930 fps
640x480 fullscreen: ~1250 fps

The difference actually increases with the number of triangles.

Share this post


Link to post
Share on other sites
Well, first of all you need to disable lighting. That theoretical figure doesn''t involve doing lighing computations. Then you need to make sure you''re getting lots of vertex cache hits, i.e. you need to have good locality of reference in your index buffer. Triangle strips is one way to achieve this, but unless you render something simple like a grid it might be hard to get strips of any decent length and then you''ll get cpu bound from all the draw calls. Just sending triangles in approximate strip order is usually good enough.

glCullFace(GL_FRONT_AND_BACK) is a good way to ensure you''re not fill limited without mucking about with resolution. You will also need to get the fps down to achieve good triangle rates (this is due to per frame overhead that becomes significant at high fps). Also, your triangles need to be small (a couple of pixels) to avoid rasterization becoming a bottleneck.

Share this post


Link to post
Share on other sites
I converted my code to run full screen with win32. No change in framerate.

It''s not fill rate limited, because I can drop the window size down to almost nothing, and the framerate doesn''t change.

My framerate is at about 60 fps. Getting more tri/s at a lower frame rate does not interest me - interactivity is important here.

I''ll see if I can stripify the spheres. Although the vertex cache should be doing quite well as it is - running down the side of a sphere. Only 1 vertex per triangle is uncached.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
In that case you''ll never reach the limit because every frame you have to clear the buffers and doing that 60X per second takes a lot of time away from the GPU. Admittedly, it''s a lot less than it used to but it is still a substantial amount of time. If you care about pumping out as many polys per second as possible, go for it, but don''t expect it to be interactive also :D

Share this post


Link to post
Share on other sites