Archived

This topic is now archived and is closed to further replies.

polygone-heavy performance optimization

This topic is 5306 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hi. i am relatively new (ie, naive) about d3d and as such am having some performance issues with my code: the benchmarking program 3DMark 2001 "high polygon test - 1 light" reports my hardware is capable of rendering 30M triangles/second. For those who may not know, this benchmark test transforms and renders a 1000000+ polygon model of a carrousel in realtime. I wrote an app myself to benchmark how "polygon-heavy" a scene i could transform/render at 1024x768 at >refresh fps rate (about 60fps on my hardware). The results of this app report approximately 19M triangles/second. I am trying to figure out what is accounting for this difference in performance between my app and 3DMark. To clarify, my own app renders just a box and lets the user walk through" the box with the arrow keys (handled by direct input; takes up very little performance overhead). The box is 10x10 units in length and height. Each of the six surfaces (4 walls, 1 ceiling, 1 floor) of the box are subdivided into a grid of 150x150 vertices. Each of the six surfaces is epresented by it''s own unique vertex and index buffer. This geometry configuration amounts to 6(2*150^2)=270000 triangles total. i get about 70 fps while "walking through" inside this box: 18.9M triangles/second. Note that i am not display-refresh-update limited in full screen mode (using D3DPRESENT_INTERVAL_IMMEDIATE). Specifically, I am trying to figure out if my code is sub optimal (i am currently learning direct3d so i am worried i am doing things incorrectly). Maybe the nature of the scene i am using is more slow for the hardware (nvidia geforce4) to handle than the scene 3DMark uses. Or maybe my code is doing something in a sub optimal way. Please let me know what you think might be the accounting for the performance difference. My app uses the following settings: -directx 8.1 -full screen mode 1024x768 -device created using D3DDEVTYPE_HAL and D3DCREATE_HARDWARE_VERTEXPROCESSING -vertex and index buffers created using D3DUSAGE_WRITEONLY and D3DPOOL_DEFAULT (index buffers are 16bit) -rendering with indexed triangle lists -using id3dxmesh::OptimizeInplace() to optimize the index and vertex buffers of each of the box''s surfaces(walls, floor, ceiling) with D3DXMESHOPT_VERTEXCACHE NOTE: the performance gain from this was not noticable. i am guessing this is because the vertex/index buffers were already ordered in a relatively cache-friendly way by the for-loops that created them. also note i confirmed that the vertex/index buffers were successfully altered by the OptimizeInplace() call. -my FVF is using only position x,y,z and diffuse color -no lights -z buffer is turned on -polygons are one-sided (backface culling turned on) Thank you for taking the time to read this. Please note, i am not interested in benchmarking maximum triangle throughput my hardware is capable of. so please don''t tell me to reduce my window size to 1x1 pixels to see how much the performance increases. i am interested in the same sort of test that 3dmark 2001''s high polygon test does: transform and rendering a polygon heavy scene at 1024x768. Also note, i have attempted to run this in windowed mode at 1024x768 and the performance decreases by about 10fps to 60fps. as mentioned earlier, i am not display refresh update limited in full screen mode. I am wondering if maybe the difference in performance has to do with how the geometry is being passed to the card. as noted, i am using indexed triangle lists. is it possible that the 3dmark test uses indexed triangle strips for the meshes? i also wonder if maybe the slower performance has to do with clipping or fill rate but even when i pulled the camera back so that no clipping was nesc. and less fill was needed, the frame rate maxes out at about 100fps. go figure. Any help in helping me understand how to achieve 30M triangles/second with a scene like the one i am testing is very appreciated. Thanks again for your time and effort, Dan

Share this post


Link to post
Share on other sites
what about the Clear() statement..

Are you clearing the Z-Buffer only, or Z and Target?

Honestly, I am imagining many factors that would cause fluctuations...
1) the amout of polys that are being processed but not drawn because of z order (I know a cube (convex model) has zero such situations, since faces are either visible, or backface-culled)...
2) Did you build your app in release mode before you tested it?
3) Roughly how big is the 3DMark model on the screen with respect to your model (on the screen).
4) Are you binding many textures? How big is the texture? If using just 1 texture, did you bind it jsut once at the beginning of your app/ binding it every frame... How does 3DMark do it?
5) FPS rounding errors/Timer may not be measuring correctly..
6)...

any more?

Its a good question, but may have many answers...

www.cppnow.com

[edited by - superdeveloper on June 3, 2003 1:17:03 PM]

Share this post


Link to post
Share on other sites
Thanks for your response. sorry for the confusion. it is difficult to try to cover everything with out uploading all the code. i will do my best to answer your questions here (note that i won''t be able to answer them all completely or correctly cause i am not that knowledgable yet):

quote:
Original post by superdeveloper
what about the Clear() statement..

Are you clearing the Z-Buffer only, or Z and Target?

-A: not sure, need to look into this.

1) the amout of polys that are being processed but not drawn because of z order (I know a cube (convex model) has zero such situations, since faces are either visible, or backface-culled)...

-A: z buffer is not nesc for a model of the inside of a box but i want to use z buffer anyway because eventually i will need it for more complicated models. i am intending for this code to be the foundation of a 3d engine that can handle complex architectural-interior models so z buffer is on. (not sure if that answeres this question correctly)

2) Did you build your app in release mode before you tested it?

-A: not sure what release mode is. (stab in the dark: i am running retail runtime if that has anything to do with this)

3) Roughly how big is the 3DMark model on the screen with respect to your model (on the screen).

-A: the 3d mark model fills up most of the screen with a rotating carrousel and a camera sort of rotating about it. most of the time at least part of the carrousel model is being clipped. my own model is being clipped at every edge of the display. but as mentioned, not requiring clipping by pulling the camera outsie of the box doesn''t seem to help much.

4) Are you binding many textures? How big is the texture? If using just 1 texture, did you bind it jsut once at the beginning of your app/ binding it every frame... How does 3DMark do it?

-A: not using textures (yet). i want to get this code as fast as possible before implementing more complexity.

5) FPS rounding errors/Timer may not be measuring correctly..

-A: possibly, but it is the same timer code that comes with the MS sdk samples.

Share this post


Link to post
Share on other sites

in response to:

what about the Clear() statement..

Are you clearing the Z-Buffer only, or Z and Target?

---

I am clearing like this:

m_pD3DDevice->Clear(0, NULL, D3DCLEAR_TARGET|D3DCLEAR_ZBUFFER, D3DCOLOR_XRGB(0, 0, 255), 1.0f, 0);

i am not sure what the difference between "z and target" is.

Share this post


Link to post
Share on other sites
Z = depth buffer, which holds the distance of pixels to the camera. when a pixel is far away (> zbuffer value), it''s not drawn.

Target represents the actual RGBA color surface where the polygon pixles are drawn.

Both surfaces (Z and RGBA) span the display size (Ie. 1024x768), and each contain typically 1024x768x4 = alot of bytes (~3.1 MB) of data.

I ask because the 3D mark may only be clearing one of these surfaces, namely the Z. Perhaps they are drawing a z-disbled environment cube... actually ignore my comment, env cubes are NO WAY faster than just clearing a fixed size buffer...

Although just for fun, only use the "D3DCLEAR_ZBUFFER" and not "D3DCLEAR_ZBUFFER|D3DCLEAR_TARGET", and observe your differences in FPS. Being new to D3D, you may also notice some very interesting artifacts


www.cppnow.com

Share this post


Link to post
Share on other sites
halo,

you have mensioned that your cube is made of 6 parts... are you rendering those parts with separate render cals? How many DrawPrimitive calls are you making every frame? If you''re making 6 calls for each side of the cube, that''s probably where the rest of the power went, try putting the whole cube into one VB, that might help

Share this post


Link to post
Share on other sites
i am rendering it six drawindexedprimitive calls. the reason for this is that i figure the best way to render a fat ammount of polygons is to max out each vertex buffer to save overhead cost. right now each face is 45000 triangles. as i understand it, vertex buffers on most cards can''t hold more than 64000 faces. so my reasoning is that i will need to render about 45-64k of data with each vertex buffer. so i think it is reasonable to keep it seperated the way it is.

the point is to compare my test to 3dmark''s test. if they have 1 million faces in their scene they MUST have to use WAY MORE than six vertex buffers to render it all. so if they can do it and get 30M tri/sec on my machine, i should be able to do it with my code too. so for testing purposes i will keep using six seperate vertex buffers even though i could probably cut back a couple if i overlapped the VB data to create 64k buffers instead of 45k ones. hope that makes sense.


quote:
Original post by FuzzyBunny
halo,

you have mensioned that your cube is made of 6 parts... are you rendering those parts with separate render cals? How many DrawPrimitive calls are you making every frame? If you''re making 6 calls for each side of the cube, that''s probably where the rest of the power went, try putting the whole cube into one VB, that might help




Share this post


Link to post
Share on other sites
hmm 6 calls seems reasonable, since they are rather large..

Disable Z-Buffer all together in your test, just to observe any fluctuations in performance:

...
deviced3d->SetRenderState(D3DRS_ZENABLE, D3DZB_FALSE);
deviced3d->SetRenderState(D3DRS_ZWRITEENABLE, false);
...

(add them just before your drawindexed calls...)

And run your test. Although in 99.95% of apps, this approach is unacceptable, as depth testing is manditory. But it would be interesting to see what your results will be.

Please post your findings

www.cppnow.com

Share this post


Link to post
Share on other sites
I think the other bench mark uses OpenGL, so they don''t have VBs. Also are you using the Misrosoft Framework to set up your application skeleton? Make sure that hardware vertex processing is on. then try using tri/strips instead of lists.

main dif between the other bench and yours is probably using VBs... since OGL is much different in that area, they might have less problems maxing out the tri/data, using VBs however has different structure and limit to 64,000 verts... Im sure if you had one vertex buffer only it would be as fast as OGL version...

Share this post


Link to post
Share on other sites
quote:
Original post by FuzzyBunny
I think the other bench mark uses OpenGL, so they don''t have VBs. Also are you using the Misrosoft Framework to set up your application skeleton? Make sure that hardware vertex processing is on. then try using tri/strips instead of lists.

main dif between the other bench and yours is probably using VBs... since OGL is much different in that area, they might have less problems maxing out the tri/data, using VBs however has different structure and limit to 64,000 verts... Im sure if you had one vertex buffer only it would be as fast as OGL version...




OpenGL DOES have Vertex Buffers (Vertex Array it''s mostly refered to as), and VAR''s (Vertex Array Range). 3D Mark (all versions) use d3d, so that''s not correct either. Possibly, you are drawing to many vertices at the same time, and they have a lot of objects that are repeated more than once, and is possibly only being passed across the AGP bus once, instead of multiple times. Also, calling Clear, and copying back-buffer -> front 70 times per second, instead of (you get what... 5fps) 5 times per second is a lot of over-head also, and could account for the loss in performance.

Share this post


Link to post
Share on other sites
quote:
Possibly, you are drawing to many vertices at the same time, and they have a lot of objects that are repeated more than once, and is possibly only being passed across the AGP bus once, instead of multiple times.


With static write-only VBs the vertices should be in video memory, not crossing the AGP bus at all once they''ve been filled.

quote:
Also, calling Clear, and copying back-buffer -> front 70 times per second, instead of (you get what... 5fps) 5 times per second is a lot of over-head also, and could account for the loss in performance.


This raises a good question for the original poster: What swap effect are you using? D3DSWAPEFFECT_COPY will incur penalties blitting the back buffer to the front. _FLIP and _DISCARD don''t require any copying. I would test with D3DSWAPEFFECT_DISCARD.

Also, I don''t understand why are you unwilling to reduce the resolution. It will tell you whether or not you are fill rate bound, which is information. You''re almost certainly not, but the point is that if you want to find out where your bottleneck is, you have to do these things or else you''re just grasping at straws.

Share this post


Link to post
Share on other sites
quote:
Possibly, you are drawing to many vertices at the same time, and they have a lot of objects that are repeated more than once, and is possibly only being passed across the AGP bus once, instead of multiple times.


I don''t understand this. what do you mean "objects that are repeated more than once" and what difference does it make if they are xfered over AGP bus once or multiple times. Please fill me in since i am quite new to all this stuff. thanks.

quote:
With static write-only VBs the vertices should be in video memory, not crossing the AGP bus at all once they''ve been filled.


This makes sense to me in that i understand the value of not having to wait for stuff to transfer. so how do i ensure that this is what is happening? in my original post i outline many of the flags i used to do various things that may help answer this question.


quote:
Also, calling Clear, and copying back-buffer -> front 70 times per second, instead of (you get what... 5fps) 5 times per second is a lot of over-head also, and could account for the loss in performance.


what are you saying takes 5fps? by "5 times per second is a lot over-head also" do you mean 5 drawindexedprimitive calls or are you refering to something else? i am not sure what you are refering to, sorry.

quote:
This raises a good question for the original poster: What swap effect are you using? D3DSWAPEFFECT_COPY will incur penalties blitting the back buffer to the front. _FLIP and _DISCARD don''t require any copying. I would test with D3DSWAPEFFECT_DISCARD.


I am using D3DSWAPEFFECT_DISCARD. have been the whole time.

quote:
Also, I don''t understand why are you unwilling to reduce the resolution. It will tell you whether or not you are fill rate bound, which is information. You''re almost certainly not, but the point is that if you want to find out where your bottleneck is, you have to do these things or else you''re just grasping at straws.


i am unwilling to reduce the resolution because i am trying to match the performance exhibited by the 3dmark benchmark high polygon test. if they can do it on my computer, i should be able to too.

incidently, i have noticed that the 3dmark high polygon test does not run at refresh rate. i do not know how much slower it runs; maybe 50fps, maybe 30fps, maybe 20fps. i do not know. so i tried sending more vertices per vertex buffer to see if my app would handle a bigger load, but as it turns out, after any ammount greater than 22500 (that is 150x150) vertices, performance goes UNDER the 18.9M triangles/second i get with 22500 vertices per buffer. so i guess that is some sort of an optimum "sweet spot" for my hardware/software configuration (geforce 4,windows 98).

thanks for the input everyone has offered up so far.

Dan

Share this post


Link to post
Share on other sites
quote:
so how do i ensure that this is what is happening?


You can't *ensure* it, strictly speaking, all you can do is describe accurately how you intend to use the data. If you're not going to be modifying the data then you don't use D3DUSAGE_DYNAMIC. If you're not going to read from it (which you should never do if you can absolutely, positively, and for the love of Pete avoid doing), then you specify D3DUSAGE_WRITEONLY. Etc. It's up to the driver to decide where the VB belongs, based on your indications. If you specify dynamic then it's likely to put it into AGP memory for fast writing by the CPU. If you don't specify D3DUSAGE_WRITEONLY then it may put it into system memory for fast reading by the CPU. D3DPOOL_DEFAULT + D3DUSAGE_WRITEONLY typically results in an allocation in video memory on DX7+ hardware.

quote:
i am unwilling to reduce the resolution because i am trying to match the performance exhibited by the 3dmark benchmark high polygon test. if they can do it on my computer, i should be able to too.


"Doctor, doctor, Joe can run a four-minute mile and I can't!"

"Well, let me do some tests to see if you're limited by your muscle endurance, or your lung capacity, or..."

"No, I just want to know what the difference is! If he can do it, I can!"

Something in the pipeline is your bottleneck. The first order of business is to determine what it is. If you optimize for any other stage in the pipeline, you won't see any benefit. I don't understand the resistance to performing a simple test to eliminate possible candidates, especially when it would take two minutes.

[edited by - Donavon Keithley on June 4, 2003 2:56:44 AM]

Share this post


Link to post
Share on other sites
So i tested using a lower resolution and the triangle/second count went dramatically up. Presumably then we are not dealing with a triangle count limitation problem, but rather a fill rate limitation. so how do i mitigate fill rate?

Share this post


Link to post
Share on other sites
How dramatic was the speedup? Going from 1024x768 to what?

You should probably see a small speedup from the reduced frame buffer size simply because it clears faster. But it sounds like from your description each pixel gets drawn to only once per frame, right? So you really shouldn''t be fill bound.

Which swap effect are you using? Don''t use COPY, it''ll compound the frame buffer effect, blitting from the back buffer to the front every frame.

Next I would compare between the two resolutions with all frame buffer clearing turned off (render target and z-buffer) -- and of course z-buffering turned off -- just to make sure that it''s not all frame buffer clearing that you''re seeing. (It''s not a pure experiment because it also eliminates z-buffer bandwidth. Properly you would keep clearing on and turn off z-buffering, then compare with clearing turned off.)

If there''s still a significant difference, then it sounds like you really are fill bound, but why?. I would turn on wireframe and look for any anomalies. Then I would look for any overdraw that isn''t supposed to be there with additive alpha blending (src blend = SRCALPHA and dest blend = ONE) with, say, 20% alpha.

That should give you some idea of the kinds of experiments you can do.

Share this post


Link to post
Share on other sites
"my hardware is capable of rendering 30M triangles/second" is rather ambiguous. Does each triangle take up 10 pixels, or cover half the entire screen? Triangle strips or triangle lists? Rendered front to back / back to front / random?...
make sure you are doing the same for a "fair" competition...

Share this post


Link to post
Share on other sites