Sign in to follow this  
rubicondev

DirectX 9: General Optimisation Tips ?

Recommended Posts

Let's forget a moment about demos and ideal world applications that a tech guy from nVidia would rustle up to show off, I'm talking about actual real-world engines. How badly do drawprim counts count against everything else for example ? My batches are fairly small, but they have to be - there's not much point putting a trillion faces into a telegraph pole, yet all those poles need rendering separately so they can be positioned. I've sort of reached a point where I feel that I'm missing something major as my engine feels kinda slow. It does all lighting per-pixel, but using ubershaders so it's all single pass. In a scene I have with around 250 dp's (and no lighting atm) I'm getting a framerate from my 1950 that feels more inline with playstation. I've commented out chunks of stuff and I never seem to get much of a speed up. I guess I'm looking for general optimising advice. It's no good suggesting playing with PIX - I'm at a stage that's way before that level of scrutiny. I'm considering an overhaul of the pipeline at a gross scale, but not quite sure which way to go because I don't really know what's wrong with what I have now. It's driving me crazy. Is there a decent *current* paper on general performance guidelines for real world apps ?

Share this post


Link to post
Share on other sites
From my own (recent) experience: shader switches are the devil. Try to only switch when you absolutely have to. I got a 300% increase in FPS just from reducing shader switches. Also, I noticed that while a lot of people tell you you should sort by shader, grouping by shader is really what you need. The difference is that grouping can be done in O(n) while sorting is at best a O(n * log n) operation.

Share this post


Link to post
Share on other sites
I'm already doing that (the sorting part anyway). In fact I'm doing quite a lot of things like that which is why I don't get why my engine feels so slow. If I turn off the rendering completely then it rips so I know the problems not in the rest of my game.

Keep the tips coming though, handy thread already. :)

Share this post


Link to post
Share on other sites
In a non-pipelined single processor architecture, optimizing the wrong place would gain you little extra performance. The beauty of programming pipelined multiple processor architectures is that optimizing the wrong place would gain ZERO extra performance! A chain is only as strong as its weakest link. A pipeline is as fast as its slowest stage.

If the part you're optimizing currently is not where the bottleneck is located, all your efforts would go in vain. If your application is not CPU-bound for instance, reducing number of batches per frame won't help at all.

So, the key to the wonderland is:
Quote:

FIRST locate the bottleneck, THEN try to optimize it. This won't totally remove the bottleneck. It simply moves the bottleneck to another part of your application and boost your performance to some extent. Relocate the bottleneck, then optimize it again. Do this as many times as you need until you reach the sweet spot. You may then exploit the idle time in other stages and complicate those stages for free!

PROFILE and OPTIMIZE

Share this post


Link to post
Share on other sites
I'm all for general profiling, trouble is I don't know how to do it properly anymore.

My feelings about VTune are expressed quite concisely elsewhere, but I don't know what other products are available that actually work. I used metrowerks CATS a long time ago but it seems to be discontinued. Dev no longer has it's simple one built in.

PIX on 360 is okay, but the bottlenecks on that beast are in totally different places so I've gone as far as I can with that. The PC version of PIX is grim by comparison and doesn't seem to show much about system-wide clashes and bottlenecks.

I'm currently downloading the Beta of ATIs perfhud equivalent. Hopefully that'll give me some insight, but my download seems to be running at a byte an hour so who knows....

I do strongly suspect I'm CPU bound. I'm using a lot of it for my game (it has a simple fluid dynamics system in it), and turning off rendering completely makes it fly. I have about 250 smallish batches to draw 80K polys, all of which are single pass. I wouldn't expect this to be bound by anything tbh - have we gone backwards ?

I'm certainly no newb at this stuff, but I do think I must have a schoolboy error somewhere. I just can't find it!

Share this post


Link to post
Share on other sites
Quote:
Original post by RubiconMobile
How badly do drawprim counts count against everything else for example ? My batches are fairly small, but they have to be - there's not much point putting a trillion faces into a telegraph pole, yet all those poles need rendering separately so they can be positioned.


Total War doesn't draw each soldier with a single call to dip.

see here

Share this post


Link to post
Share on other sites
NVPerfHUD is a great profiling tool to look into, although you already seem to be aware it. You also need an NVIDIA card if you are to use it to its full extents. I suggest you first test your application with that (or its ATI equivalent) to validate your guess and make sure you indeed are CPU-bound. If you actually are, go for algorithmic optimizations first rather than coding hacks since they would gain in much bigger improvements unless you're doing some BIG coding mistake. Go for coding optimizations next.

The documents that come with NVPerfHUD are nevertheless very interesting. They would help a lot if you feel kind of lost. There was especially that one guide on optimizations HOW TOs, whose name I've unfortunately forgotten, but I'm sure you'll be able to find it. "NVIDIA GPU Programming Guide" is also a good reading. You can find that at NVIDIA's developer website.

Share this post


Link to post
Share on other sites
I think I'm currently downloading the entire nVidia website now, thanks :)

Instancing is at the top of my engine wishlist but it won't help my current projects as they don't have lots of repeats - the telegraph poles was just a classic example of batching problems. (The lead xbox programmer for Spartan is a colleague of mine btw)

Share this post


Link to post
Share on other sites
If you scene does only consist of 250 batches/drawcalls, you're definitily not cpu bound. To my knowledge the average scene in a modern game is much larger than this. My personal experience is that a present single core system can handle 2000 batches with shader/texture switches at 30 fps and you still have spare CPU cycles. So to repeat previous advice: profile.

Some more hints, just to refresh some memories of my baddest mistakes:

* Pay attention to accesses on dynamic resources. You most propably have some of this, for example dynamic vertex buffers for particle systems or simple font/HUD rendering. Be sure to actually allocate with the USAGE_DYNAMIC flag and lock into that buffer with LOCK_DISCARD. If you don't do this, you loose all CPU/GPU parallelism. Even if one of these both is heavily under-utilized you can still gain some frames per second.

* Filter all state changes. If you set a render state to a value even if it actually has that value from a previous call, you waste a lot of CPU cycles. To my knowledge, DirectX does no filtering of useless state changes, it just warns on them at a high debug output level. It's very simple to put a thin layer between DX and your engine that caches all state changes and compares them to the previous values before issuing a DX call. A simple change that doubled our framerate a long time ago.

* Same can be applied to textures, shaders and about everything else, but you are most propably already doing that.

* Batch shader constant updates. This won't gain you as much, but it helps as well. Instead of multiple calls to SetShaderValue() or what it's called, fetch the registers indices for all textures / parameters you need in advance and store them with the shader. When drawing you are then able to construct the whole shader parameter set in an offline float buffer and commit it all in one call. It helped us a bit, but it provides not as much gain as the other tips.

* As already said before: group batches by shader.

Good luck. If you succeed in improving your performance, please let us know what exactly helped you and to what amount. Future discussions might profit from such real-world data.

Bye, Thomas

[edit] fixed some typos.

Share this post


Link to post
Share on other sites
I'm doing all that redundant state stuff, but your suggestion about dynamic data is a good one - I'm going to look again at that stuff.

I'll certainly post back later if I find anything dramatic worth passing on.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this