# [DX9] IDirect3DDevice9::Present() way to long

This topic is 3001 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey, As a school project I'm currently developing a renderingframework with some features to enable gameplay-programming. Everything was going great, untill today. A few hours ago, I started testing the rendering performance on large scenes, and as it turns out, with only 200 planes to render (shared vertex-/index-buffer) my framerate nearly reaches 4. I profiled the render-cycle, and it looks as if calling the Present(0, 0, 0, 0); method on my device-pointer takes about 200msec. It might be interesting to know that the framework renders the scene on a seperate-thread than the main thread and that the device is created with the D3DCREATE_MULTITHREADED flag. (which doen't seems to make any difference)
// Behaviour flags
devBehaviourFlags = D3DCREATE_PUREDEVICE | D3DCREATE_MULTITHREADED | D3DCREATE_HARDWARE_VERTEXPROCESSING;

// presentParams
m_ePP.BackBufferWidth = 1280;
m_ePP.BackBufferHeight = 768;
m_ePP.BackBufferFormat = bFullscreen ? D3DFMT_X8R8G8B8 : m_eDisplayMode.Format;
m_ePP.BackBufferCount = 1;
m_ePP.MultiSampleType = D3DMULTISAMPLE_NONE;
m_ePP.MultiSampleQuality = 0;
m_ePP.hDeviceWindow = KiCodil::GetRenderWindow()->GetWindow();
m_ePP.Windowed = !bFullscreen;
m_ePP.EnableAutoDepthStencil = true;
m_ePP.AutoDepthStencilFormat = D3DFMT_D24X8;
m_ePP.Flags = 0;
m_ePP.FullScreen_RefreshRateInHz = D3DPRESENT_RATE_DEFAULT;
m_ePP.PresentationInterval = D3DPRESENT_INTERVAL_IMMEDIATE;

// before rendering
HR(m_pDevice->Clear(0, NULL, D3DCLEAR_TARGET | D3DCLEAR_ZBUFFER, D3DCOLOR_RGBA(255, 255, 255, 0), 1.0f, 0));
HR(m_pDevice->BeginScene());

// after rendering
HR(m_pDevice->EndScene());
HR(m_pDevice->Present(NULL, NULL, NULL, NULL)); <= this devil takes about 200msec each frame

So, anybody any ideas on how to speed up the Present(0, 0, 0, 0); ?

##### Share on other sites
1: is this debug or release?

2: How are you profiling? With your own code if so can you describe it, if another program which one?

3: I'm assuming this is in windowed mode? if so try fullscreen do you get the same results?

4: Have you try'd cranking up your poly count to see if it actually increases the time it takes?

##### Share on other sites
Quote:
 Original post by freeworld1: is this debug or release?

This is in debug-mode as well as release-mode

Quote:
 Original post by freeworld2: How are you profiling? With your own code if so can you describe it, if another program which one?

I'm using a small pice of code I state before and after the Present(...) call:
void Start() {QueryPerformanceCounter(reinterpret_cast<LARGE_INTEGER*>(&m_nMarker));}void Stop() {__int64 nTmp;QueryPerformanceCounter(reinterpret_cast<LARGE_INTEGER*>(&nTmp));m_nSecondsElapsed += static_cast<double>(nTmp - m_nMarker) * m_nSecondsPerTick;++m_nLoopCount;}void Update() {m_nTiming = static_cast<float>(m_nSecondsElapsed / m_nLoopCount);m_nLoopCount = 0;m_nSecondsElapsed = 0;}

Quote:
 Original post by freeworld3: I'm assuming this is in windowed mode? if so try fullscreen do you get the same results?

Switching windowed to fullscreen seems to help a little bit, but by far not enough

Quote:
 Original post by freeworld4: Have you try'd cranking up your poly count to see if it actually increases the time it takes?

Adding more meshes increases the time, removing meshes decreases it

##### Share on other sites
how long is a total frame taking up? what's your FPS? I'll iterate again you should dramatically increase the amount you are drawing. You find it hard to see differences in speed if your not doing much in the first place.

##### Share on other sites
I use a different timer for the total FPS than I use for the profiler-timing.
(the FPS-counter I use comes from the book: 3D Game-Programming, a shader approach)

When I have a scene with 1 Geometric object (44 vertices) and 200 planes (consisting of 4 vertices, 2 triangles) my framerate gives me 4.
Just to be sure, I had a look at the FPS monitoring in Fraps, which said about the same thing.

##### Share on other sites
you're only getting 4 frames per second with just 400 primitives?

have you try'd using a profiler such as "code analyst" ?

##### Share on other sites
I'm not exactly sure that I'm using code analyst the right way, but here is my shot:

95.8% of the samples where taken during the Symbol 'QueryOglResource' which is part of nvd3dum.dll (some dll from nvidia)

I also tested the debug-build on the desktop of a friend, ... same result

##### Share on other sites
1: open codeanalyst and choose express profile from the wizard menu that pops up.

2: In the launch option browse and find your exe you want to profile.

3: make sure the working directory option is the same directory your exe is in.

4: When your app starts up wait till the green progress bar in the lower right finishes, then close your app.

5: in the window that pops up, find your app and double click on it.

6: This should be a huge list of all the functions that get called from your app, and a percentage of the total time they take up.

7: Look at the ones that take up the most time, first looking at your function and not third party functions like std::_vector... and so forth.

8: Find the chucks of code that take up the most time, and feel free to post one or two.

[Edited by - freeworld on December 3, 2009 7:12:34 PM]

##### Share on other sites
It's setting the world transform causing your problem. I ran into this a few months back in my game engine. Unfortunately the only way around the problem is reduce the calls to the set transform command.

Options...
Switch to instancing, this will go a long way towards reducing the need to call the set transform.
If you are doing particles, switch to DX sprites, they are batched by DX and is much faster than any particle system you can most likely write unless you are doing GPU sprite handling...

Using instancing (ex. asteroids) I can render approx 300k meshes per second on a dual core AMD 4200+ and an ATI 1650.

##### Share on other sites
Spending 200ms in present suggests strongly to me that you're GPU bound - that is your giving the video card too much work to do. You're not using the REF device are you?

Given that you're clearly not geometry bound, I'd suspect either fillrate or the pixel shader. How much overdraw is there?

Anyway, the first thing I'd suggest is getting hold of PerfHud. The documentation for that will guide you through tracking down your bottleneck.

##### Share on other sites
@Adam - No offense intended, I'm afraid the profiler will not show the root of the problem. You are correct in one respect, it is due to an overload. When a world tranform takes place it modifies multiple structures in DX each and every time. MS warns not to do this too often (search MSDN for proof, thats where I found my original answer). Once it reaches a certain number of changes per Present() interval, it becomes overloaded with changes. After testing over a month's time (and tons of hair pulling) it is between 65 to 85 world transforms when the delay starts spiking badly. On my single core AMD 3000+ it would pause up to 3 seconds with just 100 world transforms. AND it only happens when Present() is used, nowhere else. I wrote my own profiler class to test each and every last DX command using the processor clock throughout the entire game loop. Less than the limit the normal tick count was around 40-50, past that it would sometimes peak at 2 million ticks (processor time ticks).

BTW, this is a per frame problem and MS knows about it. Once the frame is presented, the transform count is pretty much considered reset. Instancing may seem hard at first but in essense it is very similar to rendering a single instance of a mesh, just much, much faster and ;) only requires 1 world transform. In this case would you rather set it once or 400 times?

The only downside to instancing is the amount of data that can be rendered per pass (yes, you will have to use a shader unfortunately), this will limit the size of the mesh vertices|indicies to 65k per pass, but you can render more complex meshes simply by reducing the instances drawn per pass. Look at the 'Instancing' demo provided in the SDK on how to do this (plus it comes with a shader to build on). You can get pretty good speed just with 64 instances per pass.

This problem is with WORLD transforms only, none of the other transforms cause this problem. Since world transforms are what put your objects in place, you will have to find work-a-rounds to this problem as I stated previously.

##### Share on other sites
Did you read http://msdn.microsoft.com/en-us/library/ee415127%28VS.85%29.aspx before doing that profiling?

Note the list right at the bottom - SetTransform() should average about 3500 clock cycles. You really shouldn't be getting three second pauses caused by just calling that 100 times - it should take significantly less than one millisecond of CPU time.

##### Share on other sites
Hehehe, we are speaking of 2 different clock cycles. The numbers from my profiler do not have anything to do with actual processor clock cycles. What you posted is processor cycles, my profiler uses the performance counter and the number has no real comparison to that. The formula to calculate the numbers I have posted are the number of count ticks needed to process a specific command. I did it both by total subroutine time and total command execution times including regular CPU commands, starting and stopping each timer before and after each command. The numbers only work within that realm, counting the ticks each command takes.

Look at the first post, he says Present() takes 200 ms, this means nothing to the profiler, it just returns how many ticks it took to get through the routine or command. MS commonly says in many areas of MSDN that this doesn't matter, but it does. If a command is normally running at 40-50 ticks then jumps to over 2 million of them, then it is a problem. It didn't matter to me that the GPU isn't processing the data fast enough, it matters that it locks up my system with over 2 million ticks at a time when it should only take 40-50 ticks which results in a huge pause in frame rate. I only wanted to know where it was happening. It was Present() causing the pause, nothing else.

What DX does in it's code was of no concern of mine, I wanted to know how it affected my CPU usage since I was optimizing my code for the highest possible speed and steady framerate. I did a pref check using the DX program you posted, it was no help what-so-ever since it wasn't picking up the WORLD transform problem. I got a hint from the MSDN documentation and did a few tests by reducing the 100 model test by 5 models at a time. It ended up varying between 65-85 models rendered that caused the problem (long since solved) so I chose to remove a single DX command one at a time. Thats one of the reasons it took so long to determine where the problem was. Once I commented out the WORLD transform, I could render a 1000 meshes at approx 50 ticks per Present(), which is quite normal. Added it back in and the spiking reappeared. One thing I have to really watch out for is the amount of time that a scene can render since I require a minimum of 30 fps if at all possible even on slower single core systems. I have 4 computers and the results were the same on all machines, WORLD transforms cause the delay in Present().

BTW, MS never really admits this is the cause, they hint at it (MSDN can be vague at times) so I had to hunt it down. As it stands now my entire game loop now runs less than 8k ticks per loop. I can provide the profiler class if you want to use it, just let me know. Only problem is that it uses my file IO class to save the log which I won't release (been building it over the past 12 years and I'm unwilling to put it in the public domain since it contains my encryption routines), this means you would have to build your own logger and saving code...

Anyways, I'm just going by what he posted where the delay was happening, after 10 years of writing DX apps, this is the only thing I have found that causes the delay in Present().

##### Share on other sites
All right than ... it seems I've finally determined the exact cause :D

As my first profiling indicated, some nvidia-dll needed a lot of time, and when Adam said something about fillrate and overdraw, I started thinking.

Something I didn't mention as I was really tired typing my previous posts, was that the planes I was so eager to render had all an alpha-channel which I used for AlphaTesting.
Nothing wrong with that, only that I'm using a deferred-renderer, and the way I implemented alpha-testing was by overwriting the Depth-value (something I did when I was debugging earlier on the same evening)

So, after surfing a bit on the internet I found an article about Tabula Rasa (GPU gems 3), and they stated something about a mysterious 'clip-command'. Well, the mistery was quickly solved.
My framerate still drops when I'm rendering 200 alpha-tested planes, but it's a lot less than before.
So I'm wondering now, ... what exactly does Present(0, 0, 0, 0); do ? (I always thought it simply filled the backbuffer with the frontbuffer, which I thought would be a linear time)

P.S:
I'm just not sure if the clip-command is the way to go on this one ? (the final game will have a lot of paralax-placed alpha-tested background-sprites)

I know this question should go into another topic, so maybe I'll make one later on today :-)

##### Share on other sites
You might find using D3DRS_ALPHATESTENABLE (with the appropriate D3DRS_ALPHAREF and D3DRS_ALPHAFUNC) works faster than clip(), although it's not as flexible.

You can also get big savings by making your geometry not just simple quads - you can completely avoid drawing things that are transparent that way. For example a circular transparent texture on a quad will be slower than one on an octagon.

What Present() does is:

1. Wait for the GPU to finish processing all the commands you've sent it (if the GPU is what's limiting the frame rate).
2. Wait for the vertical blank (if you enabled it in the present parameters).
3. Swap the front and back buffers.

The first two steps can take some time, the third one is almost instant.

##### Share on other sites
All right, tnx for explaining
Seems like ALpheTesting works better indeed, didn't knew if it would work on MRT's, but all's fine

tnx again