Render, Update, then flip the pancake?

Started by
25 comments, last by Etnu 19 years, 9 months ago
A while ago I think I remember reading a thread about keeping CPU and GPU parallelism. I was interested in a remark, but didn't have time to look into it. Someone said that you should push all the rendering onto the GPU, and then work the CPU, and then flip the buffers. This is because when you flip the buffers, it has to wait for rendering to finish. Therefore, any optimizations you might implement that take advantage of waiting for rendering to finish might even make a frame take longer to display, because it waits while it draws anyway. Now that I think about it, it's seems very practical.

Each frame:
- Render the scene
- Update the scene
     - Update all the objects
     - Re-organize scenegraph for optimization (early z fail and all that)
- Flip buffers
Am I on the right track? edit: darn tags
Advertisement
I was under the impression the best order was simply:

Each frame:

- Update the scene

- Update all the objects

- Re-organize scenegraph for optimization (early z fail and all that)

- Render the scene

- Flip buffers

I don't think flipping the buffers will cause a huge stall and break parallelism (I assume you're just doing the Present() call here). The GPU will just buffer up what it's working on and carry on regardless. What breaks parallelism is when you force a readback or render target switch, then it has to finish rendering before it can give you any data.

Edit: If you do a glFinish() call you might stall, I'm not too familiar with OpenGL anymore... I've moved to D3D in the past year ;)

-Mezz
If you render before, then you do your calculations in-synch for the second half of the frame. If you render after, then you do the updates in-synch at the beginning of the next frame. Same difference.
____________________________________________________________AAAAA: American Association Against Adobe AcrobatYou know you hate PDFs...
It's left up to the graphics card driver to decide, but all drivers I know of queue the flip call if they can. What "if they can" means, is that if your other code is so fast that there's already one or two frames buffered, it has no choice but to wait until one of them clears out before continuing. Note that this is separate from the issue of double and triple buffering, and relates specifically to calls that have been batched in preparation for sending to the GPU.

What you should do, basically, is not worry about it. Call flip() right after you finish rendering. If it blocks, it's because you don't have anything to worry about WRT framerate. If you want to be double-sure that your game runs smoothly even with the interaction of non-graphics code, use triple buffering (it really does help).
Quote:Original post by Mezz
What breaks parallelism is when you force a readback or render target switch, then it has to finish rendering before it can give you any data.
-Mezz


That's a good point. So in OpenGL, when I switch contexts to a new pbuffer, the first one has to finish? That's definitely something to watch out for. Thanks.

Quote:
If you render before, then you do your calculations in-synch for the second half of the frame. If you render after, then you do the updates in-synch at the beginning of the next frame. Same difference.

No, what I was describing was that the call SwapBuffers() waiting until all rendering was finished, then swapped. If you did Update, Render, Swap, then you would push everything to the graphics card, wait for it to render, and then calculate everything while the graphics card does nothing, then wait to render again, etc. Basically, it's because you are swapping right after rendering.

But it looks like I don't have to worry about it anyway.

Quote:
It's left up to the graphics card driver to decide, but all drivers I know of queue the flip call if they can


Alright, that makes me feel a little better. Should I still worry about switching contexts? Each frame I might have to render to a couple pixel buffers, which means switching contexts.

Not quite sure if I need to deal with triple buffering. Might look into that later. Thanks though.
Here's what you have to do.

*** Initial Setup ***
Compute all physics, AI, etc. states for the first frame

*** Game loop ***
1. Render the scene
2. Compute all physics, AI, etc. states for the *next* frame
3. Flip (blocks until all triangles in rendering queue are rendered)

From D3D documentation (this has little to do with API, it's the same for OpenGL): "To enable maximal parallelism between the CPU and the graphics accelerator, it is advantageous to call IDirect3DDevice9::EndScene as far ahead of calling present as possible." In this context "present" is the same thing as "flip". So, one must render the scene between BeginScene() and EndScene() functions, do all CPU processing for the *next* frame, and call Present/SwapBuffers/Flip/Whatever.
It doesn't really matter.

Most drivers work by caching up a certain number of commands. Some commands may force a queue flush, and of course the card can force a flush as well. A driver may flush 5-10 times (or more) per frame. It might only flush once. It all depends on the driver and the commands you're issuing to it.

Every flush usually requires swapping between kernel and user mode.

It's very, very hard to be 100% sure when your video card is or is not waiting for commands. You can guess, but you'll probably be wrong.

The best solution is to create a seperate thread to handle rendering than you do for everything else. That's not exactly easy, though. You could try simply rendering when you do your updates, but that's likely to create ugly code.

You might gain some SLIGHT advantage by doing your physics updates between EndScene() and Present(), but not necessarily. If EndScene() only had 1 command queued up, you just wasted a bunch of clock cycles for no good reason.

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

Quote:Original post by Etnu
It doesn't really matter.

Not only does it matter, it's critical to the engine's performance. Simply rearranging your game loop can improve your performance dramatically and allow you to push more polies you could ever hope for.
Quote:Original post by Etnu
Most drivers work by caching up a certain number of commands. Some commands may force a queue flush, and of course the card can force a flush as well.

In general, it's very well defined which commands may cause a queue to flush, regardless of the driver.
Quote:Original post by Etnu
Every flush usually requires swapping between kernel and user mode.

That's irrelevant. Many Win32 API functions require swapping between kernel and user mode. This isn't the performance bottleneck.
Quote:Original post by Etnu
It's very, very hard to be 100% sure when your video card is or is not waiting for commands.

The setup I described above pretty much gurantees your video card is not waiting for commands (if you manage your locks properly, of course).
Quote:Original post by Etnu
The best solution is to create a seperate thread to handle rendering than you do for everything else.

It is not the best solution. A separate rendering thread is unnecessary.

EDIT: You'll only benefit from a separate rendering thread if your GPU finishes its work much faster than your CPU. You can then increase your framerate by rendering the same (unchanged) frame more than once. However, if this is the case your engine is not properly balanced. It is CPU limited and you could safely increase the load on the GPU. If it is not the case, a rendering thread will give you no benefit. In both situations adding another thread is the wrong way to go.
Quote:Original post by Etnu
You might gain some SLIGHT advantage by doing your physics updates between EndScene() and Present(), but not necessarily.

The bottom line is that there should be no CPU intensive code between BeginScene() and EndScene(). By the time you call BeginScene() all visibility information, AI and physics information for the frame should be calculated. Once BeginScene() is called, you simply iterate through the queue of objects you need to render and push them to the GPU. This way you gurantee the GPU isn't waiting on anything. Once EndScene() is called, the GPU is busy rendering while you calculate CPU intensive information for the next frame.
Quote:Original post by Etnu
If EndScene() only had 1 command queued up, you just wasted a bunch of clock cycles for no good reason.

If you have no CPU intensive calculations between BeginScene() and EndScene(), unless you render only a few polygons you'll never have only one command queued up.
Quote:
Not only does it matter, it's critical to the engine's performance. Simply rearranging your game loop can improve your performance dramatically and allow you to push more polies you could ever hope for.


Sometimes. There are always other things to consider.

Quote:
In general, it's very well defined which commands may cause a queue to flush, regardless of the driver.


Any command can force the command queue to flush, once it's full. The number of commands the queue can hold is dictated by the runtime. Some commands are guaranteed to (like creating a vertex buffer), but many more (such as Locking a buffer) have no guarantees.


Quote:
That's irrelevant. Many Win32 API functions require swapping between kernel and user mode. This isn't the performance bottleneck.


Microsoft's own documentation specifically points to the swap as being one of the most intensive tasks that can be done. A typical swap to / from kernel mode costs about 5000 clock cycles. That means every transition costs 10,000 clocks.

Quote:
The setup I described above pretty much gurantees your video card is not waiting for commands (if you manage your locks properly, of course).


The setup you described guarantees nothing, unless you wrote the driver yourself, and write your code to be exactly sure of when the optimal time to do work will be.

Quote:
EDIT: You'll only benefit from a separate rendering thread if your GPU finishes its work much faster than your CPU. You can then increase your framerate by rendering the same (unchanged) frame more than once. However, if this is the case your engine is not properly balanced. It is CPU limited and you could safely increase the load on the GPU. If it is not the case, a rendering thread will give you no benefit. In both situations adding another thread is the wrong way to go.


You can most certainly gain from a rendering thread in a seperate loop, as it's the only way to be 100% sure that the thread is not wasting clock cycles. yes, it's harder to program for multithreading. It's not impossible, though.

Quote:
The bottom line is that there should be no CPU intensive code between BeginScene() and EndScene(). By the time you call BeginScene() all visibility information, AI and physics information for the frame should be calculated. Once BeginScene() is called, you simply iterate through the queue of objects you need to render and push them to the GPU. This way you gurantee the GPU isn't waiting on anything. Once EndScene() is called, the GPU is busy rendering while you calculate CPU intensive information for the next frame.



Like I said, EndScene() isn't necessarily guaranteed to be doing anything, except for flushing the queue. How much work is accomplished here depends entirely on how full the command queue is at that point in time.

Quote:
If you have no CPU intensive calculations between BeginScene() and EndScene(), unless you render only a few polygons you'll never have only one command queued up.


Incorrect. If the command queue gets flushed because of the last call to DrawPrimitive()(which it can), EndScene() does absolutely nothing (literally; if the command queue is empty, it returns immediately).

You can test this yourself with a profiler if you don't believe me. Why do you think sometimes commands like SetRenderState() and DrawPrimitive() take thousands of clocks? They're supposed to return immediately! That's correct, they are -- but not if the command queue is full and needs to be flushed.

Read the SDK documentation if you don't believe me on this one. It's clearly outlined that there is no way to be sure of when the card is busy and when it's not better than I could possibly explain.

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

Hmm...

I'm not really understanding the arguments being made here.
CoffeeMug's loop looked good, and still does. It maintains maximum parallelism without multithreading (which doesn't really gain you much in this case, unless you've got bad parallelism in the first place).

I think it's irrelevent what commands cause the driver's command buffer to flush within a BeginScene/EndScene, since you had to do the command anyway you just have to eat the performance hit, there isn't any way around that.

It's not whether your command buffer on the user mode side is full or not, it's the one that the driver & card are working on that matters.

Think of it like this:

BeginScene.

Some API calls.
Queue Flushes.
Card starts work.
Some more API calls.
Queue Flushes.
Card carries on working.

EndScene.

So your EndScene has done nothing, but so what - the card is still working on the two queues you did give it during the renderering. If you're on your first frame and you call Present() now, the card has to stall and finish the work it's doing before flipping the buffers. If you do some CPU work like logic/physics/AI here, i.e. your game update, then you give the card a chance to actually finish what it's working on before calling Present(). The docs back this up, and IHVs do as well.

-Mezz

This topic is closed to new replies.

Advertisement