# Render, Update, then flip the pancake?

## Recommended Posts

A while ago I think I remember reading a thread about keeping CPU and GPU parallelism. I was interested in a remark, but didn't have time to look into it. Someone said that you should push all the rendering onto the GPU, and then work the CPU, and then flip the buffers. This is because when you flip the buffers, it has to wait for rendering to finish. Therefore, any optimizations you might implement that take advantage of waiting for rendering to finish might even make a frame take longer to display, because it waits while it draws anyway. Now that I think about it, it's seems very practical.
Each frame:
- Render the scene
- Update the scene
- Update all the objects
- Re-organize scenegraph for optimization (early z fail and all that)
- Flip buffers

Am I on the right track? edit: darn tags

##### Share on other sites
I was under the impression the best order was simply:

Each frame:

- Update the scene

- Update all the objects

- Re-organize scenegraph for optimization (early z fail and all that)

- Render the scene

- Flip buffers

I don't think flipping the buffers will cause a huge stall and break parallelism (I assume you're just doing the Present() call here). The GPU will just buffer up what it's working on and carry on regardless. What breaks parallelism is when you force a readback or render target switch, then it has to finish rendering before it can give you any data.

Edit: If you do a glFinish() call you might stall, I'm not too familiar with OpenGL anymore... I've moved to D3D in the past year ;)

-Mezz

##### Share on other sites
If you render before, then you do your calculations in-synch for the second half of the frame. If you render after, then you do the updates in-synch at the beginning of the next frame. Same difference.

##### Share on other sites
It's left up to the graphics card driver to decide, but all drivers I know of queue the flip call if they can. What "if they can" means, is that if your other code is so fast that there's already one or two frames buffered, it has no choice but to wait until one of them clears out before continuing. Note that this is separate from the issue of double and triple buffering, and relates specifically to calls that have been batched in preparation for sending to the GPU.

What you should do, basically, is not worry about it. Call flip() right after you finish rendering. If it blocks, it's because you don't have anything to worry about WRT framerate. If you want to be double-sure that your game runs smoothly even with the interaction of non-graphics code, use triple buffering (it really does help).

##### Share on other sites
Quote:
 Original post by MezzWhat breaks parallelism is when you force a readback or render target switch, then it has to finish rendering before it can give you any data.-Mezz

That's a good point. So in OpenGL, when I switch contexts to a new pbuffer, the first one has to finish? That's definitely something to watch out for. Thanks.

Quote:
 If you render before, then you do your calculations in-synch for the second half of the frame. If you render after, then you do the updates in-synch at the beginning of the next frame. Same difference.

No, what I was describing was that the call SwapBuffers() waiting until all rendering was finished, then swapped. If you did Update, Render, Swap, then you would push everything to the graphics card, wait for it to render, and then calculate everything while the graphics card does nothing, then wait to render again, etc. Basically, it's because you are swapping right after rendering.

But it looks like I don't have to worry about it anyway.

Quote:
 It's left up to the graphics card driver to decide, but all drivers I know of queue the flip call if they can

Alright, that makes me feel a little better. Should I still worry about switching contexts? Each frame I might have to render to a couple pixel buffers, which means switching contexts.

Not quite sure if I need to deal with triple buffering. Might look into that later. Thanks though.

##### Share on other sites
Here's what you have to do.

*** Initial Setup ***
Compute all physics, AI, etc. states for the first frame

*** Game loop ***
1. Render the scene
2. Compute all physics, AI, etc. states for the *next* frame
3. Flip (blocks until all triangles in rendering queue are rendered)

From D3D documentation (this has little to do with API, it's the same for OpenGL): "To enable maximal parallelism between the CPU and the graphics accelerator, it is advantageous to call IDirect3DDevice9::EndScene as far ahead of calling present as possible." In this context "present" is the same thing as "flip". So, one must render the scene between BeginScene() and EndScene() functions, do all CPU processing for the *next* frame, and call Present/SwapBuffers/Flip/Whatever.

##### Share on other sites
It doesn't really matter.

Most drivers work by caching up a certain number of commands. Some commands may force a queue flush, and of course the card can force a flush as well. A driver may flush 5-10 times (or more) per frame. It might only flush once. It all depends on the driver and the commands you're issuing to it.

Every flush usually requires swapping between kernel and user mode.

It's very, very hard to be 100% sure when your video card is or is not waiting for commands. You can guess, but you'll probably be wrong.

The best solution is to create a seperate thread to handle rendering than you do for everything else. That's not exactly easy, though. You could try simply rendering when you do your updates, but that's likely to create ugly code.

You might gain some SLIGHT advantage by doing your physics updates between EndScene() and Present(), but not necessarily. If EndScene() only had 1 command queued up, you just wasted a bunch of clock cycles for no good reason.

##### Share on other sites
Quote:
 Original post by EtnuIt doesn't really matter.

Not only does it matter, it's critical to the engine's performance. Simply rearranging your game loop can improve your performance dramatically and allow you to push more polies you could ever hope for.
Quote:
 Original post by EtnuMost drivers work by caching up a certain number of commands. Some commands may force a queue flush, and of course the card can force a flush as well.

In general, it's very well defined which commands may cause a queue to flush, regardless of the driver.
Quote:
 Original post by EtnuEvery flush usually requires swapping between kernel and user mode.

That's irrelevant. Many Win32 API functions require swapping between kernel and user mode. This isn't the performance bottleneck.
Quote:
 Original post by EtnuIt's very, very hard to be 100% sure when your video card is or is not waiting for commands.

The setup I described above pretty much gurantees your video card is not waiting for commands (if you manage your locks properly, of course).
Quote:
 Original post by EtnuThe best solution is to create a seperate thread to handle rendering than you do for everything else.

It is not the best solution. A separate rendering thread is unnecessary.

EDIT: You'll only benefit from a separate rendering thread if your GPU finishes its work much faster than your CPU. You can then increase your framerate by rendering the same (unchanged) frame more than once. However, if this is the case your engine is not properly balanced. It is CPU limited and you could safely increase the load on the GPU. If it is not the case, a rendering thread will give you no benefit. In both situations adding another thread is the wrong way to go.
Quote:
 Original post by EtnuYou might gain some SLIGHT advantage by doing your physics updates between EndScene() and Present(), but not necessarily.

The bottom line is that there should be no CPU intensive code between BeginScene() and EndScene(). By the time you call BeginScene() all visibility information, AI and physics information for the frame should be calculated. Once BeginScene() is called, you simply iterate through the queue of objects you need to render and push them to the GPU. This way you gurantee the GPU isn't waiting on anything. Once EndScene() is called, the GPU is busy rendering while you calculate CPU intensive information for the next frame.
Quote:
 Original post by EtnuIf EndScene() only had 1 command queued up, you just wasted a bunch of clock cycles for no good reason.

If you have no CPU intensive calculations between BeginScene() and EndScene(), unless you render only a few polygons you'll never have only one command queued up.

##### Share on other sites
Quote:
 Not only does it matter, it's critical to the engine's performance. Simply rearranging your game loop can improve your performance dramatically and allow you to push more polies you could ever hope for.

Sometimes. There are always other things to consider.

Quote:
 In general, it's very well defined which commands may cause a queue to flush, regardless of the driver.

Any command can force the command queue to flush, once it's full. The number of commands the queue can hold is dictated by the runtime. Some commands are guaranteed to (like creating a vertex buffer), but many more (such as Locking a buffer) have no guarantees.

Quote:
 That's irrelevant. Many Win32 API functions require swapping between kernel and user mode. This isn't the performance bottleneck.

Microsoft's own documentation specifically points to the swap as being one of the most intensive tasks that can be done. A typical swap to / from kernel mode costs about 5000 clock cycles. That means every transition costs 10,000 clocks.

Quote:
 The setup I described above pretty much gurantees your video card is not waiting for commands (if you manage your locks properly, of course).

The setup you described guarantees nothing, unless you wrote the driver yourself, and write your code to be exactly sure of when the optimal time to do work will be.

Quote:
 EDIT: You'll only benefit from a separate rendering thread if your GPU finishes its work much faster than your CPU. You can then increase your framerate by rendering the same (unchanged) frame more than once. However, if this is the case your engine is not properly balanced. It is CPU limited and you could safely increase the load on the GPU. If it is not the case, a rendering thread will give you no benefit. In both situations adding another thread is the wrong way to go.

You can most certainly gain from a rendering thread in a seperate loop, as it's the only way to be 100% sure that the thread is not wasting clock cycles. yes, it's harder to program for multithreading. It's not impossible, though.

Quote:
 The bottom line is that there should be no CPU intensive code between BeginScene() and EndScene(). By the time you call BeginScene() all visibility information, AI and physics information for the frame should be calculated. Once BeginScene() is called, you simply iterate through the queue of objects you need to render and push them to the GPU. This way you gurantee the GPU isn't waiting on anything. Once EndScene() is called, the GPU is busy rendering while you calculate CPU intensive information for the next frame.

Like I said, EndScene() isn't necessarily guaranteed to be doing anything, except for flushing the queue. How much work is accomplished here depends entirely on how full the command queue is at that point in time.

Quote:
 If you have no CPU intensive calculations between BeginScene() and EndScene(), unless you render only a few polygons you'll never have only one command queued up.

Incorrect. If the command queue gets flushed because of the last call to DrawPrimitive()(which it can), EndScene() does absolutely nothing (literally; if the command queue is empty, it returns immediately).

You can test this yourself with a profiler if you don't believe me. Why do you think sometimes commands like SetRenderState() and DrawPrimitive() take thousands of clocks? They're supposed to return immediately! That's correct, they are -- but not if the command queue is full and needs to be flushed.

Read the SDK documentation if you don't believe me on this one. It's clearly outlined that there is no way to be sure of when the card is busy and when it's not better than I could possibly explain.

##### Share on other sites
Hmm...

I'm not really understanding the arguments being made here.
CoffeeMug's loop looked good, and still does. It maintains maximum parallelism without multithreading (which doesn't really gain you much in this case, unless you've got bad parallelism in the first place).

I think it's irrelevent what commands cause the driver's command buffer to flush within a BeginScene/EndScene, since you had to do the command anyway you just have to eat the performance hit, there isn't any way around that.

It's not whether your command buffer on the user mode side is full or not, it's the one that the driver & card are working on that matters.

Think of it like this:

BeginScene.

Some API calls.
Queue Flushes.
Card starts work.
Some more API calls.
Queue Flushes.
Card carries on working.

EndScene.

So your EndScene has done nothing, but so what - the card is still working on the two queues you did give it during the renderering. If you're on your first frame and you call Present() now, the card has to stall and finish the work it's doing before flipping the buffers. If you do some CPU work like logic/physics/AI here, i.e. your game update, then you give the card a chance to actually finish what it's working on before calling Present(). The docs back this up, and IHVs do as well.

-Mezz

##### Share on other sites
Quote:
 Original post by EtnuAny command can force the command queue to flush, once it's full.

Yeah, but that shouldn't normally happen. If it does, you're not using the API correctly. Normally, if you're overflowing the queue you should increase the batch sizes and decrease the number of commands.
Quote:
 Original post by EtnuMicrosoft's own documentation specifically points to the swap as being one of the most intensive tasks that can be done.

Quote:
 Original post by EtnuThe setup you described guarantees nothing, unless you wrote the driver yourself, and write your code to be exactly sure of when the optimal time to do work will be.

Not according to this presentation. There's a lot more information scattered on NVidia and ATI sites. I'll try to find more links tonight or tomorrow.
Quote:
 Original post by EtnuYou can most certainly gain from a rendering thread in a seperate loop, as it's the only way to be 100% sure that the thread is not wasting clock cycles.

Sorry, I still don't see the benefit. The thread isn't waiting for which clock cycles? CPU or GPU? If your GPU pipeline stalls, it doesn't really matter if you use a separate thread to issue commands: you have to wait until the pipeline is rendered until you can continue filling it up. Can you clarify the benefit of a separate thread?
Quote:
 Original post by EtnuRead the SDK documentation if you don't believe me on this one. It's clearly outlined that there is no way to be sure of when the card is busy and when it's not better than I could possibly explain.

There is no way to gurantee the GPU isn't waiting for the CPU. However, "gurantee" is a really strong word. You can be reasonably sure the GPU isn't being idle if you spend enough time profiling and ironing out the bottlenecks.

[Edited by - CoffeeMug on August 4, 2004 8:22:17 AM]

##### Share on other sites
Quote:
 Original post by CoffeeMugYeah, but that shouldn't normally happen. If it does, you're not using the API correctly. Normally, if you're overflowing the queue you should increase the batch sizes and decrease the number of commands.

That's not necessarily true; it's quite possible on modern hardware for the command queue to flush 2-3 times in a single iteration of the render loop, even with optimal batching. Of course, that does depend on your data.

Quote:

2k4 SDK Docs -> DirectX Graphics -> Advanced Topics -> Accurately Profiling Direct3D API Calls.

Quote:
 Sorry, I still don't see the benefit. The thread isn't waiting for which clock cycles? CPU or GPU? If your GPU pipeline stalls, it doesn't really matter if you use a separate thread to issue commands: you have to wait until the pipeline is rendered until you can continue filling it up. Can you clarify the benefit of a separate thread?

Simple; you never have to worry about what the GPU is doing, and can have as high a resolution as you'd like within your physics / input / sound code, completely independent of your rendering loop. The fact of the matter is that there will always be lost cycles when dealing with D3D calls, because, again, you may encounter things like flushes happening in the middle of the loop. A profiler will quickly show you that this happens.

Quote:
 There is no way to gurantee the GPU isn't waiting for the CPU. However, "gurantee" is a really strong word. You can be reasonably sure the GPU isn't being idle if you spend enough time profiling and ironing out the bottlenecks.

Yes, if you spend a lot of time piddling around with your timing and writing code to handle it, you can be "reasonably" sure. Alternative, you can spend 2 hours and be 100% sure it's not waiting.

##### Share on other sites
Yes, modern hardware can flush the command queue multiple times a frame. This isn't a bad thing, because it allows the GPU to get a start actually doing some work. However, while the user-kernel mode swap is around 5000 cycles, is it really a performance bottleneck?

I read the documentation Etnu pointed at and found this:

Quote:
 DrawPrimitive = kernel-transition + driver work + user-transition + runtime workDrawPrimitive = 5000 + 935,000 + 2750 + 5000 + 900DrawPrimitive = 947,950

So, while it's true that around 10,000 cycles are spent on those transitions, that's only about 1% of the total work being done.
I don't see how this could ever be a problem unless your transitions outweighed the work being done by the driver by some considerable amount, but in that case you're doing something drastically wrong anyway.

Is this why you're proposing multithreading, for cases where this flush occurs and you can regain performance by going to the other thread and doing some work there?

-Mezz

##### Share on other sites
Quote:
Original post by Mezz
Yes, modern hardware can flush the command queue multiple times a frame. This isn't a bad thing, because it allows the GPU to get a start actually doing some work. However, while the user-kernel mode swap is around 5000 cycles, is it really a performance bottleneck?

I read the documentation Etnu pointed at and found this:

Quote:
 DrawPrimitive = kernel-transition + driver work + user-transition + runtime workDrawPrimitive = 5000 + 935,000 + 2750 + 5000 + 900DrawPrimitive = 947,950

So, while it's true that around 10,000 cycles are spent on those transitions, that's only about 1% of the total work being done.
I don't see how this could ever be a problem unless your transitions outweighed the work being done by the driver by some considerable amount, but in that case you're doing something drastically wrong anyway.

Is this why you're proposing multithreading, for cases where this flush occurs and you can regain performance by going to the other thread and doing some work there?

-Mezz

Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

##### Share on other sites
Quote:
 Original post by EtnuRight, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

But even if the call to the driver is on a different thread, the driver is eating up CPU cycles so you're waiting no matter which thread the call is made on. The only place this will belp is on a multiple CPU system (and a hyperthreaded CPUs, although to a much lesser extent).

EDIT: Also, the point in that paper about the time taken by the mode transition is with regards to accurate CPU profiling and *not* parallisation with the GPU.

[Edited by - joanusdmentia on August 5, 2004 5:43:28 AM]

##### Share on other sites
[quote]Original post by joanusdmentia
Quote:
 Original post by EtnuRight, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

But even if the call to the driver is on a different thread, the driver is eating up CPU cycles so you're waiting no matter which thread the call is made on. The only place this will belp is on a multiple CPU system (and a hyperthreaded CPUs, although to a much lesser extent).
[quote]

No, you've still got normal OS multitasking going on, which means that, while the worker threads may perform somewhat slower when a heavy load is on the main thread, they'll still get SOME timeslice, meaning that updates will continue to happen even though the thread is waiting for a call to return.

Quote:
 EDIT: Also, the point in that paper about the time taken by the mode transition is with regards to accurate CPU profiling and *not* parallisation with the GPU.

Yes, but the cost of calls and how D3D works is still the same; they were talking about accurately measuring the amount of time a specific call takes, but they "accidentally" pointed out where code execution slows down.

##### Share on other sites
I'm still not sure I follow you Etnu - regardless of what thread the work is done in, the work still has to be done, so overall will it not just take the same amount of time?

I can only think of one example in which allowing another thread working time would be beneficial, and that's data streaming (music/large worlds).

Unless you want to be doing 'update' type work (AI/physics etc.) while in the middle of rendering? which I'm not sure is a good idea (unless maybe everything and absolutely everything is separated with no data shared whatsoever).

-Mezz

##### Share on other sites
Quote:
 Original post by EtnuNo, you've still got normal OS multitasking going on, which means that, while the worker threads may perform somewhat slower when a heavy load is on the main thread, they'll still get SOME timeslice, meaning that updates will continue to happen even though the thread is waiting for a call to return.

I don't see how this would increase parallelism with the GPU though. When the driver flushes this is CPU work transferring the commands to the GPU, the CPU isn't sitting idlely waiting for the GPU (or am I wrong on this?). So the only thing you'd be achieving is artificially increasing your FPS while slightly reducing the number of times the scene is updated per seconded. Something along the lines of this:
        |      |        |      |   render frame Nupdate  |     \|/frame   |   -------  N+1     |      |        |      |   render frame N       \|/    \|/

Sure, your render thread isn't waiting for the next frame to be updated, but you're just redrawing the same frame again.

##### Share on other sites
Calculations() first, then Render():

Average render time in ms:
9.249394
Average calculations time in ms:
0.203390

Render() first, then Calculations():

Average render time in ms:
9.175182
Average calculations time in ms:
0.014599

Take note of the average calculations time.

This isn't just a one-off, I tried the test several times under the same conditions. The interesting thing is that Render() always has the flip at the end of it, so this may or may not be relevant to this topic. Can anyone explain why my calculations are 10x faster if I Render() first?

##### Share on other sites
Quote:
 Original post by red_sodiumThis isn't just a one-off, I tried the test several times under the same conditions. The interesting thing is that Render() always has the flip at the end of it, so this may or may not be relevant to this topic. Can anyone explain why my calculations are 10x faster if I Render() first?

You're doing something wrong when timing. :)
If your calculations are doing exactly the same work (and are presumably entirely on CPU) then they should take exactly the same time (give or take variance due to task switching, etc) no matter where you do your rendering, and even if there is *no* rendering. What we are talking about is the CPU time spent in the rendering code.

EDIT: Stupid me, that calculation timing is in milliseconds, not seconds :)
Ignore the difference, it's bugger all. Try doing some real calculations and then do your test again.

##### Share on other sites
Wow, >= 9/10ths of my Render() time goes to flipping!!!

Also, if I flip at the end of the loop instead of in Render(), my Render() time stays the same but my Calculations() time goes up again.

Where do you presume I'm doing the wrong calculations then? You are right, the difference is bugger all, but I want to know why it changed.

##### Share on other sites
Can you post some pseudo code showing how you do timing?
BTW, if your GPU time is much larger than CPU time, you aren't balanced. You could safely increase the accuracy of the physics engine, improve the AI, etc. until both timings are approximately the same.

Etnu: sorry, I still fail to see your logic. I guess the only way is to try it. If I have some extra time I'll give it a try but I'm pretty sure your reasoning is faulty here.

##### Share on other sites
Hi everybody.
Present should stall the CPU if you're synchronised to the screen refresh rate, so i don't see why the update of the game logic would be parallelized if it's done after the swap buffer.
I'm developping a game for playstation 2 & XBox, and I can say the parallelism is one of the most important things to have decent performance.
Microsoft has made a tool called PIX which show how the parallelism is done, where it's break, etc...
If you have one day the luck to try it, you will learn a lot of things. It will be available soon for PC.
Don't forget to look at the paper from ATI & NVidia, they are very important, and I've never see them to advice multi-threading. It won't help you to have a better parallelism, and I don't see either why it's advantageous to draw twice the same frame.
Ron

##### Share on other sites
Quote:
 Original post by CoffeeMug I guess the only way is to try it. If I have some extra time I'll give it a try but I'm pretty sure your reasoning is faulty here.

Yeah, that's really the only way to answer these kind of questions. I won't be able to test it for a little while, as I'm still designing/coding some of the major parts of my engine. But I wanted to hear others' advice on the matter. I look forward to running some tests.

##### Share on other sites
Quote:
 Original post by Anonymous PosterDon't forget to look at the paper from ATI & NVidia, they are very important, and I've never see them to advice multi-threading. It won't help you to have a better parallelism, and I don't see either why it's advantageous to draw twice the same frame.

Yeah, I fail to see the benefit multithreading would give as well.
Quote:
 Original post by okonomiyakiI look forward to running some tests.

Want to place some bets on whether multithreading will improve your performance? Fifty bucks says it won't [smile]

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628278
• Total Posts
2981785

• 10
• 11
• 17
• 13
• 9