Sign in to follow this  
okonomiyaki

Render, Update, then flip the pancake?

Recommended Posts

A while ago I think I remember reading a thread about keeping CPU and GPU parallelism. I was interested in a remark, but didn't have time to look into it. Someone said that you should push all the rendering onto the GPU, and then work the CPU, and then flip the buffers. This is because when you flip the buffers, it has to wait for rendering to finish. Therefore, any optimizations you might implement that take advantage of waiting for rendering to finish might even make a frame take longer to display, because it waits while it draws anyway. Now that I think about it, it's seems very practical.
Each frame:
- Render the scene
- Update the scene
     - Update all the objects
     - Re-organize scenegraph for optimization (early z fail and all that)
- Flip buffers
Am I on the right track? edit: darn tags

Share this post


Link to post
Share on other sites
I was under the impression the best order was simply:

Each frame:

- Update the scene

- Update all the objects

- Re-organize scenegraph for optimization (early z fail and all that)

- Render the scene

- Flip buffers

I don't think flipping the buffers will cause a huge stall and break parallelism (I assume you're just doing the Present() call here). The GPU will just buffer up what it's working on and carry on regardless. What breaks parallelism is when you force a readback or render target switch, then it has to finish rendering before it can give you any data.

Edit: If you do a glFinish() call you might stall, I'm not too familiar with OpenGL anymore... I've moved to D3D in the past year ;)

-Mezz

Share this post


Link to post
Share on other sites
If you render before, then you do your calculations in-synch for the second half of the frame. If you render after, then you do the updates in-synch at the beginning of the next frame. Same difference.

Share this post


Link to post
Share on other sites
It's left up to the graphics card driver to decide, but all drivers I know of queue the flip call if they can. What "if they can" means, is that if your other code is so fast that there's already one or two frames buffered, it has no choice but to wait until one of them clears out before continuing. Note that this is separate from the issue of double and triple buffering, and relates specifically to calls that have been batched in preparation for sending to the GPU.

What you should do, basically, is not worry about it. Call flip() right after you finish rendering. If it blocks, it's because you don't have anything to worry about WRT framerate. If you want to be double-sure that your game runs smoothly even with the interaction of non-graphics code, use triple buffering (it really does help).

Share this post


Link to post
Share on other sites
Quote:
Original post by Mezz
What breaks parallelism is when you force a readback or render target switch, then it has to finish rendering before it can give you any data.
-Mezz


That's a good point. So in OpenGL, when I switch contexts to a new pbuffer, the first one has to finish? That's definitely something to watch out for. Thanks.

Quote:

If you render before, then you do your calculations in-synch for the second half of the frame. If you render after, then you do the updates in-synch at the beginning of the next frame. Same difference.

No, what I was describing was that the call SwapBuffers() waiting until all rendering was finished, then swapped. If you did Update, Render, Swap, then you would push everything to the graphics card, wait for it to render, and then calculate everything while the graphics card does nothing, then wait to render again, etc. Basically, it's because you are swapping right after rendering.

But it looks like I don't have to worry about it anyway.

Quote:

It's left up to the graphics card driver to decide, but all drivers I know of queue the flip call if they can


Alright, that makes me feel a little better. Should I still worry about switching contexts? Each frame I might have to render to a couple pixel buffers, which means switching contexts.

Not quite sure if I need to deal with triple buffering. Might look into that later. Thanks though.

Share this post


Link to post
Share on other sites
Here's what you have to do.

*** Initial Setup ***
Compute all physics, AI, etc. states for the first frame

*** Game loop ***
1. Render the scene
2. Compute all physics, AI, etc. states for the *next* frame
3. Flip (blocks until all triangles in rendering queue are rendered)

From D3D documentation (this has little to do with API, it's the same for OpenGL): "To enable maximal parallelism between the CPU and the graphics accelerator, it is advantageous to call IDirect3DDevice9::EndScene as far ahead of calling present as possible." In this context "present" is the same thing as "flip". So, one must render the scene between BeginScene() and EndScene() functions, do all CPU processing for the *next* frame, and call Present/SwapBuffers/Flip/Whatever.

Share this post


Link to post
Share on other sites
It doesn't really matter.

Most drivers work by caching up a certain number of commands. Some commands may force a queue flush, and of course the card can force a flush as well. A driver may flush 5-10 times (or more) per frame. It might only flush once. It all depends on the driver and the commands you're issuing to it.

Every flush usually requires swapping between kernel and user mode.

It's very, very hard to be 100% sure when your video card is or is not waiting for commands. You can guess, but you'll probably be wrong.

The best solution is to create a seperate thread to handle rendering than you do for everything else. That's not exactly easy, though. You could try simply rendering when you do your updates, but that's likely to create ugly code.

You might gain some SLIGHT advantage by doing your physics updates between EndScene() and Present(), but not necessarily. If EndScene() only had 1 command queued up, you just wasted a bunch of clock cycles for no good reason.

Share this post


Link to post
Share on other sites
Quote:
Original post by Etnu
It doesn't really matter.

Not only does it matter, it's critical to the engine's performance. Simply rearranging your game loop can improve your performance dramatically and allow you to push more polies you could ever hope for.
Quote:
Original post by Etnu
Most drivers work by caching up a certain number of commands. Some commands may force a queue flush, and of course the card can force a flush as well.

In general, it's very well defined which commands may cause a queue to flush, regardless of the driver.
Quote:
Original post by Etnu
Every flush usually requires swapping between kernel and user mode.

That's irrelevant. Many Win32 API functions require swapping between kernel and user mode. This isn't the performance bottleneck.
Quote:
Original post by Etnu
It's very, very hard to be 100% sure when your video card is or is not waiting for commands.

The setup I described above pretty much gurantees your video card is not waiting for commands (if you manage your locks properly, of course).
Quote:
Original post by Etnu
The best solution is to create a seperate thread to handle rendering than you do for everything else.

It is not the best solution. A separate rendering thread is unnecessary.

EDIT: You'll only benefit from a separate rendering thread if your GPU finishes its work much faster than your CPU. You can then increase your framerate by rendering the same (unchanged) frame more than once. However, if this is the case your engine is not properly balanced. It is CPU limited and you could safely increase the load on the GPU. If it is not the case, a rendering thread will give you no benefit. In both situations adding another thread is the wrong way to go.
Quote:
Original post by Etnu
You might gain some SLIGHT advantage by doing your physics updates between EndScene() and Present(), but not necessarily.

The bottom line is that there should be no CPU intensive code between BeginScene() and EndScene(). By the time you call BeginScene() all visibility information, AI and physics information for the frame should be calculated. Once BeginScene() is called, you simply iterate through the queue of objects you need to render and push them to the GPU. This way you gurantee the GPU isn't waiting on anything. Once EndScene() is called, the GPU is busy rendering while you calculate CPU intensive information for the next frame.
Quote:
Original post by Etnu
If EndScene() only had 1 command queued up, you just wasted a bunch of clock cycles for no good reason.

If you have no CPU intensive calculations between BeginScene() and EndScene(), unless you render only a few polygons you'll never have only one command queued up.

Share this post


Link to post
Share on other sites
Quote:

Not only does it matter, it's critical to the engine's performance. Simply rearranging your game loop can improve your performance dramatically and allow you to push more polies you could ever hope for.


Sometimes. There are always other things to consider.

Quote:

In general, it's very well defined which commands may cause a queue to flush, regardless of the driver.


Any command can force the command queue to flush, once it's full. The number of commands the queue can hold is dictated by the runtime. Some commands are guaranteed to (like creating a vertex buffer), but many more (such as Locking a buffer) have no guarantees.


Quote:

That's irrelevant. Many Win32 API functions require swapping between kernel and user mode. This isn't the performance bottleneck.


Microsoft's own documentation specifically points to the swap as being one of the most intensive tasks that can be done. A typical swap to / from kernel mode costs about 5000 clock cycles. That means every transition costs 10,000 clocks.

Quote:

The setup I described above pretty much gurantees your video card is not waiting for commands (if you manage your locks properly, of course).


The setup you described guarantees nothing, unless you wrote the driver yourself, and write your code to be exactly sure of when the optimal time to do work will be.

Quote:

EDIT: You'll only benefit from a separate rendering thread if your GPU finishes its work much faster than your CPU. You can then increase your framerate by rendering the same (unchanged) frame more than once. However, if this is the case your engine is not properly balanced. It is CPU limited and you could safely increase the load on the GPU. If it is not the case, a rendering thread will give you no benefit. In both situations adding another thread is the wrong way to go.


You can most certainly gain from a rendering thread in a seperate loop, as it's the only way to be 100% sure that the thread is not wasting clock cycles. yes, it's harder to program for multithreading. It's not impossible, though.

Quote:

The bottom line is that there should be no CPU intensive code between BeginScene() and EndScene(). By the time you call BeginScene() all visibility information, AI and physics information for the frame should be calculated. Once BeginScene() is called, you simply iterate through the queue of objects you need to render and push them to the GPU. This way you gurantee the GPU isn't waiting on anything. Once EndScene() is called, the GPU is busy rendering while you calculate CPU intensive information for the next frame.



Like I said, EndScene() isn't necessarily guaranteed to be doing anything, except for flushing the queue. How much work is accomplished here depends entirely on how full the command queue is at that point in time.

Quote:

If you have no CPU intensive calculations between BeginScene() and EndScene(), unless you render only a few polygons you'll never have only one command queued up.


Incorrect. If the command queue gets flushed because of the last call to DrawPrimitive()(which it can), EndScene() does absolutely nothing (literally; if the command queue is empty, it returns immediately).

You can test this yourself with a profiler if you don't believe me. Why do you think sometimes commands like SetRenderState() and DrawPrimitive() take thousands of clocks? They're supposed to return immediately! That's correct, they are -- but not if the command queue is full and needs to be flushed.

Read the SDK documentation if you don't believe me on this one. It's clearly outlined that there is no way to be sure of when the card is busy and when it's not better than I could possibly explain.

Share this post


Link to post
Share on other sites
Hmm...

I'm not really understanding the arguments being made here.
CoffeeMug's loop looked good, and still does. It maintains maximum parallelism without multithreading (which doesn't really gain you much in this case, unless you've got bad parallelism in the first place).

I think it's irrelevent what commands cause the driver's command buffer to flush within a BeginScene/EndScene, since you had to do the command anyway you just have to eat the performance hit, there isn't any way around that.

It's not whether your command buffer on the user mode side is full or not, it's the one that the driver & card are working on that matters.

Think of it like this:

BeginScene.

Some API calls.
Queue Flushes.
Card starts work.
Some more API calls.
Queue Flushes.
Card carries on working.

EndScene.

So your EndScene has done nothing, but so what - the card is still working on the two queues you did give it during the renderering. If you're on your first frame and you call Present() now, the card has to stall and finish the work it's doing before flipping the buffers. If you do some CPU work like logic/physics/AI here, i.e. your game update, then you give the card a chance to actually finish what it's working on before calling Present(). The docs back this up, and IHVs do as well.

-Mezz

Share this post


Link to post
Share on other sites
Quote:
Original post by Etnu
Any command can force the command queue to flush, once it's full.

Yeah, but that shouldn't normally happen. If it does, you're not using the API correctly. Normally, if you're overflowing the queue you should increase the batch sizes and decrease the number of commands.
Quote:
Original post by Etnu
Microsoft's own documentation specifically points to the swap as being one of the most intensive tasks that can be done.

Can you give a link? This is the first time I hear about this being a problem.
Quote:
Original post by Etnu
The setup you described guarantees nothing, unless you wrote the driver yourself, and write your code to be exactly sure of when the optimal time to do work will be.

Not according to this presentation. There's a lot more information scattered on NVidia and ATI sites. I'll try to find more links tonight or tomorrow.
Quote:
Original post by Etnu
You can most certainly gain from a rendering thread in a seperate loop, as it's the only way to be 100% sure that the thread is not wasting clock cycles.

Sorry, I still don't see the benefit. The thread isn't waiting for which clock cycles? CPU or GPU? If your GPU pipeline stalls, it doesn't really matter if you use a separate thread to issue commands: you have to wait until the pipeline is rendered until you can continue filling it up. Can you clarify the benefit of a separate thread?
Quote:
Original post by Etnu
Read the SDK documentation if you don't believe me on this one. It's clearly outlined that there is no way to be sure of when the card is busy and when it's not better than I could possibly explain.

There is no way to gurantee the GPU isn't waiting for the CPU. However, "gurantee" is a really strong word. You can be reasonably sure the GPU isn't being idle if you spend enough time profiling and ironing out the bottlenecks.

[Edited by - CoffeeMug on August 4, 2004 8:22:17 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by CoffeeMug
Yeah, but that shouldn't normally happen. If it does, you're not using the API correctly. Normally, if you're overflowing the queue you should increase the batch sizes and decrease the number of commands.


That's not necessarily true; it's quite possible on modern hardware for the command queue to flush 2-3 times in a single iteration of the render loop, even with optimal batching. Of course, that does depend on your data.

Quote:

Can you give a link? This is the first time I hear about this being a problem.


2k4 SDK Docs -> DirectX Graphics -> Advanced Topics -> Accurately Profiling Direct3D API Calls.

Quote:
Sorry, I still don't see the benefit. The thread isn't waiting for which clock cycles? CPU or GPU? If your GPU pipeline stalls, it doesn't really matter if you use a separate thread to issue commands: you have to wait until the pipeline is rendered until you can continue filling it up. Can you clarify the benefit of a separate thread?


Simple; you never have to worry about what the GPU is doing, and can have as high a resolution as you'd like within your physics / input / sound code, completely independent of your rendering loop. The fact of the matter is that there will always be lost cycles when dealing with D3D calls, because, again, you may encounter things like flushes happening in the middle of the loop. A profiler will quickly show you that this happens.

Quote:

There is no way to gurantee the GPU isn't waiting for the CPU. However, "gurantee" is a really strong word. You can be reasonably sure the GPU isn't being idle if you spend enough time profiling and ironing out the bottlenecks.


Yes, if you spend a lot of time piddling around with your timing and writing code to handle it, you can be "reasonably" sure. Alternative, you can spend 2 hours and be 100% sure it's not waiting.

Share this post


Link to post
Share on other sites
Yes, modern hardware can flush the command queue multiple times a frame. This isn't a bad thing, because it allows the GPU to get a start actually doing some work. However, while the user-kernel mode swap is around 5000 cycles, is it really a performance bottleneck?

I read the documentation Etnu pointed at and found this:

Quote:

DrawPrimitive = kernel-transition + driver work + user-transition + runtime work
DrawPrimitive = 5000 + 935,000 + 2750 + 5000 + 900
DrawPrimitive = 947,950


So, while it's true that around 10,000 cycles are spent on those transitions, that's only about 1% of the total work being done.
I don't see how this could ever be a problem unless your transitions outweighed the work being done by the driver by some considerable amount, but in that case you're doing something drastically wrong anyway.

Is this why you're proposing multithreading, for cases where this flush occurs and you can regain performance by going to the other thread and doing some work there?

-Mezz

Share this post


Link to post
Share on other sites
Quote:
Original post by Mezz
Yes, modern hardware can flush the command queue multiple times a frame. This isn't a bad thing, because it allows the GPU to get a start actually doing some work. However, while the user-kernel mode swap is around 5000 cycles, is it really a performance bottleneck?

I read the documentation Etnu pointed at and found this:

Quote:

DrawPrimitive = kernel-transition + driver work + user-transition + runtime work
DrawPrimitive = 5000 + 935,000 + 2750 + 5000 + 900
DrawPrimitive = 947,950


So, while it's true that around 10,000 cycles are spent on those transitions, that's only about 1% of the total work being done.
I don't see how this could ever be a problem unless your transitions outweighed the work being done by the driver by some considerable amount, but in that case you're doing something drastically wrong anyway.

Is this why you're proposing multithreading, for cases where this flush occurs and you can regain performance by going to the other thread and doing some work there?

-Mezz


Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

Share this post


Link to post
Share on other sites
Quote:
Original post by Etnu
Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

But even if the call to the driver is on a different thread, the driver is eating up CPU cycles so you're waiting no matter which thread the call is made on. The only place this will belp is on a multiple CPU system (and a hyperthreaded CPUs, although to a much lesser extent).

EDIT: Also, the point in that paper about the time taken by the mode transition is with regards to accurate CPU profiling and *not* parallisation with the GPU.

[Edited by - joanusdmentia on August 5, 2004 5:43:28 AM]

Share this post


Link to post
Share on other sites
[quote]Original post by joanusdmentia
Quote:
Original post by Etnu
Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

But even if the call to the driver is on a different thread, the driver is eating up CPU cycles so you're waiting no matter which thread the call is made on. The only place this will belp is on a multiple CPU system (and a hyperthreaded CPUs, although to a much lesser extent).
[quote]

No, you've still got normal OS multitasking going on, which means that, while the worker threads may perform somewhat slower when a heavy load is on the main thread, they'll still get SOME timeslice, meaning that updates will continue to happen even though the thread is waiting for a call to return.

Quote:


EDIT: Also, the point in that paper about the time taken by the mode transition is with regards to accurate CPU profiling and *not* parallisation with the GPU.


Yes, but the cost of calls and how D3D works is still the same; they were talking about accurately measuring the amount of time a specific call takes, but they "accidentally" pointed out where code execution slows down.

Share this post


Link to post
Share on other sites
I'm still not sure I follow you Etnu - regardless of what thread the work is done in, the work still has to be done, so overall will it not just take the same amount of time?

I can only think of one example in which allowing another thread working time would be beneficial, and that's data streaming (music/large worlds).

Unless you want to be doing 'update' type work (AI/physics etc.) while in the middle of rendering? which I'm not sure is a good idea (unless maybe everything and absolutely everything is separated with no data shared whatsoever).

-Mezz

Share this post


Link to post
Share on other sites
Quote:
Original post by Etnu
No, you've still got normal OS multitasking going on, which means that, while the worker threads may perform somewhat slower when a heavy load is on the main thread, they'll still get SOME timeslice, meaning that updates will continue to happen even though the thread is waiting for a call to return.

I don't see how this would increase parallelism with the GPU though. When the driver flushes this is CPU work transferring the commands to the GPU, the CPU isn't sitting idlely waiting for the GPU (or am I wrong on this?). So the only thing you'd be achieving is artificially increasing your FPS while slightly reducing the number of times the scene is updated per seconded. Something along the lines of this:

| |
| | render frame N
update | \|/
frame | -------
N+1 | |
| | render frame N
\|/ \|/

Sure, your render thread isn't waiting for the next frame to be updated, but you're just redrawing the same frame again.

Share this post


Link to post
Share on other sites
Calculations() first, then Render():

Average render time in ms:
9.249394
Average calculations time in ms:
0.203390


Render() first, then Calculations():

Average render time in ms:
9.175182
Average calculations time in ms:
0.014599


Take note of the average calculations time.

This isn't just a one-off, I tried the test several times under the same conditions. The interesting thing is that Render() always has the flip at the end of it, so this may or may not be relevant to this topic. Can anyone explain why my calculations are 10x faster if I Render() first?

Share this post


Link to post
Share on other sites
Quote:
Original post by red_sodium
This isn't just a one-off, I tried the test several times under the same conditions. The interesting thing is that Render() always has the flip at the end of it, so this may or may not be relevant to this topic. Can anyone explain why my calculations are 10x faster if I Render() first?

You're doing something wrong when timing. :)
If your calculations are doing exactly the same work (and are presumably entirely on CPU) then they should take exactly the same time (give or take variance due to task switching, etc) no matter where you do your rendering, and even if there is *no* rendering. What we are talking about is the CPU time spent in the rendering code.

EDIT: Stupid me, that calculation timing is in milliseconds, not seconds :)
Ignore the difference, it's bugger all. Try doing some real calculations and then do your test again.

Share this post


Link to post
Share on other sites
Wow, >= 9/10ths of my Render() time goes to flipping!!!

Also, if I flip at the end of the loop instead of in Render(), my Render() time stays the same but my Calculations() time goes up again.

Where do you presume I'm doing the wrong calculations then? You are right, the difference is bugger all, but I want to know why it changed.

Share this post


Link to post
Share on other sites
Can you post some pseudo code showing how you do timing?
BTW, if your GPU time is much larger than CPU time, you aren't balanced. You could safely increase the accuracy of the physics engine, improve the AI, etc. until both timings are approximately the same.

Etnu: sorry, I still fail to see your logic. I guess the only way is to try it. If I have some extra time I'll give it a try but I'm pretty sure your reasoning is faulty here.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Hi everybody.
Present should stall the CPU if you're synchronised to the screen refresh rate, so i don't see why the update of the game logic would be parallelized if it's done after the swap buffer.
I'm developping a game for playstation 2 & XBox, and I can say the parallelism is one of the most important things to have decent performance.
Microsoft has made a tool called PIX which show how the parallelism is done, where it's break, etc...
If you have one day the luck to try it, you will learn a lot of things. It will be available soon for PC.
Don't forget to look at the paper from ATI & NVidia, they are very important, and I've never see them to advice multi-threading. It won't help you to have a better parallelism, and I don't see either why it's advantageous to draw twice the same frame.
Ron

Share this post


Link to post
Share on other sites
Quote:
Original post by CoffeeMug
I guess the only way is to try it. If I have some extra time I'll give it a try but I'm pretty sure your reasoning is faulty here.


Yeah, that's really the only way to answer these kind of questions. I won't be able to test it for a little while, as I'm still designing/coding some of the major parts of my engine. But I wanted to hear others' advice on the matter. I look forward to running some tests.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Don't forget to look at the paper from ATI & NVidia, they are very important, and I've never see them to advice multi-threading. It won't help you to have a better parallelism, and I don't see either why it's advantageous to draw twice the same frame.

Yeah, I fail to see the benefit multithreading would give as well.
Quote:
Original post by okonomiyaki
I look forward to running some tests.

Want to place some bets on whether multithreading will improve your performance? Fifty bucks says it won't [smile]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this