• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
BrechtDebruyne

performance problem with my renderer

19 posts in this topic

A current-gen engine with Direct3D10+ should be able to handle at least 10k+ raw draw calls on a modern computer
With a raw draw calls I mean: pure #draw calls, without any optimisations such as instancing or culling.
EDIT: and I mean draw calls with a simple effect and vertexbuffer, like a textured rectangle.

My engine can only handle about 1k draw calls in order to sustain a smooth framerate. I have been profiling the hell out of my engine with intel vtune, amd codeanalyst, pix, and the default VS profiler, but I just can't seem to find the problem!

Another peculiar thing is that an ID3D10EffectPass::Apply() seems to take longer than most draw() calls.
After some tests, ID3D10EffectPass::Apply() doesn't do what msdn says: ([color=#2A2A2A]Set the state contained in a pass to the device.[/color])

If I apply() before I commit my shader variables, my variables won't be updated. This implies that when a technique only contains 1 pass, we can not apply() per material but are forced to do this per mesh.

If anyone made a pretty performance-concerned rendering engine for PC with Direct3D10+, can you please check how many raw draw calls it can handle, and if the CPU spends more time doing Apply() than Draw()?

Can you apply per material instead of per mesh, when a technique only contains one pass? (not according to my tests, while a lot of people say that this would be an optimization I could make)

And does anyone have any idea why my engine would only do 1k draw calls? My algorithms are all tested for computational complexity etc, so it's probably not that!

Thanks x1000! Edited by Xcrypt
0

Share this post


Link to post
Share on other sites
[quote name='Xcrypt' timestamp='1337268496' post='4940947']
A current-gen engine with Direct3D10+ should be able to handle at least 10k+ raw draw calls on a modern computer
[/quote]
There was a nvidia(?) presentation a few years back which talked about the number of draw calls per second. Pure draw calls are CPU limited and they gave a formula depending on GHz of a single core. The limits were more or less 1k-1.5k for a 2.5GHz. Considering that the GHz of single CPUs hasn't increased terrible the last 5 years, I would sugguest, that 1k is more realistic than 10k.
1

Share this post


Link to post
Share on other sites
Draw call overhead in D3D10+ is much more efficient than in previous versions, and can be cosidered more-or-less on a par with OpenGL, but it's still not free. However, a quick and dirty check shows that I can sustain ~12500 draw calls per-frame at ~250fps, which in turn shows that your performance woes are most likely coming from elsewhere.

I have no idea what's going on with EffectState::Apply - I personally don't use the effects framework at this level - but I'm guessing this is the most probable candidate. Solutions might include not using the effects framework (which is far easier than you may think at first) or moving your state handling from the framework to your program's code.
2

Share this post


Link to post
Share on other sites
[quote name='mhagain' timestamp='1337276641' post='4940981']
I have no idea what's going on with EffectState::Apply - I personally don't use the effects framework at this level - but I'm guessing this is the most probable candidate. Solutions might include not using the effects framework (which is far easier than you may think at first) or moving your state handling from the framework to your program's code.
[/quote]

Not using the effects framework? Then how do you handle texturing/lighting? In fact how do you render anything at all?
And moving state handling from the framework to my program, how would I do that?
Also, what Direct3D version are you using? Edited by Xcrypt
0

Share this post


Link to post
Share on other sites
[quote name='Xcrypt' timestamp='1337281524' post='4941009']
[quote name='mhagain' timestamp='1337276641' post='4940981']
I have no idea what's going on with EffectState::Apply - I personally don't use the effects framework at this level - but I'm guessing this is the most probable candidate. Solutions might include not using the effects framework (which is far easier than you may think at first) or moving your state handling from the framework to your program's code.
[/quote]

Not using the effects framework? Then how do you handle texturing/lighting? In fact how do you render anything at all?
And moving state handling from the framework to my program, how would I do that?
Also, what Direct3D version are you using?
[/quote]

You don't need the FX framework to do any of that.
You can set textures, render states and constants, and trigger draws on the device yourself in d3d10 / deviceContext in d3d11.

So if you read in the texture ids / state ids / constants from your material files and build up your own renderable blocks with the desired d3d resources, you can then manage it all yourself. This allows you to batch in maybe more efficient ways, remove redundent API calls, etc. which you might not be able to do through the FX framework (Last time I used the fx framework was 2007 or so, so my memory is a bit fuzzy).
In d3d11 you can also make use of multiple cores by building up your draw lists on different threads using the deferredDeviceContexts, which can help reduce the CPU load quite a bit (especially now the driver support for it seems to be pretty good, at least from NVs side).

As you mentioned in your original post, instancing can also give pretty good speedups.
2

Share this post


Link to post
Share on other sites
[quote name='mhagain' timestamp='1337294053' post='4941048']
[code]SamplerState sampler3 : register(s3);
Texture2D tex0 : register(t0);
Texture2D tex1 : register(t1);[/code]

[code]Context->PSSetSamplers (3, ...);
Context->PSSetShaderResources (0, .....);
Context->PSSetShaderResources (1, .....);[/code]

This is D3D11 but this kind of thing worked even back in D3D9 HLSL. Just specify explicit registers and set resources to those registers - the effects framework is partially just a wrapper around all of this, but you definitely don't need that wrapper.

The main motivations for doing it this way are so that I can mix and match different vertex/geometry/pixel shaders without having to specify new passes in a .FX file, so that I can dynamically switch certain states in and out in program code, because I'm a mite uneasy with the way the framework handles constant buffers (may be unwarranted but it just feels wrong to me), and so that I can avoid other overheads associated with using the framework.

This way does need a little bit more work, but like I said, it's not [i]that[/i] much, and the added flexibility and performance potential more than justifies it.
[/quote]


I could understand that this approach would update shader variables without having to call Apply().
This may give a certain(certainly worth it) performance boost in techniques that require only one pass.

However, I don't see how you would avoid using passes with it?
Also, maybe Direct3D11 has something to do with you getting such a high #draw calls per frame? (I'm using Direct3D10)
I don't believe that this approach would get me a 15x draw call performance boost (needed in order to get your #drawsperframe @ 250fps). I'm guessing max 3x.

Also, since someone else replied that instancing might give a performance boost, indeed it would. And multithreading too!
But please note that this is not a thread about the generic performance for a renderer: just [u]a focus on draw call performance, not on actually lowering the #draw calls per frame.[/u]

Thanks btw, really helpful information. Edited by Xcrypt
0

Share this post


Link to post
Share on other sites
[quote name='Xcrypt' timestamp='1337297083' post='4941059']
I could understand that this approach would update shader variables without having to call Apply().
This may give a certain(certainly worth it) performance boost in techniques that require only one pass.

However, I don't see how you would avoid using passes with it?
Also, maybe Direct3D11 has something to do with you getting such a high #draw calls per frame? (I'm using Direct3D10)
I don't believe that this approach would get me a 15x draw call performance boost (needed in order to get your #drawsperframe @ 250fps). I'm guessing max 3x.

Also, since someone else replied that instancing might give a performance boost, indeed it would. And multithreading too!
But please note that this is not a thread about the generic performance for a renderer: just a focus on draw call performance, not on actually lowering the #draw calls per frame.

Thanks
[/quote]
It's actually useless for updating shader variables - you use constant buffers for that.

It's important to realise that the whole concept of techniques and passes is just an artefact of the effects framework. Remember that the effects framework is not in any way an API that talks directly to the hardware or driver - it's just a wrapper around the [i]real[/i] D3D API. Everything in the effects framework is implemented using the real API, and you can study the source code for it (available in "[color=#0000cd]C:\Program Files (x86)\Microsoft DirectX SDK (June 2010)\Samples\C++\Effects11[/color]" if you have a reasonably up-to-date SDK installed) if you need to confirm that. Techniques and passes don't actually exist in HLSL - they're just concepts that are confined to effects, but are actually implemented using the real API.

So, in the case of updating shader variables, you can look at the code for [color=#0000cd]CheckAndUpdateCB_FX[/color] and see what it does. It keeps a backing store for the entire buffer in system memory, sets a dirty flag when a variable needs updating, and then when you call Apply, it updates the [i]entire[/i] buffer and clears the dirty flag. All just using standard D3D calls like those I gave examples of above.
1

Share this post


Link to post
Share on other sites
What do you class as part of a 'draw call'? How much work do you include?

Because chances are if you want performance you are going to have to ditch the FX framework and start dealing with constants and other elements of a draw call yourself, properly batching/constraining updates.

Also how are you timing things?
0

Share this post


Link to post
Share on other sites
[quote name='mhagain' timestamp='1337276641' post='4940981']
However, a quick and dirty check shows that I can sustain ~12500 draw calls per-frame at ~250fps, which in turn shows that your performance woes are most likely coming from elsewhere.
[/quote]

Wow. What CPU is that running on? I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.

Are you sure it is not using instancing or multithreading?

P.S: My experiment was performed on a laptop with an i7 at 2.80 GHz (Turbo). Edited by TiagoCosta
0

Share this post


Link to post
Share on other sites
[quote name='TiagoCosta' timestamp='1337330953' post='4941137']
[quote name='mhagain' timestamp='1337276641' post='4940981']
However, a quick and dirty check shows that I can sustain ~12500 draw calls per-frame at ~250fps, which in turn shows that your performance woes are most likely coming from elsewhere.
[/quote]

Wow. What CPU is that running on? I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.

Are you sure it is not using instancing or multithreading?

P.S: My experiment was performed on a laptop with an i7 at 2.80 GHz (Turbo).
[/quote]
It's also a laptop with an i7; I'm not doing cbuffer updates for each individual call (they're scattered throughout though) but I am doing texture changes. The shaders and textures are quite simple, so the measurement is a good reflection of draw calls and without too much other work being done to skew the figures. I basically just took a nice batched up renderer (~200 calls when batching) and unbatched it, converting a DrawIndexed (...) call to multiple Draw (...) calls. Definitely no instancing or multithreading.
0

Share this post


Link to post
Share on other sites
[quote name='phantom' timestamp='1337298817' post='4941062']
What do you class as part of a 'draw call'? How much work do you include?

Because chances are if you want performance you are going to have to ditch the FX framework and start dealing with constants and other elements of a draw call yourself, properly batching/constraining updates.

Also how are you timing things?
[/quote]
What do I consider as part of a draw call? Well basically setting effect variables for the object if needed, and then call ID3D10Device::Draw()
I'm afraid you might need to be a bit more specific as to how I am timing things?

[quote name='TiagoCosta' timestamp='1337330953' post='4941137']
I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.

Are you sure it is not using instancing or multithreading?

P.S: My experiment was performed on a laptop with an i7 at 2.80 GHz (Turbo).
[/quote]
Please include the Direct3D version under which your engine is running too!


[quote name='mhagain' timestamp='1337332100' post='4941138']
It's also a laptop with an i7; I'm not doing cbuffer updates for each individual call (they're scattered throughout though) but I am doing texture changes. The shaders and textures are quite simple, so the measurement is a good reflection of draw calls and without too much other work being done to skew the figures. I basically just took a nice batched up renderer (~200 calls when batching) and unbatched it, converting a DrawIndexed (...) call to multiple Draw (...) calls. Definitely no instancing or multithreading.
[/quote]
What do you mean with batched up? Edited by Xcrypt
0

Share this post


Link to post
Share on other sites
[quote name='Xcrypt' timestamp='1337339328' post='4941153']
I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.
[/quote]

d3d11.1 should help with the constant setting part actually. It allows you to build up a massive constant buffer with all your scene consts, and then give d3d a window into it for a specific draw, so you only need to push data to the GPU once.

This seems unrelated to your case if you were not setting textures / render state blocks, etc. per draw, but I got noticeable speedups by filtering out all redundant API calls with a simple state cache. Maybe something the OP should look into?
0

Share this post


Link to post
Share on other sites
[quote name='ATEFred' timestamp='1337361103' post='4941220']
[quote name='Xcrypt' timestamp='1337339328' post='4941153']
I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.
[/quote]

d3d11.1 should help with the constant setting part actually. It allows you to build up a massive constant buffer with all your scene consts, and then give d3d a window into it for a specific draw, so you only need to push data to the GPU once.

This seems unrelated to your case if you were not setting textures / render state blocks, etc. per draw, but I got noticeable speedups by filtering out all redundant API calls with a simple state cache. Maybe something the OP should look into?
[/quote]

1) I didn't say that! Wrong quote :P
2) I am not doing any redundant state settings through comparing the current active state for everything with the target state. And yes, this is def. worth it.
0

Share this post


Link to post
Share on other sites
[quote name='Xcrypt' timestamp='1337368197' post='4941251']
[quote name='ATEFred' timestamp='1337361103' post='4941220']
[quote name='Xcrypt' timestamp='1337339328' post='4941153']
I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.
[/quote]

d3d11.1 should help with the constant setting part actually. It allows you to build up a massive constant buffer with all your scene consts, and then give d3d a window into it for a specific draw, so you only need to push data to the GPU once.

This seems unrelated to your case if you were not setting textures / render state blocks, etc. per draw, but I got noticeable speedups by filtering out all redundant API calls with a simple state cache. Maybe something the OP should look into?
[/quote]

1) I didn't say that! Wrong quote [img]http://public.gamedev.net//public/style_emoticons/default/tongue.png[/img]
2) I am not doing any redundant state settings through comparing the current active state for everything with the target state. And yes, this is def. worth it.
[/quote]

argh, my bad. I did selection quote selecting within a quote, must have got confused ;).
0

Share this post


Link to post
Share on other sites
draw call overhead seems to vary a lot from one graphics card to another.
My gui library is not batching some things yet, so currently on a test where I display many buttons I get two draw calls per button. I just tried to render about two hundred of them (with opengl 4.2). I avoid issuing redundant state changes using a simple state cache, too.

I have a framerate limiter at 60 fps. On my geforce gtx580 it takes about 30% cpu (as seen in top, as I'm working in linux)

I just tried it again with my old radeon hd 5830 and there it takes about 50% cpu. I remember that this type of test resulted in even higher cpu usage before on the radeon, but since it was a while ago and I just reinstalled ati drivers for this test I probably had more recent ones where they may have improved things.

I couldn't compare FPS obtained by just letting it run without the framerate limiter, as it sadly turns out that my text renderer (and therefore my fps counter) is not working on the radeon for some reason. I'm not sure measuring FPS is a good way to compare the overhead of draw calls anyway, unless you render only 1 pixel polygons or something. Edited by Zlodo
0

Share this post


Link to post
Share on other sites
[quote name='Waterlimon' timestamp='1337341150' post='4941155']
You ARE testing using a release version of the program?
[/quote]

Thanks for reminding...
In release mode I can have ~20000 drawcalls at ~100 fps... (DirectX 11)

So I think the OP should definitely consider writing a custom effects framework, an optimize it according to how his engine work to reduce redundant state changes / CB updates, etc...
0

Share this post


Link to post
Share on other sites
At this stage it seems clear that it's your matrix updates, and not the number of draw calls, that are your primary bottleneck. If all that you're updating is a matrix, and if that matrix lives in the same cbuffer as other shader constants, then you really should consider splitting it out to a separate cbuffer on it's own. That will enable D3D to update it more efficiently and transfer less data to the GPU for each such update.

It's really difficult to say much more without a better description of what exactly you're doing, and without seeing some code. Without those, everything is just guesswork.
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0