Jump to content

  • Log In with Google      Sign In   
  • Create Account

performance problem with my renderer


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
19 replies to this topic

#1 Xcrypt   Members   -  Reputation: 154

Like
0Likes
Like

Posted 17 May 2012 - 09:28 AM

A current-gen engine with Direct3D10+ should be able to handle at least 10k+ raw draw calls on a modern computer
With a raw draw calls I mean: pure #draw calls, without any optimisations such as instancing or culling.
EDIT: and I mean draw calls with a simple effect and vertexbuffer, like a textured rectangle.

My engine can only handle about 1k draw calls in order to sustain a smooth framerate. I have been profiling the hell out of my engine with intel vtune, amd codeanalyst, pix, and the default VS profiler, but I just can't seem to find the problem!

Another peculiar thing is that an ID3D10EffectPass::Apply() seems to take longer than most draw() calls.
After some tests, ID3D10EffectPass::Apply() doesn't do what msdn says: (Set the state contained in a pass to the device.)

If I apply() before I commit my shader variables, my variables won't be updated. This implies that when a technique only contains 1 pass, we can not apply() per material but are forced to do this per mesh.

If anyone made a pretty performance-concerned rendering engine for PC with Direct3D10+, can you please check how many raw draw calls it can handle, and if the CPU spends more time doing Apply() than Draw()?

Can you apply per material instead of per mesh, when a technique only contains one pass? (not according to my tests, while a lot of people say that this would be an optimization I could make)

And does anyone have any idea why my engine would only do 1k draw calls? My algorithms are all tested for computational complexity etc, so it's probably not that!

Thanks x1000!

Edited by Xcrypt, 17 May 2012 - 09:38 AM.


Sponsor:

#2 Ashaman73   Crossbones+   -  Reputation: 7991

Like
1Likes
Like

Posted 17 May 2012 - 09:55 AM

A current-gen engine with Direct3D10+ should be able to handle at least 10k+ raw draw calls on a modern computer

There was a nvidia(?) presentation a few years back which talked about the number of draw calls per second. Pure draw calls are CPU limited and they gave a formula depending on GHz of a single core. The limits were more or less 1k-1.5k for a 2.5GHz. Considering that the GHz of single CPUs hasn't increased terrible the last 5 years, I would sugguest, that 1k is more realistic than 10k.

#3 mhagain   Crossbones+   -  Reputation: 8277

Like
2Likes
Like

Posted 17 May 2012 - 11:44 AM

Draw call overhead in D3D10+ is much more efficient than in previous versions, and can be cosidered more-or-less on a par with OpenGL, but it's still not free. However, a quick and dirty check shows that I can sustain ~12500 draw calls per-frame at ~250fps, which in turn shows that your performance woes are most likely coming from elsewhere.

I have no idea what's going on with EffectState::Apply - I personally don't use the effects framework at this level - but I'm guessing this is the most probable candidate. Solutions might include not using the effects framework (which is far easier than you may think at first) or moving your state handling from the framework to your program's code.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#4 Xcrypt   Members   -  Reputation: 154

Like
0Likes
Like

Posted 17 May 2012 - 01:05 PM

I have no idea what's going on with EffectState::Apply - I personally don't use the effects framework at this level - but I'm guessing this is the most probable candidate. Solutions might include not using the effects framework (which is far easier than you may think at first) or moving your state handling from the framework to your program's code.


Not using the effects framework? Then how do you handle texturing/lighting? In fact how do you render anything at all?
And moving state handling from the framework to my program, how would I do that?
Also, what Direct3D version are you using?

Edited by Xcrypt, 17 May 2012 - 01:09 PM.


#5 ATEFred   Members   -  Reputation: 1126

Like
2Likes
Like

Posted 17 May 2012 - 01:52 PM


I have no idea what's going on with EffectState::Apply - I personally don't use the effects framework at this level - but I'm guessing this is the most probable candidate. Solutions might include not using the effects framework (which is far easier than you may think at first) or moving your state handling from the framework to your program's code.


Not using the effects framework? Then how do you handle texturing/lighting? In fact how do you render anything at all?
And moving state handling from the framework to my program, how would I do that?
Also, what Direct3D version are you using?


You don't need the FX framework to do any of that.
You can set textures, render states and constants, and trigger draws on the device yourself in d3d10 / deviceContext in d3d11.

So if you read in the texture ids / state ids / constants from your material files and build up your own renderable blocks with the desired d3d resources, you can then manage it all yourself. This allows you to batch in maybe more efficient ways, remove redundent API calls, etc. which you might not be able to do through the FX framework (Last time I used the fx framework was 2007 or so, so my memory is a bit fuzzy).
In d3d11 you can also make use of multiple cores by building up your draw lists on different threads using the deferredDeviceContexts, which can help reduce the CPU load quite a bit (especially now the driver support for it seems to be pretty good, at least from NVs side).

As you mentioned in your original post, instancing can also give pretty good speedups.

#6 mhagain   Crossbones+   -  Reputation: 8277

Like
3Likes
Like

Posted 17 May 2012 - 04:34 PM

Not using the effects framework? Then how do you handle texturing/lighting? In fact how do you render anything at all?
And moving state handling from the framework to my program, how would I do that?
Also, what Direct3D version are you using?


SamplerState sampler3 : register(s3);
Texture2D tex0 : register(t0);
Texture2D tex1 : register(t1);

Context->PSSetSamplers (3, ...);
Context->PSSetShaderResources (0, .....);
Context->PSSetShaderResources (1, .....);

This is D3D11 but this kind of thing worked even back in D3D9 HLSL. Just specify explicit registers and set resources to those registers - the effects framework is partially just a wrapper around all of this, but you definitely don't need that wrapper.

The main motivations for doing it this way are so that I can mix and match different vertex/geometry/pixel shaders without having to specify new passes in a .FX file, so that I can dynamically switch certain states in and out in program code, because I'm a mite uneasy with the way the framework handles constant buffers (may be unwarranted but it just feels wrong to me), and so that I can avoid other overheads associated with using the framework.

This way does need a little bit more work, but like I said, it's not that much, and the added flexibility and performance potential more than justifies it.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#7 Xcrypt   Members   -  Reputation: 154

Like
0Likes
Like

Posted 17 May 2012 - 05:24 PM

SamplerState sampler3 : register(s3);
Texture2D tex0 : register(t0);
Texture2D tex1 : register(t1);

Context->PSSetSamplers (3, ...);
Context->PSSetShaderResources (0, .....);
Context->PSSetShaderResources (1, .....);

This is D3D11 but this kind of thing worked even back in D3D9 HLSL. Just specify explicit registers and set resources to those registers - the effects framework is partially just a wrapper around all of this, but you definitely don't need that wrapper.

The main motivations for doing it this way are so that I can mix and match different vertex/geometry/pixel shaders without having to specify new passes in a .FX file, so that I can dynamically switch certain states in and out in program code, because I'm a mite uneasy with the way the framework handles constant buffers (may be unwarranted but it just feels wrong to me), and so that I can avoid other overheads associated with using the framework.

This way does need a little bit more work, but like I said, it's not that much, and the added flexibility and performance potential more than justifies it.



I could understand that this approach would update shader variables without having to call Apply().
This may give a certain(certainly worth it) performance boost in techniques that require only one pass.

However, I don't see how you would avoid using passes with it?
Also, maybe Direct3D11 has something to do with you getting such a high #draw calls per frame? (I'm using Direct3D10)
I don't believe that this approach would get me a 15x draw call performance boost (needed in order to get your #drawsperframe @ 250fps). I'm guessing max 3x.

Also, since someone else replied that instancing might give a performance boost, indeed it would. And multithreading too!
But please note that this is not a thread about the generic performance for a renderer: just a focus on draw call performance, not on actually lowering the #draw calls per frame.

Thanks btw, really helpful information.

Edited by Xcrypt, 17 May 2012 - 05:39 PM.


#8 mhagain   Crossbones+   -  Reputation: 8277

Like
1Likes
Like

Posted 17 May 2012 - 05:52 PM

I could understand that this approach would update shader variables without having to call Apply().
This may give a certain(certainly worth it) performance boost in techniques that require only one pass.

However, I don't see how you would avoid using passes with it?
Also, maybe Direct3D11 has something to do with you getting such a high #draw calls per frame? (I'm using Direct3D10)
I don't believe that this approach would get me a 15x draw call performance boost (needed in order to get your #drawsperframe @ 250fps). I'm guessing max 3x.

Also, since someone else replied that instancing might give a performance boost, indeed it would. And multithreading too!
But please note that this is not a thread about the generic performance for a renderer: just a focus on draw call performance, not on actually lowering the #draw calls per frame.

Thanks

It's actually useless for updating shader variables - you use constant buffers for that.

It's important to realise that the whole concept of techniques and passes is just an artefact of the effects framework. Remember that the effects framework is not in any way an API that talks directly to the hardware or driver - it's just a wrapper around the real D3D API. Everything in the effects framework is implemented using the real API, and you can study the source code for it (available in "C:\Program Files (x86)\Microsoft DirectX SDK (June 2010)\Samples\C++\Effects11" if you have a reasonably up-to-date SDK installed) if you need to confirm that. Techniques and passes don't actually exist in HLSL - they're just concepts that are confined to effects, but are actually implemented using the real API.

So, in the case of updating shader variables, you can look at the code for CheckAndUpdateCB_FX and see what it does. It keeps a backing store for the entire buffer in system memory, sets a dirty flag when a variable needs updating, and then when you call Apply, it updates the entire buffer and clears the dirty flag. All just using standard D3D calls like those I gave examples of above.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#9 phantom   Moderators   -  Reputation: 7563

Like
0Likes
Like

Posted 17 May 2012 - 05:53 PM

What do you class as part of a 'draw call'? How much work do you include?

Because chances are if you want performance you are going to have to ditch the FX framework and start dealing with constants and other elements of a draw call yourself, properly batching/constraining updates.

Also how are you timing things?

#10 TiagoCosta   Crossbones+   -  Reputation: 2455

Like
0Likes
Like

Posted 18 May 2012 - 02:49 AM

However, a quick and dirty check shows that I can sustain ~12500 draw calls per-frame at ~250fps, which in turn shows that your performance woes are most likely coming from elsewhere.


Wow. What CPU is that running on? I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.

Are you sure it is not using instancing or multithreading?

P.S: My experiment was performed on a laptop with an i7 at 2.80 GHz (Turbo).

Edited by TiagoCosta, 18 May 2012 - 02:50 AM.


#11 mhagain   Crossbones+   -  Reputation: 8277

Like
0Likes
Like

Posted 18 May 2012 - 03:08 AM


However, a quick and dirty check shows that I can sustain ~12500 draw calls per-frame at ~250fps, which in turn shows that your performance woes are most likely coming from elsewhere.


Wow. What CPU is that running on? I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.

Are you sure it is not using instancing or multithreading?

P.S: My experiment was performed on a laptop with an i7 at 2.80 GHz (Turbo).

It's also a laptop with an i7; I'm not doing cbuffer updates for each individual call (they're scattered throughout though) but I am doing texture changes. The shaders and textures are quite simple, so the measurement is a good reflection of draw calls and without too much other work being done to skew the figures. I basically just took a nice batched up renderer (~200 calls when batching) and unbatched it, converting a DrawIndexed (...) call to multiple Draw (...) calls. Definitely no instancing or multithreading.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#12 Xcrypt   Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2012 - 05:08 AM

What do you class as part of a 'draw call'? How much work do you include?

Because chances are if you want performance you are going to have to ditch the FX framework and start dealing with constants and other elements of a draw call yourself, properly batching/constraining updates.

Also how are you timing things?

What do I consider as part of a draw call? Well basically setting effect variables for the object if needed, and then call ID3D10Device::Draw()
I'm afraid you might need to be a bit more specific as to how I am timing things?

I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.

Are you sure it is not using instancing or multithreading?

P.S: My experiment was performed on a laptop with an i7 at 2.80 GHz (Turbo).

Please include the Direct3D version under which your engine is running too!


It's also a laptop with an i7; I'm not doing cbuffer updates for each individual call (they're scattered throughout though) but I am doing texture changes. The shaders and textures are quite simple, so the measurement is a good reflection of draw calls and without too much other work being done to skew the figures. I basically just took a nice batched up renderer (~200 calls when batching) and unbatched it, converting a DrawIndexed (...) call to multiple Draw (...) calls. Definitely no instancing or multithreading.

What do you mean with batched up?

Edited by Xcrypt, 18 May 2012 - 05:10 AM.


#13 Waterlimon   Crossbones+   -  Reputation: 2638

Like
0Likes
Like

Posted 18 May 2012 - 05:39 AM

You ARE testing using a release version of the program?

o3o


#14 Xcrypt   Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2012 - 06:33 AM

Yes of course Posted Image

#15 ATEFred   Members   -  Reputation: 1126

Like
0Likes
Like

Posted 18 May 2012 - 11:11 AM

I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.


d3d11.1 should help with the constant setting part actually. It allows you to build up a massive constant buffer with all your scene consts, and then give d3d a window into it for a specific draw, so you only need to push data to the GPU once.

This seems unrelated to your case if you were not setting textures / render state blocks, etc. per draw, but I got noticeable speedups by filtering out all redundant API calls with a simple state cache. Maybe something the OP should look into?

#16 Xcrypt   Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2012 - 01:09 PM


I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.


d3d11.1 should help with the constant setting part actually. It allows you to build up a massive constant buffer with all your scene consts, and then give d3d a window into it for a specific draw, so you only need to push data to the GPU once.

This seems unrelated to your case if you were not setting textures / render state blocks, etc. per draw, but I got noticeable speedups by filtering out all redundant API calls with a simple state cache. Maybe something the OP should look into?


1) I didn't say that! Wrong quote :P
2) I am not doing any redundant state settings through comparing the current active state for everything with the target state. And yes, this is def. worth it.

#17 ATEFred   Members   -  Reputation: 1126

Like
0Likes
Like

Posted 18 May 2012 - 01:32 PM



I just tried ~10000 draw calls and it runs at only ~12 fps and in a optimal setting of only a small constant buffer update (map/unmap of a D3DXMATRIX) and the actual drawcall. The best I can do is 3000 drawcalls at ~37 fps.


d3d11.1 should help with the constant setting part actually. It allows you to build up a massive constant buffer with all your scene consts, and then give d3d a window into it for a specific draw, so you only need to push data to the GPU once.

This seems unrelated to your case if you were not setting textures / render state blocks, etc. per draw, but I got noticeable speedups by filtering out all redundant API calls with a simple state cache. Maybe something the OP should look into?


1) I didn't say that! Wrong quote Posted Image
2) I am not doing any redundant state settings through comparing the current active state for everything with the target state. And yes, this is def. worth it.


argh, my bad. I did selection quote selecting within a quote, must have got confused ;).

#18 Zlodo   Members   -  Reputation: 246

Like
0Likes
Like

Posted 18 May 2012 - 05:22 PM

draw call overhead seems to vary a lot from one graphics card to another.
My gui library is not batching some things yet, so currently on a test where I display many buttons I get two draw calls per button. I just tried to render about two hundred of them (with opengl 4.2). I avoid issuing redundant state changes using a simple state cache, too.

I have a framerate limiter at 60 fps. On my geforce gtx580 it takes about 30% cpu (as seen in top, as I'm working in linux)

I just tried it again with my old radeon hd 5830 and there it takes about 50% cpu. I remember that this type of test resulted in even higher cpu usage before on the radeon, but since it was a while ago and I just reinstalled ati drivers for this test I probably had more recent ones where they may have improved things.

I couldn't compare FPS obtained by just letting it run without the framerate limiter, as it sadly turns out that my text renderer (and therefore my fps counter) is not working on the radeon for some reason. I'm not sure measuring FPS is a good way to compare the overhead of draw calls anyway, unless you render only 1 pixel polygons or something.

Edited by Zlodo, 18 May 2012 - 05:23 PM.


#19 TiagoCosta   Crossbones+   -  Reputation: 2455

Like
0Likes
Like

Posted 18 May 2012 - 06:32 PM

You ARE testing using a release version of the program?


Thanks for reminding...
In release mode I can have ~20000 drawcalls at ~100 fps... (DirectX 11)

So I think the OP should definitely consider writing a custom effects framework, an optimize it according to how his engine work to reduce redundant state changes / CB updates, etc...

#20 mhagain   Crossbones+   -  Reputation: 8277

Like
0Likes
Like

Posted 19 May 2012 - 12:19 PM

At this stage it seems clear that it's your matrix updates, and not the number of draw calls, that are your primary bottleneck. If all that you're updating is a matrix, and if that matrix lives in the same cbuffer as other shader constants, then you really should consider splitting it out to a separate cbuffer on it's own. That will enable D3D to update it more efficiently and transfer less data to the GPU for each such update.

It's really difficult to say much more without a better description of what exactly you're doing, and without seeing some code. Without those, everything is just guesswork.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS