Jump to content
  • Advertisement
Shigeo.K

A certain situation that DX12 is slower than DX11

Recommended Posts

Posted (edited)

Hi,

I can't keep quiet about this and also I think this is very strange why this issue hasn't come to the light.

Until yesterday I have been understanding that DX12 is faster than DX11.

My own performance tests had been indicating so too.

The test I did was simply drawing  a single triangle, with a vast number of draw call per 1 frame(16,000 triangles = 16,000 draw calls).
Each triangles are drawn on the different positions.
The result are as below.
(Corei9-9900KF  GeForce RTX2080  64G sytem mem)

Single thread 16,000 draw calls per frame. 
DX11 569 fps    DX12 955 fps    DX12(using bundle) 1341 fps

Multi threads(16 threads, each thread draws 1,000)  16,000 draw calls per frame. 
DX11 1056 fps    DX12 1440 fps    DX12(using bundle) 1456 fps 

136% to 235% faster. It can be said it's WAY faster. I'm so happy. 

 

BUT, I did a different type of performance test yesterday which is also very simple test. It's drawing 1 mesh per frame. The mesh is consist of a vast number of triangles.   

Please look at the result below.

Drawing only 1 mesh per frame. (The mesh has 44.458 vertices and 88,512 triangles)
DX11 9,921 fps    DX12 5,487 fps

Drawing only 1 mesh per frame. (The mesh has 77.402 vertices and 354,048 triangles)
DX11 6,451 fps    DX12 4,425 fps

Drawing only 1 mesh per frame. (The mesh has 708,896 vertices and 1,416,192 triangles)
DX11 2,238 fps    DX12 2,001 fps

89% to 55%(55% !) slower than DX11. I'm not happy.

I believe somebody have seen this kind of result, because such a simple code.

Of course my coding might be bad and it causes this result.

But such a simple test I believe I didn't so bad. 
Vertex buffer and index buffer are both in the HEAP_DEFAULT, only constant buffer is in the HEAP_UPLOAD since it is changed per frame.
Barriers and fences are minimum. Just 1 fence per frame.
I think I can be so bad.

 

My question is, is it world common that DX12 is slower than DX11 in such a scenario(drawing 1 very big mesh)?

I hesitate put my code in here because my code is fully commented in JAPANESE and I don't want to bother this forum. 

If Japanese commented code is fine, I would happily put my code here. 

I'm looking forward to your reply, hoping my code is the cause.

Thanks.

Edited by Shigeo.K

Share this post


Link to post
Share on other sites
Advertisement

I'm not an expert on this but I don't think it's such a huge secret. I remember reading that DX11 was faster under certain conditions. One *claim* was that DX11 drivers have been around for much longer and therefore are more highly optimized.  If you're getting frame rates in the thousands, I'm not sure I'd worry about it however. I'm guessing this is some early rev of your code with simple shaders, right?  If it gets to the point it's actually an issue for users, I'd probably worry about it more.

Share this post


Link to post
Share on other sites

These sorts of numbers are almost always caused by a general misunderstanding of the intentional differences between Dx11 and Dx12.  There is a typical case where these sorts of things show up in simple (and generally invalid as folks have mentioned) performance measurements such as this.  (Please note this is a generalization and the specific case may not be valid for each driver, it was just a common issue when I was testing things.)  The common issue is that Dx11 validates each state change and will ignore duplicate attempts to change state.  Given such a trivial test case, this means Dx11 is going to be ignoring pretty much everything except perhaps a transform change every frame.  Dx12 will set all the states even if they cause no actual changes, so yes it is slower in such a case.

This is *not* a flaw in the API, it is actually the purpose of the API.  In order for Dx11 to avoid sending duplicate states to the card, it has to keep track of what is currently set on the card and compare against that every time the API is called.  This is generally trivial overhead which most folks can ignore but in real world cases it adds up and is a significant reason that draw call throughput of Dx12 *can be* so much higher.  I want to emphasize that it *can be* higher because if you implement your own caching layer, you will most likely be back to Dx11 levels or lower since they put a lot of work into optimizing this.  The basic thing here is that because the onus is moved to the caller to verify the calls are useful, it means that if you have a case where you absolutely know without question that they are changing you can make the calls and there will be no additional API overhead to verify something you know to be true.  This is one of the many different ways that Dx12 allows (but does not guarantee) greatly reduced CPU overhead and draw call costs.

So, all said and done, most likely you are seeing the 'dumb' API difference where it is following what you told it to without verification.  In a trivial test case such as this, it will show crazy numbers like you see.  In a real world test case where those states 'actually' need to be changed, you would see a minor speed up due to removed overhead. 

Share this post


Link to post
Share on other sites

Thank you for your very helpful reply folks.

19 hours ago, Gnollrunner said:

I'm not an expert on this but I don't think it's such a huge secret. I remember reading that DX11 was faster under certain conditions. One *claim* was that DX11 drivers have been around for much longer and therefore are more highly optimized. 

I see, it makes sense.
 

19 hours ago, Gnollrunner said:

If you're getting frame rates in the thousands, I'm not sure I'd worry about it however.

Thousands of frames with only 1 mesh drawing.
As a CG engineer and a game programmer, I never seen there is only 1 mesh in the scene.(Oh I'm wrong, I actually know the viewer app that has 1 model in the scene. But still it's not a regular case anyway.)
Thousands fps is only in this simple test, it becomes more lower number in the actual scene.
I was not tried to or intended to produce the such a hi number of frame rate. The test just tells so and nothing else. 
Magnitude of fps is doesn't matter(is it matter?), whether the result is hi or low/ fast or slow, that's all the matter.

 

19 hours ago, SoldierOfLight said:

I'm assuming that in D3D11, you're using SWAP_EFFECT_DISCARD, because everybody does, and in D3D12 it's SWAP_EFFECT_FLIP_DISCARD, because you have to, is that right?

Very much so.

19 hours ago, SoldierOfLight said:

Also try rendering the same mesh several times before calling Present to minimize the differences that Present will have, aiming for a realistic FPS (somewhere below 500). You'll probably find the measurements much closer.

Wow.
I drew the mesh 100 times before Present as you said, then the result was like below.
DX11 306 fps    DX12 912 fps
DX12 is much faster, thanks god!
 Are you a psychic? 
Very precise and pin-point advice, amazing.

Thanks to your advice, I realize now that the cause is Present.

Now the new question arise.

19 hours ago, SoldierOfLight said:

Try making them match.

 How?
Do you mean there is a way to equalize the DX11 Presenting speed and the DX12 somehow?
Or you just mean I should consider there is difference between DX11 and DX12 for Presenting?
I think, when we compare something about DX11 and DX12, it means we compare different technology of course.
So it is natural if there is difference in the functionality such like Present.
It should be different, and we don't have to do something tweaking. To go further, we must not to equalize them.   

Needless to say, I did everything for DX11 and DX12 to be same condition.
Same mesh, same screen size, same shader(as possible), same rendering , keep the code simple to be fair.

 

19 hours ago, All8Up said:

These sorts of numbers are almost always caused by a general misunderstanding of the intentional differences between Dx11 and Dx12. 

Thank you for your reply, but I don't get it a half of you are saying.
Like I said above, what is wrong with the result containing a hi number.
It's a one of the rigid result like other result.

I have a latest information.
I rewrite my code to change constant buffer to be HEAP_DEFAULT.
Then the frame rates get slightly better like 10% up.

I suspect my fencing is not sufficient.

I will keep update latest progress of this issue. 

Share this post


Link to post
Share on other sites

First of all, please let me make corrections.

I said that the fps got faster 10% when I rewrite Constant Buffer to HEAP_DEFAULT from HEAP_UPLOAD.
It's wrong. Actually an opposite. It got slower 10%.

Well, back to main topic.
Being fast only on such a scenario is almost nothing usefull.
Like All8Up mentioned, DX11 is able to show a hi frame rates by ignoring something. And it can't handle the real scenario and it can't keep faster in real scenario like DX12 does.
"real" means a general/typical number of draw call in games or graphic software.
 
The facts I got are as below.
1. As of now, just as of now, My PC, My test,at this moment, DX11 is faster than DX12 in a scenario where less(below 15) draw call with mass polygons.
2. DX12 is way faster in a scenario where mass draw call with less polygon.
3. DX12 is also faster in a scenario where regular(above 15) draw call with mass polygon.
4. As the driver and the windows driver are updated, DX12 might will be faster in such a scenario(less draw call with mass polygons) too.

So I conclude that DX12 is still the way to go if I want to run the game fast as possible.

Thanks guys.

Share this post


Link to post
Share on other sites
Posted (edited)

fps is a useless flawed metric only clueless gamers/journalists use, cause its calculated in a reciprocal and does not add up linearly. % off of fps is even more useless. Use ms (Milliseconds), because they are an absolute metric on a single machine, independent of unrelated stuff happening in a frame! Additionally, several frames of work might be queued up for the GPU to work in parallel and there may be no direct feedback to the CPU, which means you may measure only the queueing if you don't use a GPU-profiler.

1. Is not a fact, cause you use a flawed microbenchmark, probably DX12 got a larger once-per-frame setup cost on your computer which amorties over any normal amount of real work done in the frame.

2. as it should be, was one reason for making the newer APIs

3. see above

4. Don't get your hopes up too much, newer APIs are made to let the app-programmer do the tweaking without behind their back stuff happening in the driver; old API the guys at NV or AMD do some nonstandard black magic in the driver for AAA games only.

Edited by wintertime

Share this post


Link to post
Share on other sites
Posted (edited)
26 minutes ago, wintertime said:

independent of unrelated stuff happening in a frame!

This does not affect to ms?

I think the result would be the same no matter what measurement I use. Fast in fps then fast in ms, Slow in fps then slow in ms.

In such a simple test fps is a reasonable and enough measurement i think. 
 

26 minutes ago, wintertime said:

1. Is not a fact

1 is a very much fact. Should I send u a capture-image?

 

26 minutes ago, wintertime said:

4. Don't get your hopes up too much, newer APIs are made to let the app-programmer do the tweaking without behind their back stuff happening in the driver; old API the guys at NV or AMD do some nonstandard black magic in the driver for AAA games only.

This is an interesting story. Thanks for sharing the information. I bet you have a source of this.

You are saying 1. is no a fact. So you have some sample programs that DX11 run faster in the scenario(1 draw with million of polygons)?

Edited by Shigeo.K

Share this post


Link to post
Share on other sites
7 minutes ago, Shigeo.K said:

1 is a very much fact. Should I send u a capture-image?

One unusual test is just a single measurement and does not imply that the whole of a complex system (DX in this case) is fast or slow.

 

Btw., with ms you can better measure time single systems take, as usually you would want to measure real world performance in a real application, not a microbenchmark. You could, for example, try to measure without the per frame setup, but as said above this would also be flawed when you measure CPU time not GPU timers.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!