Jump to content
  • Advertisement
Sign in to follow this  
Blasp

DX11 Fastest way to draw Quads?

This topic is 960 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am working on a DX11 2D engine. What is the fastest way to draw lots of Quads? Currently I set an immutable vertex buffer with quad geometry during initialization, and then I draw the quads using instancing. However, there are multiple ways to do this: - trianglelist (6 vertices in immutable vertex buffer) or trianglestrip (4 vertices in immutable vertex buffer) - using indices or not using indices What to choose? It seems to me that trianglestrips and indices have more to do with memory / bandwidth optimization for large or dynamic models. So couldnt they actually hurt raw rendering performance of simple quad instances?

Share this post


Link to post
Share on other sites
Advertisement

Also consider that for a 2D engine, performance of drawing quads may not be that important.  So long as you get some reasonable batching going, you're probably more likely to be bottlenecked on fillrate and ROP than on vertices or draw calls.

Share this post


Link to post
Share on other sites
Profiling would normally be good, in this case I don't know if performance will be an option at all.
You could use a geometry shader taking a single vertex and outputting a quad (basically point sprites like mentioned above).

Share this post


Link to post
Share on other sites

Another approach is not to bind a vertex buffer, and use an instanced triangle strip. You then draw n instances of four non-existent verts, and use SV_VertexID to position them. However, maybe someone could shed light on if the overhead of using instancing still applies in this case? Otherwise, the approach described in that linked presentation (no vertex buffer, but drawing and indexed triangle list) would still be better, if a bit more complex.

Edited by Oetker

Share this post


Link to post
Share on other sites

It's covered in Matias' link, but for reinforcement - don't use instancing for small meshes. There's a decent performance penalty when the instanced mesh is less than about 500 verts.

 

Is there a rationale for the performance penalty encountered for small mesh vs drawIndexed method by the way?

As far as I know modern (desktop ?) gpu doesn't have a fixed vertex attribute fetch function anymore and use general buffer read method under the hood.
So the only difference I see between Bilodeau's method and the drawInstanced method from hardware point of view is that the instanced call provides an extra SV_InstanceID input.

 

Since it looks like it's faster to compute an instance id and a vertex id from a single value using modulo and divide operation than to use hardware provided SV_InstanceID is there any reason to use instancing at all ? The only reason I can see is that "coupled" vertexId/instanceID may require usage of 32 bits indexes instead of 16 bits ones.

Share this post


Link to post
Share on other sites

Is there a rationale for the performance penalty encountered for small mesh vs drawIndexed method by the way?

GCN's wavefront size is 64.
That means GCN works on 64 vertices at a time.
If you make two DrawPrimitive calls of 32 vertices each, GCN will process them using 2 wavefronts, wasting half of its processing power.

It's actually a bit more complex, as GCN has compute units, and each CU has 4 SIMD units. Each SIMD unit can execute between 1 and 10 wavefronts. There's also some fixed function parts like the rasterizer which may have some overhead when involving small meshes.

Long story short, it's all about load balancing, and small meshes leave a lot of idle space; hence the sweetspot is around 128-256 vertices for NVIDIA, and around 500-600 vertices for AMD (based on benchmarks).

Share this post


Link to post
Share on other sites

Wouldn't that apply to non-instanced draw calls as well?

Yes indeed. Note I said two DrawPrimitive calls. Not two instances in one instanced DrawPrimitive call.
Old hardware (GeForce 6000 & 7000) had an overhead when using instancing with small instance counts that was not present with non-instanced draw calls. But they worked quite differently. Edited by Matias Goldberg

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!