Poor performance drawing quads

Started by
5 comments, last by MJP 13 years, 5 months ago
Hello,

This is revisit from http://www.gamedev.net/community/forums/topic.asp?topic_id=576359.

I implemented my own mechanism for rendering quads, lines, points and point sprites using a material system as a replacement for D3DXSprite. The performance I am getting is far below what I was getting using D3DXSprite previously, so I am trying to figure out what I can do to improve it.


What I tried:

- Analysis: probably not GPU bound. Resolution, texture size, shader complexity, actual number of vertices per renderable, blend states don't make a lot of difference.

- Graphics profiled using PIX. Found that D3DXSprite uses static index buffer and circular locking scheme for vertex buffer. Implemented both: performance improved but not enough.

- Profiled using AMD Code Analyst. Most time spent generating vertices and moving data to system memory and video memory buffers, which is how the scheme works. For font rendering, most time is spent accessing std::map to lookup character info by its code.

- Added "restore pass" functionality for effect system to avoid saving/restoring state of entire pipeline. Performance didn't improve.

- Added caching of certain parameters such as WorldViewProjection matrix to avoid setting more than once per frame. Performance didn't improve.

- Added caching of current effect technique to avoid calling SetTechnique when not needed. Performance didn't improve.

- Added concatenation feature to reduce number of renderables. If last renderable has the same primitive type/material/transform/zorder then verts from renderable-to-be-added get concatenated with the last renderable. Performance doubled but still not enough.

- Made vertex structures contain only primitive data types (w/o explicit constructors) because AMD profiler shows lots of time is spent initializing them. No performance improvement.

Statistics:

Release mode, pure device, release d3d runtime, running 800x600 fullscreen 60Hz no-vsync.

6 renderables
216 triangles
6 batches
24 state changes
48 filtered state changes
>> 0.74 ms per frame, 1221 fps.

8 renderables
1476 triangles
8 batches
32 state changes
64 filtered state changes
>> 1.69 ms per frame, 539 fps.

That was on a last generation computer 1.64 Ghz, 2GB ram, NVIDIA Quadro NVS 285.

On my last generation multimedia laptop, the first case is 240 fps and second case is 54 fps.
Last binary compiled using D3DXSprite did 240 fps in both first case and second case on the same laptop.

On the laptop, they don't run fast enough to pull off 60 Hz with v-sync on which is below my quality standards.


Thanks and appreciate your help on suggestions for improvement.
Val
Advertisement
Seems like you should use instancing to draw 2d sprites. I read your previous topic and I noticed that you also have a large vertex buffer that all of your vertices are held in. A vertex buffer for 2d stuff is not needed in this manner.

For a quad, create static buffer

(-1, 1, 0)
(1, 1, 0)
(-1, -1, 0)
(1, -1, 0)

Then use this buffer each time you need a quad drawn. You just need to change the matrix sent to draw it for the size information. If you want to Instance it, just put the matrix info into a separate vertex stream and send that along with the quad above.


I have implemented by own 2 sprite system for my User interface. I got about a 400% increase in speed over the ID3DX10SPRITE usage, but alot of that had to do with the fact that I instance all my UI calls. So, my entire UI will only execute as many draw calls as there are open windows. Since each window needs to be drawn in order that is..
Wisdom is knowing when to shut up, so try it.
--Game Development http://nolimitsdesigns.com: Reliable UDP library, Threading library, Math Library, UI Library. Take a look, its all free.
I am using D3D9, I believe instancing is D3D10 or D3D9c only feature?

The major part of the slow-down is in rendering text. Somehow I doubt that sending 500 or so draw calls is going to be faster than sending 2 draw calls with 250 quads each. In fact without instancing, I'm pretty sure it wouldn't be faster.
Instancing is part of shader model 3.0. It works pretty much exactly the same way it does in D3D10 (separate vertex stream with instance data). You can also do instancing with shader constants using shader model 2.0 (the instancing sample in the SDK shows you how).

Anyway what instancing buys you is that you can offload computations to the GPU. So for instance if your interface for a sprite is a texture, a transform matrix, and source rectangle, you can do all of the necessary calculations for building the screen-space vertex positions in the vertex shader rather than pre-transforming them on the CPU. In fact you don't even need instancing to do this, but then you have to do 1 draw per sprite which sucks.
I looked at Instancing sample on the SDK. I have one question: if I store the positions, colors and tex coords of my quads in shader constants as arrays, then what should the verts themselves contain? Should I have each vert contain just the diffuse color as the only component so I have some kind of data to send? Seems like a huge waste since since most characters will be the same color, and all verts in one character are guaranteed to be the same color.
For instancing you wouldn't put the data into shader constant. You would put the data into a separate vertex buffer. Also, you have to juggle one thing with another. If you have at one time, 200 characters on your screen at once (on average) lets figure out the data requirements here: for each character of text you need this . . .

pos = float3
scale = float2 (x and y scaling)
color = int(rgba)

200*12*4*2 = 19,200 bytes of information. But, you decrease your draw calls from 200 to one. I dont know if you know this, but for each draw call there is a fixed overhead that is paid in calculations by the cpu. So, if you are doing 200 draw calls on just text each frame, then you are using up around 20-30% of your framerate on just text drawing.

On my computer, my lomit is around 700 ish draw calls that I am allowed in order to stay above 60fps. The amount of vertices DO NOT MATTER. I stress that, because its not a GPU slowdown, its a cpu one. So, 19,200 bytes is worth the huge speedup.

[Edited by - smasherprog on November 12, 2010 1:52:56 PM]
Wisdom is knowing when to shut up, so try it.
--Game Development http://nolimitsdesigns.com: Reliable UDP library, Threading library, Math Library, UI Library. Take a look, its all free.
Quote:Original post by ValMan
I looked at Instancing sample on the SDK. I have one question: if I store the positions, colors and tex coords of my quads in shader constants as arrays, then what should the verts themselves contain? Should I have each vert contain just the diffuse color as the only component so I have some kind of data to send? Seems like a huge waste since since most characters will be the same color, and all verts in one character are guaranteed to be the same color.


You still need to map each of the 4 verts to a corner of your sprite. If you don't do that then all of your verts end up with the same screen space position, and you're not going to see anything. :P

When I wrote my own sprite renderer for D3D11, my vertex buffer had 4 verts where the positions were mapped to to (0, 0, 0), (1, 0, 0), (1, 1, 0), and (0, 1, 0). That way in the vertex shader I could just figure out the unscaled width and height of the sprite in screen-space by passing in the size of the texture being used and using that to scale the vertices. Then after that you can just transform the vertex by your transform matrix to apply additional scaling, translation, and/or rotation.

Also keep in mind that if you go the "shader constant instancing" route, you also need a value in each vertex that identifies which instance it belongs to so that you can index into your array of constants.

This topic is closed to new replies.

Advertisement