Most efficient way to render thousands of point sprites?

Started by
5 comments, last by JohnBolton 18 years, 9 months ago
Hey all, I've been reading Frank Luna's "Intro to 3D Game Programming" book and have just implemented his example particle engine. In it, he says that the recommended way to render point sprites is to fill a dynamic vertex buffer incrementally in batches. That is, lock & fill one section, unlock & render it, lock & fill the next section, unlock & render it, etc. In this way, the GPU can begin rendering a section of the vertex buffer while the CPU can continue filling in the next section to be rendered. (As opposed to just filling one huge vertex buffer at once, then making one draw call). This makes sense to me, as it allows the GPU and CPU to work together. However, upon doing some tests, the incremetally filling / rendering method is actually slower than just doing one massive fill / render. I set my particle engine to generate 100,000 point sprites (that stay alive forever), and I get about a 1 to 2 millisecond faster frame time if I simply dump all of the sprites to one massive vertex buffer and perform one DrawPrimitive() call, rather than loading / rendering them in smaller batches. I have tried numerous different settings for the batch size of the incremental buffer. The largest I tried was a batch size of 16368 with a total buffer size of 65472. There was actually only a 2 millisecond slower frame time if I reduced the batch size drastically down to 512 sprites and a total buffer size of 2048. Still, simply making a huge 100,000 size buffer and loading / rendering once was still faster in all cases. Here is how I'm creating and locking my vertex buffers:

//
hr = pd3dDevice->CreateVertexBuffer( m_VBSize * sizeof(Particle), D3DUSAGE_DYNAMIC | D3DUSAGE_POINTS | D3DUSAGE_WRITEONLY, Particle::FVF, D3DPOOL_DEFAULT, &m_VB, 0 );

// When batching:
m_VB->Lock( m_VBOffset * sizeof( Particle ), m_VBBatchSize * sizeof( Particle ),  (void**)&v, m_VBOffset ? D3DLOCK_NOOVERWRITE : D3DLOCK_DISCARD );
// If I change that to simply D3DLOCK_DISCARD, I notice no real increase in speed.

// When not batching, I just Lock from 0 to NumParticles with D3DLOCK_DISCARD as the flag



So, what's the deal here? Is it or is it not faster to split the point sprite renders into batches within the vertex buffer? What's the best way to do it?
Advertisement
heh - I just asked this very same question 8 hours ago. According to Coder, the driver will handle this "batch before continuing" for you. It would appear that the method of locking, filling, unlocking, render, repeat has gone the way of the dodo.

My Parellel Processing Thread
So, is the best way without-a-doubt to just create one massive dynamic vertex buffer, fill it in one go, and then render? The book was written with Direct3D 9 in mind, so I'd expect it to be atleast *somewhat* correct...

I'm also only setting the stream source once per frame either way, if that makes any difference.
Quote:Original post by kosmon_x
So, is the best way without-a-doubt to just create one massive dynamic vertex buffer, fill it in one go, and then render? The book was written with Direct3D 9 in mind, so I'd expect it to be atleast *somewhat* correct...

I'm also only setting the stream source once per frame either way, if that makes any difference.


i think if you use multi threading it will be faster. but otherwise, your thread will wait for the draw call to return before moving onto filling the next part of the buffer. i don't know, just throwing a guess out.
Charles Reed, CEO of CJWR Software LLC
If you lock a buffer with the discard flag, you are saying you don't need it anymore. The driver is then able to give you a new buffer that isn't waiting on hardware.
What kind of speeds are you getting with each method?
Even if I change the batch-rendering vertex buffer to D3DLOCK_DISCARD, loading all 100,000 sprites into a buffer at once is still faster.

This is with 100,000 alpha-blended sprites on screen, in debug mode:

Batch rendering with a batch size of 16k and a buffer size of 64k:
91.5 - 92.5 ms frame time
Batch rendering with a batch size of 512 and a buffer size of 2048:
93.0 - 94.0 ms frame time
No batch rendering - loading all 100,000 sprites at once:
90.5 - 91.0 ms frame time

Those are very close to the actual numbers I got (as best as I can remember them). Again, changing the batch flag to D3DLOCK_DISCARD did not make it run any faster than not batching...

The same relationship between the batching and non-batching in terms of speed held for any number of sprites from 5k up to 100k.

It's not a huge difference in speed, but it is a difference none the less. From these tests, it would appear that batch rendering is definitely not worth it. Has anyone else had similar experiences, or am I doing this wrong and is batch rendering really more efficient?

Drawing 100000 point sprites in a single call is faster because:
  • draw calls are very expensive
  • the size of your buffer is very small -- only 400k.
Maybe if your buffer was a lot bigger (perhaps 16MB), it might be faster to split it up into several draws.

Anyway, you did the right thing. You tried it and profiled it. Others would have blindly "optimized" their code without profiling and not even known that they made it slower.

Also, the 1 ms difference is negligible. The bottleneck is elsewhere -- perhaps in your computations or the GPU's vertex or pixel processing.
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!

This topic is closed to new replies.

Advertisement