//
hr = pd3dDevice->CreateVertexBuffer( m_VBSize * sizeof(Particle), D3DUSAGE_DYNAMIC | D3DUSAGE_POINTS | D3DUSAGE_WRITEONLY, Particle::FVF, D3DPOOL_DEFAULT, &m_VB, 0 );
// When batching:
m_VB->Lock( m_VBOffset * sizeof( Particle ), m_VBBatchSize * sizeof( Particle ), (void**)&v, m_VBOffset ? D3DLOCK_NOOVERWRITE : D3DLOCK_DISCARD );
// If I change that to simply D3DLOCK_DISCARD, I notice no real increase in speed.
// When not batching, I just Lock from 0 to NumParticles with D3DLOCK_DISCARD as the flag
Most efficient way to render thousands of point sprites?
Hey all,
I've been reading Frank Luna's "Intro to 3D Game Programming" book and have just implemented his example particle engine. In it, he says that the recommended way to render point sprites is to fill a dynamic vertex buffer incrementally in batches. That is, lock & fill one section, unlock & render it, lock & fill the next section, unlock & render it, etc. In this way, the GPU can begin rendering a section of the vertex buffer while the CPU can continue filling in the next section to be rendered. (As opposed to just filling one huge vertex buffer at once, then making one draw call). This makes sense to me, as it allows the GPU and CPU to work together.
However, upon doing some tests, the incremetally filling / rendering method is actually slower than just doing one massive fill / render. I set my particle engine to generate 100,000 point sprites (that stay alive forever), and I get about a 1 to 2 millisecond faster frame time if I simply dump all of the sprites to one massive vertex buffer and perform one DrawPrimitive() call, rather than loading / rendering them in smaller batches.
I have tried numerous different settings for the batch size of the incremental buffer. The largest I tried was a batch size of 16368 with a total buffer size of 65472. There was actually only a 2 millisecond slower frame time if I reduced the batch size drastically down to 512 sprites and a total buffer size of 2048. Still, simply making a huge 100,000 size buffer and loading / rendering once was still faster in all cases.
Here is how I'm creating and locking my vertex buffers:
So, what's the deal here? Is it or is it not faster to split the point sprite renders into batches within the vertex buffer? What's the best way to do it?
heh - I just asked this very same question 8 hours ago. According to Coder, the driver will handle this "batch before continuing" for you. It would appear that the method of locking, filling, unlocking, render, repeat has gone the way of the dodo.
My Parellel Processing Thread
My Parellel Processing Thread
So, is the best way without-a-doubt to just create one massive dynamic vertex buffer, fill it in one go, and then render? The book was written with Direct3D 9 in mind, so I'd expect it to be atleast *somewhat* correct...
I'm also only setting the stream source once per frame either way, if that makes any difference.
I'm also only setting the stream source once per frame either way, if that makes any difference.
Quote:Original post by kosmon_x
So, is the best way without-a-doubt to just create one massive dynamic vertex buffer, fill it in one go, and then render? The book was written with Direct3D 9 in mind, so I'd expect it to be atleast *somewhat* correct...
I'm also only setting the stream source once per frame either way, if that makes any difference.
i think if you use multi threading it will be faster. but otherwise, your thread will wait for the draw call to return before moving onto filling the next part of the buffer. i don't know, just throwing a guess out.
If you lock a buffer with the discard flag, you are saying you don't need it anymore. The driver is then able to give you a new buffer that isn't waiting on hardware.
What kind of speeds are you getting with each method?
What kind of speeds are you getting with each method?
Even if I change the batch-rendering vertex buffer to D3DLOCK_DISCARD, loading all 100,000 sprites into a buffer at once is still faster.
This is with 100,000 alpha-blended sprites on screen, in debug mode:
Batch rendering with a batch size of 16k and a buffer size of 64k:
91.5 - 92.5 ms frame time
Batch rendering with a batch size of 512 and a buffer size of 2048:
93.0 - 94.0 ms frame time
No batch rendering - loading all 100,000 sprites at once:
90.5 - 91.0 ms frame time
Those are very close to the actual numbers I got (as best as I can remember them). Again, changing the batch flag to D3DLOCK_DISCARD did not make it run any faster than not batching...
The same relationship between the batching and non-batching in terms of speed held for any number of sprites from 5k up to 100k.
It's not a huge difference in speed, but it is a difference none the less. From these tests, it would appear that batch rendering is definitely not worth it. Has anyone else had similar experiences, or am I doing this wrong and is batch rendering really more efficient?
This is with 100,000 alpha-blended sprites on screen, in debug mode:
Batch rendering with a batch size of 16k and a buffer size of 64k:
91.5 - 92.5 ms frame time
Batch rendering with a batch size of 512 and a buffer size of 2048:
93.0 - 94.0 ms frame time
No batch rendering - loading all 100,000 sprites at once:
90.5 - 91.0 ms frame time
Those are very close to the actual numbers I got (as best as I can remember them). Again, changing the batch flag to D3DLOCK_DISCARD did not make it run any faster than not batching...
The same relationship between the batching and non-batching in terms of speed held for any number of sprites from 5k up to 100k.
It's not a huge difference in speed, but it is a difference none the less. From these tests, it would appear that batch rendering is definitely not worth it. Has anyone else had similar experiences, or am I doing this wrong and is batch rendering really more efficient?
Drawing 100000 point sprites in a single call is faster because:
Anyway, you did the right thing. You tried it and profiled it. Others would have blindly "optimized" their code without profiling and not even known that they made it slower.
Also, the 1 ms difference is negligible. The bottleneck is elsewhere -- perhaps in your computations or the GPU's vertex or pixel processing.
- draw calls are very expensive
- the size of your buffer is very small -- only 400k.
Anyway, you did the right thing. You tried it and profiled it. Others would have blindly "optimized" their code without profiling and not even known that they made it slower.
Also, the 1 ms difference is negligible. The bottleneck is elsewhere -- perhaps in your computations or the GPU's vertex or pixel processing.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement