Sign in to follow this  
maxest

Vertex buffer static/managed faster in locking than dynamic/default

Recommended Posts

The problem as states the subject. I wanted to use dynamic vertex buffers in default memory pool to update geometry that needs to be updated on CPU. To my surprise, with 100k of triangles static buffers that are in managed memory pool are 3 times faster in updating (once per frame) than dynamic buffers in default memory pool. Of course, I use D3DLOCK_DISCARD and D3DLOCK_NOSYSLOCK flags with dynamic buffers and this doesn't help. Whats the problem then?

Share this post


Link to post
Share on other sites
Question 0: do you mean D3DLOCK_NOOVERWRITE?
Question 1: if you update 1 time per frame, you need to size your buffer about 3 times as big as what you use in one frame <s>(because typically the command buffer can lag as much as 3 frames behind)</s> (irrelevant)

[Edited by - janta on August 19, 2009 8:11:09 AM]

Share this post


Link to post
Share on other sites
Why on earth do you need to transfer that much data? Assuming you have 300,000 vertices (3 per triangle, triangle list), at 20 bytes per vertex (XYZ, UV), that's just short of 6MB to transfer per frame - I'd be inclined to say that your test is verging on invalid if you're going to be throwing that much data around.

Also, you generally don't want to use D3DLOCK_NOSYSLOCK unless you're debugging and you have the lock held for a long time (Which is something you want to avoid as much as possible).

Can we see some code?

Share this post


Link to post
Share on other sites
How are you updating? It may be that the memory for the two types has different caching and write combining behaviour. That'd have an effect if you're not writing the memory in one block. Try doing a single memcpy to write the data into the buffer, and see if this affects performance.

Share this post


Link to post
Share on other sites
Quote:

Question 0: do you mean D3DLOCK_NOOVERWRITE?

Doesn't change anything (in performance).

Quote:

Why on earth do you need to transfer that much data? Assuming you have 300,000 vertices (3 per triangle, triangle list), at 20 bytes per vertex (XYZ, UV), that's just short of 6MB to transfer per frame - I'd be inclined to say that your test is verging on invalid if you're going to be throwing that much data around.

I just wanted to compare performance so I assumed that significant abuse of the PCI-bus would give me the answer.

Quote:

Can we see some code?

Sure. But first I need to say I've made a second variant of copying - "Try doing a single memcpy to write the data into the buffer, and see if this affects performance.".

So I've got two schemes:


prepare temp data
lock buffer
memcpy from temp data to actual data
unlock


and the second:


lock buffer
assign data within the lock
unlock buffer


So in the second scheme I actually "compute" the new data "on-fly" while buffer is being locked. This seems to be faster but is not.

Use of the second scheme:
static/managed - 12 fps
dynamic/default - 4 fps

Use of the first scheme:
both variants - 9-10 fps

Code for creating vertex buffer:

if (!d)
CRenderer::D3DDevice->CreateVertexBuffer(size, D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &id, NULL);
else
CRenderer::D3DDevice->CreateVertexBuffer(size, D3DUSAGE_WRITEONLY | D3DUSAGE_DYNAMIC, 0, D3DPOOL_DEFAULT, &id, NULL);

// d indicates whether buffer is dynamic

Locking code:

if (!d)
id->Lock(0, 0, (void**)&data, 0);
else
id->Lock(0, 0, (void**)&data, D3DLOCK_DISCARD | D3DLOCK_NOOVERWRITE);


The only buffer I indicate as dynamic is the one I'm locking every frame.

Share this post


Link to post
Share on other sites
Quote:
Original post by janta
Question 0: do you mean D3DLOCK_NOOVERWRITE?
Question 1: if you update 1 time per frame, you need to size your buffer about 3 times as big as what you use in one frame (because typically the command buffer can lag as much as 3 frames behind)

This won't help the OP, but this is not quite accurate, so I figured I'd jump in anyway.

When you lock a dynamic VB with DISCARD, it won't (necessarily) stall if the VB is in use. Behind the scenes a second VB becomes tied to the pointer you've got, allowing the GPU to render from the filled buffer while you're writing data to this new buffer. It's referred to a 'vertex buffer renaming', and I think a common limit for drivers was to make up to 8 rename buffers, after which you WILL stall.

If you're doing many locks per frame, then yes, you want a larger buffer, and tend to use NOOVERWRITE most often. For this case, with a single update, it won't matter which lock method is used. I've seen it suggested that at the beginning of a new frame you should lock with DISCARD, which would actually mean the OP is locking the correct way. I've never understood why the DISCARD is suggested in such a case, and I've seen no ill effects from locking with NOOVERWRITE and continuing on from where my VB was last written to.

But anyway, the whole VB renaming thing is why I replied. It's not really documented that such a feature exists, unless you look at nVidia or ATI whitepapers from a few years back.

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
id->Lock(0, 0, (void**)&data, D3DLOCK_DISCARD | D3DLOCK_NOOVERWRITE);
That's not valid - You can only specify one of those flags, not both.

Share this post


Link to post
Share on other sites
Quote:
Original post by maxest
Quote:

Question 0: do you mean D3DLOCK_NOOVERWRITE?

Doesn't change anything (in performance).

Might not change anything in performance but it's still a serious issue because those flags aren't even nearly doing the same thing.
Now, after taking a look after your code, you're not using D3DLOCK_DISCARD and D3DLOCK_NOOVERWRITE as they were intended

Here's a method.

0 - create a buffer of size K (D3DUSAGE_WRITEONLY | D3DUSAGE_DYNAMIC and D3DPOOL_DEFAULT, that's correct)
1 - keep track of the last write position (initially, 0, obviously), called P.

Each frame:
2 - You want to write N bytes to the buffer.
3 - Check that you have enough space left, i.e.: K - P > N ?
4 - Yes?
4.1 - Lock the buffer at position K with the flag NOOVERWRITE
5 - No?
4.1 - Lock the buffer at position 0 with the flag DISCARD
6 - Write your data
7 - Unlock
8 - P += N

Notes:

- if you want to append data into your buffer several times by frame, you need to issue render calls immediately every time you do so. Otherwise, if you hit the end of your buffer and there are no rendering calls associated with it, then D3D will really discard it (while it will preserve it until all draw calls that reference it have been processed)

- this technique is more efficient if you don't hit the end of your buffer every frame. This is why you should create your buffera few times as big as the amount of data you add every frame

[Edited by - janta on August 19, 2009 8:09:09 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Namethatnobodyelsetook
When you lock a dynamic VB with DISCARD, it won't (necessarily) stall if the VB is in use. Behind the scenes a second VB becomes tied to the pointer you've got, allowing the GPU to render from the filled buffer while you're writing data to this new buffer. It's referred to a 'vertex buffer renaming', and I think a common limit for drivers was to make up to 8 rename buffers, after which you WILL stall.

Correct, over sizing the buffer helps as "vertex buffer renaming" will not happen every frame. Granted that it has nothing to do with the command queue lag, I was mistaken there. (how do I strike some text?)

[Edited by - janta on August 19, 2009 8:13:08 AM]

Share this post


Link to post
Share on other sites
One possible reason for some of that performance difference is that your per vertex / pixel rendering workload is light, and the GPU is vertex fetch bound. Try it with the model filling more of the screen and / or some more complex lighting / shaders.

The reason for the 4FPS when not using memcpy is almost certainly because write combined memory needs a bit of extra care when writing to it to get decent performance. You should write the data in order of increasing memory address, leaving no gaps. Reading from write combined memory is also very slow.

Doing it that way tends to be safe on most CPUs. For details of one implementation of write combining see http://download.intel.com/design/PentiumII/applnots/24442201.pdf

Also note that write combining can sometimes be disabled in the BIOS settings, but I don't think that will be an issue in this case as memcpy() is quick.

Example code:


// The slow way to do it. Don't do this!
struct Vertex { float x, y, z, u, v; } *pVB = ....;

for (int i=0 < i < NumberOfVertices; i++)
{
pVB[i].u = 1.0f; // Out of order
pVB[i].v = 1.0f;
pVB[i].x = 42.0f;
// Failed to write y
pVB[i].z += 7.0f; // Read - modify - write.
}





// The fast way to do it. Destination pointer is declared volatile to
// stop the compiler reordering the writes and undoing that optimization
volatile struct Vertex { float x, y, z, u, v; } *pVB = ....;

for (int i=0 < i < NumberOfVertices; i++)
{
pVB[i].x = 42.0f;
pVB[i].y = 0.0f;
pVB[i].z = OldZ + 7.0f; // Store the old value elsewhere if you need it
pVB[i].u = 1.0f;
pVB[i].v = 1.0f;
}

Share this post


Link to post
Share on other sites
I've made now a vast abuse of vertex shader - 280 instructions for every single vertex (I remind I lock/unlock 300k of them) and the performance hasn't changed even a bit. It remains the same with both lock schemes.

Share this post


Link to post
Share on other sites
Quote:
Original post by Adam_42
...

Wouldn't the compiler be able to do the reordering himself when it's so obviously possible?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this