Vertex buffer static/managed faster in locking than dynamic/default

This topic is 3324 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

The problem as states the subject. I wanted to use dynamic vertex buffers in default memory pool to update geometry that needs to be updated on CPU. To my surprise, with 100k of triangles static buffers that are in managed memory pool are 3 times faster in updating (once per frame) than dynamic buffers in default memory pool. Of course, I use D3DLOCK_DISCARD and D3DLOCK_NOSYSLOCK flags with dynamic buffers and this doesn't help. Whats the problem then?

Share on other sites
Question 0: do you mean D3DLOCK_NOOVERWRITE?
Question 1: if you update 1 time per frame, you need to size your buffer about 3 times as big as what you use in one frame <s>(because typically the command buffer can lag as much as 3 frames behind)</s> (irrelevant)

[Edited by - janta on August 19, 2009 8:11:09 AM]

Share on other sites
Why on earth do you need to transfer that much data? Assuming you have 300,000 vertices (3 per triangle, triangle list), at 20 bytes per vertex (XYZ, UV), that's just short of 6MB to transfer per frame - I'd be inclined to say that your test is verging on invalid if you're going to be throwing that much data around.

Also, you generally don't want to use D3DLOCK_NOSYSLOCK unless you're debugging and you have the lock held for a long time (Which is something you want to avoid as much as possible).

Can we see some code?

Share on other sites
How are you updating? It may be that the memory for the two types has different caching and write combining behaviour. That'd have an effect if you're not writing the memory in one block. Try doing a single memcpy to write the data into the buffer, and see if this affects performance.

Share on other sites
Quote:
 Question 0: do you mean D3DLOCK_NOOVERWRITE?

Doesn't change anything (in performance).

Quote:
 Why on earth do you need to transfer that much data? Assuming you have 300,000 vertices (3 per triangle, triangle list), at 20 bytes per vertex (XYZ, UV), that's just short of 6MB to transfer per frame - I'd be inclined to say that your test is verging on invalid if you're going to be throwing that much data around.

I just wanted to compare performance so I assumed that significant abuse of the PCI-bus would give me the answer.

Quote:
 Can we see some code?

Sure. But first I need to say I've made a second variant of copying - "Try doing a single memcpy to write the data into the buffer, and see if this affects performance.".

So I've got two schemes:

prepare temp datalock buffermemcpy from temp data to actual dataunlock

and the second:

lock bufferassign data within the lockunlock buffer

So in the second scheme I actually "compute" the new data "on-fly" while buffer is being locked. This seems to be faster but is not.

Use of the second scheme:
static/managed - 12 fps
dynamic/default - 4 fps

Use of the first scheme:
both variants - 9-10 fps

Code for creating vertex buffer:
			if (!d)				CRenderer::D3DDevice->CreateVertexBuffer(size, D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &id, NULL);			else				CRenderer::D3DDevice->CreateVertexBuffer(size, D3DUSAGE_WRITEONLY | D3DUSAGE_DYNAMIC, 0, D3DPOOL_DEFAULT, &id, NULL);

// d indicates whether buffer is dynamic

Locking code:
			if (!d)				id->Lock(0, 0, (void**)&data, 0);			else				id->Lock(0, 0, (void**)&data, D3DLOCK_DISCARD | D3DLOCK_NOOVERWRITE);

The only buffer I indicate as dynamic is the one I'm locking every frame.

Share on other sites
Quote:
 Original post by jantaQuestion 0: do you mean D3DLOCK_NOOVERWRITE?Question 1: if you update 1 time per frame, you need to size your buffer about 3 times as big as what you use in one frame (because typically the command buffer can lag as much as 3 frames behind)

This won't help the OP, but this is not quite accurate, so I figured I'd jump in anyway.

When you lock a dynamic VB with DISCARD, it won't (necessarily) stall if the VB is in use. Behind the scenes a second VB becomes tied to the pointer you've got, allowing the GPU to render from the filled buffer while you're writing data to this new buffer. It's referred to a 'vertex buffer renaming', and I think a common limit for drivers was to make up to 8 rename buffers, after which you WILL stall.

If you're doing many locks per frame, then yes, you want a larger buffer, and tend to use NOOVERWRITE most often. For this case, with a single update, it won't matter which lock method is used. I've seen it suggested that at the beginning of a new frame you should lock with DISCARD, which would actually mean the OP is locking the correct way. I've never understood why the DISCARD is suggested in such a case, and I've seen no ill effects from locking with NOOVERWRITE and continuing on from where my VB was last written to.

But anyway, the whole VB renaming thing is why I replied. It's not really documented that such a feature exists, unless you look at nVidia or ATI whitepapers from a few years back.

Share on other sites
Quote:
 Original post by maxestid->Lock(0, 0, (void**)&data, D3DLOCK_DISCARD | D3DLOCK_NOOVERWRITE);
That's not valid - You can only specify one of those flags, not both.

Share on other sites
Quote:
Original post by maxest
Quote:
 Question 0: do you mean D3DLOCK_NOOVERWRITE?

Doesn't change anything (in performance).

Might not change anything in performance but it's still a serious issue because those flags aren't even nearly doing the same thing.
Now, after taking a look after your code, you're not using D3DLOCK_DISCARD and D3DLOCK_NOOVERWRITE as they were intended

Here's a method.

0 - create a buffer of size K (D3DUSAGE_WRITEONLY | D3DUSAGE_DYNAMIC and D3DPOOL_DEFAULT, that's correct)
1 - keep track of the last write position (initially, 0, obviously), called P.

Each frame:
2 - You want to write N bytes to the buffer.
3 - Check that you have enough space left, i.e.: K - P > N ?
4 - Yes?
4.1 - Lock the buffer at position K with the flag NOOVERWRITE
5 - No?
4.1 - Lock the buffer at position 0 with the flag DISCARD
6 - Write your data
7 - Unlock
8 - P += N

Notes:

- if you want to append data into your buffer several times by frame, you need to issue render calls immediately every time you do so. Otherwise, if you hit the end of your buffer and there are no rendering calls associated with it, then D3D will really discard it (while it will preserve it until all draw calls that reference it have been processed)

- this technique is more efficient if you don't hit the end of your buffer every frame. This is why you should create your buffera few times as big as the amount of data you add every frame

[Edited by - janta on August 19, 2009 8:09:09 AM]

Share on other sites
Quote:
 Original post by NamethatnobodyelsetookWhen you lock a dynamic VB with DISCARD, it won't (necessarily) stall if the VB is in use. Behind the scenes a second VB becomes tied to the pointer you've got, allowing the GPU to render from the filled buffer while you're writing data to this new buffer. It's referred to a 'vertex buffer renaming', and I think a common limit for drivers was to make up to 8 rename buffers, after which you WILL stall.

Correct, over sizing the buffer helps as "vertex buffer renaming" will not happen every frame. Granted that it has nothing to do with the command queue lag, I was mistaken there. (how do I strike some text?)

[Edited by - janta on August 19, 2009 8:13:08 AM]

Share on other sites
One possible reason for some of that performance difference is that your per vertex / pixel rendering workload is light, and the GPU is vertex fetch bound. Try it with the model filling more of the screen and / or some more complex lighting / shaders.

The reason for the 4FPS when not using memcpy is almost certainly because write combined memory needs a bit of extra care when writing to it to get decent performance. You should write the data in order of increasing memory address, leaving no gaps. Reading from write combined memory is also very slow.

Doing it that way tends to be safe on most CPUs. For details of one implementation of write combining see http://download.intel.com/design/PentiumII/applnots/24442201.pdf

Also note that write combining can sometimes be disabled in the BIOS settings, but I don't think that will be an issue in this case as memcpy() is quick.

Example code:

// The slow way to do it. Don't do this!struct Vertex { float x, y, z, u, v; } *pVB = ....;for (int i=0 < i < NumberOfVertices; i++){  pVB.u = 1.0f; // Out of order  pVB.v = 1.0f;  pVB.x = 42.0f;  // Failed to write y  pVB.z += 7.0f; // Read - modify - write.}

// The fast way to do it. Destination pointer is declared volatile to// stop the compiler reordering the writes and undoing that optimizationvolatile struct Vertex { float x, y, z, u, v; } *pVB = ....;for (int i=0 < i < NumberOfVertices; i++){  pVB.x = 42.0f;  pVB.y = 0.0f;  pVB.z = OldZ + 7.0f; // Store the old value elsewhere if you need it  pVB.u = 1.0f;  pVB.v = 1.0f;}

1. 1
2. 2
3. 3
Rutin
15
4. 4
5. 5

• 10
• 11
• 14
• 10
• 25
• Forum Statistics

• Total Topics
632652
• Total Posts
3007654
• Who's Online (See full list)

There are no registered users currently online

×