Sign in to follow this  
mind in a box

About dynamic vertex pulling

Recommended Posts

Hi everyone!

 

I look forward to implement the technique described in this Article, which looks fine by itself, but I got some questions about the implementation details.

 

First of all, the general idea seems to be that you have one big Vertex- and one big Indexbuffer to work with. You then put every mesh you want to be rendered in there and store the offsets and index-counts in an other datastructure which goes together with the instance-data into another buffer.

Then all you need to do is to issue a call to something like DrawInstanced, with the maximum amount of indices a mesh in the buffer has, and walk the instance-data buffer to get the actual vertexdata from the buffers.

If the mesh uses less indices as we told the Draw-Call, it says one should just use degenerate triangles and keep an eye on the vertexcounts.

 

Now, the article gives us a scenario about rendering a forest, with different types of trees and LOD-levels.

  • #1: Why even bother with LODs, when we draw everything with the same vertex/index-count anyways?
  • Idea: Use multiple instance-buffers with different ranges of vertex/index-counts and use more DrawCalls instead of wasting time on drawing overhead vertices on simple LOD-levels.

Next problem is about the updating of the instance-buffer. Since of course we want some frustum-culling or moving objects if we are drawing a huge forest, we would need to do that every frame. The Article suggests that one should keep a CPU-copy of the data in the buffer and if something changes, just copy everything over again.

  • #2: Wouldn't that take a huge impact on performance if we have to copy thousands of matrices to the GPU every frame? Also I'm pretty sure you would hit a GPU-sync point when doing this the naive way.
  • Idea: I haven't looked to deep into them yet, but couldn't you update a single portion of the buffer by using a compute-shader or just do the full frustum-culling on the GPU? If not, there are those Map-Modes (other than WRITE_DISCARD) worth a shot where the data stays to update only single objects? Or do I just throw this into an other thread, use doublebuffering to help with sync-points and forget about it?

 

The last question is regarding textures. I assume that in the article the textures are all of the same size, which makes it easy to put them all into a TextureArray, as the Author is doing at least.

  • #3: But I don't know much about the textures I have to work with, other than that they are all sized by a power of two. I'm using D3D11 at the moment, so TextureArrays is as far as I would get. Next problem is, that my textures can be dynamically streamed in and out.
  • Idea: Make texture-arrays of different sizes and assume how many slots we would need for the given size. For example, pre-allocate a TextureArray with 100 Slots of the size 1024² and if we ever break that boundry or a texture gets cached out, allocate more/less and copy the old one over. Slow, but would work. Then use the shaders registers for the different arrays to get access to them.
  • The other thing I could do is to allow this kind of rendering technique only for static level-geometry and to try to keep the textures for them in memory the whole time.

Does anyone maybe have better solutions/ideas to the problems than me or can give me some other useful input about this technique?

 

Thanks in advance!

Share this post


Link to post
Share on other sites

Thanks for the reply!

 

 

TBH the article looks like a horribly complicated version of what can be achieved with one huge vertex buffer and one huge index buffer combined with StartInstanceLocation and StartIndexLocation from DrawIndexedInstanced and StartVertexLocation from DrawInstanced.

 

Not quite, the point of the technique is to minimize drawcalls further than instancing can go. It enables you to render lots of different geometry with different textures using only one single DrawInstanced-Call.

Basically it IS DrawInstanced with these two parameters, but without any of the overhead comming from the drawcalls.

 

 

It is true that LOD loses some effectiveness. However there's more to it than just the vertex shader processing power.
One triangle covering 1024 pixels is much faster than 1024 triangles covering 1024 pixels each. That's because pixels are processed in at least 2x2 blocks (aka "the small triangle problem"). Triangles

 

Right, I totally forgot about that!

 

 

You are right.
The author is using 3 matrices per instance (world, view, and projection) which is far from ideal. I suppose he did it for simplicity of the article.

 

Sure, but even copying 15000 world-matrices to a buffer would take a lot of time. I think a better approach would be to use this only for static geometry and work with indices to a pre-filled buffer with all of the instance-information.

 

Thanks for the matrix optimization-tip as well, I didn't know that!

Share this post


Link to post
Share on other sites


The overhead of an actual draw call is extremely low. The biggest issue performance-wise is when you need to swap vertex/index buffers between the calls, which can be avoided by having one giant buffer and using the Start* variables.

 

I am done with the collection of all the buffers into one single bug buffer and it works pretty well, the packing at least. However, I am not quite sure how I would implement using different world-matrices without actually switching at least one constant buffer, since I can't use the instance-ID to figure out what I am currently rendering.

 

My idea would be to simple use DrawInstanced, passing only a single (maybe more) instance to render and setting the start-instance to the index the instance-data of my object is setting in a big structured-buffer bound to the vertexshader. That way I can access the instance-data using SV_InstanceID.

 

Would that be an appropriate solution or do you maybe have a better idea? The engine I'm currently working on unfortunately isn't far enough for me to test this now.

 


Yes. If you can avoid it, then better. You can also optimize for specific uses (e.g. if you only need position & orientation but no scale, send a float4 for the position and a float4 with a quaternion; if you only need need XZ position and Y is controlled globally or already baked, only send a float2. You can also use half formats to halve the bandwidth if precision isn't an issue)

 

Never really thought about this. I actually don't need scale for most of the objects I'm working with, so that is going to be a really nice optimization!

 

Thanks for your answers!

Share this post


Link to post
Share on other sites


However, I am not quite sure how I would implement using different world-matrices without actually switching at least one constant buffer, since I can't use the instance-ID to figure out what I am currently rendering.

If you're talking about merge-instancing isn't it just vertexid/size=instanceid?

Share this post


Link to post
Share on other sites

I am done with the collection of all the buffers into one single bug buffer and it works pretty well, the packing at least. However, I am not quite sure how I would implement using different world-matrices without actually switching at least one constant buffer, since I can't use the instance-ID to figure out what I am currently rendering.

To select the mesh to render via DrawPrimitive, use StartIndexLocation/StartVertexLocation.
For example if you've got mesh A of 1000 vertices (32 bytes per vertex) and right afterwards mesh B of 500 vertices (24 bytes per vertex); then you need to set StartVertexLocation to 1334.

How did I arrive to 1334?
Mesh A needed 32000 bytes (1000 * 32 bytes per vertex)
32000 / 24 bytes per vertex = 1333.333. Rounded up to 1334.
This means Mesh B needs to start at byte offset 32016 (1334 * 24). You waste 16 bytes (32000 through 32016) as padding. To minimize the waste you could consecutively load all meshes that have the same vertex size.


To identify the instance:
Create a vertex buffer filled with a uint in increasing order. You only need one, then you can reuse for all the draws. In other words:
//At initialization time
uint32_t *vertexBuffer= ...;
for( int i=0; i<4096; ++i )
    vertexBuffer[i] = i;
Note: the 4096 is arbitrary.

And bind that vertex buffer as instance data. We'll call this the "DRAWID". Then when you pass StartInstanceLocation = 500, the drawID will contain 500 for the first instance, 501 for the 2nd instance, etc (SV_InstanceID is zero-based, thus we need this trick to get the actual value in the shader)

Now that you've got the instance ID, just load myWorldMatrices[drawID];

Share this post


Link to post
Share on other sites


If you're talking about merge-instancing isn't it just vertexid/size=instanceid?

 

My meshes aren't using the same vertex-counts as of yet and I would like to get around this if possible.

 


To select the mesh to render via DrawPrimitive, use StartIndexLocation/StartVertexLocation.
For example if you've got mesh A of 1000 vertices (32 bytes per vertex) and right afterwards mesh B of 500 vertices (24 bytes per vertex); then you need to set StartVertexLocation to 1334.

 

Is this really true? Don't you have to specify the index-value of the first vertex? Wouldn't you just need to set that value to 0 for the first drawcall and to 1000 for the second? I'm pretty sure thats how it works, at least for the indices.

You are probably talking about the offsets you can set while binding the buffers to the IA?

 



To identify the instance:
Create a vertex buffer filled with a uint in increasing order. You only need one, then you can reuse for all the draws. In other words:

//At initialization time
uint32_t *vertexBuffer= ...;
for( int i=0; i<4096; ++i )
vertexBuffer[i] = i;

Note: the 4096 is arbitrary.

And bind that vertex buffer as instance data. We'll call this the "DRAWID". Then when you pass StartInstanceLocation = 500, the drawID will contain 500 for the first instance, 501 for the 2nd instance, etc (SV_InstanceID is zero-based, thus we need this trick to get the actual value in the shader)

Now that you've got the instance ID, just load myWorldMatrices[drawID];

 

Won't SV_InstanceID be filled with the value I passed in the drawcall, regardless of a second vertexbuffer being bound? I guess I will have to test this, but it would save me the overhead of reading the same value as SV_InstanceID out of a buffer.

Share this post


Link to post
Share on other sites

My meshes aren't using the same vertex-counts as of yet and I would like to get around this if possible.

Yeah sorry I had only read a fourth of your article beforehand, I had just assumed it was merge instancing and quit reading.  I just perused the whole thing and if I understand it correctly it does the degenerate triangles of merge instancing manually by making them get clipped.  Still need to read it indepth to fully grasp what he's doing.  But constants seem to be in another structured buffer at least thats what I got from the section entitled "Managing per-object constants".  But he does use draw instanced in his article so you should have an instance id available to you.

 

edit - sorry I didn't realize you had moved on to a different technique.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

Is this really true? Don't you have to specify the index-value of the first vertex? Wouldn't you just need to set that value to 0 for the first drawcall and to 1000 for the second? I'm pretty sure thats how it works, at least for the indices.
You are probably talking about the offsets you can set while binding the buffers to the IA?

Oh I was thinking in DrawInstanced terms (non-indexed version) for simplicity.
 
Let's talk in DrawIndexedInstanced terms so we're on the same page:
Mesh A: 1000 vertices (vertex format = 32 bytes per vertex), 300 indices.
Mesh B: 500 vertices (vertex format = 24 bytes per vertex), 200 indices.
 
To Render Mesh A:
StartIndexLocation = 0;
BaseVertexLocation = 0;
 
To Render Mesh B:
StartIndexLocation = 300;  //Assumes index data of Mesh B starts right after the data of Mesh A.
BaseVertexLocation = 1334; //It would be 1000 if mesh B had a vertex format of 32 bytes per vertex (or if Mesh A also had a format of 24 bytes).
 
If you don't understand why it's 1334 instead of 1000, think it this way: If you set BaseVertexLocation to 1000, the GPU will read the data at byte offset 24000... but that's Mesh A!!! (Mesh A is in range [0; 32000) ) because 1000 * 24 = 24000.
Of course if both mesh A & B have the same vertex format, then you set BaseVertexLocation to 1000.
 

Won't SV_InstanceID be filled with the value I passed in the drawcall, regardless of a second vertexbuffer being bound?

Unfortunately no, SV_InstanceID is 0-based irrespective of the value of StartInstanceLocation. Fortunately the overhead from using the 2nd vertex buffer is extremely low (in CPU terms, you only to bind it once, in GPU terms, the entire buffer fits in the cache). Edited by Matias Goldberg

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this