About dynamic vertex pulling

Started by
8 comments, last by Matias Goldberg 8 years, 6 months ago

Hi everyone!

I look forward to implement the technique described in this Article, which looks fine by itself, but I got some questions about the implementation details.

First of all, the general idea seems to be that you have one big Vertex- and one big Indexbuffer to work with. You then put every mesh you want to be rendered in there and store the offsets and index-counts in an other datastructure which goes together with the instance-data into another buffer.

Then all you need to do is to issue a call to something like DrawInstanced, with the maximum amount of indices a mesh in the buffer has, and walk the instance-data buffer to get the actual vertexdata from the buffers.

If the mesh uses less indices as we told the Draw-Call, it says one should just use degenerate triangles and keep an eye on the vertexcounts.

Now, the article gives us a scenario about rendering a forest, with different types of trees and LOD-levels.

  • #1: Why even bother with LODs, when we draw everything with the same vertex/index-count anyways?
  • Idea: Use multiple instance-buffers with different ranges of vertex/index-counts and use more DrawCalls instead of wasting time on drawing overhead vertices on simple LOD-levels.

Next problem is about the updating of the instance-buffer. Since of course we want some frustum-culling or moving objects if we are drawing a huge forest, we would need to do that every frame. The Article suggests that one should keep a CPU-copy of the data in the buffer and if something changes, just copy everything over again.

  • #2: Wouldn't that take a huge impact on performance if we have to copy thousands of matrices to the GPU every frame? Also I'm pretty sure you would hit a GPU-sync point when doing this the naive way.
  • Idea: I haven't looked to deep into them yet, but couldn't you update a single portion of the buffer by using a compute-shader or just do the full frustum-culling on the GPU? If not, there are those Map-Modes (other than WRITE_DISCARD) worth a shot where the data stays to update only single objects? Or do I just throw this into an other thread, use doublebuffering to help with sync-points and forget about it?

The last question is regarding textures. I assume that in the article the textures are all of the same size, which makes it easy to put them all into a TextureArray, as the Author is doing at least.

  • #3: But I don't know much about the textures I have to work with, other than that they are all sized by a power of two. I'm using D3D11 at the moment, so TextureArrays is as far as I would get. Next problem is, that my textures can be dynamically streamed in and out.
  • Idea: Make texture-arrays of different sizes and assume how many slots we would need for the given size. For example, pre-allocate a TextureArray with 100 Slots of the size 1024² and if we ever break that boundry or a texture gets cached out, allocate more/less and copy the old one over. Slow, but would work. Then use the shaders registers for the different arrays to get access to them.
  • The other thing I could do is to allow this kind of rendering technique only for static level-geometry and to try to keep the textures for them in memory the whole time.

Does anyone maybe have better solutions/ideas to the problems than me or can give me some other useful input about this technique?

Thanks in advance!

Advertisement
TBH the article looks like a horribly complicated version of what can be achieved with one huge vertex buffer and one huge index buffer combined with StartInstanceLocation and StartIndexLocation from DrawIndexedInstanced and StartVertexLocation from DrawInstanced.

It is however, a very useful exercise for learning the flexibility of modern GPUs when it comes to rendering methods; which can be useful for very specific rendering paths; but for what the article author proposes to fix, using StartVertexLocation/StartIndexLocation and a huge vertex buffer solves the same problem with none of the caveats (no fixed vertex count, no need for unique textures per instance, all topologies are supported, no indirection overhead, etc).


#1: Why even bother with LODs, when we draw everything with the same vertex/index-count anyways?

It is true that LOD loses some effectiveness. However there's more to it than just the vertex shader processing power.
One triangle covering 1024 pixels is much faster than 1024 triangles covering 1024 pixels each. That's because pixels are processed in at least 2x2 blocks (aka "the small triangle problem"). Triangles that are smaller than a pixel hurt performance a lot.
The details are explained in Emil Persson's article.
Also degenerate triangles don't need to go to the rasterizer stage.

#2: Wouldn't that take a huge impact on performance if we have to copy thousands of matrices to the GPU every frame? Also I'm pretty sure you would hit a GPU-sync point when doing this the naive way.

You are right.
The author is using 3 matrices per instance (world, view, and projection) which is far from ideal. I suppose he did it for simplicity of the article.
Normally you would store the view and projection matrices in a separate constant buffer that is only updated once per camera, and another constant buffer that holds only the world matrix. Storage can be further optimized by sending a 4x3 matrix (three float4) since world matrices don't need the last row.

Thanks for the reply!

TBH the article looks like a horribly complicated version of what can be achieved with one huge vertex buffer and one huge index buffer combined with StartInstanceLocation and StartIndexLocation from DrawIndexedInstanced and StartVertexLocation from DrawInstanced.

Not quite, the point of the technique is to minimize drawcalls further than instancing can go. It enables you to render lots of different geometry with different textures using only one single DrawInstanced-Call.

Basically it IS DrawInstanced with these two parameters, but without any of the overhead comming from the drawcalls.

It is true that LOD loses some effectiveness. However there's more to it than just the vertex shader processing power.
One triangle covering 1024 pixels is much faster than 1024 triangles covering 1024 pixels each. That's because pixels are processed in at least 2x2 blocks (aka "the small triangle problem"). Triangles

Right, I totally forgot about that!

You are right.
The author is using 3 matrices per instance (world, view, and projection) which is far from ideal. I suppose he did it for simplicity of the article.

Sure, but even copying 15000 world-matrices to a buffer would take a lot of time. I think a better approach would be to use this only for static geometry and work with indices to a pre-filled buffer with all of the instance-information.

Thanks for the matrix optimization-tip as well, I didn't know that!

Not quite, the point of the technique is to minimize drawcalls further than instancing can go. It enables you to render lots of different geometry with different textures using only one single DrawInstanced-Call.
Basically it IS DrawInstanced with these two parameters, but without any of the overhead comming from the drawcalls.

The overhead of an actual draw call is extremely low. The biggest issue performance-wise is when you need to swap vertex/index buffers between the calls, which can be avoided by having one giant buffer and using the Start* variables.
To render all the combinations of trees he lists, a total of 405 draw calls are needed. That is spare change if you don't need to swap buffers nor shaders or textures (a modern machine can easily handle 50k-100k draw calls per frame at 60fps if you don't involve state changes). Definitely the GPU is going to be the bottleneck, so incurring a GPU overhead to reduce API overhead that is already low is not a smart idea.

Sure, but even copying 15000 world-matrices to a buffer would take a lot of time. I think a better approach would be to use this only for static geometry and work with indices to a pre-filled buffer with all of the instance-information.

Yes. If you can avoid it, then better. You can also optimize for specific uses (e.g. if you only need position & orientation but no scale, send a float4 for the position and a float4 with a quaternion; if you only need need XZ position and Y is controlled globally or already baked, only send a float2. You can also use half formats to halve the bandwidth if precision isn't an issue)

Doing the math is easy actually. A float4x3 is 48 bytes. 15000 world matrices are 0.687MBs per frame. At 60 FPS, that's 41.19MB/s.
A modern PCIe 16x 3.0 has a bandwidth of 15.75GB/s. Even if we assume 50% is lost in overhead, it still gives you plenty of room.
You have to account though, that's not the only data you will be sending to the GPU, and if you make 6 passes (e.g. 5 shadow maps, 1 for final rendering) then that's 41.19x6 = 247.14MB/s.
Also if you target iGPU (ie. Intel cards) then common DDR3 bandwidths are around 17-25GB/s, but it is shared with the entire system, including your own engine, GPU's rendering (e.g. texture fetches), thus you'll be quite limited there and anything you can save is worth it.
Beware iGPUs you'll spent 247.14MB/s saving the data to RAM, then another 247.14MB/s reading that data from RAM. If you're lucky a big chunk of it will remain in the L3 cache so iGPUs won't suffer that dramatically (after all a single frame needs 4.12MB which fits in most L3 caches). But still you have to be wary.


The overhead of an actual draw call is extremely low. The biggest issue performance-wise is when you need to swap vertex/index buffers between the calls, which can be avoided by having one giant buffer and using the Start* variables.

I am done with the collection of all the buffers into one single bug buffer and it works pretty well, the packing at least. However, I am not quite sure how I would implement using different world-matrices without actually switching at least one constant buffer, since I can't use the instance-ID to figure out what I am currently rendering.

My idea would be to simple use DrawInstanced, passing only a single (maybe more) instance to render and setting the start-instance to the index the instance-data of my object is setting in a big structured-buffer bound to the vertexshader. That way I can access the instance-data using SV_InstanceID.

Would that be an appropriate solution or do you maybe have a better idea? The engine I'm currently working on unfortunately isn't far enough for me to test this now.


Yes. If you can avoid it, then better. You can also optimize for specific uses (e.g. if you only need position & orientation but no scale, send a float4 for the position and a float4 with a quaternion; if you only need need XZ position and Y is controlled globally or already baked, only send a float2. You can also use half formats to halve the bandwidth if precision isn't an issue)

Never really thought about this. I actually don't need scale for most of the objects I'm working with, so that is going to be a really nice optimization!

Thanks for your answers!


However, I am not quite sure how I would implement using different world-matrices without actually switching at least one constant buffer, since I can't use the instance-ID to figure out what I am currently rendering.

If you're talking about merge-instancing isn't it just vertexid/size=instanceid?

-potential energy is easily made kinetic-

I am done with the collection of all the buffers into one single bug buffer and it works pretty well, the packing at least. However, I am not quite sure how I would implement using different world-matrices without actually switching at least one constant buffer, since I can't use the instance-ID to figure out what I am currently rendering.

To select the mesh to render via DrawPrimitive, use StartIndexLocation/StartVertexLocation.
For example if you've got mesh A of 1000 vertices (32 bytes per vertex) and right afterwards mesh B of 500 vertices (24 bytes per vertex); then you need to set StartVertexLocation to 1334.

How did I arrive to 1334?
Mesh A needed 32000 bytes (1000 * 32 bytes per vertex)
32000 / 24 bytes per vertex = 1333.333. Rounded up to 1334.
This means Mesh B needs to start at byte offset 32016 (1334 * 24). You waste 16 bytes (32000 through 32016) as padding. To minimize the waste you could consecutively load all meshes that have the same vertex size.


To identify the instance:
Create a vertex buffer filled with a uint in increasing order. You only need one, then you can reuse for all the draws. In other words:
//At initialization time
uint32_t *vertexBuffer= ...;
for( int i=0; i<4096; ++i )
    vertexBuffer[i] = i;
Note: the 4096 is arbitrary.

And bind that vertex buffer as instance data. We'll call this the "DRAWID". Then when you pass StartInstanceLocation = 500, the drawID will contain 500 for the first instance, 501 for the 2nd instance, etc (SV_InstanceID is zero-based, thus we need this trick to get the actual value in the shader)

Now that you've got the instance ID, just load myWorldMatrices[drawID];


If you're talking about merge-instancing isn't it just vertexid/size=instanceid?

My meshes aren't using the same vertex-counts as of yet and I would like to get around this if possible.


To select the mesh to render via DrawPrimitive, use StartIndexLocation/StartVertexLocation.
For example if you've got mesh A of 1000 vertices (32 bytes per vertex) and right afterwards mesh B of 500 vertices (24 bytes per vertex); then you need to set StartVertexLocation to 1334.

Is this really true? Don't you have to specify the index-value of the first vertex? Wouldn't you just need to set that value to 0 for the first drawcall and to 1000 for the second? I'm pretty sure thats how it works, at least for the indices.

You are probably talking about the offsets you can set while binding the buffers to the IA?



To identify the instance:
Create a vertex buffer filled with a uint in increasing order. You only need one, then you can reuse for all the draws. In other words:

//At initialization time
uint32_t *vertexBuffer= ...;
for( int i=0; i<4096; ++i )
vertexBuffer = i;

Note: the 4096 is arbitrary.

And bind that vertex buffer as instance data. We'll call this the "DRAWID". Then when you pass StartInstanceLocation = 500, the drawID will contain 500 for the first instance, 501 for the 2nd instance, etc (SV_InstanceID is zero-based, thus we need this trick to get the actual value in the shader)

Now that you've got the instance ID, just load myWorldMatrices[drawID];

Won't SV_InstanceID be filled with the value I passed in the drawcall, regardless of a second vertexbuffer being bound? I guess I will have to test this, but it would save me the overhead of reading the same value as SV_InstanceID out of a buffer.


My meshes aren't using the same vertex-counts as of yet and I would like to get around this if possible.

Yeah sorry I had only read a fourth of your article beforehand, I had just assumed it was merge instancing and quit reading. I just perused the whole thing and if I understand it correctly it does the degenerate triangles of merge instancing manually by making them get clipped. Still need to read it indepth to fully grasp what he's doing. But constants seem to be in another structured buffer at least thats what I got from the section entitled "Managing per-object constants". But he does use draw instanced in his article so you should have an instance id available to you.

edit - sorry I didn't realize you had moved on to a different technique.

-potential energy is easily made kinetic-

Is this really true? Don't you have to specify the index-value of the first vertex? Wouldn't you just need to set that value to 0 for the first drawcall and to 1000 for the second? I'm pretty sure thats how it works, at least for the indices.
You are probably talking about the offsets you can set while binding the buffers to the IA?

Oh I was thinking in DrawInstanced terms (non-indexed version) for simplicity.

Let's talk in DrawIndexedInstanced terms so we're on the same page:
Mesh A: 1000 vertices (vertex format = 32 bytes per vertex), 300 indices.
Mesh B: 500 vertices (vertex format = 24 bytes per vertex), 200 indices.

To Render Mesh A:
StartIndexLocation = 0;
BaseVertexLocation = 0;

To Render Mesh B:
StartIndexLocation = 300; //Assumes index data of Mesh B starts right after the data of Mesh A.
BaseVertexLocation = 1334; //It would be 1000 if mesh B had a vertex format of 32 bytes per vertex (or if Mesh A also had a format of 24 bytes).

If you don't understand why it's 1334 instead of 1000, think it this way: If you set BaseVertexLocation to 1000, the GPU will read the data at byte offset 24000... but that's Mesh A!!! (Mesh A is in range [0; 32000) ) because 1000 * 24 = 24000.
Of course if both mesh A & B have the same vertex format, then you set BaseVertexLocation to 1000.

Won't SV_InstanceID be filled with the value I passed in the drawcall, regardless of a second vertexbuffer being bound?

Unfortunately no, SV_InstanceID is 0-based irrespective of the value of StartInstanceLocation. Fortunately the overhead from using the 2nd vertex buffer is extremely low (in CPU terms, you only to bind it once, in GPU terms, the entire buffer fits in the cache).

This topic is closed to new replies.

Advertisement