Need ideas about 'Baking' instances into vertex shader (no GS)

Started by
17 comments, last by Mr_Fox 7 years, 3 months ago

Hey Guys,

My recent project in DX12 will require a pass to draw 200k+ cubes ( see image below, and I need one pass solid and one pass wireframe for debugging). This pass only have trivial ps so pretty vs bound.

[attachment=34179:Capture.PNG]

The way I do it is using instancing, and I was OK with the performance until I see this post.

Then I quickly crafted my version to benchmark it, and the vs version instantly save half of the time (for that pass). However, the implementation need gigantic IB, and in my case it need hundreds of MB, which I really don't want.

So I was wondering anyone have any idea about how to do this thing without instancing, without GS, and avoid this memory burden?

I know a trivial solution which is not using trianglestrip, instead using trianglelist. But that almost kills vertex reuse in vs (instead of doing 14 vert/cube, trianglelist will do 36 vert/cube) and given this pass is vs intensive, I guess trianglelist is sub-optimal (please correct me if my assumption is wrong)

I also know a trianglestrip solution which basically only need to add 2 duplicated vertex at the start and end of the trianglestrip to make 2 degenerated triangle to connect the previous and next cube. But that only work for solid pass, and I will have undesired line in my wireframe pass....

I then run out of ideas....

Thanks in advance.

Advertisement

However, the implementation need gigantic IB, and in my case it need hundreds of MB, which I really don't want.

You sure about that for 200k cubes I calculated that you'd need a 14MB index buffer. 200,000cubes * 12tris/cube * 3vertices/tri * 2bytes/index

I know a trivial solution which is not using trianglestrip, instead using trianglelist. But that almost kills vertex reuse in vs (instead of doing 14 vert/cube, trianglelist will do 36 vert/cube) and given this pass is vs intensive, I guess trianglelist is sub-optimal (please correct me if my assumption is wrong)

If you use indexed triangle list the vertex reuse happens because the post transform vertex cache. (Assuming good vertex ordering)

-potential energy is easily made kinetic-

You sure about that for 200k cubes I calculated that you'd need a 14MB index buffer. 200,000cubes * 12tris/cube * 3vertices/tri * 2bytes/index

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint (the method I used from the post require each index is unique so 16bit IB won't work here), so I have to use 32bit index number, so the total memory is more than 100MB

If you use indexed triangle list the vertex reuse happens because the post transform vertex cache. (Assuming good vertex ordering)

but if I use triangle list, how vs could know which vertex should be reused? I was thinking without IB GPU have no idea which vertex is shared by multiple triangle, and then reuse it from cache?

Thanks

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint, so I have to use 32bit index number, so the total memory is more than 100MB

Break it up into multiple draw calls and use 16bit indices.

but if I use triangle list, how vs could know which vertex should be reused? I was thinking without IB GPU have no idea which vertex is shared by multiple triangle, and then reuse it from cache?

I said an indexed triangle list, not a plain triangle list. So yes without an index buffer there is no reuse.

Are your vertex's pretransformed? Or are you using a matrix per cube?

-potential energy is easily made kinetic-

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint, so I have to use 32bit index number, so the total memory is more than 100MB

Break it up into multiple draw calls and use 16bit indices.

but if I use triangle list, how vs could know which vertex should be reused? I was thinking without IB GPU have no idea which vertex is shared by multiple triangle, and then reuse it from cache?

I said an indexed triangle list, not a plain triangle list. So yes without an index buffer there is no reuse.

Are your vertex's pretransformed? Or are you using a matrix per cube?

Break it up sounds very promising, and I will give a indexed triangle list a try. Thanks so much.

My case, I only have 8 vertex VB, and I will read offset information from a buffer (which means each cube need a unique 'ID' in vs), and transform each cube accordingly in vs, so yes, I use a matrix per cube. So any suggestions? Thanks so much

By the looks of your screenshot your cubes look the same size and adjacent to each other, is that right? If so can you batch up cubes together and have more common vertices. Then use a dynamic index buffer/ indexed triangle list for each chunk. (i.e. one matrix per chunk)

-potential energy is easily made kinetic-

Sorry, I should give a little bit more context. 200k+ cube is expectation after culling in my vs. the input cube number is 128^3 (and in my dynamic scene, the worse case will have more than 100^3 cube need be drawing, and the number of index is greater than 16bit max uint (the method I used from the post require each index is unique so 16bit IB won't work here), so I have to use 32bit index number, so the total memory is more than 100MB

My case, I only have 8 vertex VB, and I will read offset information from a buffer (which means each cube need a unique 'ID' in vs), and transform each cube accordingly in vs, so yes, I use a matrix per cube.

If you are using one matrix per cube then you don't need 32bit indices. Unless I'm missing something.

BTW - if you use straight instancing you will underutilize the GPU since AMD/Nvidia GPU's operate on 64/32 vertices at a time.

-potential energy is easily made kinetic-

If you are using one matrix per cube then you don't need 32bit indices. Unless I'm missing something.

Thanks for being so helpful. I really appreciated it

I did not quite get it why I don't need 32bit indices for using one matrix per cube if I don't break my cubes into multiple patches? If each cube need one matrix, then each cube need a unique ID (index to the matrix buffer) to find the right matrix for that cube, and if we don't break the cubes into multiple patches, this won't work if we have more than 65536 cube since 16bit IB can't provide more than 65536 unique ID.

I think the original screenshot I posted is a little bit misleading. I got that from VSDG vertex shader capture. Here is the actual screenshot

[attachment=34182:Capture1.PNG]

So my project is about 3D reconstruction, the model I am reconstructing is inside a TSDF volume, and those cube is a spacial structure which indicate model surface are inside those cubes. And they are generated during model updating pass (block center location is added to a appendbuffer), then later I need to render those 'active blocks' to get a min/max depth (kinda like depth prepass, but I need not only the min depth, but also max depth). And this pass is the one I mentioned in this post which can benefit a lot from not using instancing.

And now you may know that those cubes are axis-aligned and are the same size, but you cannot assume more...

If all you need to do is to render cube then you don't need an IB or VB at all:

14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0

Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

If all you need to do is to render cube then you don't need an IB or VB at all:

14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0

Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

Wow~ that's brilliant!! use bits to do these thing. hum... with a little bit modification, it seems we could use this method for lots of simple geometry, 16bit magic numbers for anything less than 16tristrip geometry(may need more magic number for cases where coordinates not just 0 or 1). How do you come up with that idea?

Thanks

This topic is closed to new replies.

Advertisement