Face Instancing: Dividing Draw Calls?

Started by
17 comments, last by Drakken255 11 years, 8 months ago
Hello all,


Project Info:

So at the moment, I am using a single instance buffer containing information for roughly 32,000 square faces. There is a high possibility that once every second or slightly less, faces may need removed, or others added. Each face is part of a block is part of a chunk. (Yes... Minecraft clone... I'm bored, okay?) My math says that 3 chunks makes about 32,000 faces that are exposed and must be drawn and Minecraft draws about 50 chunks at a time, so:
32,000 faces / 3 chunks = ~10667 faces per chunk.
A single instance buffer would draw 10,667 * 50 = 533,350 faces to draw.

This, as it is, is too much for XNA to handle at a decent frame rate I am using a testing scenario that renders cubes, so 533350 faces / 6 = ~88891 cubes, which brings frame rate from a flat 60 to the 30s. If we skip that problem and assume it is fine:

What if the player destroys one block? One surface instance must be deleted and 5 more must take its place. That means recalculating every single face needing to be drawn, AND re-initializing an entire instance buffer containing an ungodly amount of instances, all in one frame.

This obviously is not the way to do it, because the engine takes about 6 seconds as it is to load with only 3 chunks.

Question:

Would it hurt the CPU on the average draw call to split one large batch into roughly 50 smaller ones? Would my computer fall over dead if I made each chunk draw its own set of instances, containing about 10,000 faces each?
Advertisement
Yes it will, the less draw calls the better in most situlations.

You may be going around the problem in the wrong way. The graphics card is made to have huge amounts of vertices and indices deleted and loaded every frame.


Heres how I got around this same problem with a very good frame rate.

For every chunk, calculate which cube faces are visible, then store the vertices and indices for the faces in memory.

Create 1 very large static vertex buffer and fill it with all vertices from all the chunks.

Create 1 very large dynamic index buffer.

Every frame use frustum culling to find with chunks are visible to the camera, fill the index buffer with the indices from the visible chunks and draw.


If a cube is destoryed, you will need to reload the visible faces for that chunk and refill the whole vertex buffer. Don't worry like I said, graphics cards are built to do this!
Thing is my instancing tester lags out at 25,000 cubes, equal to 150,000 faces. And that's without modifying the buffer every frame. This means for every block that is destroyed, the buffer, containing almost quadruple that will have to be reloaded. This would create a noticeable spike every time a block is added or destroyed. My computer may be old, but it can still keep up with Minecraft. So I know that these performance problems I am hitting will hit all but the best computers. Perhaps the best solution is to have each chunk store its own instance data, and update it on its own, but when it comes to drawing, frustrum cull like you said, and THEN fill the final dynamic buffer with the visible data. Of course with this method, any time the camera changes, the buffer needs updated, rather than once any block modification is done. Who knows, reloading the buffer with less data may actually be better.
bullfrog is right less drawcalls are better but as you can see maintaining that amount of data is quiet hard if you do it with bruteforcing and rely on the raw power of your hardware.
First: XNA can handle millions of Triangles also on older systems its only limited by the overhead your code is producing.
I dont think minecraft is updating whole chunks. It breaks the chunk into smaller parts and manages them with some sort of quadtree or whatever. Then you can use the treenodes for frustum culling which is faster than doing it for every cube. I would also think about the management of the buffers. Whats the difference between different blocks? Its not the block itself all blocks have the same size and 8 vertices. What I would try is to build one cube model and for each blocktype (stone, wood etc) one instancebuffer including the transform parameter and texture information etc...So you end up with having one vertexbuffer with 8 vertices and an indexbuffer containing indices for 6 faces and the instancingbuffer. Maybe you have more drawcalls this way but you can drastically reduce it with culling and you could try merging the data from the treeparts together so you end up with only one instancebuffer per blocktype.
The merging will give you some lag spike which can be eliminated with using seperate updatethreads and double buffering. The updatethread will only update the affected chunk part(s) and merge the data into a second buffer. Until the thread is finished you draw the "old" data and when its finished you switch the buffer and draw the updated one.
The `Index Buffer` will need updating every frame based on what the camera can see.

The `Vertex Buffer` will only need updating every time a block is destroyed.This may only happen once every 1.5 seconds, based on the rate you can destroy blocks.

Notice the index buffer is made up of 4 byte integers. If you had 150,000 faces in your vertex buffer, you will need 900,000 indices to draw every face.

Add frustum culling, which will take it down to ~33% (Based on what the camera can see), 297,000 indices is now required to draw all the faces that the camera can see.

297,000 indices * 4 bytes = 1.13MB

That amount of data should have no proplem being sent down to the grapihcs card every frame.
Oh, no no no no... I'm going straight up instancing. One quad. Sorry to be harsh, but I just can't imagine the process it would take to request the actual faces from each block and chunk, and then properly calculate the proper index positions. And I think the problem is less in sending the data, and more in calculating what is sent. I tried to use a full cube for instancing, and found it didn't give me the freedom to customize the texture for each side. So it has to be faces. I will, however, take your idea and split each chunk into levels, like 0-63, 64-127, etc. This will give each chunk 8 instance arrays, to be combined based on whether the sub-chunk is within viewing range. This should lower the cost of updating an array, since each sub-chunk only contains a maximum of 98,304 as opposed to a full chunk's 786,432. All in all, the real dilemma is on how to reduce the time it takes to calculate the array of instances. Obviously, the less in each array to merge, the less data total. Later, I will try to get sub-chunk instancing to work.
Hmm ok I don't know what you do with the textures. My first thought was to put all of the textures for one cube in one big texture so you can add the texturecoordinates to the vertices and don't need to calculate that stuff.
The problem with texturing lies in the fact that each side may need to be different. Sure, I could put in the proper texture coordinates for one block type, but if I'm using the same 8 vertices in between 10-100 different types of blocks, I need to be able to instance the coordinates. I'ts as simple as having each sub-class (block type) hold constants pointing to the right texture atlas coordinates. And each time that block's faces need to be drawn, I just put up a switch based on which face, and pass to correct atlas coordinates for each visible face.
Ok. I just implemented the per-chunk instancing pattern without frustum culling (couldn't get the intersection check right), and found I can render 16 chunks at 25-30 FPS. I checked the numbers: with the current visible face counter, there are 42249 faces being drawn. Each face has an instance with the following data: the transform, a base texture, an overlay texture coord, a "break" overlay texture coord, and a color. I'll explain the necessity of the overlays and color in a bit. The size of the instance has been calculated to 4bytes (32 bits) per float * 26 floats = 104 bytes. So 42249 instances * 104 bytes per instance = 4,393,896 bytes. Through conversion, this totals out to 4.19 MB of data sent per frame. Is this an acceptable number? Also, I am combining the instance buffer by using List.AddRange all the way until it's ready for the GPU, where I use the ToArray method. Is this faster than manually appending arrays?

To answer the imminent question: I need an overlay texture coord because some blocks in minecraft rely on overlays and coloring to smooth the land's look. That is also why I have the tint color: to be able to change the grass color. I need the break overlay coord to allow for (duh) breaking graphics to render within the shader, where it's easiest (and likely quickest) to modify individual pixels.
I have a faint memory that minecraft would split each chunk into a vertical bar of 16^3 sized chunks for rendering...

Just draw each chunk separately (and depth sort them, maybe that will let you draw moar interesting pixels...)

o3o

This topic is closed to new replies.

Advertisement