Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Face Instancing: Dividing Draw Calls?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
18 replies to this topic

#1 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 07 August 2012 - 08:56 AM

Hello all,


Project Info:

So at the moment, I am using a single instance buffer containing information for roughly 32,000 square faces. There is a high possibility that once every second or slightly less, faces may need removed, or others added. Each face is part of a block is part of a chunk. (Yes... Minecraft clone... I'm bored, okay?) My math says that 3 chunks makes about 32,000 faces that are exposed and must be drawn and Minecraft draws about 50 chunks at a time, so:
32,000 faces / 3 chunks = ~10667 faces per chunk.
A single instance buffer would draw 10,667 * 50 = 533,350 faces to draw.

This, as it is, is too much for XNA to handle at a decent frame rate I am using a testing scenario that renders cubes, so 533350 faces / 6 = ~88891 cubes, which brings frame rate from a flat 60 to the 30s. If we skip that problem and assume it is fine:

What if the player destroys one block? One surface instance must be deleted and 5 more must take its place. That means recalculating every single face needing to be drawn, AND re-initializing an entire instance buffer containing an ungodly amount of instances, all in one frame.

This obviously is not the way to do it, because the engine takes about 6 seconds as it is to load with only 3 chunks.

Question:

Would it hurt the CPU on the average draw call to split one large batch into roughly 50 smaller ones? Would my computer fall over dead if I made each chunk draw its own set of instances, containing about 10,000 faces each?

Edited by Drakken255, 07 August 2012 - 09:25 AM.


Sponsor:

#2 bullfrog   Members   -  Reputation: 481

Like
0Likes
Like

Posted 08 August 2012 - 03:47 AM

Yes it will, the less draw calls the better in most situlations.

You may be going around the problem in the wrong way. The graphics card is made to have huge amounts of vertices and indices deleted and loaded every frame.


Heres how I got around this same problem with a very good frame rate.

For every chunk, calculate which cube faces are visible, then store the vertices and indices for the faces in memory.

Create 1 very large static vertex buffer and fill it with all vertices from all the chunks.

Create 1 very large dynamic index buffer.

Every frame use frustum culling to find with chunks are visible to the camera, fill the index buffer with the indices from the visible chunks and draw.


If a cube is destoryed, you will need to reload the visible faces for that chunk and refill the whole vertex buffer. Don't worry like I said, graphics cards are built to do this!

#3 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 08 August 2012 - 05:24 AM

Thing is my instancing tester lags out at 25,000 cubes, equal to 150,000 faces. And that's without modifying the buffer every frame. This means for every block that is destroyed, the buffer, containing almost quadruple that will have to be reloaded. This would create a noticeable spike every time a block is added or destroyed. My computer may be old, but it can still keep up with Minecraft. So I know that these performance problems I am hitting will hit all but the best computers. Perhaps the best solution is to have each chunk store its own instance data, and update it on its own, but when it comes to drawing, frustrum cull like you said, and THEN fill the final dynamic buffer with the visible data. Of course with this method, any time the camera changes, the buffer needs updated, rather than once any block modification is done. Who knows, reloading the buffer with less data may actually be better.

#4 quiSHADgho   Members   -  Reputation: 325

Like
0Likes
Like

Posted 08 August 2012 - 04:03 PM

bullfrog is right less drawcalls are better but as you can see maintaining that amount of data is quiet hard if you do it with bruteforcing and rely on the raw power of your hardware.
First: XNA can handle millions of Triangles also on older systems its only limited by the overhead your code is producing.
I dont think minecraft is updating whole chunks. It breaks the chunk into smaller parts and manages them with some sort of quadtree or whatever. Then you can use the treenodes for frustum culling which is faster than doing it for every cube. I would also think about the management of the buffers. Whats the difference between different blocks? Its not the block itself all blocks have the same size and 8 vertices. What I would try is to build one cube model and for each blocktype (stone, wood etc) one instancebuffer including the transform parameter and texture information etc...So you end up with having one vertexbuffer with 8 vertices and an indexbuffer containing indices for 6 faces and the instancingbuffer. Maybe you have more drawcalls this way but you can drastically reduce it with culling and you could try merging the data from the treeparts together so you end up with only one instancebuffer per blocktype.
The merging will give you some lag spike which can be eliminated with using seperate updatethreads and double buffering. The updatethread will only update the affected chunk part(s) and merge the data into a second buffer. Until the thread is finished you draw the "old" data and when its finished you switch the buffer and draw the updated one.

#5 bullfrog   Members   -  Reputation: 481

Like
0Likes
Like

Posted 08 August 2012 - 06:38 PM

The `Index Buffer` will need updating every frame based on what the camera can see.

The `Vertex Buffer` will only need updating every time a block is destroyed.This may only happen once every 1.5 seconds, based on the rate you can destroy blocks.

Notice the index buffer is made up of 4 byte integers. If you had 150,000 faces in your vertex buffer, you will need 900,000 indices to draw every face.

Add frustum culling, which will take it down to ~33% (Based on what the camera can see), 297,000 indices is now required to draw all the faces that the camera can see.

297,000 indices * 4 bytes = 1.13MB

That amount of data should have no proplem being sent down to the grapihcs card every frame.

#6 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 08 August 2012 - 06:47 PM

Oh, no no no no... I'm going straight up instancing. One quad. Sorry to be harsh, but I just can't imagine the process it would take to request the actual faces from each block and chunk, and then properly calculate the proper index positions. And I think the problem is less in sending the data, and more in calculating what is sent. I tried to use a full cube for instancing, and found it didn't give me the freedom to customize the texture for each side. So it has to be faces. I will, however, take your idea and split each chunk into levels, like 0-63, 64-127, etc. This will give each chunk 8 instance arrays, to be combined based on whether the sub-chunk is within viewing range. This should lower the cost of updating an array, since each sub-chunk only contains a maximum of 98,304 as opposed to a full chunk's 786,432. All in all, the real dilemma is on how to reduce the time it takes to calculate the array of instances. Obviously, the less in each array to merge, the less data total. Later, I will try to get sub-chunk instancing to work.

#7 quiSHADgho   Members   -  Reputation: 325

Like
0Likes
Like

Posted 09 August 2012 - 01:13 AM

Hmm ok I don't know what you do with the textures. My first thought was to put all of the textures for one cube in one big texture so you can add the texturecoordinates to the vertices and don't need to calculate that stuff.

Edited by quiSHADgho, 09 August 2012 - 01:13 AM.


#8 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 09 August 2012 - 01:34 AM

The problem with texturing lies in the fact that each side may need to be different. Sure, I could put in the proper texture coordinates for one block type, but if I'm using the same 8 vertices in between 10-100 different types of blocks, I need to be able to instance the coordinates. I'ts as simple as having each sub-class (block type) hold constants pointing to the right texture atlas coordinates. And each time that block's faces need to be drawn, I just put up a switch based on which face, and pass to correct atlas coordinates for each visible face.

#9 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 09 August 2012 - 10:23 AM

Ok. I just implemented the per-chunk instancing pattern without frustum culling (couldn't get the intersection check right), and found I can render 16 chunks at 25-30 FPS. I checked the numbers: with the current visible face counter, there are 42249 faces being drawn. Each face has an instance with the following data: the transform, a base texture, an overlay texture coord, a "break" overlay texture coord, and a color. I'll explain the necessity of the overlays and color in a bit. The size of the instance has been calculated to 4bytes (32 bits) per float * 26 floats = 104 bytes. So 42249 instances * 104 bytes per instance = 4,393,896 bytes. Through conversion, this totals out to 4.19 MB of data sent per frame. Is this an acceptable number? Also, I am combining the instance buffer by using List.AddRange all the way until it's ready for the GPU, where I use the ToArray method. Is this faster than manually appending arrays?

To answer the imminent question: I need an overlay texture coord because some blocks in minecraft rely on overlays and coloring to smooth the land's look. That is also why I have the tint color: to be able to change the grass color. I need the break overlay coord to allow for (duh) breaking graphics to render within the shader, where it's easiest (and likely quickest) to modify individual pixels.

Edited by Drakken255, 09 August 2012 - 10:35 AM.


#10 Waterlimon   Crossbones+   -  Reputation: 2601

Like
0Likes
Like

Posted 09 August 2012 - 11:21 AM

I have a faint memory that minecraft would split each chunk into a vertical bar of 16^3 sized chunks for rendering...

Just draw each chunk separately (and depth sort them, maybe that will let you draw moar interesting pixels...)

o3o


#11 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 11 August 2012 - 04:59 AM

I just don't get it... I've created a fairly efficient way of rendering thousands of faces, and even without updating the instance buffer, the framerate is still incredibly erratic! What really bugs me is that Java is a slower language than C#, yet Java can render more faster... Is it possible that Java using OpenGL is faster than XNA is at rendering?

EDIT: Rendering is not the problem. Building and appending all the lists in the "is-face-visible" calculation is what is taking up the FPS. Let me try running one loop to form an initial count, then a second loop to add them all in the array. See if converting to arrays makes it faster.

Edited by Drakken255, 11 August 2012 - 05:38 AM.


#12 kalle_h   Members   -  Reputation: 1476

Like
0Likes
Like

Posted 11 August 2012 - 09:16 AM

I just don't get it... I've created a fairly efficient way of rendering thousands of faces, and even without updating the instance buffer, the framerate is still incredibly erratic! What really bugs me is that Java is a slower language than C#, yet Java can render more faster... Is it possible that Java using OpenGL is faster than XNA is at rendering?


Java and c# is about as fast. Some benchmarks give java slight edge and some give it to c#. Or if you have better data you can point me to that. Minecraft use LWJGL which is straight binding to opengl. There is no ovearhead other than small JNI cost. XNA on other hand has lot more stuff in between gpu and your code.

But eventually this has nothing to do with language. It's all about data stuctures.
Only render what you have to with minimal amount of data. One example is that you waste lot of data using floats as vertex colors. Unsigned byte is enough. This save 3*4bytes per vertice.
Also if you don't have to support many texture you can replace uv coordinates with unsigned char that you use as index. Then use that index with uniform vec2 array to get right texture coordinates. This will work for 256 unique texture coords if this is not enough there is allways unsigned short but you have to remember that uniform buffer size is limited to some gpu dependant value.

Edited by kalle_h, 11 August 2012 - 09:16 AM.


#13 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 11 August 2012 - 10:01 AM

Holy crap thanks, kalle-h! I'll have to look at what you're saying tomorrow, though. It's bedtime in Korea. Also, I might need slight help implementing these data changes you propose. Also, I wrote a better instance list/array combiner + frustum culling and distance culling (which is variable), and got up to 8 x 8 = 64 chunks with a decent framerate! The next wall I ran into was taking it up to 16 x 16 = 256 chunks and hit the OutOfMemoryException occasionally while flying amongst the landscape. Again, thanks for the brilliant idea. I'll be back for the help tomorrow.

#14 phil_t   Crossbones+   -  Reputation: 3949

Like
0Likes
Like

Posted 11 August 2012 - 01:27 PM

It sounds like your instance vertex format is something like this?

transform 16*4 -> 64 bytes
texcoord 2*4 -> 8 bytes
texcoord 2*4 -> 8 bytes
texcoord 2*4 -> 8 bytes
color 4*4 -> 16 bytes

As kalle_h mentioned, you should be able to reduce the color to 4 bytes. You can also use lower precision values for the texcoords. Using HalfVector2 for the texcoords will cut their size in half (or use the index menthod Kalle_h mentioned). These are simple changes to the vertex format, you don't need to change the shader.

For the transform, it sounds like you're passing a whole matrix? You actually only need to pass some of the matrix elements, and you can "reconstruct" the matrix in the shader. Certainly you could cut this down to 12 floats. If you only need translation, then you could cut it down to 3 floats. If you also need a uniform scale, that's only 1 more float. Rotation? Probably 4 more.

So, conservatively, you get have:
transform 12*4 -> 48 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
color 4*1 -> 4 bytes
TOTAL: 64 bytes

More aggressively, say you only need translation for your transform:
transform 3*4 -> 12 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
color 4*1 -> 4 bytes
TOTAL: 28 bytes

Edited by phil_t, 11 August 2012 - 01:29 PM.


#15 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 12 August 2012 - 04:55 AM

Ouch! Initializing chunks as they become near really hurts the framerate! There's just too much for the CPU to do! I can't keep all the chunks loaded in memory, and it takes too long to load them even with threading. I think it is safe to say that despite the optimizations I make, XNA Minecraft will never see the light of day... I did come up with an interesting idea though, which barely requires 6 x 6 chunks for effective gameplay... TankCraft! Imagine Pocket Tanks in a 3D Minecraft-esque world, with Minecraft textures, sounds, and maybe even some Minecraft-like weapons at your disposal...

#16 kalle_h   Members   -  Reputation: 1476

Like
0Likes
Like

Posted 12 August 2012 - 12:38 PM

Ouch! Initializing chunks as they become near really hurts the framerate! There's just too much for the CPU to do! I can't keep all the chunks loaded in memory, and it takes too long to load them even with threading. I think it is safe to say that despite the optimizations I make, XNA Minecraft will never see the light of day... I did come up with an interesting idea though, which barely requires 6 x 6 chunks for effective gameplay... TankCraft! Imagine Pocket Tanks in a 3D Minecraft-esque world, with Minecraft textures, sounds, and maybe even some Minecraft-like weapons at your disposal...

Don't blame the technology. XNA is well suited for minecraft clones and lot higher. Quick google search give me this link http://techcraft.codeplex.com/ which so really quality looking minecraft rendering technology. Just use more time to learn how to get it run. At the end you might learn something really generally usefull knowledge about algorithms and bandwith optimizations.

#17 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 12 August 2012 - 06:00 PM

By the Gods... How did they manage such lighting!?!? The main problem I am running into now is gathering all the instances into one array for drawing. It is taking too much time just to get it all sorted out. Here's my current array builder:

public void Update(GameTime gameTime, BoundingFrustum viewFrustum)
		{
			int count = 0;
			for (int i = 0; i < world.LandscapeSizeX; i++)
			{
				for (int j = 0; j < world.LandscapeSizeY; j++)
				{
					if (world.Landscape[i, j] != null)
					{
						if (viewFrustum.Intersects(world.Landscape[i, j].Bounds) && world.Landscape[i, j].IsDrawing)
						{
							count += world.Landscape[i, j].Instances.Count;
						}
					}
				}
			}
			Instances = new InstanceInfo[count];
			count = 0;
			for (int i = 0; i < world.LandscapeSizeX; i++)
			{
				for (int j = 0; j < world.LandscapeSizeY; j++)
				{
					if (world.Landscape[i, j] != null)
					{
						if (viewFrustum.Intersects(world.Landscape[i, j].Bounds) && world.Landscape[i, j].IsDrawing)
						{
							world.Landscape[i, j].Instances.ToArray().CopyTo(Instances, count);
							count += world.Landscape[i, j].Instances.Count;
						}
					}
				}
			}
		}

You may ask why I perform two of the same loop. I do this because I found it is cheaper to add to arrays than it is to lists, so I need an initial count to have an array of the correct size ready.

EDIT: New question: Why is it that when I start a new thread to reinitialize the chunks coming into range, the main thread is slowed? Admittedly there are likely about 20 chunks at the most going through this in any given frame when I am moving around. Should I make a load queue for the thread to work on instead of starting individual threads?

Edited by Drakken255, 12 August 2012 - 06:23 PM.


#18 phil_t   Crossbones+   -  Reputation: 3949

Like
0Likes
Like

Posted 12 August 2012 - 07:54 PM

Starting new threads is expensive! You definitely should not be doing that every frame. Just keep a dedicated thread around for initializing chunks.

For the two loops... do you have a good idea on what a typical maximum " count" would be? If so, just use a List<InstanceInfo> that has its capacity preset. That will avoid re-allocations as your adding items, and should make it almost as fast as an array. In the case where you go "over" the count, the re-allocation will happen and you'll take a perf hit, but it's transparent to you and if you choose a good maximum it should happen rarely.

You could also keep this List as a member variable and just .Clear it each time in your Update method.

#19 Drakken255   Members   -  Reputation: 173

Like
0Likes
Like

Posted 13 August 2012 - 04:49 AM

Starting new threads is expensive! You definitely should not be doing that every frame.


Lol thanks for the tip. Given my landscape is an [,] array of chunks, how do I pass the new coords pointing to a chunk to an existing and maybe running thread? Or would it be best to compile a list of chunks needing an init, and let the loader thread check it? Regardless of which is best, I will still need a little help getting threads to work right.

EDIT: Ok. So I set up a Queue and a worker thread that loads chunks, and it works beautifully! Barely ANY frame drops! Now to decrease the instance data size... kalle_h, phil_t, either one of you have Skype?

Edited by Drakken255, 14 August 2012 - 08:25 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS