Face Instancing: Dividing Draw Calls?

Started by
17 comments, last by Drakken255 11 years, 8 months ago
I just don't get it... I've created a fairly efficient way of rendering thousands of faces, and even without updating the instance buffer, the framerate is still incredibly erratic! What really bugs me is that Java is a slower language than C#, yet Java can render more faster... Is it possible that Java using OpenGL is faster than XNA is at rendering?

EDIT: Rendering is not the problem. Building and appending all the lists in the "is-face-visible" calculation is what is taking up the FPS. Let me try running one loop to form an initial count, then a second loop to add them all in the array. See if converting to arrays makes it faster.
Advertisement

I just don't get it... I've created a fairly efficient way of rendering thousands of faces, and even without updating the instance buffer, the framerate is still incredibly erratic! What really bugs me is that Java is a slower language than C#, yet Java can render more faster... Is it possible that Java using OpenGL is faster than XNA is at rendering?


Java and c# is about as fast. Some benchmarks give java slight edge and some give it to c#. Or if you have better data you can point me to that. Minecraft use LWJGL which is straight binding to opengl. There is no ovearhead other than small JNI cost. XNA on other hand has lot more stuff in between gpu and your code.

But eventually this has nothing to do with language. It's all about data stuctures.
Only render what you have to with minimal amount of data. One example is that you waste lot of data using floats as vertex colors. Unsigned byte is enough. This save 3*4bytes per vertice.
Also if you don't have to support many texture you can replace uv coordinates with unsigned char that you use as index. Then use that index with uniform vec2 array to get right texture coordinates. This will work for 256 unique texture coords if this is not enough there is allways unsigned short but you have to remember that uniform buffer size is limited to some gpu dependant value.
Holy crap thanks, kalle-h! I'll have to look at what you're saying tomorrow, though. It's bedtime in Korea. Also, I might need slight help implementing these data changes you propose. Also, I wrote a better instance list/array combiner + frustum culling and distance culling (which is variable), and got up to 8 x 8 = 64 chunks with a decent framerate! The next wall I ran into was taking it up to 16 x 16 = 256 chunks and hit the OutOfMemoryException occasionally while flying amongst the landscape. Again, thanks for the brilliant idea. I'll be back for the help tomorrow.
It sounds like your instance vertex format is something like this?

transform 16*4 -> 64 bytes
texcoord 2*4 -> 8 bytes
texcoord 2*4 -> 8 bytes
texcoord 2*4 -> 8 bytes
color 4*4 -> 16 bytes

As kalle_h mentioned, you should be able to reduce the color to 4 bytes. You can also use lower precision values for the texcoords. Using HalfVector2 for the texcoords will cut their size in half (or use the index menthod Kalle_h mentioned). These are simple changes to the vertex format, you don't need to change the shader.

For the transform, it sounds like you're passing a whole matrix? You actually only need to pass some of the matrix elements, and you can "reconstruct" the matrix in the shader. Certainly you could cut this down to 12 floats. If you only need translation, then you could cut it down to 3 floats. If you also need a uniform scale, that's only 1 more float. Rotation? Probably 4 more.

So, conservatively, you get have:
transform 12*4 -> 48 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
color 4*1 -> 4 bytes
TOTAL: 64 bytes

More aggressively, say you only need translation for your transform:
transform 3*4 -> 12 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
texcoord 2*2 -> 4 bytes
color 4*1 -> 4 bytes
TOTAL: 28 bytes
Ouch! Initializing chunks as they become near really hurts the framerate! There's just too much for the CPU to do! I can't keep all the chunks loaded in memory, and it takes too long to load them even with threading. I think it is safe to say that despite the optimizations I make, XNA Minecraft will never see the light of day... I did come up with an interesting idea though, which barely requires 6 x 6 chunks for effective gameplay... TankCraft! Imagine Pocket Tanks in a 3D Minecraft-esque world, with Minecraft textures, sounds, and maybe even some Minecraft-like weapons at your disposal...

Ouch! Initializing chunks as they become near really hurts the framerate! There's just too much for the CPU to do! I can't keep all the chunks loaded in memory, and it takes too long to load them even with threading. I think it is safe to say that despite the optimizations I make, XNA Minecraft will never see the light of day... I did come up with an interesting idea though, which barely requires 6 x 6 chunks for effective gameplay... TankCraft! Imagine Pocket Tanks in a 3D Minecraft-esque world, with Minecraft textures, sounds, and maybe even some Minecraft-like weapons at your disposal...

Don't blame the technology. XNA is well suited for minecraft clones and lot higher. Quick google search give me this link http://techcraft.codeplex.com/ which so really quality looking minecraft rendering technology. Just use more time to learn how to get it run. At the end you might learn something really generally usefull knowledge about algorithms and bandwith optimizations.
By the Gods... How did they manage such lighting!?!? The main problem I am running into now is gathering all the instances into one array for drawing. It is taking too much time just to get it all sorted out. Here's my current array builder:


public void Update(GameTime gameTime, BoundingFrustum viewFrustum)
{
int count = 0;
for (int i = 0; i < world.LandscapeSizeX; i++)
{
for (int j = 0; j < world.LandscapeSizeY; j++)
{
if (world.Landscape[i, j] != null)
{
if (viewFrustum.Intersects(world.Landscape[i, j].Bounds) && world.Landscape[i, j].IsDrawing)
{
count += world.Landscape[i, j].Instances.Count;
}
}
}
}
Instances = new InstanceInfo[count];
count = 0;
for (int i = 0; i < world.LandscapeSizeX; i++)
{
for (int j = 0; j < world.LandscapeSizeY; j++)
{
if (world.Landscape[i, j] != null)
{
if (viewFrustum.Intersects(world.Landscape[i, j].Bounds) && world.Landscape[i, j].IsDrawing)
{
world.Landscape[i, j].Instances.ToArray().CopyTo(Instances, count);
count += world.Landscape[i, j].Instances.Count;
}
}
}
}
}


You may ask why I perform two of the same loop. I do this because I found it is cheaper to add to arrays than it is to lists, so I need an initial count to have an array of the correct size ready.

EDIT: New question: Why is it that when I start a new thread to reinitialize the chunks coming into range, the main thread is slowed? Admittedly there are likely about 20 chunks at the most going through this in any given frame when I am moving around. Should I make a load queue for the thread to work on instead of starting individual threads?
Starting new threads is expensive! You definitely should not be doing that every frame. Just keep a dedicated thread around for initializing chunks.

For the two loops... do you have a good idea on what a typical maximum " count" would be? If so, just use a List<InstanceInfo> that has its capacity preset. That will avoid re-allocations as your adding items, and should make it almost as fast as an array. In the case where you go "over" the count, the re-allocation will happen and you'll take a perf hit, but it's transparent to you and if you choose a good maximum it should happen rarely.

You could also keep this List as a member variable and just .Clear it each time in your Update method.

Starting new threads is expensive! You definitely should not be doing that every frame.


Lol thanks for the tip. Given my landscape is an [,] array of chunks, how do I pass the new coords pointing to a chunk to an existing and maybe running thread? Or would it be best to compile a list of chunks needing an init, and let the loader thread check it? Regardless of which is best, I will still need a little help getting threads to work right.

EDIT: Ok. So I set up a Queue and a worker thread that loads chunks, and it works beautifully! Barely ANY frame drops! Now to decrease the instance data size... kalle_h, phil_t, either one of you have Skype?

This topic is closed to new replies.

Advertisement