Sprite batching performance

Graphics and GPU Programming Programming OpenGL

Started by Totologic February 24, 2015 04:42 AM

39 comments, last by Totologic 9 years, 1 month ago

Totologic

225

Author

February 24, 2015 04:42 AM

Hi.

I am working on a sprite batcher in OpenGL. I use OpenGL 3.1 and GLSL #version 110.

My batcher has the following features:

- it works with 1 single atlas texture

- no persistence, meaning you push your whole list of sprites every frame

- each sprite can have his own UV on the atlas and his own 3x3 matrix transformation

- a camera with his own 4x4 matrix

The "no persistence" feature is a choice I made to keep the API very simple. No need to "add" or "remove" the sprites. Just ask to render what you need each frame independantly .

Today the batcher can reach 800 sprites at stable 60FPS on a 800x600 window. Running on my laptop:

ASUS i7-267M CPU @ 1.80GHz 1.80GHz

RAM 4Go

GPU Intel(R) HD Graphics Family with 116 ext.

Is it good performance ?

Note: I plan to add a new method in the batcher in order to render more efficiently large tiled backgrounds.

My blog about games, AI...

http://totologic.blogspot.com/

JoshuaWaring

1,359

February 24, 2015 06:02 AM

Could I please have an executable or source code to look at it :)?

Totologic

225

Author

February 24, 2015 07:31 AM

Not yet.

Just want to know if someone did more.

My blog about games, AI...

http://totologic.blogspot.com/

21st Century Moose

13,459

February 24, 2015 07:58 AM

It's hard to say based on the information you give.

A big factor here will be the size of your sprites, and whether or not any of them overlap.

If you're drawing 800 quite small sprites with little or no overlap (like e.g small sparks in a particle system), that's quite poor performance, even for an Intel (I'm guessing it's a HD3000 so while it's not a good Intel, it's not a really bad one either, so you should be expecting close to 1000 fps with that kind of scene).

On the other hand if your sprites are large and with lots of overlap, in other words you're getting a lot of blending and hitting the framebuffer a lot, that's not too shabby at all.

The theoretical worst case is 800 full-screen blended quads, and I don't think an Intel can do that at any kind of reasonable framerate.

If you can't provide the code then at least you should provide a screenshot, so that we can see what kind of scene you're getting this performance with and make a judgement based on that. It will of course be a rough judgement (there are so many other places that your performance could drop off or could be optimal).

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Totologic

225

Author

February 24, 2015 08:11 AM

OK I understand.

I am preparing something.

My blog about games, AI...

http://totologic.blogspot.com/

Totologic

225

Author

February 24, 2015 09:42 AM

My atlas is 1024x1024:
[sharedmedia=gallery:images:6159]

A capture of rendering in a 800x400 window at 60FPS in Debug (100FPS in Release):
[sharedmedia=gallery:images:6158]

The full pink color in the atlas is considered as transparent by the fragment shader.

Here is my main:


Toto2D toto2d; // my batcher
GLFWwindow* window = toto2d.init(800, 400, "atlas.tga"); // create the window

glm::mat3 matTranslate1 = glm::mat3(1.0f);
glm::mat3 matScale = glm::mat3(1.0f);
glm::mat3 matRotation = glm::mat3(1.0f);
glm::mat3 matTranslate2 = glm::mat3(1.0f);
glm::mat3 matTransf = glm::mat3(1.0f);

int i;
int j;
float t = 0.0f;
float k;

while (!glfwWindowShouldClose(window))
{
	toto2d.reset(); // empty batch

	k = 0.0f;
	for (i=0 ; i<40 ; i++)
	{
		for (j=0 ; j<20 ; j++)
		{
			matTranslate1[2][0] = -100.0f;
			matTranslate1[2][1] = -100.0f;
			matScale[0][0] = 1.0f+sin(t+k)*0.5f;
			matScale[1][1] = 1.0f+sin(t+k)*0.5f;
			matRotation[0][0] = cos(t);
			matRotation[1][0] = -sin(t);
			matRotation[0][1] = sin(t);
			matRotation[1][1] = cos(t);
			matTranslate2[2][0] = i*30.0f;
			matTranslate2[2][1] = j*30.0f;

			matTransf = matTranslate2 * matRotation * matScale * matTranslate1; // SHITY LINE !

			// add a sprite to render list,
			// addSpriteMatrix(int textureLeft, int textureTop, int textureWidth, int textureHeight, glm::mat3x3 &transform);
			toto2d.addSpriteMatrix(200*(i%5), 200*(j%5), 200, 200, matTransf);

			k += 0.005;
		}
	}

	t += 0.01f;

	toto2d.setCameraLookAt(600.0f, 300.0f, 0.25f, 0.7f);
	toto2d.render(); // render batch

	glfwPollEvents();
}

My blog about games, AI...

http://totologic.blogspot.com/

Totologic

225

Author

February 24, 2015 09:46 AM

Doing that 800 times each frame consumes a lot:

matTransf = matTranslate2 * matRotation * matScale * matTranslate1;

I reach 100FPS (Debug) when bypassing that (and 120FPS in Release).

I am using GLM. Any way to concatenate matrices more efficlently ?

My blog about games, AI...

http://totologic.blogspot.com/

21st Century Moose

13,459

February 24, 2015 03:57 PM

You can optimize some of this by precalculating a lot of that stuff. sin(t) and cos(t) only need to be calculated once rather than 1600 times each, for example. For matTranslate1 and matTranslate2 you should be able to bake them into the sprite coords rather than having them as separate matrices. That will help some.

see also here - http://www.ies-math.com/math/java/trig/kahote/kahote.html - for precalculating sin(t+k)

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

swiftcoder

18,997

February 24, 2015 05:55 PM

The full pink color in the atlas is considered as transparent by the fragment shader.

Are you making it transparent by writing enabling alpha test and writing out an alpha of zero, or by using 'discard' in the shader for transparent pixels?

The performance of the former is liable to be orders of magnitude higher performance than the latter.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

21st Century Moose

13,459

February 24, 2015 07:16 PM

Where I'm coming from is that if there's an observable difference of this magnitude between debug and release builds, and assuming that nothing else (physics, AI, sound, etc) is going on, then the OP is most likely CPU-bound. All other things being equal, code running on the GPU shouldn't demonstrate this kind of difference between debug and release builds, so the first line of attack should be to equalize things a bit better, and having done that, then look to optimize the GPU side of things.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Sprite batching performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Sprite batching performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines