Sign in to follow this  

Skeletal animation is hella slow

This topic is 765 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

For skeletal animation, I wrote a system where each scene node may have a Bone object attached to it.

My render() function looks like this:

void render()
{
	animationController->updateAnimation(animData, manager.sceneGraph()->rootSceneNode());
	renderer.renderFrame(cam, renderer.shaderProgramCache()->getObject("skeletal animation program"));
}

The updateAnimation() function traverses the scene graph (starting from the root node) with a delta_time variable and a std::map of node names and their transformation time keys. (in code: std::unordered_map<std::string, NodeAnimationData*>)

The update function retrieves some values and then calls then next function, which does all the magic:

void AnimationController::_updateSceneNode(SceneNode *sceneNode, double time, std::unordered_map<std::string, NodeAnimationData*> *nodeAnimations)
{
	if(!sceneNode->animationRunning())
		return;

	if(nodeAnimations->empty())
		return;

	const NodeAnimationData *nodeAnim;
	try {
		nodeAnim = nodeAnimations->at( sceneNode->name() );
	}
	catch(const std::out_of_range&) {
		nodeAnim = nullptr;
	}
	Bone *bone = sceneNode->bone();

	if(nodeAnim && bone)
	{
		double factor;

		//rotation
		const QuatKey *currQuatKey = nullptr, *nextQuatKey = nullptr;
		_getAnimationKeys<QuatKey>(&nodeAnim->quatKeys, time, &currQuatKey, &nextQuatKey);

		factor = (time - currQuatKey->time) / (nextQuatKey->time - currQuatKey->time);
		factor = glm::clamp(factor, 0.0, 1.0);

		glm::quat rotation_quat = glm::slerp(currQuatKey->quat, nextQuatKey->quat, (float)factor);
		glm::mat4 rotation_mat = glm::mat4_cast(rotation_quat);

		//position
		const VectorKey *currPosKey = nullptr, *nextPosKey = nullptr;
		_getAnimationKeys<VectorKey>(&nodeAnim->positionKeys, time, &currPosKey, &nextPosKey);

		factor = (time - currPosKey->time) / (nextPosKey->time - currPosKey->time);
		factor = glm::clamp(factor, 0.0, 1.0);

		glm::vec3 position_vec = currPosKey->vec*(1.0f - (float)factor) + nextPosKey->vec * (float)factor;
		glm::mat4 position_mat = glm::translate(position_vec);

		//set matrices
		bone->setNodeMatrix( (sceneNode->parent()->bone() ? sceneNode->parent()->bone()->nodeMatrix() : glm::mat4(1))
										 * (position_mat * rotation_mat) );
		bone->setFinalMatrix( bone->nodeMatrix() * bone->offsetMatrix() );

		//upload finalMatrix to shader
		std::string str = "Bones[";
		char buf[4] = {0};
		_itoa_s(bone->index(), buf, 10);
		str += buf;
		str += "]";

		_shaderProgram->setUniform(str.c_str(), bone->finalMatrix());

		//pull out that animation node (for speed)
		nodeAnimations->erase(sceneNode->name());
	}

	//recursively call for children
	for(unsigned int i = 0; i < sceneNode->getNumChildren(); i++)
		_updateSceneNode(sceneNode->getChildNode(i), time, nodeAnimations);
}

The function basically inspects every node in the scene and checks if there's a match in the NodeAnimationData map. (IOW if there's an animation for that node's bone). If there is, it calculates all the needed values and uploads the final bone matrix to the Bones[] array in the glsl shader.

I've tried to add some optimizations but this still works slowly with more than two animations running.

 

Can anyone point out anything I could do better?

Edited by Pilpel

Share this post


Link to post
Share on other sites

On the same path as Hodgman, never use string in update but store index.

Your animation update should store an array of final transform matrix and then the renderer will use it to send to the shader bone matrix array.

Storing a linear array of index can also avoid your recursion surely.

Share this post


Link to post
Share on other sites

Seconding all what Hodgman listed, except that I would rephrase his "perhaps you need to iterate through your animation data as the outer loop first ..." as "you should iterate through your animation data as the outer loop first ...". Any sub-system should manage their data themselves. A skeleton animation sub-system, for example, should hold a list of all skeletons that need to be updated. A skeleton gets registered / unregistered with the sub-system when it is instantiated into / removed from the scene. This unburdens the sub-system from iterating the scene tree and touching all those unneeded nodes.

Share this post


Link to post
Share on other sites

How do I store a scene graph in a contiguous array? The whole point of the scene graph is having that parent-child relation.

The only way I can think of is having std::vector<SceneNode*> in SceneGraph, and store its pointer in every SceneNode. Did you mean something else?

 

Don't use std::map.

Don't use std::unordered_map.
Don't use std::string.

Why? Do you mean I shouldn't generally use them, or just in this case?

This is what my AnimationData class looks like:

class AnimationData
{
public:
	AnimationData(const aiAnimation *animation);
	~AnimationData();
	
	const char* name() const { return _name.c_str(); }
	double duration() const { return _duration; }
	double ticksPerSecond() const { return _ticksPerSecond; }
	const std::unordered_map<std::string, NodeAnimationData*>* nodeAnimations() const { return &_nodeAnimations; }

private:
	std::string _name; //animation name ("walk" etc.)
	double _duration;
	double _ticksPerSecond;

	std::unordered_map<std::string, NodeAnimationData*> _nodeAnimations; //node name, NodeAnimationData
};

Is there anything wrong in storing NodeAnimationData* in a map?

 

Perhaps you need to iterate through your animation data as the outer loop first, applying it to your scene nodes in local coordinates. 

You mean iterate through the animation data map, find the corresponding SceneNode for every element, and set its local (animation) matrix?

 

Then afterwarda, iterate through the scene nodes and perform the parent->child coordinate space transforms to get the world-space data.

This part has to be done recursively right?

 

Upload all bones to the shader in one operation (at the end, after they've all been updated). Set uniforms by numbers/locations, not by strings/names.

This would just avoid the getUniformLocation() call in every setUniform() function, and instead look in a table (perhaps a map<string,index>?) for the matching index.

Is this really my problem?

Edited by Pilpel

Share this post


Link to post
Share on other sites
Why?

Because you do this...

//upload finalMatrix to shader
std::string str = "Bones[";
char buf[4] = {0};
_itoa_s(bone->index(), buf, 10);
str += buf;
str += "]";
_shaderProgram->setUniform(str.c_str(), bone->finalMatrix());

.. for every bone. That's killing you.

 

Not only are you doing hundreds/thousands (depending on how many models you have and how many bones per model) of memory allocations and reallocations with copies, you also have thousands of individual calls setting one uniform each. (Also, I am very much inclined to believe that the harmless call _shaderProgram->setUniform(...) does a lookup in a std::map or similar... it does, doesn't it?)

 

What do you think e.g. the two harmless statements str += buf; and str += "]"; result in? They require (at least) two calls to strlen, quite possibly (unless the string implementation over-allocates sufficiently) two reallocations and copies, and creation/destruction of two temporaries.

That's not a lot per se, but it's unnecessary, and it's something that you do many (hundred, thousand) times in a rather time-critical part of your program.

 

Setting lots of individual uniforms one by one not only means adding a lot of GL/validation overhead, but also means having to do one DMA transfer per uniform, unless the driver is able to batch them together. Transfers have a high latency and thus are very expensive. Luckily, drivers do this kind of optimization and are rather good at it, too. But how can drivers do that?

When it uploads one uniform, the driver looks in the command stream whether there are any other uniform changes already queued (before the next draw command only!). If it finds any, it sends them all together in one transfer. Strictly, that's a violation of the API, but it's one that you don't know about since it produces no externally visible difference.

 

Now, if you spend some significant time between setting one uniform and another (because you do a lot of needless string manipulations) then chances are that the driver will always only find one uniform in the queue (or very few), and must choose between doing extra transfers, or doing a bulk transfer with all your uniforms at a later time (presumably just before the next draw command), both of which are less favorable.

Edited by samoth

Share this post


Link to post
Share on other sites
Is this really my problem?

May be, may be not. How much time is consumed here and there can be determined only if you make a runtime analysis. However, the points shown do all contribute to your runtime problem, some more than others.

 

Why? Do you mean I shouldn't generally use them, or just in this case?

You should not use them in dependence on the invocation frequency and available time. You want to write a realtime application and need to expect several hundreds(+) invocations. Although the std containers are not necessarily bad, they are made for general use and, more or less, desktop applications.

 

You mean iterate through the animation data map, find the corresponding SceneNode for every element, and set its local (animation) matrix?

More or less, as long as "find" does not mean a search. As I've written in a post above, let the animation sub-system manage its own data. If you need an access back to the scene node, then let the sub-system have a list of pointers to all animated scene nodes. Iterating the list and accessing the belonging scene node is then a fast operation.

 

This part has to be done recursively right?

Well, not really. If you iterate the tree top-down, then a parent's world transform is already calculated when you visit its children, so that they can rely on its up-to-date state. No recursive calculation necessary. If you follow the mentioned DoD approach for the parent/child relations as well, then order the matrices so that parent world transforms are hit before those of the associated children.

Edited by haegarr

Share this post


Link to post
Share on other sites

Then what do you guys suggest? How can I do all that's needed without std containers? To be more specific:

@samoth

 

 

 

(Also, I am very much inclined to believe that the harmless call _shaderProgram->setUniform(...) does a lookup in a std::map or similar... it does, doesn't it?)

Nope. Do you think I should implement such thing? Here's the current (inefficient, I suppose..) implementation:

GLint ShaderProgram::getUniformLocation(const GLchar *name) const
{
	if(name == NULL) {
		Log::instance() << "uniform name is null :" << name << "\n";
		return -1;
	}

	GLint location = glGetUniformLocation(_id, name);
	if(location == -1)
		Log::instance() << "uniform doesn't exist: " << name << "\n";

	return location;
}


void ShaderProgram::setUniform(const GLchar *name, const glm::mat4 &m, GLboolean transpose=GL_FALSE)
{
	glUniformMatrix4fv(getUniformLocation(name), 1, transpose, &m[0][0]);
} 

 If I will implement such thing, in order to completely avoid "playing with strings" I'm gonna need something like ::setUniformArray(const char *arrayName, unsigned int index, mat4x4 value). and then call will look like setUniformArray("Bones", index, bone->finalMatrix());. No?

But then again I will implement this with a std::map and that's a function that's gonna be called many times per frame, which is bad according to what haegarr says. I'm confused!

 

 

 

More or less, as long as "find" does not mean a search. As I've written in a post above, let the animation sub-system manage its own data. If you need an access back to the scene node, then let the sub-system have a list of pointers to all animated scene nodes. Iterating the list and accessing the belonging scene node is then a fast operation.

I'm sorry but I still don't get this. When you say "animation sub-system" do you mean my AnimationController class?

I implemented it in a way that's every AnimationData and SceneNode* pair have a matching Timer object for their animation.

Can you explain further?

Share this post


Link to post
Share on other sites

Nope. Do you think I should implement such thing? Here's the current (inefficient, I suppose..) implementation:
 

GLint ShaderProgram::getUniformLocation(const GLchar *name) const
{

...

GLint location = glGetUniformLocation(_id, name);
if(location == -1)
Log::instance() << "uniform doesn't exist: " << name << "\n";

return location;
}

This is your lookup in "something like a std::map", done by the GL. No matter how the implementation does the c-string to location translation (presumably with a hash map, but you cannot know for sure), this is a heavyweight thing. Although, admittedly, the function is const qualified, so there is a chance that the compiler might eliminate some calls (if you're lucky... which is better than nothing).
 

Should I implement that?

No. You should not implement such a thing yourself either, you should not at all translate from strings to integers in that way for every bone. No matter what algorithm you choose (linear search, binary tree, hash map, ...) they all suck. Even if someone tells you "but hash maps are O(1)". They may be, but that's irrelevant.

 

You need to consider what happens when you call such a function. Let's assume that the GL implementation uses a hash map. You pass in pointer to a 10-character string. The GL now iterates over every character of that string, calculating a hash. Then it looks into the bucket of its hash map and finds an entry. Now it must compare the string stored at that entry with your string (iterating over it again) to be sure that it's really a hit, not a collision. Now it retrieves the integer and passes it back to you. Add overhead of a function call to that.

 

Now, compare that to

a) reading an integer from a memory location and using that instead

    or

b) not fiddling with single uniforms at all, but throwing a whole block of memory (containing a thousand of them) at the driver in one go

Edited by samoth

Share this post


Link to post
Share on other sites


I'm sorry but I still don't get this. When you say "animation sub-system" do you mean my AnimationController class?
I implemented it in a way that's every AnimationData and SceneNode* pair have a matching Timer object for their animation.
Can you explain further?

With "animation sub-system" I do not name a specific one but a group of collaborating objects, and one of them is suitable to fulfill the discussed task. A class named AnimationController usually does not what I mean, but maybe yours do so.

 

For further explanation, let me begin with the game loop. The game loop is commonly organized as a repeated execution of a defined sequence of tasks. Game time advancing, input collecting, player character control, animation playback, physics simulation, … and finally rendering are typical tasks. From the point of view of the game loop, these things are high level tasks that, for the sake of separation, are usually associated with various sub-systems like input, player control, animation, physics, rendering, and so on. So the game loop has a list of such tasks, and it calls each one by an update(time, delta) or so. This is also true for the animation sub-system; it may look like

    animation->update(time, delta);

meaning "update all animations in the scene". So the routine iterates the running animations and updates each one (and here I would expect an AnimationController, one per animation). Now, this iteration should not go through the scene tree and look out for animated nodes. Instead it should go through an internal list. This list holds nothing but each and every animated node. Iterating it means that every found node is known to be animated. No need to determine this property, no need to skip "inanimate" nodes. Further, the animation sub-system has the opportunity to order the nodes as is most suitable for, so not being dependent on the order in the scene tree.

Share this post


Link to post
Share on other sites

First of all thanks for answering my questions. A few things:

 

1. The thing that hurt me the most was the exception throwing. I read about this and they say that c++ exceptions are indeed terribly slow.

2. After taking out the exception throwing part, I ran a profiler and done some debugging.

I know that the things that got me stuck are:

- setUniform().

- This _getAnimationKeys function template, that runs on the timed keyframes for every node animation to get the correct keyframes. I can see why this ruined my performance over time, as more time passes, the harder it gets to find the correct (late) keyframe indices. A good workaround would be to store that last key index I used?

- The recursive _updateSceneNode() call for each scene node child.

Now this tells me I use the approach haegarr mentioned, which is to save in a linear array every animated scene node I add to the scene, and then iterate over that array..?

Where would I store the time stamps (I use a Timer class object) for every scene node animation?

 

@samoth

For the setUniform() issue, I didn't quite get you.

Can you explain further on how I'm supposed to implement "reading an integer from a memory location and using that instead" without an std::map?

Share this post


Link to post
Share on other sites

Okay so I changed my system a lot. A few more questions:

 

1. My FPS actually raises a lot when I run my (glut) application in fullscreen and not in window mode like I used to. Any idea why?

 

2. My only test model has 12500 faces and 57 bones. Is that too much for a realtime application? Is there a place where I can download other animated models (preferable from real games)?

 

3. Using the test model I described above, my FPS starts to drop from 60 (in fullscreen) when it draws around 13~ animated models in every frame. Is that okay?

My hardware is pretty fine. CPU: Intel Core i5-2500 , 3.3GHz. GPU: NVIDIA GeForce GTX 570.

 

4. After using a profiler, it seems like most of the work goes on matrix multiplications and familiar things. I'm pretty sure that I can't avoid this stuff.

Here are the important pieces of my code with comments. Please tell if you think I'm doing something wrong.

http://pastebin.com/Jiu50nas

Note that I don't use searching, strings, maps, unordered maps in my render routine. I use them only when I setup all the data needed for the render routine so it can work without them.

 

Edit:

Wow I just realized I was building my project in "Debug" configuration the whole time.

I changed it to "Release" and the fps goes below 60 when rendering above 330~ animated models. My questions are still relevant though.

Edited by Pilpel

Share this post


Link to post
Share on other sites

I would prefer this:

 

At constant intervals iterate through all animations and do the required logic to determine which keyframes you are on etc.

Then iterate through each bone and iterate through the small list of only currently playing animations and accumulate the transforms for that bone (if any) and store it with the bone.

Then go through all bones and recursively caculate the parents' matrix and then that bone's matrix using some flag to prevent redudant calculation.

Upload the results per bone to your shader.

 

I suppose you could find a way to even do that in a shader, but I tend to just use C++ to get-er done.

Share this post


Link to post
Share on other sites

Everything you said is pretty much what I'm doing, however "Then go through all bones and recursively caculate the parents' matrix and then that bone's matrix using some flag to prevent redudant calculation."

Please take a look at the _updateFinalMatrices() function I wrote. It does exactly that. Can you tell something that might speed it up more?

Share this post


Link to post
Share on other sites

Do you do the final recursive step in matrices?

Try doing it in quaternions instead in that case and only do the matrix-calculation at the end when setting the uniform. You could also try sending quaternions to the shader and calculate the matrices there, which may be almost free (or may not). Depends on where the GPU bottle-neck is.

Edited by Erik Rufelt

Share this post


Link to post
Share on other sites

Do you do the final recursive step in matrices?

Try doing it in quaternions instead in that case and only do the matrix-calculation at the end when setting the uniform. You could also try sending quaternions to the shader and calculate the matrices there, which may be almost free (or may not). Depends on where the GPU bottle-neck is.

 

You can't really slerp matricies. I said that I calculated the matricies recursively (so that parents move children). The matricies are computed by the rotation quaternions and translations.

Typically most game ready 3d meshes don't have tons of joints/bones so you are really only redoing maybe up to 20x2 matricies once per frame per mesh.

When I compute the rotation and translation I do it only at intervals and not every frame. I compute start and end rotation+translations for every joint/bone and then use the current time to interop between them. For me it seemed that traversing the running animations, handling masking and blending etc was a bit much to do every frame. Heck, I even put that in another thread to itself.

Share this post


Link to post
Share on other sites
You can't really slerp matricies.

You should use NLerp which gives a lot of boost of performance on an animation system, it's used in all big animation system.

Edited by Alundra

Share this post


Link to post
Share on other sites

 

Don't use std::map.
Don't use std::unordered_map.
Don't use std::string.

Why? Do you mean I shouldn't generally use them, or just in this case?

Then what do you guys suggest? How can I do all that's needed without std containers?

 


Others have touched on this already, but just to clarify:

std::strings are fine in most cases. std::unordered_maps are fine in most cases, and fast. In performance-critical loops getting called literally millions of times, they aren't fine or fast enough.
 
In tight loops. because std::unordered_map stores its memory in non-contiguous locations, the CPU has to repeatedly reach for memory that's not in the cache, which causes repeated (but minuscule in isolation) delays. std::vector doesn't have that problem. std::vector stores its memory contiguously.
This means you have to do extra work (both in code, and in runtime speed) to lay out the memory in a more speed-efficient way, but its more than made up for by the data-access costs you'd be avoiding.
 
std::string is fine at high-levels, but in tight-loops, it's too costly. If you are comparing two int-types, it's really fast. But when you are comparing two strings, you are saying:

for(every char in the string)
   if(thisString.char == otherString.char)

Each 'char' is a separate int, so if your string is 10 characters long, in theory you're doing 10 int<->int comparisons instead of one int<->int comparison.
In practice, it's more optimized than this. You can compare 4 or 8 bytes at once by treating them as a single 32 bit or 64bit integer, and it's easier to tell when strings don't match (sizes aren't equal, or the first few letters are wrong, etc...), so you can early-exit more easily).

Still, you are making an entire function-call (again accessing different locations in memory) by calling std::string == std::string, with at least several int<->int comparisons and branches. This is why people often use std::strings for human-readable conveniences, but either logically map IDs to the strings, and pass around the IDs as the primary keys, or else hash the string as soon as possible (a one-time cost) and use the hash as the key.

(Unrelated note: by default, you always want to use std::unordered_map over regular std::map unless you need your elements to remain ordered. You pay extra costs for std::map for a benefit that often is not needed)

Share this post


Link to post
Share on other sites
This is why people often use std::strings for human-readable conveniences, but either logically map IDs to the strings, and pass around the IDs as the primary keys, or else hash the string as soon as possible (a one-time cost) and use the hash as the key.

Yes, it's important to avoid all find of string in a function which is called each frame in animation or other categories.

The good thing to do is to setup that one time when it's needed.

Edited by Alundra

Share this post


Link to post
Share on other sites

This topic is 765 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this