Skeletal animation is hella slow

Started by
16 comments, last by Alundra 8 years, 5 months ago

For skeletal animation, I wrote a system where each scene node may have a Bone object attached to it.

My render() function looks like this:


void render()
{
	animationController->updateAnimation(animData, manager.sceneGraph()->rootSceneNode());
	renderer.renderFrame(cam, renderer.shaderProgramCache()->getObject("skeletal animation program"));
}

The updateAnimation() function traverses the scene graph (starting from the root node) with a delta_time variable and a std::map of node names and their transformation time keys. (in code: std::unordered_map<std::string, NodeAnimationData*>)

The update function retrieves some values and then calls then next function, which does all the magic:


void AnimationController::_updateSceneNode(SceneNode *sceneNode, double time, std::unordered_map<std::string, NodeAnimationData*> *nodeAnimations)
{
	if(!sceneNode->animationRunning())
		return;

	if(nodeAnimations->empty())
		return;

	const NodeAnimationData *nodeAnim;
	try {
		nodeAnim = nodeAnimations->at( sceneNode->name() );
	}
	catch(const std::out_of_range&) {
		nodeAnim = nullptr;
	}
	Bone *bone = sceneNode->bone();

	if(nodeAnim && bone)
	{
		double factor;

		//rotation
		const QuatKey *currQuatKey = nullptr, *nextQuatKey = nullptr;
		_getAnimationKeys<QuatKey>(&nodeAnim->quatKeys, time, &currQuatKey, &nextQuatKey);

		factor = (time - currQuatKey->time) / (nextQuatKey->time - currQuatKey->time);
		factor = glm::clamp(factor, 0.0, 1.0);

		glm::quat rotation_quat = glm::slerp(currQuatKey->quat, nextQuatKey->quat, (float)factor);
		glm::mat4 rotation_mat = glm::mat4_cast(rotation_quat);

		//position
		const VectorKey *currPosKey = nullptr, *nextPosKey = nullptr;
		_getAnimationKeys<VectorKey>(&nodeAnim->positionKeys, time, &currPosKey, &nextPosKey);

		factor = (time - currPosKey->time) / (nextPosKey->time - currPosKey->time);
		factor = glm::clamp(factor, 0.0, 1.0);

		glm::vec3 position_vec = currPosKey->vec*(1.0f - (float)factor) + nextPosKey->vec * (float)factor;
		glm::mat4 position_mat = glm::translate(position_vec);

		//set matrices
		bone->setNodeMatrix( (sceneNode->parent()->bone() ? sceneNode->parent()->bone()->nodeMatrix() : glm::mat4(1))
										 * (position_mat * rotation_mat) );
		bone->setFinalMatrix( bone->nodeMatrix() * bone->offsetMatrix() );

		//upload finalMatrix to shader
		std::string str = "Bones[";
		char buf[4] = {0};
		_itoa_s(bone->index(), buf, 10);
		str += buf;
		str += "]";

		_shaderProgram->setUniform(str.c_str(), bone->finalMatrix());

		//pull out that animation node (for speed)
		nodeAnimations->erase(sceneNode->name());
	}

	//recursively call for children
	for(unsigned int i = 0; i < sceneNode->getNumChildren(); i++)
		_updateSceneNode(sceneNode->getChildNode(i), time, nodeAnimations);
}

The function basically inspects every node in the scene and checks if there's a match in the NodeAnimationData map. (IOW if there's an animation for that node's bone). If there is, it calculates all the needed values and uploads the final bone matrix to the Bones[] array in the glsl shader.

I've tried to add some optimizations but this still works slowly with more than two animations running.

Can anyone point out anything I could do better?

Advertisement
Make sure the scene graph is stored in a contiguous array, and that you traverse it by lineraly iterating through the array elements.
Don't use std::map.
Don't use std::unordered_map.
Don't use std::string.
Don't use exception throwing for flow-control.

Perhaps you need to iterate through your animation data as the outer loop first, applying it to your scene nodes in local coordinates.
Then afterwarda, iterate through the scene nodes and perform the parent->child coordinate space transforms to get the world-space data.

Upload all bones to the shader in one operation (at the end, after they've all been updated). Set uniforms by numbers/locations, not by strings/names.

Write the update routines in such a way that you're certain of the memory ranges that will be read/written and what tge side-effects are (e.g. Input->Process->Output) -- so then you can ask a helper thread to update half the nodes for you.

Look up "Pitfalls of OOP" by Albrecht, and anything on Data oriented Design (DoD).

On the same path as Hodgman, never use string in update but store index.

Your animation update should store an array of final transform matrix and then the renderer will use it to send to the shader bone matrix array.

Storing a linear array of index can also avoid your recursion surely.

Seconding all what Hodgman listed, except that I would rephrase his "perhaps you need to iterate through your animation data as the outer loop first ..." as "you should iterate through your animation data as the outer loop first ...". Any sub-system should manage their data themselves. A skeleton animation sub-system, for example, should hold a list of all skeletons that need to be updated. A skeleton gets registered / unregistered with the sub-system when it is instantiated into / removed from the scene. This unburdens the sub-system from iterating the scene tree and touching all those unneeded nodes.

How do I store a scene graph in a contiguous array? The whole point of the scene graph is having that parent-child relation.

The only way I can think of is having std::vector<SceneNode*> in SceneGraph, and store its pointer in every SceneNode. Did you mean something else?

Don't use std::map.

Don't use std::unordered_map.
Don't use std::string.

Why? Do you mean I shouldn't generally use them, or just in this case?

This is what my AnimationData class looks like:


class AnimationData
{
public:
	AnimationData(const aiAnimation *animation);
	~AnimationData();
	
	const char* name() const { return _name.c_str(); }
	double duration() const { return _duration; }
	double ticksPerSecond() const { return _ticksPerSecond; }
	const std::unordered_map<std::string, NodeAnimationData*>* nodeAnimations() const { return &_nodeAnimations; }

private:
	std::string _name; //animation name ("walk" etc.)
	double _duration;
	double _ticksPerSecond;

	std::unordered_map<std::string, NodeAnimationData*> _nodeAnimations; //node name, NodeAnimationData
};

Is there anything wrong in storing NodeAnimationData* in a map?

Perhaps you need to iterate through your animation data as the outer loop first, applying it to your scene nodes in local coordinates.

You mean iterate through the animation data map, find the corresponding SceneNode for every element, and set its local (animation) matrix?

Then afterwarda, iterate through the scene nodes and perform the parent->child coordinate space transforms to get the world-space data.

This part has to be done recursively right?

Upload all bones to the shader in one operation (at the end, after they've all been updated). Set uniforms by numbers/locations, not by strings/names.

This would just avoid the getUniformLocation() call in every setUniform() function, and instead look in a table (perhaps a map<string,index>?) for the matching index.

Is this really my problem?

Why?

Because you do this...


//upload finalMatrix to shader
std::string str = "Bones[";
char buf[4] = {0};
_itoa_s(bone->index(), buf, 10);
str += buf;
str += "]";
_shaderProgram->setUniform(str.c_str(), bone->finalMatrix());

.. for every bone. That's killing you.

Not only are you doing hundreds/thousands (depending on how many models you have and how many bones per model) of memory allocations and reallocations with copies, you also have thousands of individual calls setting one uniform each. (Also, I am very much inclined to believe that the harmless call _shaderProgram->setUniform(...) does a lookup in a std::map or similar... it does, doesn't it?)

What do you think e.g. the two harmless statements str += buf; and str += "]"; result in? They require (at least) two calls to strlen, quite possibly (unless the string implementation over-allocates sufficiently) two reallocations and copies, and creation/destruction of two temporaries.

That's not a lot per se, but it's unnecessary, and it's something that you do many (hundred, thousand) times in a rather time-critical part of your program.

Setting lots of individual uniforms one by one not only means adding a lot of GL/validation overhead, but also means having to do one DMA transfer per uniform, unless the driver is able to batch them together. Transfers have a high latency and thus are very expensive. Luckily, drivers do this kind of optimization and are rather good at it, too. But how can drivers do that?

When it uploads one uniform, the driver looks in the command stream whether there are any other uniform changes already queued (before the next draw command only!). If it finds any, it sends them all together in one transfer. Strictly, that's a violation of the API, but it's one that you don't know about since it produces no externally visible difference.

Now, if you spend some significant time between setting one uniform and another (because you do a lot of needless string manipulations) then chances are that the driver will always only find one uniform in the queue (or very few), and must choose between doing extra transfers, or doing a bulk transfer with all your uniforms at a later time (presumably just before the next draw command), both of which are less favorable.

Is this really my problem?

May be, may be not. How much time is consumed here and there can be determined only if you make a runtime analysis. However, the points shown do all contribute to your runtime problem, some more than others.

Why? Do you mean I shouldn't generally use them, or just in this case?

You should not use them in dependence on the invocation frequency and available time. You want to write a realtime application and need to expect several hundreds(+) invocations. Although the std containers are not necessarily bad, they are made for general use and, more or less, desktop applications.

You mean iterate through the animation data map, find the corresponding SceneNode for every element, and set its local (animation) matrix?

More or less, as long as "find" does not mean a search. As I've written in a post above, let the animation sub-system manage its own data. If you need an access back to the scene node, then let the sub-system have a list of pointers to all animated scene nodes. Iterating the list and accessing the belonging scene node is then a fast operation.

This part has to be done recursively right?

Well, not really. If you iterate the tree top-down, then a parent's world transform is already calculated when you visit its children, so that they can rely on its up-to-date state. No recursive calculation necessary. If you follow the mentioned DoD approach for the parent/child relations as well, then order the matrices so that parent world transforms are hit before those of the associated children.

Then what do you guys suggest? How can I do all that's needed without std containers? To be more specific:

@samoth

(Also, I am very much inclined to believe that the harmless call _shaderProgram->setUniform(...) does a lookup in a std::map or similar... it does, doesn't it?)

Nope. Do you think I should implement such thing? Here's the current (inefficient, I suppose..) implementation:


GLint ShaderProgram::getUniformLocation(const GLchar *name) const
{
	if(name == NULL) {
		Log::instance() << "uniform name is null :" << name << "\n";
		return -1;
	}

	GLint location = glGetUniformLocation(_id, name);
	if(location == -1)
		Log::instance() << "uniform doesn't exist: " << name << "\n";

	return location;
}


void ShaderProgram::setUniform(const GLchar *name, const glm::mat4 &m, GLboolean transpose=GL_FALSE)
{
	glUniformMatrix4fv(getUniformLocation(name), 1, transpose, &m[0][0]);
} 

If I will implement such thing, in order to completely avoid "playing with strings" I'm gonna need something like ::setUniformArray(const char *arrayName, unsigned int index, mat4x4 value). and then call will look like setUniformArray("Bones", index, bone->finalMatrix());. No?

But then again I will implement this with a std::map and that's a function that's gonna be called many times per frame, which is bad according to what haegarr says. I'm confused!

More or less, as long as "find" does not mean a search. As I've written in a post above, let the animation sub-system manage its own data. If you need an access back to the scene node, then let the sub-system have a list of pointers to all animated scene nodes. Iterating the list and accessing the belonging scene node is then a fast operation.

I'm sorry but I still don't get this. When you say "animation sub-system" do you mean my AnimationController class?

I implemented it in a way that's every AnimationData and SceneNode* pair have a matching Timer object for their animation.

Can you explain further?

Rather than perhaps, maybe, should and might, and assumptions about data types, have you ran a profiler against your code yet?

I am sure a lot of the points given here already are correct but without profiling you won't know for sure where your bottlenecks are, and there may be other things you haven't considered outside of skeletal animation that are causing your performance issues.

Please let us know what the profiler says and we can impart much more accurate advice...

Nope. Do you think I should implement such thing? Here's the current (inefficient, I suppose..) implementation:

GLint ShaderProgram::getUniformLocation(const GLchar *name) const
{

...

GLint location = glGetUniformLocation(_id, name);
if(location == -1)
Log::instance() << "uniform doesn't exist: " << name << "\n";

return location;
}

This is your lookup in "something like a std::map", done by the GL. No matter how the implementation does the c-string to location translation (presumably with a hash map, but you cannot know for sure), this is a heavyweight thing. Although, admittedly, the function is const qualified, so there is a chance that the compiler might eliminate some calls (if you're lucky... which is better than nothing).

Should I implement that?

No. You should not implement such a thing yourself either, you should not at all translate from strings to integers in that way for every bone. No matter what algorithm you choose (linear search, binary tree, hash map, ...) they all suck. Even if someone tells you "but hash maps are O(1)". They may be, but that's irrelevant.

You need to consider what happens when you call such a function. Let's assume that the GL implementation uses a hash map. You pass in pointer to a 10-character string. The GL now iterates over every character of that string, calculating a hash. Then it looks into the bucket of its hash map and finds an entry. Now it must compare the string stored at that entry with your string (iterating over it again) to be sure that it's really a hit, not a collision. Now it retrieves the integer and passes it back to you. Add overhead of a function call to that.

Now, compare that to

a) reading an integer from a memory location and using that instead

or

b) not fiddling with single uniforms at all, but throwing a whole block of memory (containing a thousand of them) at the driver in one go

This topic is closed to new replies.

Advertisement