Sign in to follow this  
Headkaze

[iPhone] Optimizing a long loop

Recommended Posts

For a game I'm writing for the iPhone I have hit a particular problem. I am rendering md2 models and each frame the models are animated by interpolating their vertices between the current and next frame. This is how the md2 format stores animation, as a serious of frames with each vertex position defined. In the game currently there are 4 animating models onscreen. Since each model has 918 vertices multiply that by 4 and you have 3,672 iterations per frame to update the models' animation. When I profile the game in Shark this takes up a whopping 15 - 20% of frame time. Drawing only one model takes about 6%. Here is the offending code
NSInteger vertexCount = _header.triangleCount * 3;

for( NSInteger i=0; i<vertexCount; i++)
{
	_vertices[i].position.x = currentVertices[i].position.x + alpha * (nextVertices[i].position.x - currentVertices[i].position.x);
	_vertices[i].position.y = currentVertices[i].position.y + alpha * (nextVertices[i].position.y - currentVertices[i].position.y);
	_vertices[i].position.z = currentVertices[i].position.z + alpha * (nextVertices[i].position.z - currentVertices[i].position.z);
}
Pretty basic, loop through each vertex and interpolate between current and next frame. I would really like to optimize this, and I don't see how it would be possible by doing anything different in the loop. So I would like to hear suggestions (such as updating a model per frame / frame skipping etc.). What would you do?

Share this post


Link to post
Share on other sites
I don't know much about the iPhone, but I read the newer ones at least support vertex shaders, which is excellent for things like this.
If not, perhaps you can pre-calculate alpha * (next - current) at every key-frame change, at least if your update rate is constant. You could also choose to only update the model animation at specific intervals, for example half your framerate (and/or force constant update rate for the models).

Share this post


Link to post
Share on other sites
You could try looking at this presentation ( http://gamesfromwithin.com/360idev-cranking-up-floating-point-performance-to-11 ) although it might already cover some stuff you already know.
Also either arranging your vertices to be more cache efficient/aware or taking advantage of __builtin_prefetch might speed it up a bit.
There's another article from that site about data orientated design which covers this ( http://gamesfromwithin.com/data-oriented-design ), just take the anti-oo opinion with a grain of salt.

Share this post


Link to post
Share on other sites
On the 3GS (won't help you if you are targeting old iPhones as well) you can make use of the NEON instructions for this. If my assembler is any good this operation would end up as two (single cycle?) NEON instructions (per iteration). One vsub and one vmuladd, instead of 9 as it is now. Not counting loop instructions.

Share this post


Link to post
Share on other sites
Simple stuff you can do, unroll the loop for multiple iterations per step, use raw pointers vs arrays to access the data, organize your data for better caching or if your up too it, use SIMD instructions (http://code.google.com/p/vfpmathlibrary/), etc..

Good Luck!

-ddn

Share this post


Link to post
Share on other sites
Thanks for the feedback guys. I have coded in ARM asm before so I think I could optimize it using the vfp. Thanks for that excellent link, I went through all the slides although a bit dissapointed I can't see the video of Noel doing the talk live. Is there a video somewhere? I'm also having a slight bottle neck processing particles so that is another thing I need to optimize. I'll keep you posted on how this is resolved.

BTW I was quite shocked that the 3G runs it's vfp at half the speed so it does put me off using the vfp. And obviously can't use the NEON because we are targetting older phones aswell.

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
Moving to Consoles, PDAs and Cell Phones in case you can get any more advice there. :)


Well I didn't actually think there was much opportunity to optimize the loop itself so I thought it was more of a general game programming question on how to deal with big loops. I'm not sure moving it will help either, that part of the forum seems quite dead. Perhaps there are more iPhone related forums on the web. Thanks anyway I know you're just trying to help!

Share this post


Link to post
Share on other sites
Quote:
Original post by Headkaze
NSInteger vertexCount = _header.triangleCount * 3;
The best solution would be to cut down the number of vertices. Is there any chance you can switch to a format (possibly custom) which supports indexed vertices? You could likely cut down the number of vertices by 50%.

Share this post


Link to post
Share on other sites
Okay problem solved.. this was easier than I though it would be and really staring me in the face.. Remove the interpolation. Since the models are quite small and low poly you don't really notice the difference.

So re-factored the code for this, and now instead of taking 20% I've dropped it down to 0%.

The mod is really to just assign the _verticies pointer to the current frame of vertices instead of copying them over in a loop. Since removing interpolation no realtime calculations are necessary and I can just use the raw data of each frame.

Another optimization I've been considering is changing from GL_TRIANGLES to GL_TRIANGLES_STRIP. Is it relatively easy to convert the data over to this format?

Share this post


Link to post
Share on other sites
Quote:
Original post by Headkaze
Another optimization I've been considering is changing from GL_TRIANGLES to GL_TRIANGLES_STRIP. Is it relatively easy to convert the data over to this format?
Art tools can do it, but it is probably not recommended.

First off, you would be much better off with indexed triangles rather than drawing individual triangles. That way you only transmit the points once and reuse them. If you are transmitting raw triangles every frame you are probably sending 3X the necessary data.

I'm not sure about the iPhone's various GPU editions, but most modern cards perform better with "triangle soup" rather than triangle strips. The parallel processing in the card means that each sub-processor can be more efficient if you let it do its own optimization rather than telling it a specific order to draw the strips.

Quote:
I'm not sure moving it will help either, that part of the forum seems quite dead.
It is very active but low volume.

Many people watch it. Few professionals post questions to it because of the console maker's NDA terms They will post to the official private groups.

When questions are asked, however, there are relatively more industry professionals lurking on the forum. The answers are generally more useful and targeted directly to the hardware's need.

Share this post


Link to post
Share on other sites
I know you've "solved it" but you could optimise this simple loop by removing all the aliasing. I would code the loop something like this:

while(vertexCount--)
{
fp32 fCx = pCurrent->x;
fp32 fCy = pCurrent->y;
fp32 fCz = pCurrent->z;

fp32 fNx = pNext->x;
fp32 fNy = pNext->y;
fp32 fNz = pNext->z;

pDest->x = fCx + (fNx - fCx) * fAlpha;
pDest->y = fCy + (fNy - fCy) * fAlpha;
pDest->z = fCz + (fNz - fCz) * fAlpha;

++pCurrent;
++pNext;
++pDest;
}
I hope this helps?

Share this post


Link to post
Share on other sites
Quote:
Original post by Rompa
I know you've "solved it" but you could optimise this simple loop by removing all the aliasing. I would code the loop something like this:

while(vertexCount--)
{
fp32 fCx = pCurrent->x;
fp32 fCy = pCurrent->y;
fp32 fCz = pCurrent->z;

fp32 fNx = pNext->x;
fp32 fNy = pNext->y;
fp32 fNz = pNext->z;

pDest->x = fCx + (fNx - fCx) * fAlpha;
pDest->y = fCy + (fNy - fCy) * fAlpha;
pDest->z = fCz + (fNz - fCz) * fAlpha;

++pCurrent;
++pNext;
++pDest;
}
I hope this helps?





If the compiler isnt up to snuff then ordering them better for register use might be a tiny bit better.... (set compiler options for speed over code size..) An SSE solution would probably be alot better but Im not sure if the ARM used has anything like that.



fp32 fCx = pCurrent->x;
pDest->x = fCx + (pNext->x - fCx) * fAlpha;

fp32 fCy = pCurrent->y;
pDest->y = fCy + (pNext->y - fCy) * fAlpha;

fp32 fCz = pCurrent->z;
pDest->z = fCz + (pNext-> - fCz) * fAlpha;



Another possibility is if the actions are repetitive (or when they are like in walking) to cache each frame in the repeated sequence so that they only have to be calculated once.


Also another solution (if the objects are often at distance) is to use a simpler model (LOD level of detail) when they are far enough away that the simpler detail wont matter.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this