Speed of Euler, Verlet and RK4 with loads of objects

This topic is 3893 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Hi guys, here's a little question I'm sure some of you will be able to respond in under 5 secs... I'm going to implement game that resembels Gish physics (you know, Jacobsen-like Verlets with relaxation etc), and now I'm prototyping some physics code. First, I solved FPS independent movement using technique presented by http://www.gaffer.org/game-physics/fix-your-timestep/ and everything works fine using Euler, RK4 and Verlet integration schemes. Objects move on the screen smoothly, without any visible jerking etc. So far so good. However, problems arise when I throw a "few" more objects - the framerate drops very badly at even moderate numbers. Obviously, I'm testing the stuff in release mode and here are some sample values:
Euler:      350 objects at 30 fps
Verlet:     330 objects at 30fps
RK4:        70 objects at 30fps, drops below 5 with ~100 objects

It runs at ~500 fps with 1 object for all integration schemes. (yup, no vsync) Here's code I use for integration:
  // ------------- RK4
{
previous = current;

Derivative2d a = Evaluate(current, t);
Derivative2d b = Evaluate(current, t, dt*0.5f, a);
Derivative2d c = Evaluate(current, t, dt*0.5f, b);
Derivative2d d = Evaluate(current, t, dt, c);

const float dxdt = 1.0f/6.0f * (a.d_pos.x + 2.0f*(b.d_pos.x + c.d_pos.x) + d.d_pos.x);
const float dydt = 1.0f/6.0f * (a.d_pos.y + 2.0f*(b.d_pos.y + c.d_pos.y) + d.d_pos.y);

const float dvxdt = 1.0f/6.0f * (a.d_vel.x + 2.0f*(b.d_vel.x + c.d_vel.x) + d.d_vel.x);
const float dvydt = 1.0f/6.0f * (a.d_vel.y + 2.0f*(b.d_vel.y + c.d_vel.y) + d.d_vel.y);

current.pos.x = current.pos.x + dxdt*dt;
current.pos.y = current.pos.y + dydt*dt;

current.vel.x = current.vel.x + dvxdt*dt;
current.vel.y = current.vel.y + dvydt*dt;
}

// ------------- Euler
{
previous = current;

Derivative2d a = Evaluate(current, t);

current.pos.x = current.pos.x + a.d_pos.x*dt;
current.pos.y = current.pos.y + a.d_pos.y*dt;

current.vel.x = current.vel.x + a.d_vel.x*dt;
current.vel.y = current.vel.y + a.d_vel.y*dt;
}

// ------------- Verlet

{
Vector2 tmp = state.pos_current;
state.pos_current = (state.pos_current * 2 - state.pos_previous) + state.acceleration * (dt * dt);
state.pos_previous = tmp;
state.acceleration.Set(0,0);
}

// --- for curious ones

Derivative2d TEntity :: Evaluate( const State2d & _initial, float t )
{
Derivative2d output;
output.d_pos.x = _initial.vel.x;
output.d_pos.y = _initial.vel.y;

Acceleration(_initial, t, output ); // now does nothing
return output;
}


Here's the main loop:
  float new_time = timer->GetTime();
float delta_time = new_time - current_time;
current_time = new_time;

accumulator += delta_time;

while (accumulator >= dt) // dt = 0.01
{
for (EntitiesItor itor = entities.begin(); itor != entities.end(); ++itor)
(*itor)->Integrate(t, dt);

t += dt;
accumulator -= dt;
}

const float alpha = accumulator / dt;

State2d state;

for (EntitiesItor itor = entities.begin(); itor != entities.end(); ++itor)
{
(*itor)->Update();
(*itor)->Render(renderer, (*itor)->Interpolate(alpha));
}


Does anyone have any ideas why this works so slow? I don't believe that today powerful computers that run things like Unreal 3 or Crysis on >30 fps could have problems moving a few dozens of objects, without any collision detection / collision resolving / any other activity taking place, so there's either sth inherently bad with my code, or that's the hard truth I'll have to accept. I tried removing the rendering of objects, and it raised fps by only a few percent, so obviously I'm not GPU but CPU bound. Also, I have AMD 3500+ with plenty of DDR2 RAM, Radeon X1600 which is not that bad, so IMHO slow computer config is not the case.

Share on other sites
The RK4 method seems to be performing quite well, considering how much arithmetic is involved. Rough calculations show that in your case, it is only 5x slower than Euler (5 == 350/70).

What you could do is run each integration method repeatedly for n seconds, tallying up how many iterations are performed. I'm going to guess that the number of Euler iterations will be also around 5x as many as that given by RK4.

Have you tried RK2?

When numerically integrating, you should also consider whether Euler is sufficient. Unless you're simulating non-linear velocities and accelerations, Euler is sufficient. ex: Quake's characters either stand run or walk. They do not have a range of velocities, so Euler integration is sufficient.

Also, the numerical imprecision of the floating point format that you are using may end up providing more instability than the actual integration method. ex: When modelling the orbit of Mercury, trajectory errors would be as much a fault of imprecision as naive integration.

Share on other sites
Since the OP mentioned Gish i'd say that Euler may not be the best choice, however "newton-stoermer-verlet" might be a good choice. as fast as Euler but way better-behaving:

v += a*dt
x += v*dt

I'd also like to know why the numbers are so low -- for Euler that's only 350*4*30 = 42k adds + 42k multiplies per second.. it should surely be orders of magnitude faster than that, no?

Possibly your method of timing is incorrect?

Share on other sites
A speed of a SGDI (or NSV) and a sixth order symplectic integrator are about... Sixth order integrator is seven times slower than SGDI (NSV, symplectic euler, or whatever you are calling that simple algorithm.).

So you might throw away RK4 from the window, it's crap anyway, and use sixth order integrator.

BTW what's your timestep, how much copying are you doing, and where is that O(n^2) algorithm? 70 - 100 : 30 - 5 it's an increase of amount of object by half followed by decrease of speed by 6. It looks more like O(n^3) algorithm, even if I'd account for increased memory allocation.

Share on other sites
RK4 does take about 5 times longer but you should be able to run with a dt that is 10 to 100 times larger and still have acceptable accuracy, that's really the point of using higher order integration techniques.

Share on other sites
Quote:
 Original post by RagharA speed of a SGDI (or NSV) and a sixth order symplectic integrator are about... Sixth order integrator is seven times slower than SGDI (NSV, symplectic euler, or whatever you are calling that simple algorithm.).So you might throw away RK4 from the window, it's crap anyway, and use sixth order integrator.BTW what's your timestep, how much copying are you doing, and where is that O(n^2) algorithm? 70 - 100 : 30 - 5 it's an increase of amount of object by half followed by decrease of speed by 6. It looks more like O(n^3) algorithm, even if I'd account for increased memory allocation.

I'm just curious, but why do you say RK4 is crap? And I'm not real familiar with numerical methods as they apply specifically to games but I've never heard of using a 6th order method for, well anything really.

Share on other sites
I want to see the timing methods used by the OP. I'm betting they're at best 1ms resolution and if you're going to be picky that's just not enough.

Share on other sites
just a note: I would change the order from this
  while (accumulator >= dt) // dt = 0.01   {    for (EntitiesItor itor = entities.begin(); itor != entities.end(); ++itor)                (*itor)->Integrate(t, dt);          t += dt;    accumulator -= dt;             }

to this
    for (EntitiesItor itor = entities.begin(); itor != entities.end(); ++itor)    {      for( float time_left = accumulator; time_left >= dt; time_left -= dt )      {               (*itor)->Integrate(t, dt);      }    }    while (accumulator >= dt) // dt = 0.01    {      t += dt;      accumulator -= dt;              }

This won't make a difference for a small number of objects, or when you are only doing 1 step, but should help the rest of the time. Have you tried profiling your code to see where the time is actually going?

(why do my increment operators (plus equals) not show up in the preview window?)

Share on other sites
Quote:
 Original post by lonesockjust a note: I would change the order from this*** Source Snippet Removed ***to this*** Source Snippet Removed ***This won't make a difference for a small number of objects, or when you are only doing 1 step, but should help the rest of the time. Have you tried profiling your code to see where the time is actually going?(why do my increment operators (plus equals) not show up in the preview window?)

Except that then you're passing the wrong time to the integrator.

Share on other sites
Quote:
 Original post by jjdExcept that then you're passing the wrong time to the integrator.

Good catch! I was just hacking the order and didn't pay attention...my bad.

Share on other sites
Wow guys, thx for the answers!

taby: I heared about RK2 (midpoint) but haven't tested it yet.

As for the accuracy of simulation, since I'm going to be simulating sth more complicated than FPS shooter, that is: more advanced version of Gish, I do care about it, but since levels are going to be huge with lots of objects, I'm also very careful about speed. So far it looks that RK4 is out because of speed, and Euler is out because of accuracy. This leaves me with some variation of Verlet...

Numerical imprecision: I won't be doing astronomical calculations :-) first speed, then accuracy. Also, I can't do much about imprecision of floats (which is quite good, AFAIK from GPG #4).

raigan: I already did some research and the page you mentioned was one of the first I found :-) good stuff, I'll consider it. Do you have any other links to pages that show how to practically use NSV it in games?

And I'm also worried why the fps is so low, that's why I started this thread :-)

Mike2343: you're right, I'm using SDL which gives at best 1ms resolution. Some code:

 // populated by SDL_GetTicks void Update(float _frame_dt)    {     if (paused)      return;     ++frames; ++total_frames;     last_frame_index = this_frame_index;     this_frame_index += _frame_dt * scale;         delta_time = (this_frame_index - last_frame_index) / 1000.0f;       /*        Start frames at 0, and store the time. Each frame, put the frame counter up,        and check if the current time is the stored time + 1000.        When it is, set the frames count as current FPS, then reset it to 0 and put the  current time in the stored time.         FPS will update every second, showing the average FPS over the last one second.      */           if (this_frame_index >= (marker + 1000))      {       fps = frames;       frames = 0;       marker = static_cast<int>(this_frame_index);      }             }

Are you sure that 1ms is not enough? You see, I don't count the time before and after the loop and subtract it which would be obviously require nanosecond resolution, I'm only counting frames and for this task it should be sufficient.

My_Mind_Is_Going: yup, I know that RK4 is much more stable than Euler but I want to be able to throw at least a few dozens of objects at level (not to mention CPU particles), a number at which RK4 starts to perform at amazing speeds of seconds per frame...

Raghar:
Quote:
 BTW what's your timestep, how much copying are you doing, and where is that O(n^2) algorithm? 70 - 100 : 30 - 5 it's an increase of amount of object by half followed by decrease of speed by 6. It looks more like O(n^3) algorithm, even if I'd account for increased memory allocation.

You hit the nail right in the head! I really don't know what's causing it. I scanned through all the code and can't find anynthing that would look like O(n^2), let alone O(n^3). No copying done. Objects are allocated on heap and virtual methods are used for Update and Render (but that can't be that slow!)

Here's the zip with all src code (4 short files, 2 contain actual physics code) and compiled application. Maybe somebody could confirm that I'm right/wrong? (++karma :-)) You can also see for yourself whether the numbers I gave are accurate.

Timestep = 0.01

jjd: I would like to know what a person with experience in physics field thinks about the thing: is it normal what is happening in my case...?

lonesock: thx for the tip, but jjd already corrected you :-) as for the ++ in preview window, they don't show up and that's normal(?) here on gamedev.

Share on other sites
Quote:
Original post by My_Mind_Is_Going
Quote:
 Original post by RagharA speed of a SGDI (or NSV) and a sixth order symplectic integrator are about... Sixth order integrator is seven times slower than SGDI (NSV, symplectic euler, or whatever you are calling that simple algorithm.).So you might throw away RK4 from the window, it's crap anyway, and use sixth order integrator.BTW what's your timestep, how much copying are you doing, and where is that O(n^2) algorithm? 70 - 100 : 30 - 5 it's an increase of amount of object by half followed by decrease of speed by 6. It looks more like O(n^3) algorithm, even if I'd account for increased memory allocation.

I'm just curious, but why do you say RK4 is crap? And I'm not real familiar with numerical methods as they apply specifically to games but I've never heard of using a 6th order method for, well anything really.

Because when I tried a symplectic sixth order integrator in a n-body problem it blowed to pieces incredibly quickly. (Basically it deviated from optimal solution at the exponential speed.) An eight order integrator will not fare much better, because the precision required to have full effect (on long term) is higher that dynamics of 64 FP numbers, a sixth order integrator is still able to work with 64 FP numbers nicely.

Now if you look at more common applications, these are either well solved by SGDI (NSV), or an sixth order symplectic integrator creates more smooth results than RK4 at a cost of a small increase in complexity. Basically if you would like to accept disadvantages of a fourth order integrator, you might just as well accept a symplectic sixth order integrator, these are quite close in speed.

Share on other sites
I hope whatever it is you are doing is working well with Verlet. I have heard from other people as well that Verlet is useful for realtime integration of non-linear functions (ex: Newtonian gravitation, tire rotations per second).

I will try it out. Thank you for the code.

RK2 looks something like this:
// Where gravitation == acceleration.inline void proceed_rk2(const double &dt){	vector_3 k1_velocity = mercury.velocity;	vector_3 k1_acceleration = gravitation(mercury.position, k1_velocity);	vector_3 k2_velocity = mercury.velocity + k1_acceleration*dt*0.5;	vector_3 k2_acceleration = gravitation(mercury.position + k1_velocity*dt*0.5, k2_velocity);	mercury.velocity += k2_acceleration*dt;	mercury.position += k2_velocity*dt;}

[Edited by - taby on June 22, 2007 6:19:06 PM]

Share on other sites
I see now what Gish is.

I think that the author perhaps modelled three spheres of increasing blurriness (ex: metaball), and then formed a mesh around it by using some isosurfacing algorithm like Marching Cubes. They then shaded it with an algorithm that accounts for specularity (ex: 2-tone toon shader).

Share on other sites
Quote:
Original post by Raghar
Quote:
Original post by My_Mind_Is_Going
Quote:
 Original post by RagharA speed of a SGDI (or NSV) and a sixth order symplectic integrator are about... Sixth order integrator is seven times slower than SGDI (NSV, symplectic euler, or whatever you are calling that simple algorithm.).So you might throw away RK4 from the window, it's crap anyway, and use sixth order integrator.

I'm just curious, but why do you say RK4 is crap? And I'm not real familiar with numerical methods as they apply specifically to games but I've never heard of using a 6th order method for, well anything really.

Because when I tried a symplectic sixth order integrator in a n-body problem it blowed to pieces incredibly quickly. (Basically it deviated from optimal solution at the exponential speed.) An eight order integrator will not fare much better, because the precision required to have full effect (on long term) is higher that dynamics of 64 FP numbers, a sixth order integrator is still able to work with 64 FP numbers nicely.

It really, really isn't simple as saying the higher order integrator is the thing that made your simulation stable, or that higher order is always going to improve stability or quality of results. Accuracy is not always associated with improved stability. Its great that this worked for you, and that you are pleased with the results. But that probably wasn't the only solution. A low-order fully implicit integrator also might have been more stable, or an explicit upwind integrator. Every integrator has certain stability characteristics that determine the growth or decay of error based on inputs, and the stability characteristics change depending on the underlying governing equations being integrated. The goal, of course, is to choose an integrator that, for your particular governing equations, does not result in unbounded growth of error.

Share on other sites
Quite interesting discussion so far, however, still no one has directly adressed my question. Some folks who are quite advanced in physics have written here but gave no hints about what they think of the subject. This gives me feeling that for them it's so obvious that integrating such number of objects is slow, they find it useless to state it directly.

So, let me rephrase it a little.
If you think that moving on the screen few hundreds of objects without any collision tests whatsover and with minimal amount of other computations using simplest integration schemes should knock down today computers, there's no need to write here. Otherwise, please tell me that am I doing sth wrong since I feel a little stupid right now ("all those people know sth that I don't"). You don't even have to tell me WHAT am I doing wrong, just simple "You should be able to move 10x as much objects at 10x the FPS" will do, so I will know that I have to dig deeper for truth.

Quote:
 Original post by tabyI see now what Gish is.I think that the author perhaps modelled three spheres of increasing blurriness (ex: metaball), and then formed a mesh around it by using some isosurfacing algorithm like Marching Cubes. They then shaded it with an algorithm that accounts for specularity (ex: 2-tone toon shader).

I recall reading that jelly body was done using particles connected by springs that are integrated using Verlet etc, just like presented here: http://www.gdmag.com/src/jun06.zip

Your approach also may work (heck, it may work even better) but IMHO it sounds like an overkill :-)

Share on other sites
If you aren't doing any collision handling, and it is only simple time step integration of independent bodies that is affecting your frame rate (as it appears is the case from your description and sample code), then I would guess you are doing something wrong. You should be able to have a larger number of objects with no noticeable affect on frame rate. I would estimate you should be able to run some thousands of objects, as long as all you are doing is time integration with perhaps a constant gravitational force or some other independent and cheaply computed force per object.

So...I have a suspicion that part of the problem may not be physics at all. It may in fact be a bottleneck on the rendering side. My assumption here is that you're using a traditional graphics API (OpenGL or Direct3D), perhaps within a game engine that you either wrote yourself or that is open source. Performance of these traditional API's is heavily a function of geometry batching. How many draw calls do you have? If you essentially have one, or more independent meshes *per* object, e.g., # of graphics objects with an independent mesh == the # of physics objects, then that absolutely can cause your rendering to bottleneck at some point. And the # of objects won't be too large (a few hundred objects) before batching issues slow things down. It can be quite difficult to do batch management for modern graphics cards. But, I may be telling you something you already know, or providing not enough detail. At least perhaps it is a clue.

The batching issue is real, but doesn't necessarily explain why RK4 is 4x slower than Euler in terms of measured fps. The graphics rendering bits shouldn't be different beween RK4 and Euler. So, lets see if anything else might obviously be wrong. Looking at your code...you have a lot of lines like "Derivative2d a = Evaluate...". Here, you are creating an object on the stack, caling its constructor, then calling an assignment operator with the assigned value coming from another function call. So, three function calls and a stack allocation per Derivative2d object just to construct the temporary objects. And upon exiting the function, the destructor for the object is called and the item is popped off of the stack. So, in reality, for each of those objects, you have a minimum of 4 function calls (constructor, destructor, assignment, evaluate function), a stack push and pop. Just dealing with temporary objects. For RK4 you do all of this 4 times. So, 16 function calls just for the derivatives + 8 stack alloc/push and free/pop operations. Additionally, you allocate your derivate values dxdt, etc. on the stack for each time step. All of these object constructions/destructions, stack operations, function calls....take time and they alone can kill performance. Clean it up....keep some container class that is allocated once with all the deritive objects, temp values, etc., and reuse the preallocated objects or member variables for every time step until you are ready to stop the simulation altogether. You may very find that this is one of the root problems with the simple integration/no collision case.

To test whether the integration stuff is the real cause vs. batching, try running the sim with nothing being rendered for the physics objects. Then, once you've solved the pure physics stuff, go back and work the rendering side.

Please let me know if any of this actually does produce a beneficial gain in performance. Its a first step, and ultimately there can be memory cache hit/miss issues that creep in... I'd really like to hear how this works out!

Share on other sites
By the way, let me be the first to apologize for going off on a tangent. When talking about something of interest, its easy to go with the flow. But it didn't help you.

Share on other sites
Yeah, you should be able to have ~10000 objects at once even with RK4. On my Macbook Pro 2.16 Ghz I got about 5 fps for something like that. I disagree with grhodes_at_work's suggestions of function call/stack allocation optimizations. These are just microoptimizations that aren't going to cause the huge decrease in performace that you're seeing. Look for O(n^2) operations like another person said.

Another thing, are you compiling in release mode? That would definitely speed things up.

Share on other sites
Please post the whole source. Point out areas that are suspect.

Share on other sites
Quote:
 Original post by AresseraI disagree with grhodes_at_work's suggestions of function call/stack allocation optimizations. These are just microoptimizations that aren't going to cause the huge decrease in performace that you're seeing.

True, actually.

Share on other sites
Quote:
Original post by grhodes_at_work
Quote:
 Original post by AresseraI disagree with grhodes_at_work's suggestions of function call/stack allocation optimizations. These are just microoptimizations that aren't going to cause the huge decrease in performace that you're seeing.

True, actually.

I agree. The static keyword is a programmer's good friend. The heuristic behind the optimizing compiler is fallible.

OMG!!!!!!

After some failed experiments (and a lot of swearing) I started playing with dt... to my total amusement, fps started raising... here's the snapshot:
     // reaction: wow!     dt = 60.0f / 1000.0f;   // 1260 objects at 30 fps     // oh yeah     dt = 200.0f / 1000.0f;  // 2400 at 30 fps     // baby, this started working     dt = 300.0f / 1000.0f;  // 30 fps: 2700 Euler, 2000 RK4     // at this point I started sweating     dt = 500.0f / 1000.0f;  // 3500 at 30 fps     dt = 1.0f;              // 5000 at 30 fps E     // shock     dt = 5.0f;              // 15000 at 30 fps E (150 without rendering)     // started praying     dt = 15.0f;             // 20000 at 30 fps, (250 without rendering, at fullscreen)

And there's seems to be no end to it. In fact, I simply got bored of changing dt, pressing + to add objects and waiting for program to exit (in last scenario I had to wait few minutes for all the objects to be released). However, I know it's not that easy and for the speed I'm sacrificing sth else... probably stability, right?

Do you have any ideas what are the proper values of dt, that result in balanced speed and stability?

(Btw, I wonder why Gaffer put such small dt in his article originally and didn't mention how it affects framerate.)

grhodes_at_work: thanks for the in-depth explanation! You're right that I'm using OGL and doing large amount of very small batches (for every object: 1 rectangle rendered using glBegin... lol), but that's temporary and will change. Disabling rendering resulted in performance jumping up from 30 to 250 fps, so there's a lot to do here.

Share on other sites
Welly well, your framerate problem has nothing to do with your integration code.

I took a look at your main loop and saw that you use a fixed time step with an accumulator. That's all well and good, but the problem you are having is that if for a timestep, the rendering takes longer than your fixed time step, multiple simulation steps will occur, further reducing your framerate. This vicious cycle degenerates such that you eventually you are performing way to many simulation steps every time you render.

The reason why increasing your dt increases the percieved framerate is because you only update the physics once dt time has passed. By increasing dt you reduce probability of the above problem occurring, and that's why you seem to get better framerates.

basically, that gaffer on games article is complete BS and that method that he uses for a fixed timestep does not work reliably.

Try doing a test without using the fixed time step (i.e. just pass your delta time straight to your integrator). I expect that this will fix your problems and you'll see about 5000 objects handled at a decent framerate.

Share on other sites
Wait a minute.
Integration 1/100 means it's done 100 times per second, this means each 10 ms. It could be done even less often. Now if you used some arbitrary numbers, not a second and did your calculation in 1/100 of these arbitrary numbers, you might actually do million operations per second per object. (In this case the exponential behavior might be simply caused by data leaving L1 and L2 cache.)

Even 14 updates per second should show you a noticeable jumps (without interpolation).

Is resolution of your timer really 1 ms, or is it high precision one?