n-body physics wrong after separating interleaved xy coordinates

Started by
8 comments, last by CommanderLake 5 years, 7 months ago
I've been experimenting with my own n-body simulation for some time and I recently discovered how to optimize it for efficient multithreading and vectorization with the Intel compiler.
It did exactly the same thing after making it multithreaded and scaled very well on my ancient i7 3820 (4.3GHz).
Then I changed the interleaved xy coordinates to separate arrays for x and y to eliminate the strided loads to improve AVX scaling and copy the coordinates to an interleaved array for OpenTK to render as points.
Now the physics is all wrong, the points form clumps that interact with each other but they are unusually dense and accelerate faster than they decelerate causing the clumps to randomly fly off into the distance and after several seconds I get a NaN where 2 points somehow occupy exactly the same x and y float coordinates.

This is the C++ DLL:


#include "PPC.h"
#include <thread>
static const float G = 0.0000001F;
const int count = 4096;
__declspec(align(64)) float pointsx[count];
__declspec(align(64)) float pointsy[count];
void SetData(float* x, float* y){
	memcpy(pointsx, x, count * sizeof(float));
	memcpy(pointsy, y, count * sizeof(float));
}
void Compute(float* points, float* velx, float* vely, long pcount, float aspect, float zoom) {
#pragma omp parallel for
	for (auto i = 0; i < count; ++i) {
		auto forcex = 0.0F;
		auto forcey = 0.0F;
		for (auto j = 0; j < count; ++j) {
			if(j == i)continue;
			const auto distx = pointsx[i] - pointsx[j];
			const auto disty = pointsy[i] - pointsy[j];
			//if(px != px) continue;	//most efficient way to avoid a NaN failure
			const auto force = G / (distx * distx + disty * disty);
			forcex += distx * force;
			forcey += disty * force;
		}
		pointsx[i] += velx[i] -= forcex;
		pointsy[i] += vely[i] -= forcey;
		if (zoom != 1) {
			points[i * 2] = pointsx[i] * zoom / aspect;
			points[i * 2 + 1] = pointsy[i] * zoom;
		} else {
			points[i * 2] = pointsx[i] / aspect;
			points[i * 2 + 1] = pointsy[i];
		}
		/*points[i * 2] = pointsx[i];
		points[i * 2 + 1] = pointsy[i];*/
	}
}

This is the relevant part of the C# OpenTK GameWindow:


private void PhysicsLoop(){
	while(true){
		if(stop){
			for(var i = 0; i < pcount; ++i) { velx[i] = vely[i] = 0F; }
		}
		if(reset){
			reset = false;
			var r = new Random();
			for(var i = 0; i < Startcount; ++i){
				do{
					pointsx[i] = (float)(r.NextDouble()*2.0F - 1.0F);
					pointsy[i] = (float)(r.NextDouble()*2.0F - 1.0F);
				} while(pointsx[i]*pointsx[i] + pointsy[i]*pointsy[i] > 1.0F);
				velx[i] = vely[i] = 0.0F;
			}
			NativeMethods.SetData(pointsx, pointsy);
			pcount = Startcount;
			buffersize = (IntPtr)(pcount*8);
		}
		are.WaitOne();
		NativeMethods.Compute(points0, velx, vely, pcount, aspect, zoom);
		var pointstemp = points0;
		points0 = points1;
		points1 = pointstemp;
		are1.Set();
	}
}
protected override void OnRenderFrame(FrameEventArgs e){
	GL.Clear(ClearBufferMask.ColorBufferBit);
	GL.EnableVertexAttribArray(0);
	GL.BindBuffer(BufferTarget.ArrayBuffer, vbo);
	mre1.Wait();
	are1.WaitOne();
	GL.BufferData(BufferTarget.ArrayBuffer, buffersize, points1, BufferUsageHint.StaticDraw);
	are.Set();
	GL.VertexAttribPointer(0, 2, VertexAttribPointerType.Float, false, 0, 0);
	GL.DrawArrays(PrimitiveType.Points, 0, pcount);
	GL.DisableVertexAttribArray(0);
	SwapBuffers();
}

These are the array declarations:


private const int Startcount = 4096;
private readonly float[] pointsx = new float[Startcount];
private readonly float[] pointsy = new float[Startcount];
private float[] points0 = new float[Startcount*2];
private float[] points1 = new float[Startcount*2];
private readonly float[] velx = new float[Startcount];
private readonly float[] vely = new float[Startcount];

 

Edit 0: It seems that adding 3 zeros to G increases the accuracy of the simulation but I'm at a loss as to why its different without interleaved coordinates. Edit 1: I somehow achieved an 8.3x performance increase with AVX over scalar with the new code above!
Advertisement

I don't have a solution to your problem, but I see a few opportunities for improving your code.

 


                pointsx[i] += velx[i] -= fx;
		pointsy[i] += vely[i] -= fy;

That's hard to read. Try to make life easy for the reader.

You don't need Math.sqrt() to check if a distance is greater than 1: Just check if the distance squared is greater than 1.

You shouldn't use goto where a more descriptive control-flow mechanism exists. In this case it's just a do-while loop.

You are mixing the computation of the updated positions and velocities with the details of the representation (aspect, zoom). That is poor taste. A function that is called Compute should not be in the business of doing any part of the rendering, and it should not know anything about how the positions will be mapped to a screen.

 

Oh, maybe I found it: You are updating positions and velocities as part of the parallelized loop, while other threads might be reading from it. There's no guarantee if the other threads are going to read the positions before the update or after the update, or even if they are going to read a valid position at all (maybe they'll get the old x but the new y?). You need to store the new positions and velocities in a separate array and do the updating once the parallelized loop is over.

 

 

 

1 hour ago, alvaro said:

I don't have a solution to your problem, but I see a few opportunities for improving your code.

 



                pointsx[i] += velx[i] -= fx;
		pointsy[i] += vely[i] -= fy;

That's hard to read. Try to make life easy for the reader.

You don't need Math.sqrt() to check if a distance is greater than 1: Just check if the distance squared is greater than 1.

You shouldn't use goto where a more descriptive control-flow mechanism exists. In this case it's just a do-while loop.

You are mixing the computation of the updated positions and velocities with the details of the representation (aspect, zoom). That is poor taste. A function that is called Compute should not be in the business of doing any part of the rendering and it should not know anything about how the positions will be mapped to a screen.

 

Oh, maybe I found it: You are updating positions and velocities as part of the parallelized loop, while other threads might be reading from it. There's no guarantee if the other threads are going to read the positions before the update or after the update, or even if they are going to read a valid position at all (maybe they'll get the old x but the new y?). You need to store the new positions and velocities in a separate array and do the updating once the parallelized loop is over.

 

 

 

I should have mentioned it does the same thing without multithreading, I did the same thing in the multithreaded one that works properly, I see what you mean though and thanks for the tips.

I will update the OP with the changes I have made.

2 minutes ago, CommanderLake said:

I should have mentioned it does the same thing without multithreading, I did the same thing in the multithreaded one that works properly, I see what you mean though and thanks for the tips.

I will update the OP with the changes I have made.

Updating the OP makes it difficult to follow the conversation, so please don't do that. You can just post updated code in another message.

Even in the single-threaded scenario, it's probably wrong to update the particle positions as you go.

Anyway, I suggest you start with 2 particles, then 3 and see if you can follow the calculations by hand. If you have the working version of the code somewhere else, you can make both versions spit out details of what they are doing so you can track where they diverge. Debugging is an important skill to develop, and this is a good opportunity to do so.

Good luck.

Oh well I've changed it now.

You are right, I was using separate source and destination buffers up to a point and I didn't realise what I was doing when I changed it!

Its fixed and its faster:


#include "PPC.h"
#include <thread>
static const float G = 0.0000001F;
const int count = 4096;
__declspec(align(64)) float px0[count];
__declspec(align(64)) float py0[count];
__declspec(align(64)) float px1[count];
__declspec(align(64)) float py1[count];
float* pointsx0 = px0;
float* pointsy0 = py0;
float* pointsx1 = px1;
float* pointsy1 = py1;
void SetData(float* x, float* y){
	memcpy(px0, x, count * sizeof(float));
	memcpy(py0, y, count * sizeof(float));
	memcpy(px1, x, count * sizeof(float));
	memcpy(py1, y, count * sizeof(float));
}
void Compute(float* points, float* velx, float* vely, long pcount, float aspect, float zoom) {
#pragma omp parallel for
	for (auto i = 0; i < count; ++i) {
		auto forcex = 0.0F;
		auto forcey = 0.0F;
		for (auto j = 0; j < count; ++j) {
			if(j == i)continue;
			const auto distx = pointsx0[i] - pointsx0[j];
			const auto disty = pointsy0[i] - pointsy0[j];
			const auto force = G / (distx * distx + disty * disty);
			forcex += distx * force;
			forcey += disty * force;
		}
		pointsx1[i] = pointsx0[i] + (velx[i] -= forcex);
		pointsy1[i] = pointsy0[i] + (vely[i] -= forcey);
		if (zoom != 1) {
			points[i * 2] = pointsx1[i] * zoom / aspect;
			points[i * 2 + 1] = pointsy1[i] * zoom;
		} else {
			points[i * 2] = pointsx1[i] / aspect;
			points[i * 2 + 1] = pointsy1[i];
		}
	}
	const auto pxt = pointsx0;
	const auto pyt = pointsy0;
	pointsx0 = pointsx1;
	pointsy0 = pointsy1;
	pointsx1 = pxt;
	pointsy1 = pyt;
}

The code is now SO fast i can do 32768 points at 58 FPS on a server with 2x 14 core Xeon E5 2695 v3's and its absolutely mesmerising!

There is an algorithm called the fast multipole method, which can do this much much faster (it's at least O(n*log(n)), and perhaps even O(n)), but it's tremendously complicated. At some point I understood most of it, but at the time I was only interested in a 1D version of this problem. Maybe I'll give the 3D case a try someday, since it could make for a really cool demo with maybe 1 million particles.

 

I get the same framerate with 65536 points on an old nVIDIA CUDA demo on an overclocked water cooled GTX1080 so the Xeons are doing about 25% as many FLOPS as the 1080 which is impressive.

I've seen the FMM mentioned a lot, sounds like a nightmare to implement without a code sample, how accurate is it?

12 hours ago, CommanderLake said:

I've seen the FMM mentioned a lot, sounds like a nightmare to implement without a code sample, how accurate is it?

 

It can be made as accurate as needed. If I remember correctly, there is a factor of O(log(1/epsilon)) in computation time if you require errors of order epsilon.

I'd love to add a 3rd dimension and try it on my Vive, how complex would such a thing be using OpenVR just drawing simple points?

This topic is closed to new replies.

Advertisement