• entries
455
639
• views
422788

# More Particles

331 views

Particles.

Today was a bank holiday.
So I decided that as I'm not going to be about the rest of the weekend it would be a good time to do "bad things" [grin]

As I've mentioned before now I've been (on and off) working on a particle system type thing, but more 'off' than 'on' of late. Today I decided to crack on and get some stuff done on it.

I already had a base of some code which I had copied from the previous test bed app, so basic functions outlined however it was lacking substance. I decided to start with a structure called ParticleInformation which is a key structure in that it owns the pointers to all the data the particle system will run.

	struct ParticleInfomation 	{		float * x;		float * y;		float * scale;		float * momentumX;		float * momentumY;		float * velocityX;		float * velocityY;		float * age;		float * colourR;		float * colourG;		float * colourB;		float * colourA;		float * rotation;	}

This is how it started life today; a collection of pointers for each component. At this point you might be wondering why I'd do that and not just have something which looked more like this;

	struct ParticleInfomation 	{		vector2 position;		vector2 momentum;		vector2 velocity;		vector4 colour;		float scale;		float rotation;		float age;	}

The reason is simple; for particle processing this structure is about as bad as you can get CPU and cache wise. On a Core i7 920 each cache line is 64bytes in size with each core having 32Kb of L1 cache and 256KByte of L2 cache (which is shared per thread). That structure above, assuming it is packed properly, would take up 52bytpes on its own which would totally blow your L1 cache and you'd soon fill up L2 as well given that you can't even fit 5 of them in there; and given that memory latency is a problem the cache misses are going to add up and stall you nicely (a round trip to ram could take 600 odd clock cycles, which is costly). So processing one single particle is going to start doing very bad things to your cache pretty quickly and rules out the use of SSE as you can't get decent alignment and you've have to swizzle the values about to process them.

The structure I'm using however is setup in such a way that each component has its own block of memory which you can align on 16bytes meaning we are nicely setup for SSE right away. When it comes to cache usage things are a bit nicer as well; the algorithm in question for updating the particles can work on separate sets of data at a time. So for updating the ages we can just blast though the age data in one chunk and process 4 ages at once. So 16 bytes can be pulled in, worked on and then flushed back and with the aggressive pre-fetching on modern CPUs this should pipeline nicely and we can do more work per cycle.

Of course, the structure started life like that, then I got into the issues of life times and copying around as well as data creation and release and it soon became this;
	struct ParticleInfomation 	{		float * x;		float * y;		float * scale;		float * momentumX;		float * momentumY;		float * velocityX;		float * velocityY;		float * age;		float * colourR;		float * colourG;		float * colourB;		float * colourA;		float * rotation;	private:		volatile long * refCount;		void IncRefCount();		void DecRefCount();	public:		ParticleInfomation();		ParticleInfomation(const ParticleInfomation &rhs) ;		ParticleInfomation(ParticleInfomation &&rhs) ;		~ParticleInfomation();		void ReserveMemory(int size);	};

The particle information now maintains the data itself (ok, the pointers are still public but no one besides the particle system ever sees them and direct access is needed so that's fine), ensuring copying and moving (ParticleInfomation(ParticleInfomation &&rhs)) happen sanely.

Most functions work as expected so I'll not bother with them, I will touch briefly on my initial move support and detor.

ParticleInfomation::ParticleInfomation(ParticleInfomation &&rhs)		: x(rhs.x), y(rhs.y), scale(rhs.scale), momentumX(rhs.momentumX), momentumY(rhs.momentumY),  velocityX(rhs.velocityX), velocityY(rhs.velocityY), age(rhs.age), 		colourR(rhs.colourR), colourG(rhs.colourG), colourB(rhs.colourB), colourA(rhs.colourA), rotation(rhs.rotation), refCount(rhs.refCount)	{		rhs.x = NULL;		rhs.y = NULL; 		rhs.scale = NULL;		rhs.momentumX = NULL;		rhs.momentumY = NULL;		rhs.velocityX = NULL;		rhs.velocityY = NULL;		rhs.age = NULL;		rhs.colourR = NULL;		rhs.colourG = NULL;		rhs.colourB = NULL;		rhs.colourA = NULL;		rhs.rotation = NULL;		rhs.refCount = NULL;	}

This is the 'move' constructor; it transfers ownership from one ParticleInfomation instance to another. The main thing to note is that while it is setup as a normal copy-ctor with the initialiser list the right hand side's values are then nulled out, including the pointer to the refCount which is a pointer to an aligned chunk of memory which can be shared between copied instances. (Ok, so I've kind of recreated a smart pointer here, just a limited subset thereof).

When the class is copied or constructed it increments a reference count and when its destroyed it decrements one via these two functions;
void ParticleInfomation::IncRefCount(){	_InterlockedIncrement(refCount);}void ParticleInfomation::DecRefCount(){	_InterlockedDecrement(refCount);}

However, as you can see from the above code if ownership is transferred via a move operation the refCount variable has been nulled so the above is likely to explode; so we have to be a little bit careful in the dtor;
	ParticleInfomation::~ParticleInfomation()	{		if(refCount != NULL)		{			DecRefCount();			if(*refCount == 0)			{				_aligned_free(x);				_aligned_free(y);				_aligned_free(scale);				_aligned_free(momentumX);				_aligned_free(momentumY);				_aligned_free(velocityX);				_aligned_free(velocityY);				_aligned_free(age);				_aligned_free(colourR);				_aligned_free(colourG);				_aligned_free(colourB);				_aligned_free(colourA);				_aligned_free(rotation);				_aligned_free((void*)(refCount));			}		}	}

It probably isn't' completely 100% safe as yet, but it should do for now [grin]

With that out of the road I decided to move on to the first pass of the update logic.
I decided to split this over 3 functions;

- PreUpdate() - this would handle any new 'trigger' calls and setup blocks of particles and then figure out how many update 'blocks' we want to dispatch
- Update() - which is designed to be called from multiple threads to work on blocks of particles
- PostUpdate() - which will mainly remove any 'dead' particles and any other work required

At the time of writing PreUpdate is only partly done as I'm as yet to decide on how I want to structure the work requests for the update loop. It'll be some interface but I need to write that first [grin]

It does however have some basic code for emitting based on trigger requests;
	void ParticleEmitter::PreUpdate(float deltaTime)	{		// Services any required trigger calls		while(!queuedTriggers.empty())		{			const EmitterPosition & position = queuedTriggers.front();			if(details.maxParticleCount > usedParticles) 			{				usedParticles += EmitParticles(position);			}			queuedTriggers.pop();		}		// then work out block sizes and schedule those blocks for updates		//TODO: figure out what kind of task interface this is going to use	}

Pretty simple; for each request fire off some particles if we have the space. This is only a first pass because in the end I'd like two classes of particle systems;
- one shot, which is basically what this is
- constantly spawning, which needs slightly different PreUpdate logic however I figure it best to get the one shot working first and then refactor

The EmitParticles function does the main chunk of the work; again this is a first pass system as currently each particle spawns at a single point. Ideally this will be customisable via an affecter later on.
	int ParticleEmitter::EmitParticles(const EmitterPosition &position)	{		int realReleaseAmount = (usedParticles + details.releaseAmount > details.maxParticleCount ) ? details.maxParticleCount - usedParticles : details.releaseAmount;				SetParticleData(realReleaseAmount, 0.0f, particleData.momentumX + usedParticles);		SetParticleData(realReleaseAmount, 0.0f, particleData.momentumY + usedParticles);		SetParticleData(realReleaseAmount, position.x, particleData.x + usedParticles);		SetParticleData(realReleaseAmount, position.y, particleData.y + usedParticles);		SetParticleData(realReleaseAmount, details.maxLife, particleData.age + usedParticles);		SetParticleData(realReleaseAmount, details.scale, particleData.scale + usedParticles);		SetParticleData(realReleaseAmount, details.colour.r, particleData.colourR + usedParticles);		SetParticleData(realReleaseAmount, details.colour.g, particleData.colourG + usedParticles);		SetParticleData(realReleaseAmount, details.colour.b, particleData.colourB + usedParticles);		SetParticleData(realReleaseAmount, details.colour.a, particleData.colourA + usedParticles);		GenerateForces(realReleaseAmount, particleData.velocityX + usedParticles, particleData.velocityY + usedParticles);		return realReleaseAmount;	}	void ParticleEmitter::SetParticleData(int amount, float value, float * location)	{		std::fill_n(location, amount, value);	}	void ParticleEmitter::GenerateForces(int amount, float *locationX, float *locationY)	{		std::generate_n(locationX, amount, [this]() { return sin((float)this->rndNumGen()) * this->details.releaseSpeed; });		std::generate_n(locationY, amount, [this]() { return cos((float)this->rndNumGen()) * this->details.releaseSpeed; });	}

My love of lambda functions and the C++ Std. Lib can be seen on display here, mainly in the 'GenerateForces' function where a lambda is used to generate some velocities for release.

The Update function follows next, and it looks slightly simple....
	void ParticleEmitter::Update(float deltaTime, int blockStart, int blockEnd)	{		// Update a chunk of particles here		int len = blockEnd - blockStart;		int remain = len % 8;		int offset = blockEnd - remain;				// Deal with age subtraction first		UpdateAges(deltaTime, offset, remain);		// Next momentum += velocity; pos += momentum * deltaTime;		UpdatePositionsAndMomentums(len, deltaTime);		if(!details.colourModFuncs.empty())		{			UpdateColours(len, deltaTime);		}		if(!details.rotationModFuncs.empty())		{			UpdateRotations(len, deltaTime);		}	}

As mentioned this is designed to work on blocks of particles, which are always multiples of 4 but could be multiples of 8; remain is only used in the UpdateAges function as it is simple enough that I don't find myself running into SSE register issues on x86. If I do x64 versions of the other functions they will probably end up working in much the same way.

UpdateAges itself is quite simple;
	void ParticleEmitter::UpdateAges( float deltaTime, int offset, int remain )	{		__m128 time = _mm_load1_ps(&deltaTime);		for(int i = 0; i < offset; i+= 8)		{			__m128 ages = _mm_load_ps(particleData.age + i);			__m128 updatedAges = _mm_sub_ps(ages, time);			__m128 ages2 = _mm_load_ps(particleData.age + i + 4);			__m128 updatedAges2 = _mm_sub_ps(ages, time);			_mm_stream_ps(particleData.age + i,updatedAges);			_mm_stream_ps(particleData.age + i + 4,updatedAges2);		}		if(remain)		{			__m128 ages = _mm_load_ps(particleData.age + offset);			__m128 updatedAges = _mm_sub_ps(ages, time);			_mm_stream_ps(particleData.age + offset,updatedAges);		}	}

For each age in the particle system we simply read it in, subtract the deltaTime and stream it back out again. It might be worth modifying this function so that each read is half the range away to let us use the cache a bit better as currently it would consume a whole cache line on each loop.

UpdatePositionsAndMomentums is by far the most complex of these functions;
void ParticleEmitter::UpdatePositionsAndMomentums( int len, float deltaTime )	{		for(int i = 0; i < len; i+= 4)		{				__m128 posX = _mm_load_ps(particleData.x + i);			__m128 posY = _mm_load_ps(particleData.y + i);			__m128 velX = _mm_load_ps(particleData.velocityX + i);			__m128 velY = _mm_load_ps(particleData.velocityY + i);			if(!details.positionModFuncs.empty())			{				ParticlePosition position(posX, posY);				std::for_each(details.positionModFuncs.begin(), details.positionModFuncs.end(), [&](particlePositionModifierFunc &func)				{					particleForce force = func(position, this->position, deltaTime);					__m128 forceX = _mm_loadu_ps(force.x);					velX = _mm_add_ps(velX, forceX);					__m128 forceY = _mm_loadu_ps(force.y);					velY = _mm_add_ps(velY, forceY);				});			}			__m128 momentumsX = _mm_load_ps(particleData.momentumX + i);			__m128 momentumsY = _mm_load_ps(particleData.momentumY + i);			momentumsX = _mm_add_ps(momentumsX, velX);			momentumsY = _mm_add_ps(momentumsY, velY);			velX = _mm_setzero_ps();			velY = _mm_setzero_ps();			// store Velocity and momentum and reload position which should still live in cache			__m128 time = _mm_load1_ps(&deltaTime);			_mm_stream_ps(particleData.velocityX + i,velX);			_mm_stream_ps(particleData.velocityY + i,velY);			_mm_stream_ps(particleData.momentumX + i,momentumsX);			_mm_stream_ps(particleData.momentumY + i,momentumsY);			momentumsX = _mm_mul_ps(momentumsX, time);			momentumsY = _mm_mul_ps(momentumsY, time);			posX = _mm_add_ps(momentumsX, posX);			posY = _mm_add_ps(momentumsY, posY);			_mm_stream_ps(particleData.x + i, posX);			_mm_stream_ps(particleData.y + i, posY);		}	}

In theory we have enough registers here but it does somewhat assume the compiler can be smart about it.
The process is simple enough;
- read in positions and velocities
- give affecters a chance to adjust the velocities in some manner (more lambda trickery there)
- apply those velocities to the momentum of the particle
- update the position based on time

It only looks complex because we are doing 4 particles at a time.
I'm considering doing some benchmarking on this with regards to the 'std::for_each' placement. There is a chance that moving it out of the loop and having it just update the velocities for each particle before doing the rest of the loop might be better for it but without testing this method will do for now.

The UpdateColours and UpdateRotations functions are also pretty simple;
	void ParticleEmitter::UpdateColours( int len, float deltaTime )	{		for(int i = 0; i < len; i+= 4)		{			__m128 age = _mm_load_ps(particleData.age + i);			__m128 red = _mm_load_ps(particleData.colourR + i);			__m128 green = _mm_load_ps(particleData.colourG + i);			__m128 blue = _mm_load_ps(particleData.colourB + i);			__m128 alpha = _mm_load_ps(particleData.colourA + i);			std::for_each(details.colourModFuncs.begin(), details.colourModFuncs.end(), [&](particleColourModifierFunc &func)			{				Age ages(age);				ParticleColours colours(red, green, blue, alpha);				func(colours, ages, deltaTime);			});			_mm_stream_ps(particleData.colourR + i, red);			_mm_stream_ps(particleData.colourG + i, green);			_mm_stream_ps(particleData.colourB + i, blue);			_mm_stream_ps(particleData.colourA + i, alpha);		}	}	void ParticleEmitter::UpdateRotations( int len, float deltaTime )	{		for(int i = 0; i < len; i+= 4)		{			__m128 rotation = _mm_load_ps(particleData.rotation + i);			std::for_each(details.rotationModFuncs.begin(), details.rotationModFuncs.end(), [&rotation, &deltaTime](particleRotationModiferFunc &func)			{				particleRotations rot = func(deltaTime);				__m128 accumRotations = _mm_loadu_ps(rot.rotation);				rotation = _mm_add_ps(rotation, accumRotations);			});			_mm_stream_ps(particleData.rotation + i, rotation);		}	}

Simple read in, modify and write out affairs with no shocks to be had.

Which brings us to the final function of the day; PostUpdate.
This little gem is needed to retire dead particles. In my original test bed the method used as naive to say the least.
	int idx = 0;	while(idx < usedParticles)	{		if(particle_age[idx] >= particleLife)		{			particle_age[idx] = particle_age[usedParticles - 1];			particle_x[idx] = particle_x[usedParticles - 1];			particle_y[idx] = particle_y[usedParticles - 1];			particle_scale[idx] = particle_scale[usedParticles - 1];			particle_momentum_x[idx] = particle_momentum_x[usedParticles - 1];			particle_momentum_y[idx] = particle_momentum_y[usedParticles - 1];			particle_velocity_x[idx] = particle_velocity_x[usedParticles - 1];			particle_velocity_y[idx] = particle_velocity_y[usedParticles - 1];			usedParticles--;		}		else			idx++;	}

While simple enough and employing the old 'copy back' method of particle swapping it simply isn't very fast or friendly.

We move forward though the particles and, when we find a dead one, a particle from the back is copied into its place, usedParticles is decreased and that particle is tested. The problem is we end up wasting time and resources.

Firstly, we know we can copy in blocks of 4 because we enforce a policy of 'at least 4 particles are alive' to keep the SSE sane.
Secondly, this means that blocks of particles are going to retire together. At least 4 and maybe more if the particle system is a 'one shot' setup.
Thirdly, we are copying whole particle data chunks about and pulling data into cache lines when we might just be discarding it again anyway. That's a lot of memory waste right away.
Finally, when accessing data it is done in cache lines, which means on an i7 you'll pull in 64bytes every time you access new data outside the cache. So when you access data at the back you are pulling in that data and more which won't get used. Also, modern CPUs try to guess at what you are accessing next so might well try to read in the next cache line as well, just in case, which as you are going backwards you aren't going to touch.

With this in mind I came up with a new plan;
- First find the first dead particle using blocks of 4; we are only accessing the age data at this point, forward to nice and cache friendly
-- If we don't find any then bail out because everyone is still alive
-- If we do find a dead particle then remember that block and go looking for the end of it (still only age data at this point)
--- If we hit the end of the active particles before we find an alive particle then everything from the start to end is dead so we reduce the used count and bail
--- If we find an end then we mark it and walk though that data trying to find either another dead particle or the end of the active particles
---- As we move we copy blocks of 4 of age data back over the old dead particles. These are in cache anyway so this doesn't cost us (assuming std::move uses none cache polluting 'stream' under the hood. If it doesn't then it'll need replacing as it'll be hurting the cache again)
--- Once we find the end point we mark it, figure out the length and then block copy the rest of the particle data down over the old data
--- Then we update the 'dead block' start so it is at the end of the data copied down, set index to the next 'dead block' we found and loop back to step 2

While CPU wise it looks more complicate we are touching a lot less data frequently and the CPU will be happier doing block copies than it would be with the initial method.
The code however... is a bit more complex...
	int idx = 0;	// Search for first dead block of particles	while(idx < usedParticles && particleData.age[idx] > 0.0f)	{			idx +=4;				}	// - if not found then bail out	if(idx >= usedParticles)	{		return;	}	int deadBlockStart = idx;	while(idx < usedParticles)	{		// Now search for end of block		while(idx < usedParticles && particleData.age[idx] <= 0.0f)		{			idx+= 4;		}		// if not found then everything from start to end was dead so reduce count and bail		if(idx >= usedParticles)		{			usedParticles = (usedParticles - deadBlockStart);			break;		}		// tag where we go to for the 'alive' block		int aliveBlockStart = idx;		int movedParticleDest = deadBlockStart;		// now search for the end of this block of 'alive' particles		// moving blocks of 'ages' down as we go		while(idx < usedParticles && particleData.age[idx] > 0.0f)		{			float * start = particleData.age + idx;			float * end = start + 4;			float * dest = particleData.age + movedParticleDest;			std::move(start, end, dest);			idx+=4;			movedParticleDest+=4;		}		// copy 		int aliveBlockLength = movedParticleDest - aliveBlockStart;		MoveParticleBlockWithoutAge(aliveBlockStart, aliveBlockLength, deadBlockStart);		int oldUsedParticles = usedParticles;		usedParticles -= (aliveBlockStart - deadBlockStart);		// if we make it to the end then copy everything downwards		if(idx >= oldUsedParticles)		{			break;		}		// Set idx to last block we looked at which could contain 'dead' particles		idx = movedParticleDest;		// and move the 'start' up the amount we have just copied so we are now		// passed the end of the data we have just copied down.		deadBlockStart += aliveBlockLength;	}

The main wins will come when the whole particle system is dead or at least large chunks of it towards the end.

Having completed that I decided to call it a day; it needs testing and I still need to work on the rendering system (probably a Pre/Render/Post system as well) and generally get it hooked up to my multithreaded test bed to make sure the one shot system works before refactoring into a one shot and continuous system setup.

Still, a good days work was had a I feel, until the next update *salutes*

unsigned int ParticleInfomation::IncRefCount()
{
_InterlockedIncrement(refCount);
}

unsigned int ParticleInfomation::DecRefCount()
{
_InterlockedDecrement(refCount);
}

ParticleInfomation::~ParticleInfomation()
{
if(refCount && DecRefCount() == 0)
{
//delete here
}
}

In your code, there could have been a decrement on another thread in between leading to possibly multiple deconstructions. Also goes for any similar code in your constructor.

Ah, good catch, I knew there was something wrong with it [grin]

I'll look into upgrading the code when I'm back at my home machine in a few days, cheers [smile]

## Create an account

Register a new account