# Particle transformation too slow?

This topic is 2601 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi,

after trying to use instancing for my particle system (it's not really working right now but thats another topic) I noticed a great improve in the frameraterate. Say from first 3 FPS with 100.000 particels before I now got around 50 fps (in release mode). I know, measuring in FPS isn't the best way, so let's say the frame know takes around 50 ms to compute, while before it took 333 ms. Quite a whole lot of an improvement! Well I knew that there was still more, because my code for doing the world transformation was kind of messed up. While trying to clean up, I noticed I deleted the call for transforming the particles. I put it in again, and quess what: My framerate dropped back to 13 FPS. I did some profiling and found out that, as I expected, a whole lot of time was spent setting the transformation for each particle. That was the code before the optimizations:

void CObjects::SetWorldTransformation(void) { D3DXMATRIX WorldMatrix, RotationMatrixX, RotationMatrixY, RotationMatrixZ, TranslationMatrix, ScaleMatrix, ViewMatrix, ProjectionMatrix; float RotationX, RotationY, RotationZ = 0.0f; if(m_Rotation.x == 0.0f) m_Rotation.x = 360.0f; RotationX = DEGTORAD(m_Rotation.x); if(m_Rotation.y == 0.0f) m_Rotation.y = 360.0f; RotationY = DEGTORAD(m_Rotation.y); if(m_Rotation.z == 0.0f) m_Rotation.z = 360.0f; RotationZ = DEGTORAD(m_Rotation.z); D3DXMatrixRotationX(&RotationMatrixX, RotationX); D3DXMatrixRotationY(&RotationMatrixY, RotationY); D3DXMatrixRotationZ(&RotationMatrixZ, RotationZ); D3DXMatrixScaling(&ScaleMatrix, m_Scale.x, m_Scale.y, m_Scale.z); D3DXMatrixTranslation(&TranslationMatrix, 0.0f, 0.0f, 0.0f); D3DXMatrixMultiply(&WorldMatrix, &ScaleMatrix, &TranslationMatrix); D3DXMatrixMultiply(&WorldMatrix, &WorldMatrix, &RotationMatrixZ); D3DXMatrixMultiply(&WorldMatrix, &WorldMatrix, &RotationMatrixX); D3DXMatrixMultiply(&WorldMatrix, &WorldMatrix, &RotationMatrixY); D3DXMatrixTranslation(&TranslationMatrix, m_Position.x, m_Position.y, m_Position.z); D3DXMatrixMultiply(&WorldMatrix, &WorldMatrix, &TranslationMatrix); m_lpDevice->SetTransform(D3DTS_WORLD, &WorldMatrix); m_lpDevice->GetTransform(D3DTS_VIEW, &ViewMatrix); m_lpDevice->GetTransform(D3DTS_PROJECTION, &ProjectionMatrix); WorldMatrix = WorldMatrix * ViewMatrix * ProjectionMatrix; m_World = WorldMatrix; }

I know I did a lot of unnecessary stuff, like doing translation/rotation/etc.. every time, even if the paramters didn't change. The I used SetTransform which should take some performance, and lateron called GetTransform, which I evaded storing all the matrix data as members of the CObject class (from which my particels inherit). So now the code is looking like this:

void CObjects::SetWorldTransformation(void) { m_WorldMatrix = m_ScaleMatrix; m_WorldMatrix *= m_RotationMatrixZ; m_WorldMatrix *= m_RotationMatrixX; m_WorldMatrix *= m_RotationMatrixY; m_WorldMatrix *= m_TranslationMatrix; m_lpDevice->GetTransform(D3DTS_VIEW, &m_ViewMatrix); m_lpDevice->GetTransform(D3DTS_PROJECTION, &m_ProjectionMatrix); m_World = m_WorldMatrix; m_ViewMatrix *= m_ProjectionMatrix; m_World *= m_ViewMatrix; }

I excluded a lot of unnecessary calculations, not only in that code, but in the emitter too (like calculations I accidently performed per-particle, even if they could have been done one time per frame). However, this is what I got afterwards: 15 FPS. Well its not that bad keeping in mind how low the framerate is, but its not as I expected. I profiled again, and now, although the time taken by matrix-calculations significaty lowered (they where taking up to 25% of CPU-time before, know its only about 5% maximum), the updating of the particles still takes a lot of time. So now the obvious performance-killer lies here:
 void CParticle::Update(void) { m_Velocity.y -= m_Gravity; m_Time += 1; Move(m_Velocity.x + cos((float)m_Time/4)/4*m_Float.x, m_Velocity.y, m_Velocity.z + cos((float)m_Time/4)/4*m_Float.y); SetWorldTransformation(); } void CObjects::Move(float X, float Y, float Z) { m_Position.x += X; m_Position.y += Y; m_Position.z += Z; D3DXMatrixTranslation(&m_TranslationMatrix, m_Position.x, m_Position.y, m_Position.z); }

When I don't call Update() on any particle, framerate goes up to 37. Excluding SetWorldTransform makes the framerate rise up to 22-23, and excluding Move() makes it go up to 17. Excluding both somehow makes the framerate only go up to 30, only if I also exclude both the velocity and time calculation, everything is fine.

So, does anyone know what I can do? Is it a fault from me, or is it something I have to live with? 100.000 particles are not less, but I'm not even using a vector or map but an array, which should be faster from all that I know. Note: I'm using a particlesystem that doesn't delete particles after a certain amount of time and replaces them but rather "resets" old particles. So there is no performance cap from inserting/deleting elements, and the reseting doesn't take any time (I tried excluding it and it doesn't give me a single frame). Is there any way I can further optimize my transformations, or is it maybe any bug deeper in my system? Looping through all of the particles takes some time (if I exclude the complete loop I get about 120 FPS), but drawing doesn't do much, as this problem is clearly a limitiation of my CPU. Any ideas pls?

##### Share on other sites
First of all, eliminate your divisions. Instead of dividing by 4, multiply by 0.25. Your compiler may be optimizing this anyway, however, I'm not sure.

Secondly, you're performing [color=#1C2837][font=CourierNew, monospace][size=2][color=#000000]cos[color=#666600](([color=#000088]float[color=#666600])[color=#000000]m_Time[color=#666600]/[color=#006666]4[color=#666600])[/font]multiple times. Trigonometric functions are expensive. Do it only once.

Thirdly, depending on your target platform, look into using SIMD (SSE2) operations for your matrix and vector transformations.

##### Share on other sites
First of all, eliminate your divisions. Instead of dividing by 4, multiply by 0.25. Your compiler may be optimizing this anyway, however, I'm not sure.

Secondly, you're performing cos((float)m_Time/4)multiple times. Trigonometric functions are expensive. Do it only once. [/quote]

Thanks, I changed cos to only be used once, and it gave me a little boost of 1-2 FPS. talking about 15 FPS average, thats not bad.

Thirdly, depending on your target platform, look into using SIMD (SSE2) operations for your matrix and vector transformations. [/quote]

Thx, I'll try SIMD out. But what about SSE3+? Should I stick to SSE2 like you suggested or are there huge benefits in using newer SSE?

##### Share on other sites
Honestly I don't know much about the details of SIMD implementations, I just know what they are and what they do. If you have SSE3+ available to you, I guess I don't see why you'd use it Maybe someone more knowledgeable can answer.

BTW, I don't know what your platform is, but if it's windows you have the XNA Math library available to you. That's set up to use SIMD operations.

##### Share on other sites
Frankly, if you want a fast particle system then you are Doing It Wrong.

Class for a particle is bad for many reasons including;
• lack of ability to do SIMD transforms on multiple particles at once
• the cost of jumping into the 'update' function (made worse if you've made it virtual) to update one particle
• loading too much data for that update, most of it useless, thus trashing the cacheThat 'move' function looks very suspect... almost like you have one matrix per particle?!? If so... dear god... instead you'd want an 'emitter' which has a base position and everything moves relative to that and that is done via the GPU (frankly, if you have one matrix per particle I dread to see your drawing code...).

If you are after performance then you'll want to go and read up on Data Orientated Design (aka DOD) for which particle systems are perfect for.

And, to give you an idea, on a 2.6Ghz i7, using 8 threads and SIMD updates I managed to transform 1 million 2D particles in <4ms using DOD methods and code constructs, which is ~1ms for 250,000 particles (at which point I ran into problems uploading the data to the GPU to draw).

##### Share on other sites
The D3DX stuff already uses SIMD internally if it's available on the PC.

Assuming that you're just billboarding particles you don't need to eval a full transform for each. At the most basic level, all that you need as per-instance data is a 3-float position and a colour. Then you can extract the forward, up and right (or left as appropriate) vectors from your view matrix, set them once to your shader for all particles, and perform billboarding in your vertex shader. Faaaaaast.

Note that with particles fillrate is also going to be a major bottleneck, so even the lowest overhead elsewhere won't do much to ameliorate that.

##### Share on other sites

The D3DX stuff already uses SIMD internally if it's available on the PC.

While that is true there is two levels of SIMD;
1. Using a maths function which uses SIMD instructions internally to speed things up (D3DX and XNA Math for example) which would otherwise have taken longer via the FPU
2. Laying out your data in such a way as to process multiple pieces of data at once in each operation
The first is easy to do and with things like XNA Math around on the PC should be something people consider doing all the time. However while faster it doesn't really take advantage of the streaming nature as well as the second one can in certain cases, and a particle system is a good example of this.

Particle system updates are nicely seperatable, so what happens to the 'x' of it's position is independent to what happens to the 'y'. Using this knowledge if we make two buffers, one which holds all the 'x' components and one which holds all the 'y' then using SSE we can do a single cache friendly pass over the 'x' data and a single cache friendly pass over the 'y' data and update four particles per instruction.

So, your update loop becomes something like;

[source]
struct ParticleInfomation
{
float * __restrict x;
float * __restrict y;
float * __restrict scale;
float * __restrict momentumX;
float * __restrict momentumY;
float * __restrict velocityX;
float * __restrict velocityY;
float * __restrict age;
float * __restrict maxAge;
float * __restrict colourR;
float * __restrict colourG;
float * __restrict colourB;
float * __restrict colourA;
float * __restrict rotation;
// some functions for dealing with instances of this class
}

// i is the particle count which increments by 4, there is always a multiple of 4 particles alive in the world if any are alive

__m128 posX = _mm_load_ps(particleData.x + i);
__m128 posY = _mm_load_ps(particleData.y + i);
__m128 velX = _mm_load_ps(particleData.velocityX + i);
__m128 velY = _mm_load_ps(particleData.velocityY + i);

__m128 momentumsX = _mm_load_ps(particleData.momentumX + i);
__m128 momentumsY = _mm_load_ps(particleData.momentumY + i);

velX = _mm_setzero_ps();
velY = _mm_setzero_ps();

_mm_stream_ps(particleData.velocityX + i,velX);
_mm_stream_ps(particleData.velocityY + i,velY);
_mm_stream_ps(particleData.momentumX + i,momentumsX);
_mm_stream_ps(particleData.momentumY + i,momentumsY);

momentumsX = _mm_mul_ps(momentumsX, time);
momentumsY = _mm_mul_ps(momentumsY, time);

_mm_store_ps(particleData.x + i, posX);
_mm_store_ps(particleData.y + i, posY);
[/source]

Which processes both X and Y in groups of four particles at once.

Edit:

Oh, and I forgot to mention what makes this good;

- Firstly the source memory is cache aligned to that as soon as you touch the first block you read in a whole cache line of 'x' and 'y' values, which with an i7 and it's 64byte line size gets you 16 values or four loops over the data at a time
- Secondly, because the CPU will prefetch data based on expected access patterns chances are the next 8 values will also be in cache somewhere when you get to need them

I will say this much however, the routine above isn't completely finished and optimised and there is a chance that doing all the updates then doing all the position calcs would be faster and more cache friendly but I've not had a chance to look at it yet (nor get myself a decent profiler to find out for certain).

##### Share on other sites
I personally let the GPU handle all particle transformations..

The post i a little ill-formatted (was posted on the old forum format) but I did a post early last year on another particle type where I include GPU transformation/scaling/rotating of particles (http://www.gamedev.net/topic/562581-billboarded-sprite-transform/page__p__4606417#entry4606417 for the reference)

The big downside is the Fat vertex data since I had to include position, scale and rotation (around Y axis) data. But it meant I could do one draw call per blend type for any number of particles and let the GPU transform, rotate and scale the particle.

##### Share on other sites

The D3DX stuff already uses SIMD internally if it's available on the PC.

Assuming that you're just billboarding particles you don't need to eval a full transform for each. At the most basic level, all that you need as per-instance data is a 3-float position and a colour. Then you can extract the forward, up and right (or left as appropriate) vectors from your view matrix, set them once to your shader for all particles, and perform billboarding in your vertex shader. Faaaaaast.

Note that with particles fillrate is also going to be a major bottleneck, so even the lowest overhead elsewhere won't do much to ameliorate that.

XNAMath is what you want to use for SIMD. D3DXMath uses SIMD instructions but has to do a load/store conversion, while XNAMath explicitly uses optimally-aligned hardware register data types.

1. 1
2. 2
3. 3
4. 4
Rutin
17
5. 5

• 12
• 9
• 12
• 37
• 12
• ### Forum Statistics

• Total Topics
631419
• Total Posts
2999987
×