SIMD SoA(structure of arrays) into VBO

Started by
7 comments, last by Yann L 15 years, 10 months ago
For my particle system I have recently switched over to SoA format to use SIMD improvements. Originally I used a vector filled with vertices for the buffer.

class XY
{
public:
	float x1;	float y1;
	float x2;	float y2;
	float x3;	float y3;
	float x4;	float y4;
};


This worked fine and dandy when I was using just vectors to hold all the data of my particles. But when I switched over, I needed to change that to SIMD SoA format so I didn't hurt the calculation speed.

class XY_SSE 
{
public:
	float *x1;	float *y1;
	float *x2;	float *y2;
	float *x3;	float *y3;
	float *x4;	float *y4;
public:
	XY_SSE()
	{
		x1  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		x2  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		x3  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		x4  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		y1  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		y2  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		y3  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
		y4  = (float*)_mm_malloc(MAX_PARTICLES * sizeof(float), BOUNDARY_ALIGNMENT);
	}

	~XY_SSE()
	{
		_mm_free(x1);
		_mm_free(x2);
		_mm_free(x3);
		_mm_free(x4);
		_mm_free(y1);
		_mm_free(y2);
		_mm_free(y3);
		_mm_free(y4);
	}
};


Now the only problem is giving the coordinate data to the buffer. For some reason, I cannot get it to work.

                glEnable(GL_BLEND);
		glColor4f(1.0, 1.0, 1.0, cprf_Blend_I);
		glBindTexture(GL_TEXTURE_2D, Type.TexNum->imageDataNear);

		glEnableClientState( GL_VERTEX_ARRAY );					
		glEnableClientState( GL_TEXTURE_COORD_ARRAY );			

		glGenBuffersARB( 1, &m_nVBOVertices );						
		glBindBufferARB( GL_ARRAY_BUFFER_ARB, m_nVBOVertices );			
		//glBufferDataARB( GL_ARRAY_BUFFER_ARB, pSize, &VBOVertices[0], GL_STREAM_DRAW_ARB ); this was the original way of doing things, VBOVerticies is just an std::vector<XY>
		glBufferDataARB( GL_ARRAY_BUFFER_ARB, pSize, &vSSE, GL_STREAM_DRAW_ARB );

		glBindBufferARB( GL_ARRAY_BUFFER_ARB, m_nVBOVertices );
		glVertexPointer( 2, GL_FLOAT, 0, 0 );		
		glBindBufferARB( GL_ARRAY_BUFFER_ARB, m_nVBOTexCoords );
		glTexCoordPointer( 2, GL_FLOAT, 0, 0 );		

		glDrawArrays( GL_QUADS, 0, pSize4);

		glDeleteBuffersARB(1, &m_nVBOVertices);

		glDisableClientState( GL_VERTEX_ARRAY );				
		glDisableClientState( GL_TEXTURE_COORD_ARRAY );				

		glDisable(GL_BLEND);


vSSE is an XY_SSE SoA. I have tried everything from just vSSE, &vSSE, &*vSSE, &vSSE->x[0], &vSSE[0] and none work. Some give me very weird results of odd shaped squares that extend way past the edge of the screen. Any advice on how to get this to work properly would be very appreciated. Jake [Edited by - jake_Ghost on June 29, 2008 11:21:47 AM]
Advertisement
What is vSSE? How do you define it?

BufferDataARB expects a pointer to a contiguous block of elements. You're adding an extra layer of indirection that means your data is not layed out in that way.

Any reason for the 8 different blocks? Can you not just use one big one? If you need finer grained access, you can still keep pointers to offset positions within the block.
[size="1"]
XY_SSE *vSSE;vSSE = new XY_SSE();MAX_PARTICLES 60000


for SIMD to work at is maximum speed it needs to be in SoA format. To fully utilize those calculations, my vertex array has to be in SoA format, i think.

here is a little bit of what is going on,

_mm_store_ps(&vSSE->x1, _mm_load_ps(&pSSE->x));				_mm_store_ps(&vSSE->x2, _mm_load_ps(&pSSE->x));				_mm_store_ps(&vSSE->x3, _mm_load_ps(&pSSE->xw));				_mm_store_ps(&vSSE->x4, _mm_load_ps(&pSSE->xw));				_mm_store_ps(&vSSE->y1, _mm_load_ps(&pSSE->y));				_mm_store_ps(&vSSE->y2, _mm_load_ps(&pSSE->yh));				_mm_store_ps(&vSSE->y3, _mm_load_ps(&pSSE->y));				_mm_store_ps(&vSSE->y4, _mm_load_ps(&pSSE->yh));


this code puts the new x and y values into their appropriate variable for the vertex array.
It's not good for performance to create and delete the VBO every frame.
Sig: http://glhlib.sourceforge.net
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhLoadIdentityf2(matrix);
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);
OK, bear in mind I've never used SIMD or intrinsics, so I'm going to avoid any discussion of what you're doing from that angle. That said, I'm not convinced the way you're separating the data will help - though that obviously depends on the access pattern of the code that modifies it. Are you really sure that you'll find it more useful to have more x elements in cache than half as many x elements and their corresponding y element? This seems unlikely to me if your SIMD code does any vector math. Sure, it may be faster if you're simply filling each array, but is that really your usual usage case? If so, wouldn't a single block and a memcpy be faster?

As I said though I know little of such things, so I apologise if the above is nonsense. Back to VBO:

Simply put, BufferDataARB doesn't know about your pointers - it doesn't know their types, or whether they point to anything. Think of BufferDataARB being the same as memcpy - it takes a void* and copies a bunch of bytes. It knows nothing about what those bytes represent.

given:

struct something
{
int* a;
int* b;
}

memcpy (and BufferDataARB) will copy the pointers, but will not follow them or copy what they point to.

You need to think about how your allocations are laid out in memory, and how that affects how you're going to access your data with opengl. For example, glVertexPointer is used with either a ptr (vertex arrays) or offset into a VBO. It assumes the data it points to is contiguous and in groups of 2,3 or 4 (xy, xyz, or xyzw). Your layout cannot provide this.

That said, you could use custom vertex attributes instead of glVertexPointer and reassemble the streams inside a vertex shader, but it's probably more work than it's worth.

Hope that helps
[size="1"]
Quote:
That said, you could use custom vertex attributes instead of glVertexPointer and reassemble the streams inside a vertex shader, but it's probably more work than it's worth.

That's pretty much the only way to make a GPU swallow such a messed up vertex layout. And you're right, it's very far from optimal.

Jake, the layout you chose is highly incompatible with the way GPUs work, even if you fix the indirection bug. You have to setup several separate data streams and recomposite the vectors in the vertex shaders. The (probably rather small) performance gains you get from SIMD are going to be completely lost by the huge overhead of making the GPU accept this.

If you want really good performance on particle systems, I would suggest to completely drop SIMD. Move all particle calculations over to the GPU instead, using either OpenGL or CUDA.
v-man, thanks lol. that is now fixed :P.

mrbastard, thanks for explaining how a vertex array works more clearly. Now i see why its not working at all and i can get some really weird values out of it lol. For my SIMD calculations i dont have all the xs and ys, its just,

float *x, *y, *w, *h, *xw, *yh;	float *Vel, *rVel;	float *Angle;


as for the vbos, i think i may be a bit confused. ATM ive just been putting in every vertex for a quad. so the x1->x4 are each corners x value. Is that the wrong way to do it?

in my loop when it comes time to update the particles vertex array value, its like

for (every particle){    calculate forces and whatnot    then set all the coords basically looking like this   x1y1       x2y2   x3y3       x4y4}
Quote:Original post by Yann L
Quote:
That said, you could use custom vertex attributes instead of glVertexPointer and reassemble the streams inside a vertex shader, but it's probably more work than it's worth.

That's pretty much the only way to make a GPU swallow such a messed up vertex layout. And you're right, it's very far from optimal.

Jake, the layout you chose is highly incompatible with the way GPUs work, even if you fix the indirection bug. You have to setup several separate data streams and recomposite the vectors in the vertex shaders. The (probably rather small) performance gains you get from SIMD are going to be completely lost by the huge overhead of making the GPU accept this.

If you want really good performance on particle systems, I would suggest to completely drop SIMD. Move all particle calculations over to the GPU instead, using either OpenGL or CUDA.


So basically that means I'm stuck with the good ol std::vector of coords then eh?

I don't really have time to change everything over to GPU or patients lol. It took me long enough to get SIMD to work properly!

But thanks guys for clearing that up. I didn't know if it was possible to even use the SIMD SoA array in a VA but I know now it's not. Just looking to squeeze in every bit of performance I can with what limited knowledge I have.

Quote:Original post by jake_Ghost
So basically that means I'm stuck with the good ol std::vector of coords then eh?

If you don't want to move your particle calculations to the GPU, then yes, a standard array of vertex coordinates is the best choice. And don't use SIMD, it isn't worth the trouble. Just use a layout that is convenient to you and the GPU. And if one day you really decide that it's too slow, then go fully GPU.

Depending on what you want to achieve, you could also look into point sprites.

Quote:Original post by jake_Ghost
But thanks guys for clearing that up. I didn't know if it was possible to even use the SIMD SoA array in a VA but I know now it's not.

Well, it theoretically is, but it's neither practical nor efficient.

Quote:Original post by jake_Ghost
Just looking to squeeze in every bit of performance I can with what limited knowledge I have.

The problem is that going SIMD is very often the wrong way. Also, keep in mind that you have to retransfer all the geometry data to the GPU every single frame ! Goodbye performance. You have a massive parallel processor just at your fingertips - the GPU. Especially with todays flexible shaders, you can get particle systems entirely running on the GPU that are orders of magnitude faster than anything you could ever dream to do with SSE.

Anyway, whatever you do, remember to profile your results !

This topic is closed to new replies.

Advertisement