# How to align data for SSE2? What's the most common adapted method?

## Recommended Posts

For using SSE and SSE2 in transformations and other simular operations, we need to align data proper bly. I read a lot of papers andooks and tutorials. Some of them suggest (or maybe from Intel's suggestion) to use structure for arrays like: struct tagVERTEXLIST { float x[noofvertex]; // or float* x; float y[noofvertex]; // or float* y; float z[noofvertex]; // or float* z; float nx[noofvertex]; // or float* nx; float ny[noofvertex]; // or float* ny; float nz[noofvertex]; // or float* nz; float u0[noofvertex]; // or float* u0; float v0[noofvertex]; // or float* v0; float u1[noofvertex]; // or float* u1; float v1[noofvertex]; // or float* v1; } VertexList; This is good when performing a lot of transform or calculations with SSE/SSE2. An d this can be set to vertex shader with different stream addressing. Quite flexible as well. Though, mostly, when vertex shader is used, there is few cases we need to perform a lot of calculations on a per-vertex basis with the CPU. e.g. for soft skinning, we just need to perform the transformation on bone matrixs. And DirectX performs better with interleaved vertex buffers. So, I would like to know, is the traditional structure per data or the new structure array is more common? Or this really depends on situation and so, what's the judging critiria?

##### Share on other sites
Quote:
 Original post by pcwlaiFor using SSE and SSE2 in transformations and other simular operations, we need to align data proper bly. I read a lot of papers andooks and tutorials. Some of them suggest (or maybe from Intel's suggestion) to use structure for arrays like:struct tagVERTEXLIST{ [*struct insides removed*]} VertexList;

OMG!1!1!11 Why would someone do that?

Seriously, what was the original Intel's suggestion? And how was it explained?

Quote:
 Original post by pcwlaiThis is good when performing a lot of transform or calculations with SSE/SSE2. An d this can be set to vertex shader with different stream addressing. Quite flexible as well.

I can see how something like this:
typedef struct tagVertexList{  struct {    float x,y,z;  }  positions[noofvertex]; // or pointer  struct {    float x,y,z;  }  normals[noofvertex]; // or pointer  //...}VertexList;
could be easily used by GL, or (with only a little more tweaking) by D3D. But decoupling x,y,z of position? How would you transfer the data to the shader (without some dirty hacks, like transfering position as 3 1D texture coordinates)?

And, I'm doing it the traditional way: 4-float vertex.
It wastes some amount of space in system RAM, but I think it's worth it.
As for GPU transfers, I usually use 4th coordinate of a vertex as a color, or 1D texture coordinate, if I need it (and I usually do). So: on the CPU, I waste 4th coordinate, on the GPu, it's being overwritten by something relevant, if posible.

Cheers.
~def

##### Share on other sites
It's not particularly well-suited to GPU input as you've got a limited number of streams to work with - I think you're looking at 8 at the most - and it's pretty inefficient to have an input assembler channel push through only one value (as opposed to an XYZW tuple).

The structure you've shown is an instance of a technique known as "structure of arrays," and for SSE operation, you are correct that it is the best way to store your data. Be careful to ensure that your arrays each start on 16-byte boundaries (if you have a number of vertices that is not divisible by 4 then you need to add padding); that way you can use 'movaps' instead of 'movups' for a slight speed boost.

As far as packing the data for output... you can convert 'xxxx','yyyy','zzzz','wwww' in four XMM registers into 'xyzw','xyzw','xywz','xyzw' format by doing a matrix transpose, which IIRC is 8 shufps instructions. It's probably necessary to write code to convert from your structure-of-arrays format to an array-of-structures (which is what the GPU wants for optimal operation); this matrix transpose may help you a lot.

##### Share on other sites
Thanks a lot for you information, superpig!

There's always trade off between different approaches. It would be great if SSE4 wil support swizzle. I am crrently using the traditnioal structure of a vertex method. It works great on vertex shaders.

Though, I need to perform some calculations which relies on the CPU and those SSE2 extensions seems a great performance boost for me. Afterward, I will need to pass them onto the vertex shader for further processing.

Now, I will choose the structure of arrays for my storage and choose code path between software vertex shader and hardware vertex shader. Though, I will try to bench mark whether the shuflling of output from CPU to GPU will work better next week.

##### Share on other sites
you have to understand the advantages of SOA in order to know when to use them. I would never ever submit SoA to the GPU, ever! SoA comes from the fact that it is really messy to do "horizontal" operations in SSE. Horizontal means that x, y, or z components are interacting with each other, like in a dot or cross product. And at the same time, we're typically doing the same operation on multiple pieces of sequential data.

For instance, say you have 2 vectors stored x,y,z in 2 registers. To take their dot product your instructions might look like:

mulps
shufps
movhlps

with lots of dependency stalling. But if you knew you were going to do 4 dot products and registers were loaded SoA-style like xxxx yyyy zzzz you could compute all 4 dots like

mulps
mulps
mulps

still some stalling, but you've done 4x as much work! With cross products, the advantage is even better.

AoS necessarily wastes a lot of calculations because the result of w is rarely used. SoA also hides the latencies of 'ps' instructions better... but to really take advantage of SoA you need to think about your algorithm beforehand. It can be very hard, especially with only 8 registers, and 4 xyz vectors take up 3 of them!

If you wanted to do lots of transposes to make the GPU happy, that might be okay. I wrote it once, but it wasn't 8 shuffles... it was like 4 movaps, 2 movhlps, 2 movlhps, 2 movhps and 2 movlps or something. You could shave off 2 intstrs from the 4x4 transpose if you're only doing 4x3.

##### Share on other sites
Thanks for your kind reminder, ajas95.

Because of previous engine design flaw, I need to perform a lot of morphing and transformation on the vertex arrays to animate them in CPU. So, I am trying to optimize them with SSE2 then, passing them on to the GPU for rendering.

MS's DirectX SDK doc. also suggest to use array of structure. So, I am very confused at first.

I think I really need to implement both methods and have a thorough bench marking.

Thanks for all the kind helps.

##### Share on other sites
Okay, well software skinning is a good enough way to show the difference between SoA and AoS. If you write this AoS, you want to store the skeleton output as column major matrix... The keyframes can be quats with translation, but the output should be column-major aligned to 16 bytes like:
__declspec(align(16))struct vec4{    float x, y, z, pad;};__declspec(align(16))struct Bone{    vec4 x_axis;    vec4 y_axis;    vec4 z_axis;    vec4 trans;};and your skinning routine is something like:// eax is bone0 and ecx is bone1 if we're doing a 2-bone blend.// esi is the blend factor of bone0, and edi is blend of bone1// edx is the base-pose vertex we're blending.movss   xmm0, [edx]vec4.xmovss   xmm1, [edx]vec4.ymovss   xmm2, [edx]vec4.zshufps  xmm0, xmm0, 0   // start the x-axis dot productshufps  xmm1, xmm1, 0   // start the y-axis dot productshufps  xmm2, xmm2, 0   // start the z-axis dot product// do the same thing again, rather than a movaps xmm3, xmm0 because// that xmm0 shufps is not done yet.movss   xmm3, [edx]vec4.xmovss   xmm4, [edx]vec4.ymovss   xmm5, [edx]vec4.zmulps   xmm0, [eax]bone.x_axis // starting the matrix multiply.mulps   xmm1, [eax]bone.y_axismulps   xmm2, [eax]bone.z_axisshufps  xmm3, xmm3, 0   // 2nd transform x-axis dotshufps  xmm4, xmm4, 0   // 2nd transform y-axis dotshufps  xmm5, xmm5, 0   // 2nd transform z-axis dotaddps   xmm0, xmm1addps   xmm2, [eax]bone.trans  // add in translationmovss   xmm6, [esi]     // first blend weightmovss   xmm7, [edi]     // second blend weightaddps   xmm2, xmm0      // seems stally, but this will get re-ordered.                        // finishes the matrix multiply.mulps   xmm3, [ecx]bone.x_axis // starting the second matrix multiply.mulps   xmm4, [ecx]bone.y_axismulps   xmm5, [ecx]bone.z_axisshufps  xmm6, xmm6, 0   // shuffle the scale factorshufps  xmm7, xmm7, 0addps   xmm3, xmm4      // adding x to y, these ought to be done by nowaddps   xmm5, [ecx]bone.trans // add translation to second bone multiply.mulps   xmm6, xmm2      // scaling the matrix multiply by the blend factor.                        // xmm2 stalled above, but it should be ready here.addps   xmm5, xmm3      // finish the 2nd matrix multiply.// since this will certainly be in a a loop, this is an excellent time to start// loading the next verts, or else if you do 4-weight blending, you can reload// this same vertex and multiply by the next 2 matrices.  Note that xmm0-4 are// all free now, so there is plenty of space.  The last 2 instructions will // stall but movss can get reordered to fill the pipeline.mulps   xmm7, xmm5      // scale 2nd matrix muliply.                        // xmm5 now free to finish the next vert loading.// do some next-iteration shufps here to wait for xmm7 result to be ready.addps   xmm6, xmm7      // final vertex position.

That is off the top of my head, but it looks alright. You'd need to add your own code to advance the vertex, matrix and blend weight pointers, but it should give you an idea of AoS. The big thing to notice is that the W component is never used! So we always waste 25% of our computations.

To do the same in SoA is a little trickier. We put more requirements on our data, like all our verts in a group of 4 are weighted to the same bones. In fact, this is okay, and in tools it's pretty to make these groupings. BUT we get to use 100% of our calculations.

__declspec(align(16))struct comp_group{    float a, b, c, d;  // 4 components of separate vectors stored in a row.};struct bone{    float x0, y0, z0, tx;  // row-major.  save some space.    float x1, y1, z1, ty;  // x,y,z are ordinal axes.    float x2, y2, z2, tz;};struct vert_group{    comp_group x;    comp_group y;    comp_group z;};// same register identifiers as above.  If you think about doing scalar // operations this is even easier than the AoS version.  Thinking scalarly, it's  // x0*x + y0*y + z0*z + tx.  To do that operation in SoA, you need to load // in the x0 part scalarly and shufps across all 4 components.  Note that// the blend factors on the matrices need to be comp_groups now and not actual// scalarsmovss   xmm0, [eax]bone.x0 // load bones into registers, since those need to be shufps'ed now.movss   xmm1, [eax]bone.y0movss   xmm2, [eax]bone.z0movss   xmm3, [eax]bone.txshufps  xmm0, xmm0, 0      // shoofle.shufps  xmm1, xmm1, 0shufps  xmm2, xmm2, 0shufps  xmm3, xmm3, 0movss   xmm4, [eax]bone.x1 // now do multipliers for y-comp in vertexmovss   xmm5, [eax]bone.y1movss   xmm6, [eax]bone.z1movss   xmm7, [eax]bone.tyshufps  xmm4, xmm4, 0      // shoofle.shufps  xmm5, xmm5, 0shufps  xmm6, xmm6, 0shufps  xmm7, xmm7, 0mulps   xmm0, [edx]vert_group.x  // x0*xmulps   xmm1, [edx]vert_group.y  // y0*ymulps   xmm2, [edx]vert_group.z  // z0*zaddps   xmm3, xmm0         // x0*x + tx, the proc will re-order the stall here.addps   xmm2, xmm1         // y0*y + z0*zmulps   xmm4, [edx]vert_group.x  // x1*xmulps   xmm5, [edx]vert_group.y  // y1*ymulps   xmm6, [edx]vert_group.z  // z1*zaddps   xmm3, xmm2         // x-component of result vector finished (not blended)addps   xmm7, xmm4         // x1*x + tyaddps   xmm6, xmm5         // y1*y + z1*zmulps   xmm3, [esi]        // blend x// got some free registers, start loading in z-component multipliers.movss   xmm0, [eax]bone.x2movss   xmm1, [eax]bone.y2movss   xmm2, [eax]bone.z2movss   xmm4, [eax]bone.tz // need to use xmm4 instead of xmm3 because it's still in use.addps   xmm7, xmm6         // finished unblended y components.shufps  xmm0, xmm0, 0shufps  xmm1, xmm1, 0shufps  xmm2, xmm2, 0shufps  xmm4, xmm4, 0mulps   xmm0, [edx]vert_group.x  // x2*xmulps   xmm1, [edx]vert_group.y  // y2*ymulps   xmm2, [edx]vert_group.z  // z2*zmulps   xmm7, [esi]        // blend y-components.addps   xmm4, xmm0addps   xmm2, xmm1// good time to start loading in next batch of data. xmm0, 1, 5, 6 all free here.addps   xmm4, xmm2         // xmm2 free.mulps   xmm4, [esi]        // blend z-components.// done with one matrix blend.  outputs are x-comp in xmm3, y-comp in xmm7, z-comp in xmm4.

So, that SoA version only does a single bone transform, BUT it does it for 4 verts. If you stored your final animated skeleton such that each component was the same value repeated 4 times, you could eliminate all those shufps and it would probably go at least 25% faster. So the SoA version is faster in any case, but less flexible and bulkier, you see how having only 8 registers is very cumbersome to work around... but none of the calculations are wasted.

Anyway, good luck. SSE is a moronic instruction set, but it is the only way to make fp math go fast on x86.

##### Share on other sites
Thanks for your information and codes again, ajas95.

It really gives a lot of delight to my experience. I will try to check out all the possibilities.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628293
• Total Posts
2981870

• 11
• 10
• 10
• 11
• 17