Sign in to follow this  
pcwlai

How to align data for SSE2? What's the most common adapted method?

Recommended Posts

For using SSE and SSE2 in transformations and other simular operations, we need to align data proper bly. I read a lot of papers andooks and tutorials. Some of them suggest (or maybe from Intel's suggestion) to use structure for arrays like: struct tagVERTEXLIST { float x[noofvertex]; // or float* x; float y[noofvertex]; // or float* y; float z[noofvertex]; // or float* z; float nx[noofvertex]; // or float* nx; float ny[noofvertex]; // or float* ny; float nz[noofvertex]; // or float* nz; float u0[noofvertex]; // or float* u0; float v0[noofvertex]; // or float* v0; float u1[noofvertex]; // or float* u1; float v1[noofvertex]; // or float* v1; } VertexList; This is good when performing a lot of transform or calculations with SSE/SSE2. An d this can be set to vertex shader with different stream addressing. Quite flexible as well. Though, mostly, when vertex shader is used, there is few cases we need to perform a lot of calculations on a per-vertex basis with the CPU. e.g. for soft skinning, we just need to perform the transformation on bone matrixs. And DirectX performs better with interleaved vertex buffers. So, I would like to know, is the traditional structure per data or the new structure array is more common? Or this really depends on situation and so, what's the judging critiria?

Share this post


Link to post
Share on other sites
Quote:
Original post by pcwlai
For using SSE and SSE2 in transformations and other simular operations, we need to align data proper bly. I read a lot of papers andooks and tutorials. Some of them suggest (or maybe from Intel's suggestion) to use structure for arrays like:

struct tagVERTEXLIST
{
[*struct insides removed*]

} VertexList;


OMG!1!1!11 Why would someone do that?

Seriously, what was the original Intel's suggestion? And how was it explained?

Quote:
Original post by pcwlai
This is good when performing a lot of transform or calculations with SSE/SSE2. An d this can be set to vertex shader with different stream addressing. Quite flexible as well.


I can see how something like this:

typedef struct tagVertexList
{
struct {
float x,y,z;
}
positions[noofvertex]; // or pointer

struct {
float x,y,z;
}
normals[noofvertex]; // or pointer

//...
}
VertexList;
could be easily used by GL, or (with only a little more tweaking) by D3D. But decoupling x,y,z of position? How would you transfer the data to the shader (without some dirty hacks, like transfering position as 3 1D texture coordinates)?


And, I'm doing it the traditional way: 4-float vertex.
It wastes some amount of space in system RAM, but I think it's worth it.
As for GPU transfers, I usually use 4th coordinate of a vertex as a color, or 1D texture coordinate, if I need it (and I usually do). So: on the CPU, I waste 4th coordinate, on the GPu, it's being overwritten by something relevant, if posible.

Cheers.
~def

Share this post


Link to post
Share on other sites
It's not particularly well-suited to GPU input as you've got a limited number of streams to work with - I think you're looking at 8 at the most - and it's pretty inefficient to have an input assembler channel push through only one value (as opposed to an XYZW tuple).

The structure you've shown is an instance of a technique known as "structure of arrays," and for SSE operation, you are correct that it is the best way to store your data. Be careful to ensure that your arrays each start on 16-byte boundaries (if you have a number of vertices that is not divisible by 4 then you need to add padding); that way you can use 'movaps' instead of 'movups' for a slight speed boost.

As far as packing the data for output... you can convert 'xxxx','yyyy','zzzz','wwww' in four XMM registers into 'xyzw','xyzw','xywz','xyzw' format by doing a matrix transpose, which IIRC is 8 shufps instructions. It's probably necessary to write code to convert from your structure-of-arrays format to an array-of-structures (which is what the GPU wants for optimal operation); this matrix transpose may help you a lot.

Share this post


Link to post
Share on other sites
Thanks a lot for you information, superpig!

There's always trade off between different approaches. It would be great if SSE4 wil support swizzle. I am crrently using the traditnioal structure of a vertex method. It works great on vertex shaders.

Though, I need to perform some calculations which relies on the CPU and those SSE2 extensions seems a great performance boost for me. Afterward, I will need to pass them onto the vertex shader for further processing.

Now, I will choose the structure of arrays for my storage and choose code path between software vertex shader and hardware vertex shader. Though, I will try to bench mark whether the shuflling of output from CPU to GPU will work better next week.

Share this post


Link to post
Share on other sites
you have to understand the advantages of SOA in order to know when to use them. I would never ever submit SoA to the GPU, ever! SoA comes from the fact that it is really messy to do "horizontal" operations in SSE. Horizontal means that x, y, or z components are interacting with each other, like in a dot or cross product. And at the same time, we're typically doing the same operation on multiple pieces of sequential data.

For instance, say you have 2 vectors stored x,y,z in 2 registers. To take their dot product your instructions might look like:

mulps
shufps
movhlps
addss
addss

with lots of dependency stalling. But if you knew you were going to do 4 dot products and registers were loaded SoA-style like xxxx yyyy zzzz you could compute all 4 dots like

mulps
mulps
mulps
addps
addps

still some stalling, but you've done 4x as much work! With cross products, the advantage is even better.

AoS necessarily wastes a lot of calculations because the result of w is rarely used. SoA also hides the latencies of 'ps' instructions better... but to really take advantage of SoA you need to think about your algorithm beforehand. It can be very hard, especially with only 8 registers, and 4 xyz vectors take up 3 of them!

If you wanted to do lots of transposes to make the GPU happy, that might be okay. I wrote it once, but it wasn't 8 shuffles... it was like 4 movaps, 2 movhlps, 2 movlhps, 2 movhps and 2 movlps or something. You could shave off 2 intstrs from the 4x4 transpose if you're only doing 4x3.

Share this post


Link to post
Share on other sites
Thanks for your kind reminder, ajas95.

Because of previous engine design flaw, I need to perform a lot of morphing and transformation on the vertex arrays to animate them in CPU. So, I am trying to optimize them with SSE2 then, passing them on to the GPU for rendering.

MS's DirectX SDK doc. also suggest to use array of structure. So, I am very confused at first.

I think I really need to implement both methods and have a thorough bench marking.

Thanks for all the kind helps.

Share this post


Link to post
Share on other sites
Okay, well software skinning is a good enough way to show the difference between SoA and AoS. If you write this AoS, you want to store the skeleton output as column major matrix... The keyframes can be quats with translation, but the output should be column-major aligned to 16 bytes like:

__declspec(align(16))
struct vec4
{
float x, y, z, pad;
};

__declspec(align(16))
struct Bone
{
vec4 x_axis;
vec4 y_axis;
vec4 z_axis;
vec4 trans;
};

and your skinning routine is something like:

// eax is bone0 and ecx is bone1 if we're doing a 2-bone blend.
// esi is the blend factor of bone0, and edi is blend of bone1
// edx is the base-pose vertex we're blending.

movss xmm0, [edx]vec4.x
movss xmm1, [edx]vec4.y
movss xmm2, [edx]vec4.z

shufps xmm0, xmm0, 0 // start the x-axis dot product
shufps xmm1, xmm1, 0 // start the y-axis dot product
shufps xmm2, xmm2, 0 // start the z-axis dot product

// do the same thing again, rather than a movaps xmm3, xmm0 because
// that xmm0 shufps is not done yet.

movss xmm3, [edx]vec4.x
movss xmm4, [edx]vec4.y
movss xmm5, [edx]vec4.z

mulps xmm0, [eax]bone.x_axis // starting the matrix multiply.
mulps xmm1, [eax]bone.y_axis
mulps xmm2, [eax]bone.z_axis

shufps xmm3, xmm3, 0 // 2nd transform x-axis dot
shufps xmm4, xmm4, 0 // 2nd transform y-axis dot
shufps xmm5, xmm5, 0 // 2nd transform z-axis dot

addps xmm0, xmm1
addps xmm2, [eax]bone.trans // add in translation
movss xmm6, [esi] // first blend weight
movss xmm7, [edi] // second blend weight

addps xmm2, xmm0 // seems stally, but this will get re-ordered.
// finishes the matrix multiply.

mulps xmm3, [ecx]bone.x_axis // starting the second matrix multiply.
mulps xmm4, [ecx]bone.y_axis
mulps xmm5, [ecx]bone.z_axis

shufps xmm6, xmm6, 0 // shuffle the scale factor
shufps xmm7, xmm7, 0

addps xmm3, xmm4 // adding x to y, these ought to be done by now
addps xmm5, [ecx]bone.trans // add translation to second bone multiply.

mulps xmm6, xmm2 // scaling the matrix multiply by the blend factor.
// xmm2 stalled above, but it should be ready here.

addps xmm5, xmm3 // finish the 2nd matrix multiply.

// since this will certainly be in a a loop, this is an excellent time to start
// loading the next verts, or else if you do 4-weight blending, you can reload
// this same vertex and multiply by the next 2 matrices. Note that xmm0-4 are
// all free now, so there is plenty of space. The last 2 instructions will
// stall but movss can get reordered to fill the pipeline.

mulps xmm7, xmm5 // scale 2nd matrix muliply.
// xmm5 now free to finish the next vert loading.

// do some next-iteration shufps here to wait for xmm7 result to be ready.

addps xmm6, xmm7 // final vertex position.



That is off the top of my head, but it looks alright. You'd need to add your own code to advance the vertex, matrix and blend weight pointers, but it should give you an idea of AoS. The big thing to notice is that the W component is never used! So we always waste 25% of our computations.

To do the same in SoA is a little trickier. We put more requirements on our data, like all our verts in a group of 4 are weighted to the same bones. In fact, this is okay, and in tools it's pretty to make these groupings. BUT we get to use 100% of our calculations.


__declspec(align(16))
struct comp_group
{
float a, b, c, d; // 4 components of separate vectors stored in a row.
};

struct bone
{
float x0, y0, z0, tx; // row-major. save some space.
float x1, y1, z1, ty; // x,y,z are ordinal axes.
float x2, y2, z2, tz;
};

struct vert_group
{
comp_group x;
comp_group y;
comp_group z;
};

// same register identifiers as above. If you think about doing scalar
// operations this is even easier than the AoS version. Thinking scalarly, it's
// x0*x + y0*y + z0*z + tx. To do that operation in SoA, you need to load
// in the x0 part scalarly and shufps across all 4 components. Note that
// the blend factors on the matrices need to be comp_groups now and not actual
// scalars

movss xmm0, [eax]bone.x0 // load bones into registers, since those need to be shufps'ed now.
movss xmm1, [eax]bone.y0
movss xmm2, [eax]bone.z0
movss xmm3, [eax]bone.tx

shufps xmm0, xmm0, 0 // shoofle.
shufps xmm1, xmm1, 0
shufps xmm2, xmm2, 0
shufps xmm3, xmm3, 0

movss xmm4, [eax]bone.x1 // now do multipliers for y-comp in vertex
movss xmm5, [eax]bone.y1
movss xmm6, [eax]bone.z1
movss xmm7, [eax]bone.ty

shufps xmm4, xmm4, 0 // shoofle.
shufps xmm5, xmm5, 0
shufps xmm6, xmm6, 0
shufps xmm7, xmm7, 0

mulps xmm0, [edx]vert_group.x // x0*x
mulps xmm1, [edx]vert_group.y // y0*y
mulps xmm2, [edx]vert_group.z // z0*z

addps xmm3, xmm0 // x0*x + tx, the proc will re-order the stall here.
addps xmm2, xmm1 // y0*y + z0*z

mulps xmm4, [edx]vert_group.x // x1*x
mulps xmm5, [edx]vert_group.y // y1*y
mulps xmm6, [edx]vert_group.z // z1*z

addps xmm3, xmm2 // x-component of result vector finished (not blended)
addps xmm7, xmm4 // x1*x + ty
addps xmm6, xmm5 // y1*y + z1*z

mulps xmm3, [esi] // blend x

// got some free registers, start loading in z-component multipliers.
movss xmm0, [eax]bone.x2
movss xmm1, [eax]bone.y2
movss xmm2, [eax]bone.z2
movss xmm4, [eax]bone.tz // need to use xmm4 instead of xmm3 because it's still in use.

addps xmm7, xmm6 // finished unblended y components.

shufps xmm0, xmm0, 0
shufps xmm1, xmm1, 0
shufps xmm2, xmm2, 0
shufps xmm4, xmm4, 0

mulps xmm0, [edx]vert_group.x // x2*x
mulps xmm1, [edx]vert_group.y // y2*y
mulps xmm2, [edx]vert_group.z // z2*z

mulps xmm7, [esi] // blend y-components.

addps xmm4, xmm0
addps xmm2, xmm1

// good time to start loading in next batch of data. xmm0, 1, 5, 6 all free here.

addps xmm4, xmm2 // xmm2 free.
mulps xmm4, [esi] // blend z-components.

// done with one matrix blend. outputs are x-comp in xmm3, y-comp in xmm7, z-comp in xmm4.



So, that SoA version only does a single bone transform, BUT it does it for 4 verts. If you stored your final animated skeleton such that each component was the same value repeated 4 times, you could eliminate all those shufps and it would probably go at least 25% faster. So the SoA version is faster in any case, but less flexible and bulkier, you see how having only 8 registers is very cumbersome to work around... but none of the calculations are wasted.

Anyway, good luck. SSE is a moronic instruction set, but it is the only way to make fp math go fast on x86.

Adam

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this