# SSE2 optimisation

This topic is 2706 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I've never used the SSE intrinsics in MSVC before, so I'm looking for advice on whether my particular algorithm can be made faster. It involves determining if a plane intersects a convex polygon in 3D space. While the algorithm runs a very large number of times, I don't think I can parallelise across separate operations, as subsequent stages of the process modify the data set. Plus I'm using MSVC's STL so I don't have 16 byte alignment on my Vec3s.

The plane to convex polygon algorithm is as follows:

- Iterate over the polygon's points, calculating the point-to-plane distance of each
- The plane intersects the polygon if:
a) Distance p and p+1 have opposite signs and have magnitude greater than epsilon
or
Distance p is within epsilon of zero, but p-1 and p+1 have magnitude greater than epsilon and opposite signs
- An edge of the polygon lies on the plane if
a) Distance p and p+1 are both within epsilon of zero

First I was thinking of using intrinsics to evaluate dv = (planePt - v).dot(planeN) each iteration, but maybe it would be better to calculate the operation for 4 points in parallel. But I don't know if the above process can be adapted to SSE, or if it's even worth trying.

Could anyone with experience using SSE give me some advice here? Is this an avenue worth persuing at all?

##### Share on other sites
Another thing to think about -- do you only ever need to test one plane against one polygon? Or do you have many planes, and many polygons that you need to test?

If so, you can use SSE to do something like testing a batch of 4 different planes against 1 polygon at once.

Ideally, you'd have an input array of planes, and input array of polygons, and an output array of results --- If that's the structure for the problem, then you can usually find a way to do parts of it in parallel (SIMD and/or multithreading).

##### Share on other sites
This is an algorithm that is relatively easy to implement using SSE2. Consider looking at the XNA math h/inl files in the DirectX distribution to see how standard operations (such as dot product) are implemented. Regarding STL and 16-byte alignment, you will need to write custom allocators to get the alignment you need. Such allocators can call _mm_malloc. You also need 'new' operators to get proper alignment when allocating arrays of 16-byte aligned quantities.

##### Share on other sites
I've been giving it some thought and I think my best bet is to vectorise the loop of each plane/face intersection. I should be able to calculate the distance each vertex is to the plane, set values to zero if they're within epsilon, and store them all in an array. Then I'll do a second pass (without SSE) to detect the conditions of intersection.

How does one go about vectorising this operation using up to SSE2?
 if(abs(dist) < epsilon) dist = 0; 

This is an algorithm that is relatively easy to implement using SSE2. Consider looking at the XNA math h/inl files in the DirectX distribution to see how standard operations (such as dot product) are implemented. Regarding STL and 16-byte alignment, you will need to write custom allocators to get the alignment you need. Such allocators can call _mm_malloc. You also need 'new' operators to get proper alignment when allocating arrays of 16-byte aligned quantities.

IIRC custom allocators don't solve the problems Visual Studio's STL has with trying to force 16 byte alignment. You have to use a different implementation entirely. Although I have upgraded to VS2010. Does anyone has information on whether they removed that limitation?

##### Share on other sites
How does one go about vectorising this operation using up to SSE2?if(abs(dist) < epsilon) dist = 0; 
I'm no SIMD expert, but a starting idea might be:
To get the absolute value:
Load 4 distance values into [font="Courier New"]a[/font].
Load [font="Courier New"]0.0f[/font] into the four slots of [font="Courier New"]b[/font].
Compare [font="Courier New"]a[/font] and [font="Courier New"]b[/font] to generate a [font="Courier New"]mask[/font] indicating which distances are below zero.
Load [font="Courier New"]1.0f[/font] into the four slots of [font="Courier New"]c[/font].
Load [font="Courier New"]-1.0f[/font] into the four slots of [font="Courier New"]d[/font].
Set [font="Courier New"]a = a * b[/font]

To set any "below epsilon" values to zero:
Load [font="Courier New"]epsilon[/font] into the four slots of [font="Courier New"]b[/font].
Compare [font="Courier New"]a[/font] and [font="Courier New"]b[/font] to generate a [font="Courier New"]mask[/font] indicating which distances are below epsilon.
Set[font="Courier New"] a = a & mask
[font="Arial"]
AFAIK, there's SSE instrinsics for all of the above operations.[/font]
[/font]

##### Share on other sites

How does one go about vectorising this operation using up to SSE2?
 if(abs(dist) < epsilon) dist = 0; 
The question is: do you really need this? I'm not entirely sure what the intent is. In a physics simulation, I'd assume this is for stability, so for example something you drop on the floor won't slide around forever (in which case the body should be removed from the simulation anyway, though). For plane instersection I'm not sure what good it really does.

If something is at a representable (i.e. non-zero) distance from a clip/frustum plane, you would probably want to know that distance, so you correctly count objects as "in", to be on the conservative side. Otherwise you could get visible artefacts (probably small ones, but still). The same goes for collisions. You'd probably rather want to be conservative than have objects collide "through" and "into" each other (though the difference likely won't be visible).

Oherwise, if your concern is the possible performance impact of denormals, you could get away by enabling the denormals-are-zero and flush-to-zero flags, so the hardware will treat denormals as zero with no extra code at all.

##### Share on other sites

[quote name='taz0010' timestamp='1295208598' post='4759754']
How does one go about vectorising this operation using up to SSE2?
 if(abs(dist) < epsilon) dist = 0; 
The question is: do you really need this? I'm not entirely sure what the intent is. In a physics simulation, I'd assume this is for stability, so for example something you drop on the floor won't slide around forever (in which case the body should be removed from the simulation anyway, though). For plane instersection I'm not sure what good it really does.

If something is at a representable (i.e. non-zero) distance from a clip/frustum plane, you would probably want to know that distance, so you correctly count objects as "in", to be on the conservative side. Otherwise you could get visible artefacts (probably small ones, but still). The same goes for collisions. You'd probably rather want to be conservative than have objects collide "through" and "into" each other (though the difference likely won't be visible).

Oherwise, if your concern is the possible performance impact of denormals, you could get away by enabling the denormals-are-zero and flush-to-zero flags, so the hardware will treat denormals as zero with no extra code at all.
[/quote]

Some of my algorithms aren't stable without using epsilons. For example, when a convex face is bisected by a plane, the resulting pieces should also convex. But this is not guaranteed when calculations are limited by floating point precision.
Without epsilons, certain face splits will produce degenerate polygons which are too small to be visible, but inhibit other parts of the software. There are other examples. I think in certain situations using epislons is unavoidable when working with finite precision values.

##### Share on other sites

How does one go about vectorising this operation using up to SSE2?
 if(abs(dist) < epsilon) dist = 0; 

Depends if you are talking about floats, integers, or doubles. Since you keep mentioning SSE2, I assume you mean double? Pretty simple really, just use bitwise selection (as is the case with any comparison done in SSE). Float and int will work in a similar way, just make sure you choose the right mask value for abs.

const uint64 mask = 0x7FFFFFFFFFFFFFFF; const __m128d dAbsMask = _mm_set1_pd( *(const double*)((const void*)&mask) ); __m128d dist = whateverTheValueIs(); __m128d epsilon = whateverTheEpsilonIs(); __m128d distAbs = _mm_and_pd(dist, dAbsMask); __m128d cmp = _mm_cmpge_pd(distAbs, epsilon); dist = _mm_and_pd(dist, cmp);

IIRC custom allocators don't solve the problems Visual Studio's STL has with trying to force 16 byte alignment. You have to use a different implementation entirely. Although I have upgraded to VS2010. Does anyone has information on whether they removed that limitation?

The problem is due to std::vector<>::resize(). Change the second arg to a reference instead of a value, and then with allocators it's possible to use std::vector. There are however a number of reasons why that sucks royally as a solution - the only one I care about frankly, is that when working in a team, it's a bit of a pita to get everyone to hand modify the default VC++ headers. It's far far simpler to avoid mixing STL and SSE, and just leave it at that....

[color="#1C2837"]I've never used the SSE intrinsics in MSVC before, so I'm looking for advice on whether my particular algorithm can be made faster. It involves determining if a plane intersects a convex polygon in 3D space. While the algorithm runs a very large number of times, I don't think I can parallelise across separate operations, as subsequent stages of the process modify the data set. Plus I'm using MSVC's STL so I don't have 16 byte alignment on my Vec3s.[/quote]

[color="#1C2837"]Chances are you'll get next to no speed up. Trying to take an existing bit of code to vectorise it will always end in tears - it'll either run slower, or show a small speed up that wasn't worth the effort in the grander scheme of things.

[color="#1C2837"]The only way to get SSE to give you instant 4 times speed up is to simply start doing 4 things at a time. i.e.

[color="#1C2837"]struct Vec3_SOA [color="#1C2837"]{ [color="#1C2837"] __m128 x;[color="#1C2837"] __m128 y; [color="#1C2837"] __m128 z; [color="#1C2837"]}; [color="#1C2837"]void vec3Add(Vec3SOA& out, [color="#1C2837"]Vec3SOA& out, const [color="#1C2837"]Vec3SOA& a, [color="#1C2837"]const [color="#1C2837"]Vec3SOA& [color="#1C2837"]{ [color="#1C2837"] out.x = _mm_add_ps(a.x, b.x);[color="#1C2837"] out.y = _mm_add_ps(a.y, b.y); [color="#1C2837"] out.z = _mm_add_ps(a.z, b.z); [color="#1C2837"] }

[color="#1C2837"]Using SSE on a singular object (eg a single __m128 to represent a single vector3) leads to an excessive amount of bit masks and general fudgery that quite often don't give the performance increases you initially hope for I'm afraid.....
[color="#1C2837"]
[color="#1C2837"]/edit. Well. The new gamedev code tags are a "massive improvement"

##### Share on other sites

I've never used the SSE intrinsics in MSVC before, so I'm looking for advice on whether my particular algorithm can be made faster. It involves determining if a plane intersects a convex polygon in 3D space. While the algorithm runs a very large number of times, I don't think I can parallelise across separate operations, as subsequent stages of the process modify the data set. Plus I'm using MSVC's STL so I don't have 16 byte alignment on my Vec3s.

The plane to convex polygon algorithm is as follows:

- Iterate over the polygon's points, calculating the point-to-plane distance of each
- The plane intersects the polygon if:
a) Distance p and p+1 have opposite signs and have magnitude greater than epsilon
or
Distance p is within epsilon of zero, but p-1 and p+1 have magnitude greater than epsilon and opposite signs
- An edge of the polygon lies on the plane if
a) Distance p and p+1 are both within epsilon of zero

First I was thinking of using intrinsics to evaluate dv = (planePt - v).dot(planeN) each iteration, but maybe it would be better to calculate the operation for 4 points in parallel. But I don't know if the above process can be adapted to SSE, or if it's even worth trying.

Could anyone with experience using SSE give me some advice here? Is this an avenue worth persuing at all?

Well the instruction set is fairly complete and rather large so learning it takes time. However there are holes (no 64 bit integer multiplies or any integer division) and quirks (some overzealous type casting when expanding out to __m128i and __m128d types).

There are several main things that get you speed using SSE, being able to do 2 or 4 simultaneous ops using the same algorithm is only one of them.

1) Provided you don't flood the pipeline with divisions you can do quite a few scalar ops and they will be run in parallel in the pipeline very efficiently. This means if you are working with vec3 data, you should just go ahead and write 3 scalar loads and do scalar adds and scalar multiplies until you can coax some useful meta result into multiple registers. Doing the shuffles takes more cycles, and adds more dependencies on the result of the final shuffle before any math can operate on the combined form. Its possible to do the load/shuffle thing to make a vec3 fit in the register but then you are at risk of any given operation generating a NaN or INF in the 4th channel, which will seriously hurt the performance of the algorithm until the specials are masked out or removed.

2) The real secret weapon in SSE code is eliminating branches. I have a port of some Cephes trig math library functions that outperform Microsoft's SSE implementations by 2-3x because mine have the branches completely removed. This included removing a binary search with a linear search over 16 elements in a table. There is a lot more code but the CPU literally screams through it because of how it is designed. You simply write both sets of code paths into different temporary local variables, and select the right result 'after the fact'. The selector algorithm is (A & B ) | (A & ~ B ) but in vector math form, and in SSE4 there is an actual intrinsic to do it in fewer clocks (BLENDPS) However sometimes you need a real bona-fide branch and that leads to:

3) You CAN do conditional branching on fully vectorized code. In MSVC/intel compiler terms you call a function like _mm_cmpge_ps (vectorized float >=), followed by _mm_movemask_ps to test the 4 bits of each 32 bit channel. Theoretically this means you need 30 test functions (Which I call ANY and ALL, each having 15 permutations of which channels match the set you need). In practice you don't need them all. The tests boil down to 'do any of the channels pass the test' or 'do all of the channels pass the test'. For example in a physics ray cast that tests 4 polygons at once and can early out on any hit, can test with 'any4' (which is ((_mm_movemask_ps(v) & 15) != 0) to bail out. Or a test where you test a point vs a convex hull, and the result of 4 plane tests at a time, all4 must pass to be inside, you would test for 'all4' (which is ((_mm_movemask_ps(v) & 15) == 15) to continue testing.

4) You can convert SSE based scalar floats and doubles to integers, and move them directly into one of the main CPU general purpose registers. The intrinsics that do this are _mm_cvttps_epi32 and _mm_cvttpd_epi32, and are very handy if you need to do an indexed table lookup that came from SSE data. Table lookups are another nice way to remove branches from code. This also the most efficient way to get a value from the SSE registers to the general purpose ones, but there are restrictions (source must be a float or double, output must be a 32 bit integer, and converted to integer with truncation)

5) Basically you need good wrappers on the intrinsics (if you ever plan on porting to linux or say a PowerPC with an entirely different instruction set), and they all need to be __forceinline level stuff. Once you have that you can write a C++ class library around that. There are a bunch out there but writing your own teaches you the whole instruction set so you can understand how it all works together. Be aware that un-optimized SSE intrinsics on the MSVC compiler are hideously inefficient (each instruction is basically load/work/write and totally full of load hit stores). This is primarily so you can inspect the variables in a debugger but the performance is horrid until optimizations are on.

Now the MSVC compiler emits scalar float or double SSE code automatically for nearly everything nowadays. So ssome of what I've written the compiler can do for you. But it has to follow the rules alot of the times, and so things like eliminating the branches cannot be done by the compiler. Loop unrolling? The compiler can do. Removing branches? Requires a human.

* If you poke around the instruction set you will see float instructions have reciprocal and sqrt estimators. Doubles do not have estimators and also take a lot longer. The latest intel hardware is rapidly making them close to useless as the full precision functions are being made more and more efficient by Intel, but if you feel compelled to use them, it is worth noting you should do a newton-raphson iteration on them to improve their accuracy, and avoid using them directly. Using the estimates raw can cause some real pain down the line.

vfloat vfRcp(const vfloat& v) { vfloat e = _mm_rcp_ps(v); // _mm_rcp_ps is a 12 bit estimate // Newton iteration of 1/x, approximately 20 bits of precision vfloat ee = vfMul(e, e); vfloat vee = vfMul(v, ee); vfloat e2 = vfAdd(e, e); vfloat rslt = vfSub(e2, vee); return rslt; } vfloat vfRcpSqrt(const vfloat& v) { vfloat e = _mm_rsqrt_ps(v); // _mm_rsqrt_ss is a 12 bit estimate // Newton iteration of 1/x^2, approximately 20 bits of precision vfloat ee = vfMul(e, e); vfloat vee = vfMul(v, ee); vfloat halfe = vfMul(e, GvfGlobal_Half); // GvfGlobal_Half is {0.5, 0.5, 0.5, 0.5} vfloat threeminus_vee = vfSub(GvfGlobal_Three, vee); vfloat rslt = vfMul(halfe, threeminus_vee); // GvfGlobal_Three is {3.0, 3.0, 3.0, 3.0} return rslt; }

Data conversion cheat sheet I made when figuring things out:

(int/float/double are POD data types stored in main memory)
(sint/sfloat/sdouble are scalars in SSE registers)
(vint/vfloat/vdouble are vectors in SSE registers)

(forum couldn't quite format it perfectly but it should be salvagable)
 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // v-src dst->// int // float // double // sint // sfloat // sdouble // vint // vfloat // vdouble // ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // int // = // (float) // (double) // _mm_load_ss // _mm_cvtsi32_ss // _mm_cvtsi32_sd // _mm_load_ss // _mm_cvtsi32_ss // _mm_cvtsi32_sd // // float // round // = // (double) // _mm_cvtps_epi32 // _mm_load_ss // _mm_cvtss_sd // _mm_cvtps_epi32 // _mm_load_ss // _mm_cvtss_sd // // double // round // (float) // = // _mm_cvtpd_epi32 // _mm_cvtsd_ss // _mm_load_sd // _mm_cvtpd_epi32 // _mm_cvtsd_ss // _mm_load_sd // // sint // _mm_store_ss // _mm_cvtepi32_ps // _mm_cvtepi32_pd // = // _mm_cvtepi32_ps // _mm_cvtepi32_pd // _mm_shuffle_ps // _mm_cvtepi32_ps // _mm_cvtepi32_pd // // sfloat // _mm_cvtss_si32 // _mm_store_ss // _mm_cvtss_sd // _mm_cvtps_epi32 // = // _mm_cvtss_sd // _mm_cvtps_epi32 // _mm_shuffle_ps // _mm_cvtss_sd // // sdouble // _mm_cvtsd_si32 // _mm_cvtsd_ss // _mm_store_sd // _mm_cvtpd_epi32 // _mm_cvtsd_ss // = // _mm_cvtpd_epi32 // _mm_cvtsd_ss // _mm_shuffle_pd // // vint // _mm_store_ss // _mm_cvtepi32_ps // _mm_cvtepi32_pd // _mm_shuffle_ps // _mm_cvtepi32_ps // _mm_cvtepi32_pd // _mm_load(u)_ps // _mm_cvtepi32_ps // _mm_cvtepi32_pd // // vfloat // _mm_cvtss_si32 // _mm_store_ss // _mm_cvtss_sd // _mm_cvtps_epi32 // _mm_shuffle_ps // _mm_cvtss_sd // _mm_cvtps_epi32 // _mm_load(u)_ps // _mm_cvtps_pd // // vdouble // _mm_cvtsd_si32 // _mm_cvtsd_ss // _mm_store_sd // _mm_cvtpd_epi32 // _mm_cvtsd_ss // _mm_shuffle_pd // _mm_cvtpd_epi32 // _mm_cvtpd_ps // _mm_load(u)_pd // ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 

##### Share on other sites
Wow, there's a lot of information here. Thanks guys.

I'm going to try using the XNA library as suggested and see if it gives a performance improvement. Something just occurred to me, literally right as I was tying this up. Rather than trying to vectorise this particular operation, I should implement early outs using axis aligned bounding boxes instead. I can store them in SoA format, 16 byte aligned and make use of intrinsics without having to touch the much more complicated intersection code.

The info on eliminating branching is interesting. I haven't had any luck with these optimsations in the past. It seems that correctly predicted branches cost virtually nothing on the i7, and emitting extra code to remove a branch normally results in a performance hit. I wonder if it's similar on the iPhone platform.

• 48
• 12
• 10
• 10
• 9
• ### Forum Statistics

• Total Topics
631374
• Total Posts
2999653
×