# redesigning the sse

This topic is 2108 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I am not sure if this topic is apropriate, but Im trying to understand an learn

sse (both integral and float arithmetic parts) and i think discussing this topic

could help me to learn it maybe:

When i read about the sse operations some seem somewhat chaotic and strange

to me, for example there is no as far as i know command for horizontal multiply

that would just mul a1*a2*a3*a4, and some other gives a strange cross located

results :/

whould you redesing the sse operations?

##### Share on other sites

for example there is no as far as i know command for horizontal multiply
that would just mul a1*a2*a3*a4

Sometimes you just need to rearrange the data. The benefits of simd ops do no come from a1*a2*a3*a4 but rather from:

v1(a1,b1,c1,d1);
v2(a2,b2,c2,d2);
v3(a3,b3,c3,d3);
v4(a4,b4,c4,d4);

v5 = vec4mul( v1, vec4mul( v2, vec4mul( v3, v4 ) ) );
// v5( a1*a2*a3*a4, b1*b2*b3*b4, c1*c2*c3*c4, d1*d2*d3*d4 )

Edited by zfvesoljc

##### Share on other sites

The reason that horizontal operations are hard for SIMD designs, as I understand it, is that its either fairly high-latency, or you have to throw a lot of transistors at it to make it go fast. Combine that with its relatively limited utility, and its no wonder you don't get things like horizontal multiplies. I think they added horizontal add/subtract at some point, but multiplier circuits are considerably larger, and you need three of them stacked 2-deep (so at least twice the latency, although there are probably faster methods if they spent even more transistors on it). Depending on how things are wired, supporting horizontal ops at all could complicate how the register file is designed too.

In general, though, nearly all the problems you might want a horizontal add/multiply for can be transposed (commonly from array-of-structures to structure-of-arrays). SSE is for very specialized coding. It asks you to bend the problem to its ways of working, and offers great performance in return. But its not suited for every problem, either.

If you want to take a look at what's generally considered to be a nicer (the nicest, some argue) vector instruction set, take a look at AltiVec, as found in PowerPC processors dating back to the G4 and recently in the Xbox 360 and PS3. I keep a G4 mac mini around just so that I have an AltiVec machine to play with.

##### Share on other sites

SSE/AVX/FMA/... isn't a big issue. Just get started with the basics.

const float dot_product_3D = _mm_dot_ps(_mm_set_ps(v1.x, v1.y, v1.z, v1.w),
_mm_set_ps(v2.x, v2.y, v2.z, v2.w), 0x71).m128_f32[0];
const float dot_product_4D = _mm_dot_ps(_mm_set_ps(v1.x, v1.y, v1.z, v1.w),
_mm_set_ps(v2.x, v2.y, v2.z, v2.w), 0xF1).m128_f32[0];
_mm_set_ss(offset)); // _mm_fmad_ss Requires FMA 3

should do what you mean.

First times, try out simple things, then do more advanced stuff:

• Compute with SSE values (mul tiply, add ition, sub traction, div ide?, rcp roke, sqrt, fmad fused multiply-add) as single and 4D
• Compare SSE values.
• Validate the values (NaN, not NaN).
• Shuffle vectors.

And that's all. The other things are some knownlegde about cycles, timing and tricks how to do some stuff (like matrix inverse).

Hope it helped a lot. Today, a lot of debuggers help too. Also, this documentation helps a lot.

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 14
• 30
• 9
• 16
• 22