# SSE, 3D Vectors, and the like

This topic is 5137 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile] 1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars? 2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip? 3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently? Cheers in advance...

##### Share on other sites
Quote:
 Original post by superpigI'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile]1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars?

No.
Quote:
 2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip?

It differs from chip to chip. Not only that, but there are other performance penalties that can nail you (such as address % 64k on some P4 chips)
Quote:
 3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently?

Pad it with one float, baring that, move 2 then 1 (using MOVLPS | MOVHPS, etc)

##### Share on other sites
1. There's no equivalent, but you can do better even without it.

2. Probably not, it will depend on cpu architecture. Maybe vendors provide such list.

3. For vectors I use movups so one float in register was undefined after move.
If you are aiming for speed and you don't mind using more memory, you can pad vector with one float and align it on 16 byte boundary. Than you can use movaps, which should be faster tahn movups.
Than it will look like:

{
float x;
float y;
float z;
};

And did you write cross product in sse? I had problems with that and I used lot of swizzling to calculate it.

##### Share on other sites
Quote:
 Original post by b2b33. For vectors I use movups so one float in register was undefined after move.
Ah, right... I was thinking about that but worrying about access violation faults.

Of course I guess I'd still have to unpack in two steps.

The CPU architecture in question, btw, is based on a Pentium III. Problem is I don't know quite how similar it is... but it gives me a place to start looking.

Cheers, both of you.

##### Share on other sites
Look at the Intel Architecture Optimization Reference Manual for instruction latency and througput information for various processors. If this CPU that's 'based on a Pentium III' is in a console then you should be able to get most of this information from the console manufacturer (assuming you're a registered developer) - they're generally pretty helpful.

Intel have a lot of helpful performance information on their site but it's not very well organised and can be difficult to find what you're looking for. Spending some time hunting around there is probably time well spent though.

##### Share on other sites
Quote:
 Original post by superpigI'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile]1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars?2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip?3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently?Cheers in advance...

1) No. I suppose you need it for Euler or Quaternion SLERP. But it can be done very well with Taylor series. I also have some precise ideas on how to redefine the whole math.h on 3DNow and SSE. For instance a very efficient implementation of degree 5 polynomials, highly parallel and scheduled. I also envision the idea of using tricks based on the quick rsqrt or rcp (playing around the Taylor coefficients). For instance I found a page on the www where one describes a fast acos based on rsqrt, very astute and elegant. (Google)

2) Yes the Intel or AMD docs ;) From chip to chip ... yes, but it's not so vital usually. Every machine more or less behaves the same way in practice. Clock counts will be quite close.

3) Frankly, consider replacing your old x,y,z structs by x,y,z,w. A 33% waste of memory but really worth the price payed. Else consider SoA conversions. Well more particularilly hybrid ways : x0,x1,x2,x3, y0,y1,y2, etc.... x,y,z structs can be handled, but you pay a very consequent overhead, or else you'll need to complexify, unroll the code of your loops a lot.

Last : wouldn't you rather try to see if you can contribute to my Virtual SIMD ? I am very close to launch it "semi-publically". Sourceforge soon possibly. It's Open Source, but I had to keep it underground, a long one man preparatory job, because it was far too complex, dynamic and experimental to fit well with team work. I also had to learn a lot of details and find a lot of ideas. Now I am far closer to the right and stable project architecture, that's why I consider opening it to some contributors. Still official version 1.0, not before the next year.

Shortly presented. VSIMD extends the instruction sets a lot and makes them standard and cross platform :

Virtual SIMD layer :
r =sin_4f(s); // Works on a Pentium, K6, P3, G4, etc...

Then the library will have higher layers based on this VSIMD :

Linear algebra layer :
r = s^t + Quat*Point;

Geometric layer :
d = dist(Sphere, OBBox);

##### Share on other sites
cpuid also has information about latency and throughput, and it may be a bit more oversightful than the intel or amd docs.

1. 1
Rutin
40
2. 2
3. 3
4. 4
5. 5

• 18
• 20
• 12
• 14
• 9
• ### Forum Statistics

• Total Topics
633364
• Total Posts
3011517
• ### Who's Online (See full list)

There are no registered users currently online

×