Jump to content
  • Advertisement
Sign in to follow this  
superpig

SSE, 3D Vectors, and the like

This topic is 5137 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile] 1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars? 2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip? 3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently? Cheers in advance...

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by superpig
I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile]

1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS
instructions, and if so, are they much faster when using scalars?

No.
Quote:

2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip?

It differs from chip to chip. Not only that, but there are other performance penalties that can nail you (such as address % 64k on some P4 chips)
Quote:

3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently?

Pad it with one float, baring that, move 2 then 1 (using MOVLPS | MOVHPS, etc)

Share this post


Link to post
Share on other sites
1. There's no equivalent, but you can do better even without it.

2. Probably not, it will depend on cpu architecture. Maybe vendors provide such list.

3. For vectors I use movups so one float in register was undefined after move.
If you are aiming for speed and you don't mind using more memory, you can pad vector with one float and align it on 16 byte boundary. Than you can use movaps, which should be faster tahn movups.
Than it will look like:

__declspec(align(16)) struct VectorPadded
{
float x;
float y;
float z;
float pad;
};

And did you write cross product in sse? I had problems with that and I used lot of swizzling to calculate it.

Share this post


Link to post
Share on other sites
Quote:
Original post by b2b3
3. For vectors I use movups so one float in register was undefined after move.
Ah, right... I was thinking about that but worrying about access violation faults.

Of course I guess I'd still have to unpack in two steps.

The CPU architecture in question, btw, is based on a Pentium III. Problem is I don't know quite how similar it is... but it gives me a place to start looking.

Cheers, both of you.

Share this post


Link to post
Share on other sites
Look at the Intel Architecture Optimization Reference Manual for instruction latency and througput information for various processors. If this CPU that's 'based on a Pentium III' is in a console then you should be able to get most of this information from the console manufacturer (assuming you're a registered developer) - they're generally pretty helpful.

Intel have a lot of helpful performance information on their site but it's not very well organised and can be difficult to find what you're looking for. Spending some time hunting around there is probably time well spent though.

Share this post


Link to post
Share on other sites
Quote:
Original post by superpig
I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile]

1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars?

2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip?

3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently?

Cheers in advance...


1) No. I suppose you need it for Euler or Quaternion SLERP. But it can be done very well with Taylor series. I also have some precise ideas on how to redefine the whole math.h on 3DNow and SSE. For instance a very efficient implementation of degree 5 polynomials, highly parallel and scheduled. I also envision the idea of using tricks based on the quick rsqrt or rcp (playing around the Taylor coefficients). For instance I found a page on the www where one describes a fast acos based on rsqrt, very astute and elegant. (Google)

2) Yes the Intel or AMD docs ;) From chip to chip ... yes, but it's not so vital usually. Every machine more or less behaves the same way in practice. Clock counts will be quite close.

3) Frankly, consider replacing your old x,y,z structs by x,y,z,w. A 33% waste of memory but really worth the price payed. Else consider SoA conversions. Well more particularilly hybrid ways : x0,x1,x2,x3, y0,y1,y2, etc.... x,y,z structs can be handled, but you pay a very consequent overhead, or else you'll need to complexify, unroll the code of your loops a lot.

Last : wouldn't you rather try to see if you can contribute to my Virtual SIMD ? I am very close to launch it "semi-publically". Sourceforge soon possibly. It's Open Source, but I had to keep it underground, a long one man preparatory job, because it was far too complex, dynamic and experimental to fit well with team work. I also had to learn a lot of details and find a lot of ideas. Now I am far closer to the right and stable project architecture, that's why I consider opening it to some contributors. Still official version 1.0, not before the next year.

Shortly presented. VSIMD extends the instruction sets a lot and makes them standard and cross platform :

Virtual SIMD layer :
r =sin_4f(s); // Works on a Pentium, K6, P3, G4, etc...

Then the library will have higher layers based on this VSIMD :

Linear algebra layer :
r = s^t + Quat*Point;

Geometric layer :
d = dist(Sphere, OBBox);

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!