SSE, 3D Vectors, and the like

Started by
5 comments, last by quasar3d 19 years, 6 months ago
I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile] 1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars? 2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip? 3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently? Cheers in advance...

Richard "Superpig" Fine - saving pigs from untimely fates - Microsoft DirectX MVP 2006/2007/2008/2009
"Shaders are not meant to do everything. Of course you can try to use it for everything, but it's like playing football using cabbage." - MickeyMouse

Advertisement
Quote:Original post by superpig
I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile]

1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS
instructions, and if so, are they much faster when using scalars?

No.
Quote:
2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip?

It differs from chip to chip. Not only that, but there are other performance penalties that can nail you (such as address % 64k on some P4 chips)
Quote:
3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently?

Pad it with one float, baring that, move 2 then 1 (using MOVLPS | MOVHPS, etc)

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

1. There's no equivalent, but you can do better even without it.

2. Probably not, it will depend on cpu architecture. Maybe vendors provide such list.

3. For vectors I use movups so one float in register was undefined after move.
If you are aiming for speed and you don't mind using more memory, you can pad vector with one float and align it on 16 byte boundary. Than you can use movaps, which should be faster tahn movups.
Than it will look like:

__declspec(align(16)) struct VectorPadded
{
float x;
float y;
float z;
float pad;
};

And did you write cross product in sse? I had problems with that and I used lot of swizzling to calculate it.
Quote:Original post by b2b3
3. For vectors I use movups so one float in register was undefined after move.
Ah, right... I was thinking about that but worrying about access violation faults.

Of course I guess I'd still have to unpack in two steps.

The CPU architecture in question, btw, is based on a Pentium III. Problem is I don't know quite how similar it is... but it gives me a place to start looking.

Cheers, both of you.

Richard "Superpig" Fine - saving pigs from untimely fates - Microsoft DirectX MVP 2006/2007/2008/2009
"Shaders are not meant to do everything. Of course you can try to use it for everything, but it's like playing football using cabbage." - MickeyMouse

Look at the Intel Architecture Optimization Reference Manual for instruction latency and througput information for various processors. If this CPU that's 'based on a Pentium III' is in a console then you should be able to get most of this information from the console manufacturer (assuming you're a registered developer) - they're generally pretty helpful.

Intel have a lot of helpful performance information on their site but it's not very well organised and can be difficult to find what you're looking for. Spending some time hunting around there is probably time well spent though.

Game Programming Blog: www.mattnewport.com/blog

Quote:Original post by superpig
I'm working with SSE1, trying to optimise a math library, and there are some questions I can't answer. If you folks could help me out I'd be most appreciative [smile]

1) Are there SSE equivalents to the x87 FSIN/FCOS/FSINCOS instructions, and if so, are they much faster when using scalars?

2) Is there a general listing of operation expenses (i.e. how many cycles each takes to execute) around? Is it something that differs from chip to chip?

3) How do I un/pack a three-float vector (6 bytes) into a four-float register (xmm) efficiently?

Cheers in advance...


1) No. I suppose you need it for Euler or Quaternion SLERP. But it can be done very well with Taylor series. I also have some precise ideas on how to redefine the whole math.h on 3DNow and SSE. For instance a very efficient implementation of degree 5 polynomials, highly parallel and scheduled. I also envision the idea of using tricks based on the quick rsqrt or rcp (playing around the Taylor coefficients). For instance I found a page on the www where one describes a fast acos based on rsqrt, very astute and elegant. (Google)

2) Yes the Intel or AMD docs ;) From chip to chip ... yes, but it's not so vital usually. Every machine more or less behaves the same way in practice. Clock counts will be quite close.

3) Frankly, consider replacing your old x,y,z structs by x,y,z,w. A 33% waste of memory but really worth the price payed. Else consider SoA conversions. Well more particularilly hybrid ways : x0,x1,x2,x3, y0,y1,y2, etc.... x,y,z structs can be handled, but you pay a very consequent overhead, or else you'll need to complexify, unroll the code of your loops a lot.

Last : wouldn't you rather try to see if you can contribute to my Virtual SIMD ? I am very close to launch it "semi-publically". Sourceforge soon possibly. It's Open Source, but I had to keep it underground, a long one man preparatory job, because it was far too complex, dynamic and experimental to fit well with team work. I also had to learn a lot of details and find a lot of ideas. Now I am far closer to the right and stable project architecture, that's why I consider opening it to some contributors. Still official version 1.0, not before the next year.

Shortly presented. VSIMD extends the instruction sets a lot and makes them standard and cross platform :

Virtual SIMD layer :
r =sin_4f(s); // Works on a Pentium, K6, P3, G4, etc...

Then the library will have higher layers based on this VSIMD :

Linear algebra layer :
r = s^t + Quat*Point;

Geometric layer :
d = dist(Sphere, OBBox);
"Coding math tricks in asm is more fun than Java"
cpuid also has information about latency and throughput, and it may be a bit more oversightful than the intel or amd docs.

This topic is closed to new replies.

Advertisement