#### Archived

This topic is now archived and is closed to further replies.

# SSE2 vector "optimizations"

This topic is 5060 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I''m doing research on photorealistic renderring and in an attempt to maximize my optimization, I am looking into creating an SSE2 optimized 3D vector class. I would think SSE2 would speed things up, albeit only slightly, but it is actually about 4/3 slower. There are obviously more function calls (to the intrinsics), but I assumed that the optimizations would more than make up for this. Does anyone have any ideas why this might be? I''m using MSVS.NET 2003. I''ll post code if that will help.

##### Share on other sites
It''s Visual Studio .NET compiler that slows the things down. Suggest trying the same code with Visual Studio 6.0. I hope that''ll speed the things up. If that would not work, just email me. I have some sse and mmx optimized routines, sse still untested. No info on sse2, though.

##### Share on other sites
Check the asm that''s being generated. It''s possible that you''re loading and storing the xmm registers on each call to your sse vector class. Perhaps you can post the asm generated by the following program?
Vector v1(1,2,3);
Vector v2(4,5,6);
Vector v3 = v1 + v2;
v1 = v1 + v3;
v2 = v2 + v3;

Ideally, it should be something like
If it looks like this:
store v2, blah
store v1, blah
store v2, blah

then you''re in trouble

##### Share on other sites

; 60 : return Vector( x + rkVector.x, y + rkVector.y, z + rkVector.z );

mov eax, DWORD PTR _rkVector$[esp] movss xmm0, DWORD PTR [eax+12] movss xmm1, DWORD PTR [eax+8] movss xmm2, DWORD PTR [eax+4] mov eax, DWORD PTR ___$ReturnUdt$[esp] addss xmm0, DWORD PTR [ecx+12] addss xmm1, DWORD PTR [ecx+8] addss xmm2, DWORD PTR [ecx+4] mov DWORD PTR$T21584[esp+4], 0
mov DWORD PTR [eax], OFFSET FLAT:??_7Vector@@6B@
movss DWORD PTR [eax+4], xmm2
movss DWORD PTR [eax+8], xmm1
movss DWORD PTR [eax+12], xmm0

SSE2 "optimized":
; 54 : Vector kVector;
; 55 : __m128 vec;
; 56 :
; 57 : vec = _mm_add_ps( xyz, rkVector.xyz );

movaps xmm0, XMMWORD PTR [ecx+16]
mov eax, DWORD PTR _rkVector$[ebp] movaps xmm1, XMMWORD PTR [eax+16] ; 58 : kVector.xyz = vec; ; 59 : return kVector; mov eax, DWORD PTR ___$ReturnUdt$[ebp] addps xmm0, xmm1 movaps XMMWORD PTR [eax+16], xmm0 movss xmm0, DWORD PTR _kVector$[esp+96]
movss DWORD PTR [eax+32], xmm0
movss xmm0, DWORD PTR _kVector$[esp+100] movss DWORD PTR [eax+36], xmm0 movss xmm0, DWORD PTR _kVector$[esp+104]
mov DWORD PTR \$T21612[esp+64], 0
mov DWORD PTR [eax], OFFSET FLAT:??_7Vector@@6B@
movss DWORD PTR [eax+40], xmm0

[edited by - kddak on April 12, 2004 12:07:02 PM]

[edited by - kddak on April 12, 2004 12:07:38 PM]

bump?

##### Share on other sites
it looks like it''s doing way too much data shuffling. Can we see the code for your vector assignment operator and copy constructor? We might be able to suggest a way to fix it.

##### Share on other sites
#ifndef VECTOR_H#define VECTOR_H#include <emmintrin.h>//#include <math.h>class Vector{public:	Vector( float fx = 0, float fy = 0, float fz = 0 ) { xyz = _mm_set_ps( fx, fy, fz, 0 ); }//	Vector( float fx = 0, float fy = 0, float fz = 0 ) : x( fx ), y( fy ), z( fz ) { }	Vector( const Vector &rkVector ) : xyz( rkVector.xyz ) { }//	Vector( const Vector &rkVector ) : x( rkVector.x ), y( rkVector.y ), z( rkVector.z ) { }	virtual ~Vector() { }	__inline float Length();//	Vector Normalize();//	Vector Cross( const Vector &rkVector );	float Dot( const Vector &rkVector );//	__inline Vector operator *( float fMul ) { Vector kVec; kVec.x = x * fMul; kVec.y = y * fMul; kVec.z = z * fMul; return kVec; }	__inline float operator *( const Vector &rkVector ) { return Dot( rkVector ); }//	__inline Vector operator %( const Vector &rkVector ) { return Cross( rkVector ); }	__inline Vector operator +( const Vector &rkVector );	__inline Vector operator -( const Vector &rkVector );//	__inline float operator []( unsigned int uiIndex ) { unsigned uiValue = uiIndex % 3; if( uiValue == 0 ) return x; if( uiValue == 1 ) return y; return z; } 	__m128 xyz;//	float x, y, z;};__inline float Vector::Dot( const Vector &rkVector ){	__m128 vec1 = _mm_mul_ps( xyz, rkVector.xyz ), vec2, vec3;	float fRet;	vec2 = _mm_shuffle_ps( vec1, vec1, _MM_SHUFFLE( 1, 2, 3, 0 ) );	vec3 = _mm_shuffle_ps( vec1, vec1, _MM_SHUFFLE( 2, 3, 0, 1 ) );	_mm_store_ss( &fRet, _mm_add_ps( vec3, _mm_add_ps( vec1, vec2 ) ) );	return fRet;//	return ( x * rkVector.x ) + ( y * rkVector.y ) + ( z * rkVector.z );}__inline float Vector::Length(){	__m128 vec1 = _mm_mul_ps( xyz, xyz ), vec2, vec3;	float fRet;	vec2 = _mm_shuffle_ps( vec1, vec1, _MM_SHUFFLE( 1, 2, 3, 0 ) );	vec3 = _mm_shuffle_ps( vec1, vec1, _MM_SHUFFLE( 2, 3, 0, 1 ) );	_mm_store_ss( &fRet, _mm_sqrt_ss( _mm_add_ps( vec3, _mm_add_ps( vec1, vec2 ) ) ) );	return fRet;//	return sqrtf( ( x * x ) + ( y * y ) + ( z * z ) );}__inline Vector Vector::operator +( const Vector &rkVector ){	Vector kVector;	__m128 vec;	vec = _mm_add_ps( xyz, rkVector.xyz );	kVector.xyz = vec;	return kVector;//	return Vector( x + rkVector.x, y + rkVector.y, z + rkVector.z );}__inline Vector Vector::operator -( const Vector &rkVector ){	Vector kVector;	__m128 vec;	vec = _mm_sub_ps( xyz, rkVector.xyz );	kVector.xyz = vec;	return kVector;//	return Vector( x - rkVector.x, y - rkVector.y, z - rkVector.z );}#endif

##### Share on other sites
You forgot the alignement directive in your class :
__declspec(align(16))

Once it''s done it should be MUCH faster.

##### Share on other sites
I have MSVS.net set to auto-align structures to 16 bits, isn''t that sufficient?

##### Share on other sites
Further "research" has shown that the SSE2 vector class is about 2-3x faster with the actual calculations, but loading the floating point values into the __m128 in the constructor takes significantly longer to the point where the entire process takes quite a bit longer than just using straight float operations. Is there a better (read "faster") way to load the floats into the __m128?

##### Share on other sites
try disabling exceptions, by default they are on, .NET doesnt like returning classes by value with exceptions on

also try something like this for your add function (you will need a c''tor that takes an __m128, it should take this by value)

__inline Vector Vector::operator +( const Vector &rkVector )
{
}

and you are testing this in an optimised build arent you? not a debug one

##### Share on other sites
No it's not only about optimisation settings. You really have to use declspec, the movss stuff comes from this misalignement. General alignement does not apply to local variables pushed on the stack. It only ensures the stride between structures i arrays and alignement inside your structures. This declspec that comes from the Intel libs adds preambles to align the stack (esp) in function scopes. Also beware that std new won't align for you. You must redefine the new delete operators for these classes and use the _aligned_malloc _aligned_free routines.

The asm code should simply be :

(movaps ; write)

Load and stores will be removed by the optimizer if you use temp variables. The compiler must keep temp vars in registers.

I have solved equivalent problems and much more because I am doing a high perfs lib. I'll soon release the docs in a dedicated web site. I'll be searching for cotributors for this vast Open Source project. My current results lead to handwritten asm quality while only std C/C++ coded needs to be written.

There are many subtelties involving for instance cast operators and so on to obtain a quasi perfect asm output code. Another thing I do is I don't use floats anymore except with explicit casts. I use a scalar type ionstead that maps to float, __m64 or __m128 depending on the preprocessing settings of my library. Float conversions would cost too much and kill most of the advantages of using SIMD. That's why I did it.

Here is how it looks like :

scalar d; // a __m128 for a SSE compatible target
float4 U,V,W;
d = U*V;
W = d*(U^V);

For instance W=U+V in a unrolled loop works in 2 cycles for 2 reads, 1 add, 1 write. This shows a quasi optimal code.

[edited by - Charles B on April 15, 2004 11:10:21 AM]

##### Share on other sites
It is advisible not to use intrinsics, and not to use an SSE ''optimized'' vector class. Instead, locate the real bottleneck and focus on that using inline assembly.

The advantages are that you waste much less time, because there are only one or two real bottlenecks in any application. Because you can then optimize the whole bottleneck and not one vector operation at a time, you can keep a lot of data in registers, instead of constantly writing it to __m128 memory and reading it back for the next operation. Furthermore, you can parallelize operations, so you can eliminate many shuffle operations.

##### Share on other sites
Yes unless someone makes the effort to solve all the issues once for all. It''s what I did and now I''ll request the contribution of the community (lib soon online) and the major hw constructors. Math code is everywhere in a 3D application. A bottle neck is rarelly consumming 95%. Even if it''s the case, once optimized it makes the rest of the code far more prominent, leaving room for other optimizations. So having everywhere a factor 4 (SSE) speed up is really worth.

You''ll change your mind when I''ll publish the papers based on my next gen terrain renderer based on VSIMD, my C/C++ portable abstraction of SIMD assembly or C instrisics. Using both CPU (SIMD) and GPU at max is what''s required for next gen unlimited graphics and physics.

##### Share on other sites
USE D3DX!! DON''T REINVENT THE WHEEL [except if you''re charles].

AMD and Intel already optimize these, they have full timers that know their microarchitechture very well doing this stuff.

E.g. Did you know that a movlps and movhps is faster than a movups?! [on Opterons anyway].

Use intrincs, MS is getting rid of inline asm =[. But this should be a good thing as the compiler can register allocate/schedule. Right now the MS intrinsics compiler isn''t that good, it sucks IMO. But, it''ll get better.

Also, for this type of stuff you should try to move to SSE3 or use 3DNow.

pfacc comes in handy or haddps in sse3.

I believe that Charles B is up to something good, and that library is worth working on. But if you don''t have a good idea of what you''re doing, just go ahead and use D3DX. It''s fast. Focus on developing your other skills, ray tracing code, etc. Don''t focus on the little details. Use a library instead. To write a vector library, it could take months!! If you want to write one correctly that is.

##### Share on other sites
Maybe you can get good performance by writing a plain C++ implementation then using Intel''s compiler to generate code which uses CPU-specific optimisations?

If you are an academic or individual I think you can generally get free-use licences.

Just a thought. Certainly save you learning a bunch of dodgy asm.

Mark

##### Share on other sites
kdak it would be cool if you could cut the two longest lines of your code sample post. It sucks, one can't read the messages unless he has a 21" screen.

@ngill

USE D3DX!! DON'T REINVENT THE WHEEL except if you're charles.
AMD and Intel already optimize these, they have full timers that know their microarchitechture very well doing this stuff.

There are three things :
a) the class/function libs
b) the intrinsic libs
c) and the compilers/vectorizers.

a) there are only small subsets of functions available. Mostly basic linear algebra like matrix, quaternion and vertex transfos. But that's only a very limited help for most complex 3D applications. Everyone will benefit of my C/C++ inline classes and functions. For instance :

xAABox Box(pVerts, nVerts);
//... somewhere further, maybe clipping

if( plane[ i ].dist(Box) < epsilon ) // around 5 clock cycles
{
// I doubt anyone will try to reinvent the wheel here
// I provide the best possible code for you
// in a compatible multiplatform way
}

b) It's what kddak is doing. So he still has to reinvent the wheel to make it work efficiently in C++

c) Vectorizers won't do SoA for you. They are a plus but letting one write std float x,y,z code won't certainly lead to any great output code whatever the strength of any compiler. The intelligence of the user, his responsibility to shape the data in covenient ways has to be involved. No compiler can take decisions at this intermediate level of code. Those who bow to the lazy "God MS VC++ does it all for me", forget this essential parameter.

E.g. Did you know that a movlps and movhps is faster than a movups?! [on Opterons anyway].

Yes but even a K6 can do great things Quake3 did not use 3DNow which is sad. This sould be the resposibility of the compiler here anyway.

Use intrincs, MS is getting rid of inline asm =[. But this should be a good thing as the compiler can register allocate/schedule.

Right it's the real motivation for intrisics, dixit the MS/Intel pdf. Compared to frozen inline asm it enables the larger scale optimizations you mentionned which are capital. Weirdly asm can only be useful for big functions or functions with loops such as array processing. My lib will include SoA AoS dedicated classes. That's exactly why I saw an opportunity for a lib based on inline and macros that would exploit intrisics with long term views. Also vectorizers can not harm, my VSIMD lib does not compete with them it can work aside of them.

Right now the MS intrinsics compiler isn't that good, it sucks IMO. But, it'll get better.

Well surprisingly not that much. There is only a 10% diff between my C++ test routines and my handwritten, unrolled, scheduled reference asm routines. I have spent entire weeks to achieve this. Truely the code is not pretty but with modern CPUs and their code reordering caps even the dirty VC6 does quite well cosidering rdtsc benchmarks. That's why I can now assert my VSIMD is really worth the challenge. Write one C++ code and you get several REALLY optimized output .exes (586, 3DNow, SSE, Altivec). Add the many math tricks and subtelties included in the code base of the lib and you'll easilly get huge factors of speed up.

Also, for this type of stuff you should try to move to SSE3 or use 3DNow.
pfacc comes in handy or haddps in sse3.

Right this instruction makes 3DNow competitive compared to SSE for the widely used dot product or plane equations. However as my lib is in practice SSE or Altivec still really brings something despite the shuffle stuff. Anyway really speedy stuff should rely on explicit SoA and parallelism. Then SEE is a Ferrari.

I believe that Charles B is up to something good, and that library is worth working on.

But if you don't have a good idea of what you're doing, just go ahead and use D3DX. It's fast. Focus on developing your other skills, ray tracing code, etc. Don't focus on the little details. Use a library instead.

That's actually why I took the bet and did it. In our game project, we plan many original technologies, such as an infinite (ly precise) terrain and a great particle renderer. So it was worth spending a lot of time on math stuff. Math are everywhere in these two 3D softs. So now I hope we'll have contributors so that this lib becomes solid enough to be shared and deve:opped by many, possibly with the support of big names.

To write a vector library, it could take months!!

Actually it took 6 months at near 50% of my full time to achieve what I have now. At start I was really overoptimistic. But now when I see the results I am really satisfied because I can see how fast I'll be able to develop complex code highly optimized at the first rush.

If you want to write one correctly that is

Yes else there is a bunch of libs that will mostly exploit 586 FPU and not much more on higher processors. Understanding the bias required to use SIMD at max is not very straightforward and intuitive, specially with the lack of docs and the poor and unpredictable job done by the various C++ compilers.

[edited by - Charles B on April 16, 2004 10:27:04 AM]

[edited by - Charles B on April 16, 2004 10:31:58 AM]