# SSE in Assembly - wrong output

This topic is 3041 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

hi, I put together a small app to try using SSE in assembly, but I am getting the wrong output.
these are the .cpp file and the .asm file

#define WIN32_LEAN_AND_MEAN#include <iostream>using namespace std;extern "C" int *myFunc(float *a, float *b, float *result);int main(int argc, char *argv[]){	float inA[] = {1.0f, 2.0f, 3.0f, 4.0f};	float inB[] = {1.0f, 2.0f, 3.0f, 4.0f};	float ret[4];	myFunc(inA, inB, ret);	cout << ret[0] << endl;	cout << ret[1] << endl;	cout << ret[2] << endl;	cout << ret[3] << endl;	system("pause");		return 0;}

.586P.XMM.MODEL FLAT, C.STACK.DATA.CODEmyFunc PROC a:DWORD, b:DWORD, result:DWORDmovups xmm0, [a]movups xmm1, addps xmm0, xmm1movups [result], xmm0retmyFunc ENDPEND

I was expecting to get the result
Quote:
 2468

Quote:
 -1.07374e+008-1.07374e+008-1.07374e+008-1.07374e+008

can anyone help me out?

##### Share on other sites
Hey there,

One thing I did notice is you did not align your data. SSE can only accept 16 bit aligned data.

So I think you need to do a __declspec(align(16)) (if the underscores did not get posted then there are two underscores before declspec).

Ive not seen the movups command before. I would first load the data into a normal asm register (eax for example) and the do movdqa xmm0, [eax].

Hope this helps

##### Share on other sites
You should probably try using the compiler intriniscs.
That would let you write all the code inline with your C++ code, without needing the extra assember file at all.

##### Share on other sites
Hmmmm, i think that a, being a label, is actually the address of the pointer to the floating point data (&a, not a, in c notation). So i think you are currently loading &a and 96 more garbage bytes into xmm0 and you actualy need to dereference twice. Something like this:
mov rcx, [a]movups xmm0, [rcx]

a and b should already be in the CPU's registers according to C's calling convention. Compiler SSE intrinsics are much more convenient and are the same at least across GCC, Microsoft and Intel compilers, but some people say that they produce suboptimal code in some cases. I hope I helped...

[Edited by - D_Tr on July 20, 2010 4:42:47 PM]

##### Share on other sites
Kazuo5000, you are right. Someone on another forum also told me that it was because you can't directly dereference a label.

this is allowed:
mov edi, amovups xmm0, [edi]

but this is not:
mov xmm0, [a]

I would like to know why I have to do an extra copy for nothing though =/

##### Share on other sites
CPPNick, I think its because you cant directly access the SSE registers like you can with normal CPU registers, but I am not entirely sure on this. You could always try loading in the address of [a] instead using the lea command.

Personally, I don't have a great deal of knowledge in SSE but have a little bit of experience in it.

Again I hope this helps

##### Share on other sites
Right, posts cleaned up.

2) but don't jump down someone elses throat when they do that however.

##### Share on other sites
Is it just me or are these proc declarations a bit confusing? They provide a level of abstraction, which should not be there when you are a beginner. The "pure" way to write a function in assembly is to write the function code under a label and jump to it using the call instruction, accessing the data according to the calling convention you are planning to use (in the OP's case the C calling convention). Doing things the "pure" way will help you understand assembly language better. A great assembly tutorial using nasm (in the form of a short book) can be found here

##### Share on other sites
Just use the intrinsics.

#include <xmmintrin.h>int main(){	__declspec(align(16)) float inA[] = {1.0f, 2.0f, 3.0f, 4.0f};	__declspec(align(16)) float inB[] = {1.0f, 2.0f, 3.0f, 4.0f};	__m128 i1 = _mm_load_ps(inA);	//if you don't align it, you have to use _mm_loadu_ps	__m128 i2 = _mm_load_ps(inB);	i1 = _mm_add_ps(i1, i2);		return 0;}

##### Share on other sites
thank phantom.

D_Tr: I would rather use MASM..just preference I guess. The way it looks makes more sense to me. I'm past the point of having a hard time understanding assembly, I just need to learn the details and nuances now. I will definitely take a look at the book you recommended...at first glance I am relieved that its only 195 pages....not nearly as intimidating as "The Art of Assembly" which is over 1400...scary

clashie: I have tried that, and it did seem to help. I have a question though. In an isolated case like your example, you have to move a total of 48 bytes of data to get the the data in and back out. Does it really work out to be faster?
I am doing what I am doing for two reasons; One is just to learn some more assembly, and the second is to try and avoid as much overhead as possible and streamline the procedure. I made the inner embedded loop of a function in assembly using the same method as my original post, and it turned out to be slower. Once I examined the disassembly, I found that VC++ was adding another 15 lines of junk before each call to my assembly procedure to protect the contents of the registers I was overwriting. I figured that if I made the whole procedure assembly instead of only the inner embedded loop, I could avoid that.

1. 1
Rutin
38
2. 2
3. 3
4. 4
5. 5

• 11
• 9
• 12
• 14
• 9
• ### Forum Statistics

• Total Topics
633350
• Total Posts
3011471
• ### Who's Online (See full list)

There are no registered users currently online

×

## Important Information

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!