SSE in Assembly - wrong output

Started by
10 comments, last by RobTheBloke 13 years, 9 months ago
hi, I put together a small app to try using SSE in assembly, but I am getting the wrong output.
these are the .cpp file and the .asm file

#define WIN32_LEAN_AND_MEAN#include <iostream>using namespace std;extern "C" int *myFunc(float *a, float *b, float *result);int main(int argc, char *argv[]){	float inA[] = {1.0f, 2.0f, 3.0f, 4.0f};	float inB[] = {1.0f, 2.0f, 3.0f, 4.0f};	float ret[4];	myFunc(inA, inB, ret);	cout << ret[0] << endl;	cout << ret[1] << endl;	cout << ret[2] << endl;	cout << ret[3] << endl;	system("pause");		return 0;}

.586P.XMM.MODEL FLAT, C.STACK.DATA.CODEmyFunc PROC a:DWORD, b:DWORD, result:DWORDmovups xmm0, [a]movups xmm1, addps xmm0, xmm1movups [result], xmm0retmyFunc ENDPEND

I was expecting to get the result
Quote:2
4
6
8

but instead got
Quote:-1.07374e+008
-1.07374e+008
-1.07374e+008
-1.07374e+008

can anyone help me out?
Advertisement
Hey there,

One thing I did notice is you did not align your data. SSE can only accept 16 bit aligned data.

So I think you need to do a __declspec(align(16)) (if the underscores did not get posted then there are two underscores before declspec).

Ive not seen the movups command before. I would first load the data into a normal asm register (eax for example) and the do movdqa xmm0, [eax].

Hope this helps
You should probably try using the compiler intriniscs.
That would let you write all the code inline with your C++ code, without needing the extra assember file at all.
Hmmmm, i think that a, being a label, is actually the address of the pointer to the floating point data (&a, not a, in c notation). So i think you are currently loading &a and 96 more garbage bytes into xmm0 and you actualy need to dereference twice. Something like this:
mov rcx, [a]movups xmm0, [rcx]

a and b should already be in the CPU's registers according to C's calling convention. Compiler SSE intrinsics are much more convenient and are the same at least across GCC, Microsoft and Intel compilers, but some people say that they produce suboptimal code in some cases. I hope I helped...

[Edited by - D_Tr on July 20, 2010 4:42:47 PM]
Kazuo5000, you are right. Someone on another forum also told me that it was because you can't directly dereference a label.

this is allowed:
mov edi, amovups xmm0, [edi]

but this is not:
mov xmm0, [a]

I would like to know why I have to do an extra copy for nothing though =/
CPPNick, I think its because you cant directly access the SSE registers like you can with normal CPU registers, but I am not entirely sure on this. You could always try loading in the address of [a] instead using the lea command.

Personally, I don't have a great deal of knowledge in SSE but have a little bit of experience in it.

Again I hope this helps
Right, posts cleaned up.

1) If you have a question which isn't directly related to the thread please post a new thread about it.

2) but don't jump down someone elses throat when they do that however.
Is it just me or are these proc declarations a bit confusing? They provide a level of abstraction, which should not be there when you are a beginner. The "pure" way to write a function in assembly is to write the function code under a label and jump to it using the call instruction, accessing the data according to the calling convention you are planning to use (in the OP's case the C calling convention). Doing things the "pure" way will help you understand assembly language better. A great assembly tutorial using nasm (in the form of a short book) can be found here
Just use the intrinsics.

#include <xmmintrin.h>int main(){	__declspec(align(16)) float inA[] = {1.0f, 2.0f, 3.0f, 4.0f};	__declspec(align(16)) float inB[] = {1.0f, 2.0f, 3.0f, 4.0f};	__m128 i1 = _mm_load_ps(inA);	//if you don't align it, you have to use _mm_loadu_ps	__m128 i2 = _mm_load_ps(inB);	i1 = _mm_add_ps(i1, i2);		return 0;}
thank phantom.

D_Tr: I would rather use MASM..just preference I guess. The way it looks makes more sense to me. I'm past the point of having a hard time understanding assembly, I just need to learn the details and nuances now. I will definitely take a look at the book you recommended...at first glance I am relieved that its only 195 pages....not nearly as intimidating as "The Art of Assembly" which is over 1400...scary

clashie: I have tried that, and it did seem to help. I have a question though. In an isolated case like your example, you have to move a total of 48 bytes of data to get the the data in and back out. Does it really work out to be faster?
I am doing what I am doing for two reasons; One is just to learn some more assembly, and the second is to try and avoid as much overhead as possible and streamline the procedure. I made the inner embedded loop of a function in assembly using the same method as my original post, and it turned out to be slower. Once I examined the disassembly, I found that VC++ was adding another 15 lines of junk before each call to my assembly procedure to protect the contents of the registers I was overwriting. I figured that if I made the whole procedure assembly instead of only the inner embedded loop, I could avoid that.

This topic is closed to new replies.

Advertisement