• Advertisement

# SSE in Assembly - wrong output

This topic is 2776 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Recommended Posts

hi, I put together a small app to try using SSE in assembly, but I am getting the wrong output.
these are the .cpp file and the .asm file

#define WIN32_LEAN_AND_MEAN#include <iostream>using namespace std;extern "C" int *myFunc(float *a, float *b, float *result);int main(int argc, char *argv[]){	float inA[] = {1.0f, 2.0f, 3.0f, 4.0f};	float inB[] = {1.0f, 2.0f, 3.0f, 4.0f};	float ret[4];	myFunc(inA, inB, ret);	cout << ret[0] << endl;	cout << ret[1] << endl;	cout << ret[2] << endl;	cout << ret[3] << endl;	system("pause");		return 0;}

.586P.XMM.MODEL FLAT, C.STACK.DATA.CODEmyFunc PROC a:DWORD, b:DWORD, result:DWORDmovups xmm0, [a]movups xmm1, addps xmm0, xmm1movups [result], xmm0retmyFunc ENDPEND

I was expecting to get the result
Quote:
 2468

but instead got
Quote:
 -1.07374e+008-1.07374e+008-1.07374e+008-1.07374e+008

can anyone help me out?

#### Share this post

##### Share on other sites
Advertisement
Hey there,

One thing I did notice is you did not align your data. SSE can only accept 16 bit aligned data.

So I think you need to do a __declspec(align(16)) (if the underscores did not get posted then there are two underscores before declspec).

Ive not seen the movups command before. I would first load the data into a normal asm register (eax for example) and the do movdqa xmm0, [eax].

Hope this helps

#### Share this post

##### Share on other sites
You should probably try using the compiler intriniscs.
That would let you write all the code inline with your C++ code, without needing the extra assember file at all.

#### Share this post

##### Share on other sites
Hmmmm, i think that a, being a label, is actually the address of the pointer to the floating point data (&a, not a, in c notation). So i think you are currently loading &a and 96 more garbage bytes into xmm0 and you actualy need to dereference twice. Something like this:
mov rcx, [a]movups xmm0, [rcx]

a and b should already be in the CPU's registers according to C's calling convention. Compiler SSE intrinsics are much more convenient and are the same at least across GCC, Microsoft and Intel compilers, but some people say that they produce suboptimal code in some cases. I hope I helped...

[Edited by - D_Tr on July 20, 2010 4:42:47 PM]

#### Share this post

##### Share on other sites
Kazuo5000, you are right. Someone on another forum also told me that it was because you can't directly dereference a label.

this is allowed:
mov edi, amovups xmm0, [edi]

but this is not:
mov xmm0, [a]

I would like to know why I have to do an extra copy for nothing though =/

#### Share this post

##### Share on other sites
CPPNick, I think its because you cant directly access the SSE registers like you can with normal CPU registers, but I am not entirely sure on this. You could always try loading in the address of [a] instead using the lea command.

Personally, I don't have a great deal of knowledge in SSE but have a little bit of experience in it.

Again I hope this helps

#### Share this post

##### Share on other sites
Right, posts cleaned up.

1) If you have a question which isn't directly related to the thread please post a new thread about it.

2) but don't jump down someone elses throat when they do that however.

#### Share this post

##### Share on other sites
Is it just me or are these proc declarations a bit confusing? They provide a level of abstraction, which should not be there when you are a beginner. The "pure" way to write a function in assembly is to write the function code under a label and jump to it using the call instruction, accessing the data according to the calling convention you are planning to use (in the OP's case the C calling convention). Doing things the "pure" way will help you understand assembly language better. A great assembly tutorial using nasm (in the form of a short book) can be found here

#### Share this post

##### Share on other sites
Just use the intrinsics.

#include <xmmintrin.h>int main(){	__declspec(align(16)) float inA[] = {1.0f, 2.0f, 3.0f, 4.0f};	__declspec(align(16)) float inB[] = {1.0f, 2.0f, 3.0f, 4.0f};	__m128 i1 = _mm_load_ps(inA);	//if you don't align it, you have to use _mm_loadu_ps	__m128 i2 = _mm_load_ps(inB);	i1 = _mm_add_ps(i1, i2);		return 0;}

#### Share this post

##### Share on other sites
thank phantom.

D_Tr: I would rather use MASM..just preference I guess. The way it looks makes more sense to me. I'm past the point of having a hard time understanding assembly, I just need to learn the details and nuances now. I will definitely take a look at the book you recommended...at first glance I am relieved that its only 195 pages....not nearly as intimidating as "The Art of Assembly" which is over 1400...scary

clashie: I have tried that, and it did seem to help. I have a question though. In an isolated case like your example, you have to move a total of 48 bytes of data to get the the data in and back out. Does it really work out to be faster?
I am doing what I am doing for two reasons; One is just to learn some more assembly, and the second is to try and avoid as much overhead as possible and streamline the procedure. I made the inner embedded loop of a function in assembly using the same method as my original post, and it turned out to be slower. Once I examined the disassembly, I found that VC++ was adding another 15 lines of junk before each call to my assembly procedure to protect the contents of the registers I was overwriting. I figured that if I made the whole procedure assembly instead of only the inner embedded loop, I could avoid that.

#### Share this post

##### Share on other sites
The problem is your case is somewhat artifical in nature.

Generally, when using SSE/SIMD you'll want to be running through large chunks of data which are already nicely formatted for the SSE intructions to handle.

So, the source data would just be a chunk of a larger chunk of (aligned) source data already in memory.

As for the assembly 'extra copy', well thats is somewhat unavoidable.

If we assume the function call uses the stack for all parameters then a 'pure' assembly (without the PROC helper) would have to pull the parameters from the stack anyway and place them into a register so they could be treated as an address to get the data from.

The thing is, as mentioned above, generally with SSE you'll be blasting large chunks of data, so you'll perform that load once and then just increment the register(s) to get to the new source data on each run over the loop.

A 'fast call' calling convention, which will pass some parameters via registers, might allow you to keep address in register and avoid an 'extra copy' but thats about the only way around it.

#### Share this post

##### Share on other sites
Quote:
 Original post by CPPNickclashie: I have tried that, and it did seem to help.

You've tried intrinsics but they don't help? Wut?

Quote:
 Original post by CPPNickI have a question though. In an isolated case like your example, you have to move a total of 48 bytes of data to get the the data in and back out. Does it really work out to be faster?

Dude, the intrinsics boil down to 3 movaps and an addps. i.e. Exactly the same amount of data is moved in that method as in your asm. Well, no. That's not true. Since you are using the un-aligned version, your code would be moving up to 96 bytes around....

Quote:
 Original post by CPPNick I am doing what I am doing for two reasons; One is just to learn some more assembly,

Don't bother. Use intrinsics.

Quote:
 Original post by CPPNickand the second is to try and avoid as much overhead as possible and streamline the procedure.

Then use intrinsics. It's what they are there for.

Quote:
 Original post by CPPNickI made the inner embedded loop of a function in assembly using the same method as my original post, and it turned out to be slower.

Yeah, don't use assembler. Use intrinsics because no matter how good you *think* you are, the compiler will outperform you 99.99% of the time.

Quote:
 Original post by CPPNickOnce I examined the disassembly, I found that VC++ was adding another 15 lines of junk before each call to my assembly procedure to protect the contents of the registers I was overwriting. I figured that if I made the whole procedure assembly instead of only the inner embedded loop, I could avoid that.

And here's the thing, if you had used intrinsics, the compiler would understand you are writing SSE code and it will help you to optimise it. Using asm these days is not a good idea - you will only end up fighting with the compiler.

#### Share this post

##### Share on other sites

• Advertisement
• Advertisement
• ### Popular Tags

• Advertisement
• ### Popular Now

• 46
• 11
• 17
• 11
• 13
• Advertisement