Home » Community » Forums » Graphics Programming and Theory » SIMD SSE intrinsics, code won't work
  Intel sponsors gamedev.net search:   
[Control Panel] [Register] [Bookmarks] [Who's Online] [Active Topics] [Stats] [FAQ] [Search]

Add Forum to Favorites |  Send Topic To a Friend | View Forum FAQ | Track this topic


 Last Thread Next Thread 
 SIMD SSE intrinsics, code won't work
Post New Topic  Post Reply 
hello! I am trying to learn how to do paralell processing. I am baseing this on an example that I saw here:
http://www.codeproject.com/KB/recipes/sseintro.aspx
the one titled "SSETest Demo Project"

I tried to make my own, and don't understand why it isn't working. Here is the code:

// testconsole.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <time.h>
#include <iostream>
#include "windows.h"
#include "math.h"
#include <conio.h>
#include "test.h"
#include <xmmintrin.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
	float *f1 = new float[4];
	float *f2 = new float[4];
	float *result = new float[4];

	for(int i = 0; i < 4; i++)
	{
		f1[i] = (float)i;
		f2[i] = (float)i;
	}
	__m128 *m1 = (__m128*)f1;
	__m128 *m2 = (__m128*)f2;
	__m128 *res = (__m128*)result;
//the next line produces the error: "Unhandled exception at 0x0125149c in 
//testconsole.exe: 0xC0000005: Access violation reading location 0xffffffff."
	*res = _mm_add_ps(*m1, *m2); //access violation reading
	for(int i = 0; i < 4; i++)
	{
		cout << result[i] << endl;
	}
	getch();
	return 0;
}




the strangest part, is even though it produces the error at that line, I do NOT get the error if I comment out the second for loop and the getch(). It simply runs, exits, no error.

can anyone help me out?
Thanks =)

 User Rating: 964   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

The address of f1 and f2 must be 16 byte aligned.. i.e the last 16 bits of the address must be zero.



 User Rating: 1356   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

actually, I tried compiling it as the release build and running it outside the IDE, and running it from the debug folder, and it worked, even though it wouldn't work inside the IDE..very strange.. anyways, thanks forthe tip. I will look up the allignment thing next..from what I remember, the alignment is't mandatory, but is for preventing cache stalls when the cache isnt all the way filled, and aligning the data means padding the end to make it a power of two....right?

thanks again

 User Rating: 964   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
from what I remember, the alignment is't mandatory, but is for preventing cache stalls when the cache isnt all the way filled, and aligning the data means padding the end to make it a power of two....right?


the __m128 type _is_, by definition, 16 bytes-aligned. using C-casts as you did is an unsafe way to get things done that won't warn you when you're doing something wrong. once you get a random pointer forced into an __m128 pointer, the compiler will assume the __m128* points to 16-bytes aligned memory.

in the code you posted, the compiler will emit silently two load instructions to load f1 and f2 in two xmm registers. as you're passing __m128 pointers, it will assume the address is 16-bytes aligned, and emit 'movaps' instructions, instead of 'movups' instructions (which are the ones you want, as the pointers returned by 'new' are not guaranteed to be 16-bytes aligned at all...)

basically, you've got two instructions at your disposal to load your data:

__m128 m1 = _mm_load_ps(f1); // expects a 16-bytes aligned pointer
__m128 m1 = _mm_loadu_ps(f1); // doesn't care about alignment


the following line:

__m128 result = _mm_add_ps(*m1, *m2);


is, behind the scenes, equivalent to:

__m128 xmm0 = _mm_load_ps((const float*)m1);
__m128 xmm1 = _mm_load_ps((const float*)m2);
__m128 result = _mm_add_ps(xmm0, xmm1);


and worse yet, you alse _write_ to a potentially unaligned memory location, through an __m128, so the compiler also emits another movaps instruction.

*res = _mm_add_ps(*m1, *m2);


is equivalent to:

__m128 xmm0 = _mm_load_ps((const float*)m1);
__m128 xmm1 = _mm_load_ps((const float*)m2);
__m128 xmm2 = _mm_add_ps(xmm0, xmm1);
_mm_store_ps((float*)res, xmm2);


and the randomness of the crashes you're experiencing probably comes from the fact that although 'new' isn't guaranteed to return 16-bytes aligned memory adresses, sometimes, it may do so... if you're (un)lucky.

so you can either:
- explicitely use unaligned loads/writes, and trash those __m128* casts.
- use 16-bytes aligned memory. (look into aligned_malloc, use compiler alignment pragmas, or do it yourself by padding a manually allocated buffer)

(actually, I would recommend you use aligned memory with explicit aligned loads/writes and no __m128* casts at all)

[Edited by - momotte on March 3, 2009 3:06:05 AM]

 User Rating: 1034   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Okay! I should have listened to eq in the first place...I tried the alignmen after failing to make the method work in my real project, and came to the conclusion that it must have been a fluke that the first one worked at all.

Thanks for the elaboration though =) I decided to use 16 bit alligned allocation where applicaple. This new knowledge is going to have me re-writing my entire engine yet again..**SIGH****

its looking ok though.. I have a simple cube room, with one cube in the middle,
total: 48verts, 24 faces, 1 free lightsource(vert shading), textured walls. I have it running full screen at 1280x800 and it looks to be doing around 10-15 frames per second.
-edit: on a p4 dual core 1.6ghz laptop, ati 200 built in vid-card.
I'm sure I can get it going faster once I completely implement the SIMD.

peace =D

 User Rating: 964   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by CPPNick
Thanks for the elaboration though =) I decided to use 16 bit alligned allocation where applicaple. This new knowledge is going to have me re-writing my entire engine yet again..**SIGH****


Why? you just need to overload the new/delete operators....

Quote:
Original post by CPPNick
I'm sure I can get it going faster once I completely implement the SIMD.


Don't count on it. SIMD can be faster, but quite often you'll find things going a little slower (unless you re-organise your code to be SSE friendly).

 User Rating: 1409   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by RobTheBloke
unless you re-organize your code to be SSE friendly

I did plan on it. I am clear now on how to declare a single array, and _align_ it for use with the SIMD, but I am also going to need a structure of arrays to work with for the vertices. It should look something like this:

I will load into this, so I can transform 4 at a time
struct VERTEX
{
float x[4];
float y[4];
float z[4];
float u[4];
float v[4];
};
//then move them to a vertex buffer like this for clipping:
struct SIMPLE_VERTEX
{
float x, y, z, u, v;
};
SIMPLE_VERTEX vertices[9];

how would I make sure that all this is aligned? since(sizeof(VERTEX) > 16), do I just use the next multiple of 16? in its place and expect the compiler to add padding for me?

edit:
would this be correct?
__declspec(align(128))struct VERTEX
{
float x[4];
float y[4];
float z[4];
float u[4];
float v[4];
};

or will "__declspec(align(80))struct VERTEX" work? a multiple of 16 but not power of two...?

 User Rating: 964   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

All times are ET (US)

Post Reply
 Last Thread Next Thread 
Forum Rules:
You may not post new threads
You may post replies
You may not edit your posts
You may not use HTML in your posts
Jump To:
Administrative Options: