• ### Announcements

GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Followers 0

# SSE2 Integer operations on Vectors

## 11 posts in this topic

I've been playing around with SSE to get a better understanding of it, using SSE2 intrinsics for integer operations. Currently I've done a very common yet simple code example:

Vector2* ar = CacheAlignedAlloc<Vector2>(SIZE); //16 or multiple of 16-byte aligned array, SIZE is a largue value

//Some values are set for ar here....

__m128i sse;
__m128i sse2 = _mm_set_epi32(0,5,0,5);
__m128i result;

for(int i=0; i<SIZE; i=i+2)
{
_mm_store_si128((__m128i*)&ar[i], result);
}


Vector2 is a very simple struct:

struct Vector2
{
int x, y;
};


The way things work now is that I can at most load a maximum of two Vector2 in a 128-bit register. This is fine if I were to perform an operation on both values of each Vector2 at the same time. However if you look at the code above the y-value of each Vector2 only gets added with zero so it remains unchanged, thus 64 bits of the 128-bit register are essentially doing nothing. Is there a way to load four x-values from four Vectors2 instead, perform operations and then store the result back again?

Edited by Suen
0

##### Share on other sites

Sounds like you're looking for SoA to/from AoS conversion (SoA = Structure of Array, AoS = Array of structures)

Note the conversion has a cost, so it depends on all the operations you're going to do to see whether it is worth it.

See _MM_TRANSPOSE4_PS's code as an example of how to efficiently convert from AoS to SoA and back (designed to work for 4x4 matrices though)

1

##### Share on other sites

Sounds like you're looking for SoA to/from AoS conversion (SoA = Structure of Array, AoS = Array of structures)

Note the conversion has a cost, so it depends on all the operations you're going to do to see whether it is worth it.

See _MM_TRANSPOSE4_PS's code as an example of how to efficiently convert from AoS to SoA and back (designed to work for 4x4 matrices though)

Vector2* ar = CacheAlignedAlloc<Vector2>(SIZE);

//Some values are set for ar here....

__m128i v0v1;
__m128i v2v3;
__m128i v4v5;
__m128i v6v7;
__m128i sse2 = _mm_set1_epi32(5);

for(int i=0; i<32; i=i+8)
{
_MM_TRANSPOSE4_PS(_mm_castsi128_ps(v0v1), _mm_castsi128_ps(v2v3), _mm_castsi128_ps(v4v5), _mm_castsi128_ps(v6v7));
_MM_TRANSPOSE4_PS(_mm_castsi128_ps(v0v1), _mm_castsi128_ps(v2v3), _mm_castsi128_ps(v4v5), _mm_castsi128_ps(v6v7));
_mm_store_si128((__m128i*)&positions[i], v0v1);
_mm_store_si128((__m128i*)&positions[i+2], v2v3);
_mm_store_si128((__m128i*)&positions[i+4], v4v5);
_mm_store_si128((__m128i*)&positions[i+6], v6v7);
}


Ended up with this, haven't checked it's performance yet but it does look like a bad solution :/

0

##### Share on other sites
If you know you will only operate on X values, rearrange your data:
struct Data {

int * xValues;
int * yValues;

const unsigned numValues;

explicit Data (unsigned valueCount)
: numValues(valueCount)
{
xValues = AlignedAllocate(sizeof(int) * valueCount);
yValues = AlignedAllocate(sizeof(int) * valueCount);
}

~Data () {
Free(xValues);
Free(yValues);
}
};

Data d;
// TODO - initialize data elements

__m128i changes = _mm_set_epi32(5, 6, 7, 8);

for (unsigned i = 0; i < d.numValues; i += 4) {
__m128i values = _mm_load_si128(reinterpret_cast<const __m128i *>(&d.xValues[i]));
_mm_store_si128(reinterpret_cast<__m128i *>(&d.xValues[i]));
}
This will operate on 4 values at a time, with fewer cache misses than any other option, and probably ideal throughput given sufficient compiler optimizations.
2

##### Share on other sites

If you know you will only operate on X values, rearrange your data:

struct Data {

int * xValues;
int * yValues;

const unsigned numValues;

explicit Data (unsigned valueCount)
: numValues(valueCount)
{
xValues = AlignedAllocate(sizeof(int) * valueCount);
yValues = AlignedAllocate(sizeof(int) * valueCount);
}

~Data () {
Free(xValues);
Free(yValues);
}
};

Data d;
// TODO - initialize data elements

__m128i changes = _mm_set_epi32(5, 6, 7, 8);

for (unsigned i = 0; i < d.numValues; i += 4) {
__m128i values = _mm_load_si128(reinterpret_cast<const __m128i *>(&d.xValues[i]));
_mm_store_si128(reinterpret_cast<__m128i *>(&d.xValues[i]));
}
This will operate on 4 values at a time, with fewer cache misses than any other option, and probably ideal throughput given sufficient compiler optimizations.

Yep was simply thinking of doing this, going in that direction now. I will be operating on the Y-values but much less in comparison. Might as well change my design from AOS to SOA while it's still possible as it fits better with the way SIMD works.

0

##### Share on other sites

I'm not really sure there's any value to this excercise. As soon as you get to mul_epi32(), you're going to shrug your shoulders, and then give up (otherwise you're going to produce an abomination in code). Take it as a hint you may be approaching this this wrong....  A SIMD-optimised, integer Vec2/Vec3 SOA implementation is of no use to anyone. In practice, you're more likely to use the integer ops when converting a 16bit colour to floating point RGBA, or calculating offsets into floating point data arrays, etc. Generally speaking, bitshifts, bitwise operators, extract, and addition/subtraction tend to be most useful of the integer instructions (combined with a few cvtps_epi32/cvtepi32_ps ops here and there). Multiplication is there if you need it, but chances are, you probably don't!

0

##### Share on other sites

I'm not really sure there's any value to this excercise. As soon as you get to mul_epi32(), you're going to shrug your shoulders, and then give up (otherwise you're going to produce an abomination in code). Take it as a hint you may be approaching this this wrong....  A SIMD-optimised, integer Vec2/Vec3 SOA implementation is of no use to anyone. In practice, you're more likely to use the integer ops when converting a 16bit colour to floating point RGBA, or calculating offsets into floating point data arrays, etc. Generally speaking, bitshifts, bitwise operators, extract, and addition/subtraction tend to be most useful of the integer instructions (combined with a few cvtps_epi32/cvtepi32_ps ops here and there). Multiplication is there if you need it, but chances are, you probably don't!

I'm more than likely not going to use something like in a serious project as there already great VecX-libraries out there that get most of the job done (and I wouldn't be surprised if many of them are SIMD-optimized and even if not the compiler might do it for you). It's just me playing around mostly but I have to admit using Vec2 and Vec3 with SIMD does give a headache, might explain why it's not all that useful :P

0

##### Share on other sites
Beware, if you're just multiplicating (one instruction), you'll be bandwidth & latency limited.
SIMD probably won't make a huge improvement over normal C++ code.

You'll start seeing a lot more improvement between SIMD & non-SIMD code when you do more math operations (and for example, the cost from converting from AoS/SoA in case you need it becomes worthwhile).

Your case looks like a perfect fit for the prefetch as described in Chapter 7 of Intel 64 and IA-32 Architectures Optimization Reference Manual.
I suggest you take a look at it for efficient cache usage. Specially if you're planning on iterating on lots of contiguous vectors.
0

##### Share on other sites

I have to admit using Vec2 and Vec3 with SIMD does give a headache, might explain why it's not all that useful

Vec2 and Vec3 SOA structures are exactly how it should be used (and it's actually easier than AOS methods). My point was about your usage of integers for a Vec2. In pretty much every single operation beyond add/sub (dot, cross, transform by matrix, matrix mult, etc) you'll need mutliplication, and that, with integers, is not nice. You could use fixed point, but why bother when you have floats? Even converting the integers to floats, performing the ops, and then converting back, is likely to be quicker than dealing with integer multiplication (and all the inevitable shifts / shuffles / or's that will be needed to re-merge the vectors).

Your case looks like a perfect fit for the prefetch as described in Chapter 7 of Intel 64 and IA-32 Architectures Optimization Reference Manual.

It isn't. The loop is memory bandwith bound, so the hardware prefetcher will give the best performance in this case. As always, use a profiler instead of [i]'something you read'[/i] - espeically when it comes to streaming & prefetching!
0

##### Share on other sites

Your case looks like a perfect fit for the prefetch as described in Chapter 7 of Intel 64 and IA-32 Architectures Optimization Reference Manual.

It isn't. The loop is memory bandwith bound, so the hardware prefetcher will give the best performance in this case. As always, use a profiler instead of 'something you read' - espeically when it comes to streaming & prefetching!

If you read Chapter 7, a common suggestion is that he could maximize throughput and latency by unrolling his loop so that he multiplies (i.e.) 4 SIMD vectors per iteration (4x4 = 16 floats/ints at a time) and use prefetch on every loop iteration to fetch the next 16 ints (or may be bigger). He's probably not bandwidth bound either because he may not be fully utilizing all the pipelines whith just one math instruction per loop.

Of course, whether that's actually faster needs always to be profiled. It's always a good advice. May be he would still need more math operations to make this change really worth it, or may be not. And he's a newbie just learning about SSE2, it's a good excercise to experiment with all not-so-obvious alternatives.

In other words (pseudo code):

for(int i=0; i<SIZE; i=i+8)
{
prefetch( ar + 8 ); //Needed?
_mm_store_si128((__m128i*)&ar[i+0], result0);
_mm_store_si128((__m128i*)&ar[i+2], result1);
_mm_store_si128((__m128i*)&ar[i+4], result2);
_mm_store_si128((__m128i*)&ar[i+6], result3);
}

Of course he still will need to handle the last iteration in case SIZE is not multiple of 8.

The performance results may vary wildly per architecture (i.e. Core 2 vs Core i7 vs Atom)

0

##### Share on other sites

I've read chapter 7, but more importantly, I've actually profiled the benefits in a number of commerial products, on a wide variety of Intel CPU's (all the way from ATOM up to XEON). Loop unrolling is a given. What you haven't realised though, is that your psuedo-code above, will actually run faster if you remove the prefetch instruction! If you'd have profiled this use case, you would know this. The hardware prefetcher found in modern Intel chips is not stupid (unlike the pentium 4, which needed it's hand holding). It will pretty quickly grok that you're accessing an array linearly, and will take steps to optimise memory access for you. All the prefetch op in this case does, is just add an extra SIZE/8 instructions to execute, all of which will tell the prefetcher what it already knows. Added to that, you're making the assumption that this array must be in memory, but have you actually considered that it might already be loaded into the cache? Inserting prefetch instructions without any idea of whether you actually need them or not, is pointless in the extreme. There are cases where prefetching is useful, but this is not one of them.

That document has been around since the days of the pentium 4, and periodically it has new chapters added to it when new CPU's are released. It lists a set of techniques that may be useful across Intel CPU's (both young and old), but each CPU has it's own quirks and characteristics. That document is not a shopping list of optimisations that you must apply anywhere, and everywhere. It is a set of guidelines that, with the use of a profiler, can help you to improve the performance of your code.

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid [url=http://en.wikipedia.org/wiki/Cargo_cult_programming]has a name[/url].

// how big is SIZE? 0x10? 0x100? 0x1000000000? This matters!
for(int i=0; i < SIZE; i=i+8)
{
prefetch( ar + 8 ); //Needed?  .... don't ask me, ask a profiler
_mm_store_si128((__m128i*)&ar[i+0], result0); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+2], result1); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+4], result2); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+6], result3); //< place in memory (and cache)
}

1

##### Share on other sites

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid

That comment hurts, not because it hurts my ego; but because I've never implied it was a sure win or that I was beholder of all truth. I suggested insightful links to the OP about what's going on deeper than the apparent C/C++; and then given your input we've elaborated further.
I did realize that removing prefetch could improve performance (again, this <b>is</b> mentioned in chapter 7), and that's why I added a "Needed?" comment. I won't be profiling theoretical code that because I'm not going to use that code. The idea was that the OP should do that himself and test test test. The situation could change if memory access patterns suddenly change (because the theoretical code goes to practice and looks different enough), so he would have to profile again.
I'm very open to critics (and btw. I thank you for elaborating the point further, at first I only pointed to Ch. 7 for a read), but I won't tolerate putting words in my mouth (or in my fingers) that I didn't say.

If we were discussing a practical in-depth optimization of X library, then our conversation would be completely different.
0

## Create an account

Register a new account