• Create Account

## SSE2 Integer operations on Vectors

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

11 replies to this topic

### #1Suen  Members

160
Like
0Likes
Like

Posted 11 March 2013 - 10:53 AM

I've been playing around with SSE to get a better understanding of it, using SSE2 intrinsics for integer operations. Currently I've done a very common yet simple code example:

Vector2* ar = CacheAlignedAlloc<Vector2>(SIZE); //16 or multiple of 16-byte aligned array, SIZE is a largue value

//Some values are set for ar here....

__m128i sse;
__m128i sse2 = _mm_set_epi32(0,5,0,5);
__m128i result;

for(int i=0; i<SIZE; i=i+2)
{
result = _mm_add_epi32(sse, sse2);
_mm_store_si128((__m128i*)&ar[i], result);
}


Vector2 is a very simple struct:

struct Vector2
{
int x, y;
};


The way things work now is that I can at most load a maximum of two Vector2 in a 128-bit register. This is fine if I were to perform an operation on both values of each Vector2 at the same time. However if you look at the code above the y-value of each Vector2 only gets added with zero so it remains unchanged, thus 64 bits of the 128-bit register are essentially doing nothing. Is there a way to load four x-values from four Vectors2 instead, perform operations and then store the result back again?

Edited by Suen, 11 March 2013 - 10:56 AM.

### #2Matias Goldberg  Members

9068
Like
1Likes
Like

Posted 11 March 2013 - 11:21 AM

Sounds like you're looking for SoA to/from AoS conversion (SoA = Structure of Array, AoS = Array of structures)

Note the conversion has a cost, so it depends on all the operations you're going to do to see whether it is worth it.

See _MM_TRANSPOSE4_PS's code as an example of how to efficiently convert from AoS to SoA and back (designed to work for 4x4 matrices though)

### #3Suen  Members

160
Like
0Likes
Like

Posted 11 March 2013 - 04:00 PM

Sounds like you're looking for SoA to/from AoS conversion (SoA = Structure of Array, AoS = Array of structures)

Note the conversion has a cost, so it depends on all the operations you're going to do to see whether it is worth it.

See _MM_TRANSPOSE4_PS's code as an example of how to efficiently convert from AoS to SoA and back (designed to work for 4x4 matrices though)

Vector2* ar = CacheAlignedAlloc<Vector2>(SIZE);

//Some values are set for ar here....

__m128i v0v1;
__m128i v2v3;
__m128i v4v5;
__m128i v6v7;
__m128i sse2 = _mm_set1_epi32(5);

for(int i=0; i<32; i=i+8)
{
_MM_TRANSPOSE4_PS(_mm_castsi128_ps(v0v1), _mm_castsi128_ps(v2v3), _mm_castsi128_ps(v4v5), _mm_castsi128_ps(v6v7));
v0v1 = _mm_add_epi32(v0v1, sse2);
v3v4 = _mm_add_epi32(v3v4, sse2);
v5v6 = _mm_add_epi32(v5v6, sse2);
v7v8 = _mm_add_epi32(v7v8, sse2);
_MM_TRANSPOSE4_PS(_mm_castsi128_ps(v0v1), _mm_castsi128_ps(v2v3), _mm_castsi128_ps(v4v5), _mm_castsi128_ps(v6v7));
_mm_store_si128((__m128i*)&positions[i], v0v1);
_mm_store_si128((__m128i*)&positions[i+2], v2v3);
_mm_store_si128((__m128i*)&positions[i+4], v4v5);
_mm_store_si128((__m128i*)&positions[i+6], v6v7);
}


Ended up with this, haven't checked it's performance yet but it does look like a bad solution :/

### #4ApochPiQ  Moderators

21408
Like
2Likes
Like

Posted 11 March 2013 - 04:23 PM

If you know you will only operate on X values, rearrange your data:
struct Data {

int * xValues;
int * yValues;

const unsigned numValues;

explicit Data (unsigned valueCount)
: numValues(valueCount)
{
xValues = AlignedAllocate(sizeof(int) * valueCount);
yValues = AlignedAllocate(sizeof(int) * valueCount);
}

~Data () {
Free(xValues);
Free(yValues);
}
};

Data d;
// TODO - initialize data elements

__m128i changes = _mm_set_epi32(5, 6, 7, 8);

for (unsigned i = 0; i < d.numValues; i += 4) {
__m128i values = _mm_load_si128(reinterpret_cast<const __m128i *>(&d.xValues[i]));
__m128i result = _mm_add_epi32(values, changes);
_mm_store_si128(reinterpret_cast<__m128i *>(&d.xValues[i]));
}
This will operate on 4 values at a time, with fewer cache misses than any other option, and probably ideal throughput given sufficient compiler optimizations.
Wielder of the Sacred Wands

### #5Suen  Members

160
Like
0Likes
Like

Posted 11 March 2013 - 04:45 PM

If you know you will only operate on X values, rearrange your data:

struct Data {

int * xValues;
int * yValues;

const unsigned numValues;

explicit Data (unsigned valueCount)
: numValues(valueCount)
{
xValues = AlignedAllocate(sizeof(int) * valueCount);
yValues = AlignedAllocate(sizeof(int) * valueCount);
}

~Data () {
Free(xValues);
Free(yValues);
}
};

Data d;
// TODO - initialize data elements

__m128i changes = _mm_set_epi32(5, 6, 7, 8);

for (unsigned i = 0; i < d.numValues; i += 4) {
__m128i values = _mm_load_si128(reinterpret_cast<const __m128i *>(&d.xValues[i]));
__m128i result = _mm_add_epi32(values, changes);
_mm_store_si128(reinterpret_cast<__m128i *>(&d.xValues[i]));
}
This will operate on 4 values at a time, with fewer cache misses than any other option, and probably ideal throughput given sufficient compiler optimizations.

Yep was simply thinking of doing this, going in that direction now. I will be operating on the Y-values but much less in comparison. Might as well change my design from AOS to SOA while it's still possible as it fits better with the way SIMD works.

### #6RobTheBloke  Members

2536
Like
0Likes
Like

Posted 11 March 2013 - 05:16 PM

I'm not really sure there's any value to this excercise. As soon as you get to mul_epi32(), you're going to shrug your shoulders, and then give up (otherwise you're going to produce an abomination in code). Take it as a hint you may be approaching this this wrong....  A SIMD-optimised, integer Vec2/Vec3 SOA implementation is of no use to anyone. In practice, you're more likely to use the integer ops when converting a 16bit colour to floating point RGBA, or calculating offsets into floating point data arrays, etc. Generally speaking, bitshifts, bitwise operators, extract, and addition/subtraction tend to be most useful of the integer instructions (combined with a few cvtps_epi32/cvtepi32_ps ops here and there). Multiplication is there if you need it, but chances are, you probably don't!

### #7Suen  Members

160
Like
0Likes
Like

Posted 11 March 2013 - 05:59 PM

I'm not really sure there's any value to this excercise. As soon as you get to mul_epi32(), you're going to shrug your shoulders, and then give up (otherwise you're going to produce an abomination in code). Take it as a hint you may be approaching this this wrong....  A SIMD-optimised, integer Vec2/Vec3 SOA implementation is of no use to anyone. In practice, you're more likely to use the integer ops when converting a 16bit colour to floating point RGBA, or calculating offsets into floating point data arrays, etc. Generally speaking, bitshifts, bitwise operators, extract, and addition/subtraction tend to be most useful of the integer instructions (combined with a few cvtps_epi32/cvtepi32_ps ops here and there). Multiplication is there if you need it, but chances are, you probably don't!

I'm more than likely not going to use something like in a serious project as there already great VecX-libraries out there that get most of the job done (and I wouldn't be surprised if many of them are SIMD-optimized and even if not the compiler might do it for you). It's just me playing around mostly but I have to admit using Vec2 and Vec3 with SIMD does give a headache, might explain why it's not all that useful

### #8Matias Goldberg  Members

9068
Like
0Likes
Like

Posted 11 March 2013 - 06:24 PM

Beware, if you're just multiplicating (one instruction), you'll be bandwidth & latency limited.
SIMD probably won't make a huge improvement over normal C++ code.

You'll start seeing a lot more improvement between SIMD & non-SIMD code when you do more math operations (and for example, the cost from converting from AoS/SoA in case you need it becomes worthwhile).

Your case looks like a perfect fit for the prefetch as described in Chapter 7 of Intel 64 and IA-32 Architectures Optimization Reference Manual.
I suggest you take a look at it for efficient cache usage. Specially if you're planning on iterating on lots of contiguous vectors.

### #9RobTheBloke  Members

2536
Like
0Likes
Like

Posted 11 March 2013 - 08:12 PM

I have to admit using Vec2 and Vec3 with SIMD does give a headache, might explain why it's not all that useful

Vec2 and Vec3 SOA structures are exactly how it should be used (and it's actually easier than AOS methods). My point was about your usage of integers for a Vec2. In pretty much every single operation beyond add/sub (dot, cross, transform by matrix, matrix mult, etc) you'll need mutliplication, and that, with integers, is not nice. You could use fixed point, but why bother when you have floats? Even converting the integers to floats, performing the ops, and then converting back, is likely to be quicker than dealing with integer multiplication (and all the inevitable shifts / shuffles / or's that will be needed to re-merge the vectors).

Your case looks like a perfect fit for the prefetch as described in Chapter 7 of Intel 64 and IA-32 Architectures Optimization Reference Manual.

It isn't. The loop is memory bandwith bound, so the hardware prefetcher will give the best performance in this case. As always, use a profiler instead of 'something you read' - espeically when it comes to streaming & prefetching!

### #10Matias Goldberg  Members

9068
Like
0Likes
Like

Posted 13 March 2013 - 06:03 PM

Your case looks like a perfect fit for the prefetch as described in Chapter 7 of Intel 64 and IA-32 Architectures Optimization Reference Manual.

It isn't. The loop is memory bandwith bound, so the hardware prefetcher will give the best performance in this case. As always, use a profiler instead of 'something you read' - espeically when it comes to streaming & prefetching!

If you read Chapter 7, a common suggestion is that he could maximize throughput and latency by unrolling his loop so that he multiplies (i.e.) 4 SIMD vectors per iteration (4x4 = 16 floats/ints at a time) and use prefetch on every loop iteration to fetch the next 16 ints (or may be bigger). He's probably not bandwidth bound either because he may not be fully utilizing all the pipelines whith just one math instruction per loop.

Of course, whether that's actually faster needs always to be profiled. It's always a good advice. May be he would still need more math operations to make this change really worth it, or may be not. And he's a newbie just learning about SSE2, it's a good excercise to experiment with all not-so-obvious alternatives.

In other words (pseudo code):

for(int i=0; i<SIZE; i=i+8)
{
prefetch( ar + 8 ); //Needed?
result0 = _mm_add_epi32(sse00, sse2);
result1 = _mm_add_epi32(sse01, sse2);
result2 = _mm_add_epi32(sse02, sse2);
result3 = _mm_add_epi32(sse03, sse2);
_mm_store_si128((__m128i*)&ar[i+0], result0);
_mm_store_si128((__m128i*)&ar[i+2], result1);
_mm_store_si128((__m128i*)&ar[i+4], result2);
_mm_store_si128((__m128i*)&ar[i+6], result3);
}

Of course he still will need to handle the last iteration in case SIZE is not multiple of 8.

The performance results may vary wildly per architecture (i.e. Core 2 vs Core i7 vs Atom)

### #11RobTheBloke  Members

2536
Like
1Likes
Like

Posted 14 March 2013 - 06:39 PM

I've read chapter 7, but more importantly, I've actually profiled the benefits in a number of commerial products, on a wide variety of Intel CPU's (all the way from ATOM up to XEON). Loop unrolling is a given. What you haven't realised though, is that your psuedo-code above, will actually run faster if you remove the prefetch instruction! If you'd have profiled this use case, you would know this. The hardware prefetcher found in modern Intel chips is not stupid (unlike the pentium 4, which needed it's hand holding). It will pretty quickly grok that you're accessing an array linearly, and will take steps to optimise memory access for you. All the prefetch op in this case does, is just add an extra SIZE/8 instructions to execute, all of which will tell the prefetcher what it already knows. Added to that, you're making the assumption that this array must be in memory, but have you actually considered that it might already be loaded into the cache? Inserting prefetch instructions without any idea of whether you actually need them or not, is pointless in the extreme. There are cases where prefetching is useful, but this is not one of them.

That document has been around since the days of the pentium 4, and periodically it has new chapters added to it when new CPU's are released. It lists a set of techniques that may be useful across Intel CPU's (both young and old), but each CPU has it's own quirks and characteristics. That document is not a shopping list of optimisations that you must apply anywhere, and everywhere. It is a set of guidelines that, with the use of a profiler, can help you to improve the performance of your code.

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid has a name.

// how big is SIZE? 0x10? 0x100? 0x1000000000? This matters!
for(int i=0; i < SIZE; i=i+8)
{
prefetch( ar + 8 ); //Needed?  .... don't ask me, ask a profiler
sse00 = _mm_load_si128((__m128i*)&ar[i+0]);  //< read from memory (or cache)
sse01 = _mm_load_si128((__m128i*)&ar[i+2]);  //< read from memory (or cache)
sse02 = _mm_load_si128((__m128i*)&ar[i+4]);  //< read from memory (or cache)
sse03 = _mm_load_si128((__m128i*)&ar[i+6]);  //< read from memory (or cache)
result0 = _mm_add_epi32(sse00, sse2);
result1 = _mm_add_epi32(sse01, sse2);
result2 = _mm_add_epi32(sse02, sse2);
result3 = _mm_add_epi32(sse03, sse2);
_mm_store_si128((__m128i*)&ar[i+0], result0); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+2], result1); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+4], result2); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+6], result3); //< place in memory (and cache)
}


### #12Matias Goldberg  Members

9068
Like
0Likes
Like

Posted 15 March 2013 - 07:51 PM

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid

That comment hurts, not because it hurts my ego; but because I've never implied it was a sure win or that I was beholder of all truth. I suggested insightful links to the OP about what's going on deeper than the apparent C/C++; and then given your input we've elaborated further.
I did realize that removing prefetch could improve performance (again, this <b>is</b> mentioned in chapter 7), and that's why I added a "Needed?" comment. I won't be profiling theoretical code that because I'm not going to use that code. The idea was that the OP should do that himself and test test test. The situation could change if memory access patterns suddenly change (because the theoretical code goes to practice and looks different enough), so he would have to profile again.
I'm very open to critics (and btw. I thank you for elaborating the point further, at first I only pointed to Ch. 7 for a read), but I won't tolerate putting words in my mouth (or in my fingers) that I didn't say.

If we were discussing a practical in-depth optimization of X library, then our conversation would be completely different.