Back to General and Gameplay Programming

SSE2 Integer operations on Vectors

Suen Lagash · 2013-03-16T01:51:30

I've been playing around with SSE to get a better understanding of it, using SSE2 intrinsics for integer operations. Currently I've done a very common yet simple code example: Vector2* ar = CacheAlignedAlloc<Vector2>(SIZE); //16 or multiple of 16-byte aligned array, SIZE is a largue value //Some values are set for ar here.... __m128i sse; __m128i sse2 = _mm_set_epi32(0,5,0,5); __m128i result; for(int i=0; i<SIZE; i=i+2) { sse = _mm_load_si128((__m128i*)&ar); result = _mm_add_epi32(sse, sse2); _mm_store_si128((__m128i*)&ar, result); } Vector2 is a very simple struct: struct Vector2 { int x, y; }; The way things work now is that I can at most load a maximum of two Vector2 in a 128-bit register. This is fine if I were to perform an operation on both values of each Vector2 at the same time. However if you look at the code above the y-value of each Vector2 only gets added with zero so it remains unchanged, thus 64 bits of the 128-bit register are essentially doing nothing. Is there a way to load four x-values from four Vectors2 instead, perform operations and then store the result back again?

General and Gameplay Programming Programming

Started by Suen March 11, 2013 04:53 PM

10 comments, last by Matias Goldberg 11 years, 1 month ago

RobTheBloke

2,553

March 15, 2013 12:39 AM

I've read chapter 7, but more importantly, I've actually profiled the benefits in a number of commerial products, on a wide variety of Intel CPU's (all the way from ATOM up to XEON). Loop unrolling is a given. What you haven't realised though, is that your psuedo-code above, will actually run faster if you remove the prefetch instruction! If you'd have profiled this use case, you would know this. The hardware prefetcher found in modern Intel chips is not stupid (unlike the pentium 4, which needed it's hand holding). It will pretty quickly grok that you're accessing an array linearly, and will take steps to optimise memory access for you. All the prefetch op in this case does, is just add an extra SIZE/8 instructions to execute, all of which will tell the prefetcher what it already knows. Added to that, you're making the assumption that this array must be in memory, but have you actually considered that it might already be loaded into the cache? Inserting prefetch instructions without any idea of whether you actually need them or not, is pointless in the extreme. There are cases where prefetching is useful, but this is not one of them.

That document has been around since the days of the pentium 4, and periodically it has new chapters added to it when new CPU's are released. It lists a set of techniques that may be useful across Intel CPU's (both young and old), but each CPU has it's own quirks and characteristics. That document is not a shopping list of optimisations that you must apply anywhere, and everywhere. It is a set of guidelines that, with the use of a profiler, can help you to improve the performance of your code.

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid has a name.

// how big is SIZE? 0x10? 0x100? 0x1000000000? This matters!
for(int i=0; i < SIZE; i=i+8)
{
    prefetch( ar + 8 ); //Needed?  .... don't ask me, ask a profiler
    sse00 = _mm_load_si128((__m128i*)&ar[i+0]);  //< read from memory (or cache)
    sse01 = _mm_load_si128((__m128i*)&ar[i+2]);  //< read from memory (or cache)
    sse02 = _mm_load_si128((__m128i*)&ar[i+4]);  //< read from memory (or cache)
    sse03 = _mm_load_si128((__m128i*)&ar[i+6]);  //< read from memory (or cache)
    result0 = _mm_add_epi32(sse00, sse2);
    result1 = _mm_add_epi32(sse01, sse2);
    result2 = _mm_add_epi32(sse02, sse2);
    result3 = _mm_add_epi32(sse03, sse2);
    _mm_store_si128((__m128i*)&ar[i+0], result0); //< place in memory (and cache)
    _mm_store_si128((__m128i*)&ar[i+2], result1); //< place in memory (and cache)
    _mm_store_si128((__m128i*)&ar[i+4], result2); //< place in memory (and cache)
    _mm_store_si128((__m128i*)&ar[i+6], result3); //< place in memory (and cache)
}

Matias Goldberg

9,637

March 16, 2013 01:51 AM

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid

That comment hurts, not because it hurts my ego; but because I've never implied it was a sure win or that I was beholder of all truth. I suggested insightful links to the OP about what's going on deeper than the apparent C/C++; and then given your input we've elaborated further.
I did realize that removing prefetch could improve performance (again, this <b>is</b> mentioned in chapter 7), and that's why I added a "Needed?" comment. I won't be profiling theoretical code that because I'm not going to use that code. The idea was that the OP should do that himself and test test test. The situation could change if memory access patterns suddenly change (because the theoretical code goes to practice and looks different enough), so he would have to profile again.
I'm very open to critics (and btw. I thank you for elaborating the point further, at first I only pointed to Ch. 7 for a read), but I won't tolerate putting words in my mouth (or in my fingers) that I didn't say.

If we were discussing a practical in-depth optimization of X library, then our conversation would be completely different.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

SSE2 Integer operations on Vectors

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

SSE2 Integer operations on Vectors

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines