SSE2 Integer operations on Vectors

Started by
10 comments, last by Matias Goldberg 11 years, 1 month ago

I've read chapter 7, but more importantly, I've actually profiled the benefits in a number of commerial products, on a wide variety of Intel CPU's (all the way from ATOM up to XEON). Loop unrolling is a given. What you haven't realised though, is that your psuedo-code above, will actually run faster if you remove the prefetch instruction! If you'd have profiled this use case, you would know this. The hardware prefetcher found in modern Intel chips is not stupid (unlike the pentium 4, which needed it's hand holding). It will pretty quickly grok that you're accessing an array linearly, and will take steps to optimise memory access for you. All the prefetch op in this case does, is just add an extra SIZE/8 instructions to execute, all of which will tell the prefetcher what it already knows. Added to that, you're making the assumption that this array must be in memory, but have you actually considered that it might already be loaded into the cache? Inserting prefetch instructions without any idea of whether you actually need them or not, is pointless in the extreme. There are cases where prefetching is useful, but this is not one of them.

That document has been around since the days of the pentium 4, and periodically it has new chapters added to it when new CPU's are released. It lists a set of techniques that may be useful across Intel CPU's (both young and old), but each CPU has it's own quirks and characteristics. That document is not a shopping list of optimisations that you must apply anywhere, and everywhere. It is a set of guidelines that, with the use of a profiler, can help you to improve the performance of your code.

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid has a name.

// how big is SIZE? 0x10? 0x100? 0x1000000000? This matters!
for(int i=0; i < SIZE; i=i+8)
{
    prefetch( ar + 8 ); //Needed?  .... don't ask me, ask a profiler
    sse00 = _mm_load_si128((__m128i*)&ar[i+0]);  //< read from memory (or cache)
    sse01 = _mm_load_si128((__m128i*)&ar[i+2]);  //< read from memory (or cache)
    sse02 = _mm_load_si128((__m128i*)&ar[i+4]);  //< read from memory (or cache)
    sse03 = _mm_load_si128((__m128i*)&ar[i+6]);  //< read from memory (or cache)
    result0 = _mm_add_epi32(sse00, sse2);
    result1 = _mm_add_epi32(sse01, sse2);
    result2 = _mm_add_epi32(sse02, sse2);
    result3 = _mm_add_epi32(sse03, sse2);
    _mm_store_si128((__m128i*)&ar[i+0], result0); //< place in memory (and cache)
    _mm_store_si128((__m128i*)&ar[i+2], result1); //< place in memory (and cache)
    _mm_store_si128((__m128i*)&ar[i+4], result2); //< place in memory (and cache)
    _mm_store_si128((__m128i*)&ar[i+6], result3); //< place in memory (and cache)
}
Advertisement

Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid

That comment hurts, not because it hurts my ego; but because I've never implied it was a sure win or that I was beholder of all truth. I suggested insightful links to the OP about what's going on deeper than the apparent C/C++; and then given your input we've elaborated further.
I did realize that removing prefetch could improve performance (again, this <b>is</b> mentioned in chapter 7), and that's why I added a "Needed?" comment. I won't be profiling theoretical code that because I'm not going to use that code. The idea was that the OP should do that himself and test test test. The situation could change if memory access patterns suddenly change (because the theoretical code goes to practice and looks different enough), so he would have to profile again.
I'm very open to critics (and btw. I thank you for elaborating the point further, at first I only pointed to Ch. 7 for a read), but I won't tolerate putting words in my mouth (or in my fingers) that I didn't say.

If we were discussing a practical in-depth optimization of X library, then our conversation would be completely different.

This topic is closed to new replies.

Advertisement