I've read chapter 7, but more importantly, I've actually profiled the benefits in a number of commerial products, on a wide variety of Intel CPU's (all the way from ATOM up to XEON). Loop unrolling is a given. What you haven't realised though, is that your psuedo-code above, will actually run faster if you remove the prefetch instruction! If you'd have profiled this use case, you would know this. The hardware prefetcher found in modern Intel chips is not stupid (unlike the pentium 4, which needed it's hand holding). It will pretty quickly grok that you're accessing an array linearly, and will take steps to optimise memory access for you. All the prefetch op in this case does, is just add an extra SIZE/8 instructions to execute, all of which will tell the prefetcher what it already knows. Added to that, you're making the assumption that this array must be in memory, but have you actually considered that it might already be loaded into the cache? Inserting prefetch instructions without any idea of whether you actually need them or not, is pointless in the extreme. There are cases where prefetching is useful, but this is not one of them.
That document has been around since the days of the pentium 4, and periodically it has new chapters added to it when new CPU's are released. It lists a set of techniques that may be useful across Intel CPU's (both young and old), but each CPU has it's own quirks and characteristics. That document is not a shopping list of optimisations that you must apply anywhere, and everywhere. It is a set of guidelines that, with the use of a profiler, can help you to improve the performance of your code.
Putting complete faith in ancient texts, following them to the letter, and failing to test whether any of the claims are valid has a name.
// how big is SIZE? 0x10? 0x100? 0x1000000000? This matters!
for(int i=0; i < SIZE; i=i+8)
{
prefetch( ar + 8 ); //Needed? .... don't ask me, ask a profiler
sse00 = _mm_load_si128((__m128i*)&ar[i+0]); //< read from memory (or cache)
sse01 = _mm_load_si128((__m128i*)&ar[i+2]); //< read from memory (or cache)
sse02 = _mm_load_si128((__m128i*)&ar[i+4]); //< read from memory (or cache)
sse03 = _mm_load_si128((__m128i*)&ar[i+6]); //< read from memory (or cache)
result0 = _mm_add_epi32(sse00, sse2);
result1 = _mm_add_epi32(sse01, sse2);
result2 = _mm_add_epi32(sse02, sse2);
result3 = _mm_add_epi32(sse03, sse2);
_mm_store_si128((__m128i*)&ar[i+0], result0); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+2], result1); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+4], result2); //< place in memory (and cache)
_mm_store_si128((__m128i*)&ar[i+6], result3); //< place in memory (and cache)
}