Is SIMD worth it?

Started by
4 comments, last by _the_phantom_ 11 years, 3 months ago
Hi, i've been experimenting with some SIMD lately, like mmx and SSE/SSE2, and im a bit disapointed by the result. Im doing simple stuff like filling 2 arrays with randoms numbers, then adding the result in another array, using c++, mmx and SSE (im using inline assembly in the last 2 functions, not the intrinsic functions).

ex:

const int NumElements = 10000;
const int NumLoops = 1000;

int a[1000];
int b[1000];
int c[1000];

void CPPTest(){
for(int i = 0; i < NumLoops; i++){
for(int j = 0; j < NumElements; j++){
c[j] = a[j] + b[j];
}
}
}



I dont have the code with me atm but that basically what i do, then i do the same for the 2 other functions but in mmx or SSE, replacing the inner loop with assembly code.

Sure, the debug version with no optimization is about 10-12 time faster with SSE,and mmx show some improvement as well, but in release mode, the mmx version is about 10% slower, and the SSE version only slighly better, maybe 5%. I have to say i was expecting better result. I also noticed that if i use smaller buffers, i get better results, if i use biggers one, the result even out. I suspect the cache is doing this.

So, that's why im asking, is it still worth it to use those instructions with a compiler so good at optimizing the code?
Advertisement
Short answer: yes. You're simply not testing the right thing.

Long answer: microbenchmarks are useless for this kind of test. You need to know what the actual benefits would be in real usage situations. Cache pressure, pipelining, etc. etc. can have massive impacts on the performance of a real piece of code. At the end of the day, profile. Don't guess. Find something that you can prove is slow - via timing - and then see if SIMD benefits it.

As a side note, compilers can emit pretty good SSE instructions for most code nowadays. If you're building a 64-bit binary you're already using SSE whether you know it or not. Also, writing your own hand-rolled assembly is never a good idea for performance-intensive operations: it inhibits certain compiler optimizations in the vicinity of your code. Use intrinsics instead.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

In your example, the bottleneck of the program is almost certainly memory bandwidth -- reading/writing your arrays -- so optimizing the ALU-cost of the algorithm should be expected to have little impact.

As with all optimizations, you should profile first.

But putting that aside, one of the main issues with SIMD optimizations is that you need to layout your data and computations in a way, that you can benefit from SIMD. This is a task, that the compiler can't do for you. If you have your data layout like in the example above, then adding intrinsics is a no-brainer and I would assume the compiler already did it to a large degree. In a real world example (and I think this is what ApochPiQ was getting at), those values would be spread out amoung random structs, somewhere in the heap, and there is nothing the compiler can do. This is where you will see massive speed improvements, not only because of SIMD but also because you are forced to reorganize your stuff in a way that is more compiler and CPU/cache friendly.

The code has a buffer overrun. NumElements should be 1000.

For smaller array sizes (i.e. totals < a few MB), the arrays are likely to reside in the cache, and you should notice a performance boost with SSE. For array sizes that exceed the cache size, it's like that chunks will have to be ejected and re-read (as Hodgman has said). In this case, you can improve performance a little bit more by using _mm_stream_si128 to write the value to memory without placing a copy in the cache (which leaves more room for the input data, which should help performance a little bit). Really though, your approach needs to be fine tuned towards the hardware a little better. At the moment you are basically benchmarking how fast you main memory is, which probably isn't that useful as a metric.

Memory is slow. The less you read/write the better the performance will be. Once you have read some memory, try to do as much work on that data as possible, BEFORE you write it back out again (i.e. one loop that does lots of work, is better than lots of loops that do very little). By doing more work for each SIMD value you read, you will hopefully be able to mask the latency of the memory, and you should get some pretty decent performance from SIMD.

SIMD is very much about knowing your data, it's transforms and how the CPU is going to treat it in terms of memory I/O.

Your problem with this test is, as mentioned, memory bandwidth bound; you are doing very little ALU work and while the CPU will be prefetching ahead in this case you don't have enough ALU work to do to cover the stalls to main memory.

With your C++ code then between working on 1 floating point op at a time and simply working thought the data some of the latency to main memory will be covered anyway.

Assuming your SIMD routines are working on 4 values at a time when you are doing the ALU work 4 times faster but still bumping into potential memory stalls all that much faster.

Throw some more register heavy ALU work in there and you'll notice a speed up.

This is where thinking about the data and the transforms come in to it as well; for example if you have a simple 2D particle system running then you can break up your data in such as way as to split up the processing done so that you do all the 'x' components first, then 'y', then update any velocity for x, then y and so on with each section of data being in its own nicely aligned chunk of memory meaning you can stream through taking advantage I-cache and d-cache coherency, prefetching and (on x64) work with the limited set of registers that you have to hand.

In short; done correctly and with enough ALU work SIMD is most definitively worth it once you know what you are doing :)

This topic is closed to new replies.

Advertisement