ex:
const int NumElements = 10000;
const int NumLoops = 1000;
int a[1000];
int b[1000];
int c[1000];
void CPPTest(){
for(int i = 0; i < NumLoops; i++){
for(int j = 0; j < NumElements; j++){
c[j] = a[j] + b[j];
}
}
}
I dont have the code with me atm but that basically what i do, then i do the same for the 2 other functions but in mmx or SSE, replacing the inner loop with assembly code.
Sure, the debug version with no optimization is about 10-12 time faster with SSE,and mmx show some improvement as well, but in release mode, the mmx version is about 10% slower, and the SSE version only slighly better, maybe 5%. I have to say i was expecting better result. I also noticed that if i use smaller buffers, i get better results, if i use biggers one, the result even out. I suspect the cache is doing this.
So, that's why im asking, is it still worth it to use those instructions with a compiler so good at optimizing the code?