My take is that the basic theory is to pack four instructions together and issue it in one go.
A single instruction that works on 2 or 4 inputs simultaneously.
And it is compatible with multi-threading.
All single instructions are atomic with regards to in/out reads/writes.
The question is how much performance gain do we usually have after optimizing your program with SSE, especially for the math library.
This depends on many factors, especially on how many SIMD instructions you can string together in order to avoid loading and flushing data and on the instruction set (which I assume to be x86 here since you are mentioning SSE). Some SIMD instructions perform 2 operations (multiply-add for example) or can otherwise replace multiple single-data instructions algorithmicly, resulting in even greater performance.
I'm not expecting four times boost, however I do expect twice performance. Is it possible?
Very generally speaking 3 times is reasonable.
My simple experiment with SSE shows that it drops more than 40% cost than without SSE solution. Close to twice performance than before.
Is it normal?
It largely depends on the math you are testing. Short functions with only 1 or 2 SIMD instructions and only a single load/flush pair will likely be near the same performance as without SIMD. Between loading the data and flushing it, the more operations that are necessary on the data and which can be replaced with SIMD instructions the more speed-up you get.
It is normal for a variety of math functions to fall along this spectrum from 30% faster to 70% faster. Also, you haven’t been clear on exactly what drops by 40% and how you measured it. Based on your report I see no reason for alarms yet.
L. Spiro