Currently Im not using optimization
Ah? If you want to measure performance you need to compile using optimizations. at least /O2, otherwise every instruction will go through memory.
The best tool for profiling is Intel's VTune, they have 30 days free trial. Visual studio comes with a profiler you can use as well. If you want to profile yourself, you can use the timestamp counter (rdtcs instruction, visual studio has an intinsic).
__forceinline is very good for performance, I use it almost everywhere in performance critical code.
The problem is probably the shuffles. Not sure what CPU you are using, but shuffle performance is limited. You have 3 shuffles and 3 math instructions, which is not a good ratio. The SSE math instructions have data dependency on the shuffles, which will stall the CPU pipeline.
Also, your inputs and outputs are structs. Even though you pass them by reference, both shuffles and return values will got through memory (probably, need to look at the generated assembly).
To maximize SSE performance you need to design your whole code around it. Have long SSE math sequences, and reduce the amount of memory acceses and shuffles.
BTW, you can tell CL to use scalar SSE for floating-point instructions by using /arch:SSE. Check this.