__mm_* and _m128 are there as the minimal tools to expose the SSE instruction set into C / C++. They aren't meant to be a vector-maths library. They're tools for building a vector math library OR anything else that you want to run on top of SSE.
Traditionally on x86, you would have a vec4 type, which is 4 floats in an _m128, and also a float_in_vec type, which is 1 float in an _m128. You'd do this because mixing FPU and SSE code was extremely slow, so you'd want to avoid using normal floats, and use float_in_vec's instead.
When adding two float_in_vec's together, you're adding two _m128 variables, but the correct intrinsic is _mm_add_ss. When addign two vec4's together, you're adding two _m128 variables, but the correct intrinsic is _mm_add_ps. When adding a vec4 and a float_in_vec together, you're adding two _m128 variables, but the correct intrinsic is _mm_shuffle_ps followed by _mm_add_ps.
Therefore declaring that _mm_add_ps is the one true way to add together two _m128 variables is wrong, because _m128 is not a float4 type. It's the building block of a float4 and other types