Unless you have a lot of experience and invest a lot of time, GLM is as good or better than anything you can write performance-wise. It's likely better than anything I could write in finite time, anyway.
That, and it just works, and it "looks" the same as GLSL, which is a big plus.
As for SSE implementations, GLM has them, at least in some places, but they don't make much of a difference anyway. SSE is good for whacking through long streams of SoA data, but it is pretty useless for doing a few dozen dot products or multiplying 3-4 matrices, or for any data that you normally have (because AoS is the natural thing, not SoA). You have to work really hard to make SSE truly useful (other than in a contrieved, artificial example).
Special applications like audio/video codecs are of course an exception, but this is not surprising, it's what SSE was made for after all.
If your SSE-optimized dot product saves 1-2 cycles compared to a C implementation (compiled with optimizations turned on) you can consider yourself a happy man. So what, one branch prediction gone wrong costs 7-8 times as much.
If your quaternion/vector multiply saves 5-6 clock cycles, you're lucky. So even if you calculate a thousand of them per frame (for skeletal animation, or whatever) that's 5,000 clocks. Big thing. If that's an issue, then don't you ever dare making a call to a D3D function, or even access the disk.