When a light wave (polarised or not) hits a surface, it's reflected and refracted. Specular simulates the reflected parts, and Diffuse simulates the refracted parts.
Light that's polarised in different ways is still partially reflected and partially refracted (so: requires both diffuse and specular formulas).
I wouldn't really call it "refracted", it's more a notion of incoherent scattering. In other words, "specular" means "not scattered", whereas refraction *is* a form of local specular reflection. This distinction isn't particularly important in everyday computer graphics but there are some physical effects that apply to specularly reflected light, but not scattered light (and vice versa), so I think the notion of coherence between reflected rays for specular/diffuse light is really the distinctive feature we are trying to quantize here.
Anyway, on topic, yes, the point is that the total energy reflected off a surface patch is less than (or equal to) the total energy falling on it. We don't care about "energies" of particular light rays, they could be incredibly large. If you shine a laser in your eye - don't do this, by the way - it's going to be super bright, with an intensity exceeding hundreds of watts per steradian, yet that laser has a finite amount of power (e.g. half a watt) being converted into light. Now widen the beam.. and it doesn't look as bright anymore. So what we're really interested in is how much power is radiated from a given surface *in every direction*, not just a single one, which involves summing up (or integrating) over the sphere or hemisphere of directions.
I'm a bit curious about this. Since real-life materials are perfectly capable of absorbing light and then re-emitting it as energy other than visible light, what does it really mean to have a material that doesn't conserve energy? I guess materials that absorb light in especially strange, angle-sensitive ways are probably rare, but it seems plausible that some arrangement of microfacets could potentially be described by materials which are obviously "wrong."
That's because your typical BRDF doesn't handle those sorts of effects. Basically, most computer graphics renderers assume that the flow of light in a scene has entered a steady state, that is, it is constant and unchanging as long as the geometry remains the same, which leaves no obvious way to simulate fluorescence and other time-dependent effects (it can be done, though, especially when ray tracing). In this sense the notion of "energy conservation" dictated by those renderers can be somewhat limited physically.