I'm no expert, but considering the speed of sound (ca. 300 m/s) and the size of a head (ca. 0.3 m), the difference between "sound comes from far left" to "sound comes from far right", which is pretty much the most extreme possible, is 0.5 ms. The ear is able to pick that up without any trouble (and obviously, it's able to pick up much smaller differences, too -- we are able to hear a lot more detailled than just "left" and "right").
In that light, 10ms seems like... huge. I'm not convinced something that coarse can fly.
Of course we're talking about overall latency (on all channels) but the brain has to somehow integrate that with the visuals, too. And seeing how it's apparently doing that quite delicately at ultra-high resolution, I think it may not work out.