x3daudio needs to be initialized so that it's outputs will match the speaker configuration you are using, and to match the scale of units to your application. In general yes, you need to release com interfaces for objects.
For true to life stereo you would need 2 listeners, set as far apart as the characters ears are. This is the technique that was used for the voices in Pixar's Monsters Inc movie. (http://video.sina.com.cn/v/b/44064572-1604540395.html)
But that is overkill, IMHO. What is the delay heard between one ear and the next? 340.29 m / s is the velocity of sound at sea leavel, your head is roughly 0.2 meters across, we are looking at less than a millisecond delay between ears. I read somewhere long ago that humans cannot in general discern delays lower than about 9ms. A much better clue to positional audio is the doppler effect, and the filtering effect cause by the shape of the ears. All of these, however, can be calculated by setting flags in your call to X3dAudioCalculate. The Delay function only works with stereo speaker setups however, as we humans are binaural beasts after all.
Not having my codebase with me at the moment, I would assume that your level matrix can be cleaned up right away. According to M$
IXAudio2Voice::GetOutputMatrix always returns the levels most recently set by IXAudio2Voice::SetOutputMatrix. However, they may not actually be in effect yet: they only take effect the next time the audio engine runs after the IXAudio2Voice::SetOutputMatrix call (or after the corresponding IXAudio2::CommitChanges call, if IXAudio2Voice::SetOutputMatrix was called with a deferred operation ID).
So once SetOutputMatrix has been called, the voice has saved a copy internally, even if it has not been applied to the hardware yet.