I've implemented both approaches. I was surprised to find that the "intersection method" is far easier to get working in a robust way because it is entirely based on handling overlap. In the other approach overlap is a failure case that you must prevent. This becomes tricky when handling contact, i.e., zero distance intersection.
Preventing "pass through" with the intersection method is usually handled by updating an object using substeps of the total time step based on its velocity and bounding volume size. This impacts performance only for small and fast objects (in which case you can probably get away with raycasting).
Of course implementing your own collision/dynamics rarely makes sense these days. I'm assuming you have a good reason for not using Bullet or some other SDK.
In Xaudio2 I am SubmitSourceBuffer() and start() ing multiple IXAudio2SourceVoice* at the same time, I am hearing some cracking noise and I think it is because the Masteringvoice() mixes those voices by just adding them together without normalizing the final output.
Is there any function in the API that configures how voices are mixed?
Yes, they are just added. If you play two or more sounds that have a full range of amplitudes at full volume, you will run into the problems you are encountering. It's essentially 16-bit values overflowing/underflowing.