I'm just wondering what challenges I'll face. It seems easy enough to me but maybe I'm forgetting something. Is it just a case of encoding the captured data, sending that over network as fast as possible and having clients decode it and play?
It's not quite that simple
You need to deal with a couple main issues:
You'll want to send your data UDP (not TCP). TCP is fine for sending files, etc but is not well suited for real-time communications.
1) when you send data over the internet via UDP, you're not guaranteed that the client receiving data will get in the same order that you sent it. I.e. if you send 5 little packets of audio data, A,B,C,D,E, they might be received in the order B,C,E,A,D. For that reason Internet voice systems put little sequence numbers on their data packets. When receiving data, the receiving client saves up a few packets before playing them-- this gives the client a chance to put them into the right order. FYI, that is usually called a "jitter buffer."
2) You have to account for a packet never arriving at all. In the example above, packet "C" may never arrive--ever. So your software has to be clever about what to do in that situation. Typically the client will wait a certain amoutn of time waiting for that 'lost' packet, and then just decide it will never arrive. Then it will usually just play one of the packets twice (eg A B B D E) to make up for the lost packet.
You face these challenges even if you don't compress the audio data at all-- Dealing with these 2 issues are among the main ways that bad VOIP systems sound worse than good VOIP systems.
You'll have to manage the "robustness vs latency" issue. That is, the bigger your buffers, the less likely your sound will break up or crackle, but the longer your latency (delay from the time person A speaks and person B hears) will be. After network latency, you really want to shoot for end-to end latency of less than 200ms (including the latency introduced by the jitter buffers). Much more than that, and conversation gets difficult/annoying
Opus is a good choice. It has "wide band" (High fidelity) modes for speech.