Sound (Voice) over Network

Started by
14 comments, last by hplus0603 11 years, 3 months ago
This is mostly to do with sound programming for the most part but I just want some reasurance before going deep into this.

I want to add in-game voice chat to my game. My libraries of choice are OpenAL for sound capture and playing. For encoding/decoding I'll be using Opus.

I'm just wondering what challenges I'll face. It seems easy enough to me but maybe I'm forgetting something. Is it just a case of encoding the captured data, sending that over network as fast as possible and having clients decode it and play?

As internet speeds get faster, wont there become a point where encoding sound data would be uneeded and the raw sound data could be sent across without a problem?

Has anyone done this before? Could you give a quick rundown on what is required to get this done?

I'm also wondering if I'll need to be threading any part of this. I don't like using threads so if it can be avoided for now then that's good.
Advertisement
As internet speeds get faster, wont there become a point where encoding sound data would be uneeded and the raw sound data could be sent across without a problem?
As far as this goes, I seriously doubt that it's will ever be an option. Bandwidth translates directly into money. While most home users will usually have some kind of rate-limited flatrate (like 16 mibbit/s DSL or 50 mibit/s optical fiber), servers are practically always accounted for traffic (the same is true for many wireless/phone rates).

You usually have "some amount" of prepaid traffic included, and as you exceed this quota it either becomes very expensive all of a sudden, or you are throttled or cut off. No such thing as "unlimited" exists, although this is something often advertized. When you take "unlimited" literally, what usually happens is that you're cut off the net without a warning under some bogus excuse ("it looks like your server is under a DoS attack") or even without an excuse, and your contract is terminated under some pretext.

It's not surprising either -- hosting companies have to live, too. The "unlimited" bargain is based on the assumption that it sounds attractive to new customers and nobody uses more than a few mibits/s at most anyway.
No such thing as "unlimited" is possible from a technical point of view anyway, if you look at what "typical" datacenters look like.

You have somewhere from 10,000 to 50,000 servers with 1 gibit/s network cards going into switches that rate-limit them to 100 mibit/s (unless you pay extra $$$) in one or two large buildings, and a uplink (usually split over half a dozen carriers) with a total bandwidth ranging anywhere from 50 to 200 gibit/s. Let's assume 10,000 servers and 100 gibit/s uplink, that's 10 mibit/s per server. Consequently, there can be no such thing as "unlimited" because if only 10% of the customers took this offer seriously and literally, there would not be enough bandwidth left for anyone else.

Uncompressed audio consumes 10-20 times as much bandwidth as compressed audio (or more, depending on quality settings), so one could say (in a very simplified way) that it costs 10-20 times as much money. Or, from the opposed point of view, you can serve 20 times as many customers ( = 20x revenue) with the same base costs.

About what challenges you'll face with Opus, I can't tell (first time I've heard of it, sounds promising). OpenAL as such is pretty straightforward to work with, both for input and output. So, as long as Opus "kind of works" (in a manner similar to, say, Speex), I'd be very optimistic.
I'm just wondering what challenges I'll face. It seems easy enough to me but maybe I'm forgetting something. Is it just a case of encoding the captured data, sending that over network as fast as possible and having clients decode it and play?

It's not quite that simple smile.png

You need to deal with a couple main issues:

You'll want to send your data UDP (not TCP). TCP is fine for sending files, etc but is not well suited for real-time communications.

1) when you send data over the internet via UDP, you're not guaranteed that the client receiving data will get in the same order that you sent it. I.e. if you send 5 little packets of audio data, A,B,C,D,E, they might be received in the order B,C,E,A,D. For that reason Internet voice systems put little sequence numbers on their data packets. When receiving data, the receiving client saves up a few packets before playing them-- this gives the client a chance to put them into the right order. FYI, that is usually called a "jitter buffer."

2) You have to account for a packet never arriving at all. In the example above, packet "C" may never arrive--ever. So your software has to be clever about what to do in that situation. Typically the client will wait a certain amoutn of time waiting for that 'lost' packet, and then just decide it will never arrive. Then it will usually just play one of the packets twice (eg A B B D E) to make up for the lost packet.

You face these challenges even if you don't compress the audio data at all-- Dealing with these 2 issues are among the main ways that bad VOIP systems sound worse than good VOIP systems.

You'll have to manage the "robustness vs latency" issue. That is, the bigger your buffers, the less likely your sound will break up or crackle, but the longer your latency (delay from the time person A speaks and person B hears) will be. After network latency, you really want to shoot for end-to end latency of less than 200ms (including the latency introduced by the jitter buffers). Much more than that, and conversation gets difficult/annoying

Opus is a good choice. It has "wide band" (High fidelity) modes for speech.

Brian Schmidt

Brian Schmidt Studios

Brian Schmidt

Executive Director, GameSoundCon:

GameSoundCon 2016:September 27-28, Los Angeles, CA

Founder, Brian Schmidt Studios, LLC

Music Composition & Sound Design

Audio Technology Consultant

Thanks for all the info. I've been messing around with Opus encode and decode examples and I was wondering something - What options do I set to lower the quality resulting in less data in the packets?

Do players send their own voice data with the highest quality and then server sends a lesser quality version based on the clients bandwidth option?

I guess what I'm trying to ask is, what factors go into making a lower quality sound? From what I see only the -loss and -bandwidth options would result in a lower quality sound. Packet loss is obvious but what decides what the bandwidth option should be when encoding? Would the server even bother to do any encoding/decoding of sound or would it just pass it on to other clients without touching it?

I think I'm just overthinking this. Maybe just using wide-band all the time for voice communication would be good enough. I shouldn't really have to adjust anything for players that are lagging.
I think I'm just overthinking this.

I'd say your are correct :)...

While it is quite possible to create a sophisticated system that looks at network QoS (Quality of Service) and does run-time analysis to give each player the best quality, dependent on their own network with server-side encoding, in practice, it's way easier to just decide on what quality you want to deliver and always use that setting.

So encode on the sender's system at whatever you decide is the quality you want, send it up to the server and let the server broadcast it out to the people at the other end of the 'phone.' (or just go peer-to-peer and bypass the server if you don't need large #'s of people to hear)

I'm not familiar with the details/options of Opus. But in general, there are 2 things that will decide the audio fidellity: Sampling Rate of the sound itself and the compression ratio. The sampling rate should be at least 16kHz for "high quality" speech. (a regular telephone is 8kHz), and I'd say that anything over 24kHz for in-game chat is overkill.

Note that "sampling rate" will generally be about twice the 'bandwidth', so the -bandwidth option might be controlling the sampling rate. as for -loss, that probably directly or indirectly controls the compression ratio. The less compression, the more natural the voice will sound. at extreme compression ratios, you often get that warbly/watery/robotic sound to speech.

Hope that helps!

Brian

Brian Schmidt

Executive Director, GameSoundCon:

GameSoundCon 2016:September 27-28, Los Angeles, CA

Founder, Brian Schmidt Studios, LLC

Music Composition & Sound Design

Audio Technology Consultant

Thanks. I think I know enough to go ahead and start on this. Should have good in-game voice chat by the end of it!

A high-quality encoder will pretty much always improve the experience compared to "raw" data. The reason is that bandwidth is *never* unlimited -- there's always going to be more conversations, more movie streams, more large file downloads, and more fridges calling the grocery to tell them you're out of milk.

The two things you need to solve are:

1) Topology -- peer to peer, or server-bounced?

2) Network and sound card differences.

To do peer-to-peer, you're going to need a NAT introducer in your server. There are links in the FAQ about this, but it's not entirely simple to set up. Going through a central server is much easier, and more robust for end users, but may add some latency, and will certainly add bandwidth usage to the server.

Secondly, different networks have different amounts of jitter, and different sound cards play out at slightly different sampling frequencies! Even if your sound card says it's doing 48 kHz sample rate, that may actually be 47,950 Hz, or 48,050, or, due to the mad drive to the lowest price, even as bad as 47,500 or 48,500 Hz. The reason this matters is that the sender may be producing samples at 48,500 samples per second (or some fraction thereof) and the sender will play out at 47,500 samples per second, and for each second continually played, the receiver will fall 1,000 samples behind. After 48 seconds of reception, the receiver will be a whole second behind in playback!

The fix for both of these is to keep a window of allowed play-out latency. When a packet with data comes in, put it in the queue, but don't necessarily immediately play it out. Decide on some lowest amount of data you need to be able to play out (say, 80 ms,) and some highest amount of data you will accept (say, 200 ms,) and then start playing when you get to the lowest amount. If each packet is 30 ms of data, you will start playing out when you have three packets. Then, as you play out, if you ever find that there's not enough data, you stop playing out until you have enough data (80 ms) again. Similarly, if you get additional data so you have more than 200 ms of data, drop all the older, unplayed data, until you're down at about (200+80)/2 == 140 milliseconds of data.

And, in a good sound implementation, you don't actually start or stop the playback stream; you keep the playback stream going, but "stopping playing" means you generate zeros for the play-out. That part is more related to sound management practices than networking, though :-)
enum Bool { True, False, FileNotFound };
Rather that writing your own, I strongly suggest just using an off-the-shelf VoIP library. The better ones will integrate nicely with games, operate with very low bandwidth requirements, and they will automatically run on their own port, play nicely with UPNP, and otherwise just do the right thing.

VoipDevKit is one such solution. Mumble is pretty popular, as is Ventrilo.

If you really want to have your own codec that works with OpenAL, Speex is one library that has been integrated many times. There is lots of documentation online about how to do it.

I think this is kinda something I want to do myself or just not have it at all. I've written the game from scratch so far, I may aswell have just used unity or something otherwise.

Despite all the problems, I think this is something I can pull off. I may use Speex over Opus if there's better examples and docs. I think Opus is made by the same people though.

I would just send UDP packets of 1024 samples, probably at 11khz, roughly 10 packets per second (and thus 10 samples per second) with an incrementing ID prefixed, a single byte I imagine would work (along with whatever other info you'd like to send maybe to identify the sender, etc.) which would be used to deal with out of order packets.. I'd buffer a few of the sample packets before actually sending to the audio hardware, two or three should suffice.. You will experience momentary gaps in audio using UDP when packets are dropped, but it would drastically simplify things as far as networking code is concerned.

Compression entails the use of a fourier transform, which could allow you to cull a bunch of unused frequencies and effectively compress the audio signal down to relevant frequencies - reconstructing the audio from a sort of list of frequency coefficients. The use of a fixed-tree huffman compression on outgoing packets would be faster, simpler, and overall better suited IMO (read up on Quake3's networking code). You could probably also get away with even less fidelity - 5.5khz, either 5 packets a second (at 1024 samples) or 10 packets at 512 bytes (assuming 8-bit samples)..

This topic is closed to new replies.

Advertisement