•      Sign In
• Create Account

Sound (Voice) over Network

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

16 replies to this topic

#1laylay  Members

310
Like
1Likes
Like

Posted 31 December 2012 - 09:59 AM

This is mostly to do with sound programming for the most part but I just want some reasurance before going deep into this.

I want to add in-game voice chat to my game. My libraries of choice are OpenAL for sound capture and playing. For encoding/decoding I'll be using Opus.

I'm just wondering what challenges I'll face. It seems easy enough to me but maybe I'm forgetting something. Is it just a case of encoding the captured data, sending that over network as fast as possible and having clients decode it and play?

As internet speeds get faster, wont there become a point where encoding sound data would be uneeded and the raw sound data could be sent across without a problem?

Has anyone done this before? Could you give a quick rundown on what is required to get this done?

I'm also wondering if I'll need to be threading any part of this. I don't like using threads so if it can be avoided for now then that's good.

Edited by laylay, 31 December 2012 - 10:02 AM.

#2samoth  Members

8927
Like
1Likes
Like

Posted 31 December 2012 - 10:47 AM

As internet speeds get faster, wont there become a point where encoding sound data would be uneeded and the raw sound data could be sent across without a problem?
As far as this goes, I seriously doubt that it's will ever be an option. Bandwidth translates directly into money. While most home users will usually have some kind of rate-limited flatrate (like 16 mibbit/s DSL or 50 mibit/s optical fiber), servers are practically always accounted for traffic (the same is true for many wireless/phone rates).

You usually have "some amount" of prepaid traffic included, and as you exceed this quota it either becomes very expensive all of a sudden, or you are throttled or cut off. No such thing as "unlimited" exists, although this is something often advertized. When you take "unlimited" literally, what usually happens is that you're cut off the net without a warning under some bogus excuse ("it looks like your server is under a DoS attack") or even without an excuse, and your contract is terminated under some pretext.

It's not surprising either -- hosting companies have to live, too. The "unlimited" bargain is based on the assumption that it sounds attractive to new customers and nobody uses more than a few mibits/s at most anyway.
No such thing as "unlimited" is possible from a technical point of view anyway, if you look at what "typical" datacenters look like.

You have somewhere from 10,000 to 50,000 servers with 1 gibit/s network cards going into switches that rate-limit them to 100 mibit/s (unless you pay extra \$) in one or two large buildings, and a uplink (usually split over half a dozen carriers) with a total bandwidth ranging anywhere from 50 to 200 gibit/s. Let's assume 10,000 servers and 100 gibit/s uplink, that's 10 mibit/s per server. Consequently, there can be no such thing as "unlimited" because if only 10% of the customers took this offer seriously and literally, there would not be enough bandwidth left for anyone else.

Uncompressed audio consumes 10-20 times as much bandwidth as compressed audio (or more, depending on quality settings), so one could say (in a very simplified way) that it costs 10-20 times as much money. Or, from the opposed point of view, you can serve 20 times as many customers ( = 20x revenue) with the same base costs.

About what challenges you'll face with Opus, I can't tell (first time I've heard of it, sounds promising). OpenAL as such is pretty straightforward to work with, both for input and output. So, as long as Opus "kind of works" (in a manner similar to, say, Speex), I'd be very optimistic.

Edited by samoth, 31 December 2012 - 10:59 AM.

#3bschmidt1962  Members

2193
Like
1Likes
Like

Posted 31 December 2012 - 11:29 AM

I'm just wondering what challenges I'll face. It seems easy enough to me but maybe I'm forgetting something. Is it just a case of encoding the captured data, sending that over network as fast as possible and having clients decode it and play?

It's not quite that simple

You need to deal with a couple main issues:

You'll want to send your data UDP (not TCP).  TCP is fine for sending files, etc but is not well suited for real-time communications.

1) when you send data over the internet via UDP, you're not guaranteed that the client receiving data will get in the same order that you sent it.  I.e. if you send 5 little packets of audio data, A,B,C,D,E, they might be received in the order B,C,E,A,D.  For that reason Internet voice systems put little sequence numbers on their data packets.  When receiving data, the receiving client saves up a few packets before playing them-- this gives the client a chance to put them into the right order.  FYI, that is usually called a "jitter buffer."

2) You have to account for a packet never arriving at all.  In the example above, packet "C" may never arrive--ever.  So your software has to be clever about what to do in that situation.  Typically the client will wait a certain amoutn of time waiting for that 'lost' packet, and then just decide it will never arrive.  Then it will usually just play one of the packets twice (eg A B B D E) to make up for the lost packet.

You face these challenges even if you don't compress the audio data at all-- Dealing with these 2 issues are among the main ways that bad VOIP systems sound worse than good VOIP systems.

You'll have to manage the "robustness vs latency" issue.  That is, the bigger your buffers, the less likely your sound will break up or crackle,  but the longer your latency (delay from the time person A speaks and person B hears) will be.  After network latency, you really want to shoot for end-to end latency of less than 200ms (including the latency introduced by the jitter buffers).  Much more than that, and conversation gets difficult/annoying

Opus is a good choice.  It has "wide band" (High fidelity) modes for speech.

Brian Schmidt

Brian Schmidt Studios

Edited by bschmidt1962, 31 December 2012 - 11:33 AM.

Brian Schmidt

Executive Director, GameSoundCon:

GameSoundCon 2016:September  27-28, Los Angeles, CA

Founder, Brian Schmidt Studios, LLC

Music Composition & Sound Design

Audio Technology Consultant

#4laylay  Members

310
Like
0Likes
Like

Posted 31 December 2012 - 11:48 AM

Thanks for all the info. I've been messing around with Opus encode and decode examples and I was wondering something - What options do I set to lower the quality resulting in less data in the packets?

Do players send their own voice data with the highest quality and then server sends a lesser quality version based on the clients bandwidth option?

I guess what I'm trying to ask is, what factors go into making a lower quality sound? From what I see only the -loss and -bandwidth options would result in a lower quality sound. Packet loss is obvious but what decides what the bandwidth option should be when encoding? Would the server even bother to do any encoding/decoding of sound or would it just pass it on to other clients without touching it?

I think I'm just overthinking this. Maybe just using wide-band all the time for voice communication would be good enough. I shouldn't really have to adjust anything for players that are lagging.

Edited by laylay, 31 December 2012 - 12:19 PM.

#5bschmidt1962  Members

2193
Like
1Likes
Like

Posted 31 December 2012 - 01:12 PM

I think I'm just overthinking this.

I'd say your are correct ...

While it is quite possible to create a sophisticated system that looks at network QoS (Quality of Service) and does run-time analysis to give each player the best quality, dependent on their own network with server-side encoding, in practice, it's way easier to just decide on what quality you want to deliver and always use that setting.

So encode on the sender's system at whatever you decide is the quality you want, send it up to the server and let the server broadcast it out to the people at the other end of the 'phone.'  (or just go peer-to-peer and bypass the server if you don't need large #'s of people to hear)

I'm not familiar with the details/options of Opus.  But in general, there are 2 things that will decide the audio fidellity: Sampling Rate of the sound itself and the compression ratio.  The sampling rate should be at least 16kHz for "high quality" speech.  (a regular telephone is 8kHz), and I'd say that anything over 24kHz for in-game chat is overkill.

Note that "sampling rate" will generally be about twice the 'bandwidth', so the -bandwidth option might be controlling the sampling rate.  as for -loss, that probably directly or indirectly controls the compression ratio.  The less compression, the more natural the voice will sound.  at extreme compression ratios, you often get that warbly/watery/robotic sound to speech.

Hope that helps!

Brian

Brian Schmidt

Executive Director, GameSoundCon:

GameSoundCon 2016:September  27-28, Los Angeles, CA

Founder, Brian Schmidt Studios, LLC

Music Composition & Sound Design

Audio Technology Consultant

#6laylay  Members

310
Like
0Likes
Like

Posted 31 December 2012 - 02:05 PM

Thanks. I think I know enough to go ahead and start on this. Should have good in-game voice chat by the end of it!

#7hplus0603  Moderators

10568
Like
1Likes
Like

Posted 31 December 2012 - 02:07 PM

A high-quality encoder will pretty much always improve the experience compared to "raw" data. The reason is that bandwidth is *never* unlimited -- there's always going to be more conversations, more movie streams, more large file downloads, and more fridges calling the grocery to tell them you're out of milk.

The two things you need to solve are:

1) Topology -- peer to peer, or server-bounced?

2) Network and sound card differences.

To do peer-to-peer, you're going to need a NAT introducer in your server. There are links in the FAQ about this, but it's not entirely simple to set up. Going through a central server is much easier, and more robust for end users, but may add some latency, and will certainly add bandwidth usage to the server.

Secondly, different networks have different amounts of jitter, and different sound cards play out at slightly different sampling frequencies! Even if your sound card says it's doing 48 kHz sample rate, that may actually be 47,950 Hz, or 48,050, or, due to the mad drive to the lowest price, even as bad as 47,500 or 48,500 Hz. The reason this matters is that the sender may be producing samples at 48,500 samples per second (or some fraction thereof) and the sender will play out at 47,500 samples per second, and for each second continually played, the receiver will fall 1,000 samples behind. After 48 seconds of reception, the receiver will be a whole second behind in playback!

The fix for both of these is to keep a window of allowed play-out latency. When a packet with data comes in, put it in the queue, but don't necessarily immediately play it out. Decide on some lowest amount of data you need to be able to play out (say, 80 ms,) and some highest amount of data you will accept (say, 200 ms,) and then start playing when you get to the lowest amount. If each packet is 30 ms of data, you will start playing out when you have three packets. Then, as you play out, if you ever find that there's not enough data, you stop playing out until you have enough data (80 ms) again. Similarly, if you get additional data so you have more than 200 ms of data, drop all the older, unplayed data, until you're down at about (200+80)/2 == 140 milliseconds of data.

And, in a good sound implementation, you don't actually start or stop the playback stream; you keep the playback stream going, but "stopping playing" means you generate zeros for the play-out. That part is more related to sound management practices than networking, though :-)
enum Bool { True, False, FileNotFound };

#8frob  Moderators

41235
Like
0Likes
Like

Posted 31 December 2012 - 04:48 PM

Rather that writing your own, I strongly suggest just using an off-the-shelf VoIP library. The better ones will integrate nicely with games, operate with very low bandwidth requirements, and they will automatically run on their own port, play nicely with UPNP, and otherwise just do the right thing.

VoipDevKit is one such solution. Mumble is pretty popular, as is Ventrilo.

If you really want to have your own codec that works with OpenAL, Speex is one library that has been integrated many times. There is lots of documentation online about how to do it.

Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I occasionally write about assorted stuff.

#9laylay  Members

310
Like
0Likes
Like

Posted 31 December 2012 - 05:13 PM

I think this is kinda something I want to do myself or just not have it at all. I've written the game from scratch so far, I may aswell have just used unity or something otherwise.

Despite all the problems, I think this is something I can pull off. I may use Speex over Opus if there's better examples and docs. I think Opus is made by the same people though.

#10deftware  Prime Members

1670
Like
1Likes
Like

Posted 31 December 2012 - 06:40 PM

I would just send UDP packets of 1024 samples, probably at 11khz, roughly 10 packets per second (and thus 10 samples per second) with an incrementing ID prefixed, a single byte I imagine would work (along with whatever other info you'd like to send maybe to identify the sender, etc.) which would be used to deal with out of order packets.. I'd buffer a few of the sample packets before actually sending to the audio hardware, two or three should suffice.. You will experience momentary gaps in audio using UDP when packets are dropped, but it would drastically simplify things as far as networking code is concerned.

Compression entails the use of a fourier transform, which could allow you to cull a bunch of unused frequencies and effectively compress the audio signal down to relevant frequencies - reconstructing the audio from a sort of list of frequency coefficients. The use of a fixed-tree huffman compression on outgoing packets would be faster, simpler, and overall better suited IMO (read up on Quake3's networking code). You could probably also get away with even less fidelity - 5.5khz, either 5 packets a second (at 1024 samples) or 10 packets at 512 bytes (assuming 8-bit samples)..

#11deftware  Prime Members

1670
Like
1Likes
Like

Posted 31 December 2012 - 06:46 PM

Oh yea, and for outputting the audio, SDL has a function that lets you directly write the PCM data out to the hardware to be played, I utilized this functionality in a game engine project and built a scripted audio synth engine ontop of it.

look at the Command_S_XXXXX() functions, specifically Command_S_End() which makes the SDL Mix_QuickLoad_RAW call itself.

http://revolude.svn.sourceforge.net/viewvc/revolude/source/audio.cpp?revision=62&view=markup

#12laylay  Members

310
Like
0Likes
Like

Posted 31 December 2012 - 07:27 PM

Well I've got OpenAL setup so I can just play it through that without a problem.

I think tomorrow I'm going to atleast get the encoded audio sent to server and across to clients, having clients decode what they get sent. From there I'll have a much better idea of what needs fixing. I'll be missing packets and have packets in the wrong order but I should be able to atleast hear something through my speakers.

I'll be testing on localhost anyway so It will probably be audible

#13hplus0603  Moderators

10568
Like
0Likes
Like

Posted 01 January 2013 - 02:26 PM

FWIW: VDK and Ventrilo are commercial products. Mumble you can run your own for free.

Btw: not using compression is a poor choice IMO. If you just get 4:1 data compression, you can go from 11 kHz/8bit data to 22 kHz/16bit data with the same data rate, which is a HUGE improvement in quality.
enum Bool { True, False, FileNotFound };

#14laylay  Members

310
Like
0Likes
Like

Posted 05 January 2013 - 09:52 AM

So I got my encoded voice sending across UDP and seems to play smoothly (on localhost) Now I need to know what I need to do to have it play nicely when connected to other servers.

So from what you've all told me there's 2 main things I need to solve. Packet reordering and packet loss.

Let's start with packet reordering. I imagine I'd have to wait for a certain amount of packets to come in before I can put them in order and play them. How many packets should I be waiting for before i sort them though? 5, 6, 7, 10? If I'm waiting too long then there's going to be a lot of latency.

Any other problems I'll have to solve? I want this to be on par with something like the voice chat in Source Engine games.

Edited by laylay, 05 January 2013 - 09:52 AM.

#15hplus0603  Moderators

10568
Like
1Likes
Like

Posted 05 January 2013 - 11:14 AM

A simple algorithm that works OK is this:

1) prefix each packet with a sequence number. A single byte is enough. If you detect an packet that is re-ordered, or a duplicate of a previously received packet, then drop it. You can make this detection simply:
char delta = (char)(receivedpacketseq - lastreceivedpacketseq);
if (delta <= 0) { drop packet; }
else { receive packet, set lastreceivedpacketseq = receivedpacketseq; }


This uses magic of signed/unsigned two's complement math to do the right thing. Just increment the sequence number by one for each packet you send, and let it roll over to 0 after it reaches 255.

2) keep a queue of received data. Let's count this queue in "received packets." Each time you update sound (typically, each time through your main loop,) run this algorithm:
bool playing = false;
queue<packet> receivedqueue;

void update() {
if (playing) {
if (soundcard needs data) {
if (receivedqueue.empty()) {
playing = false;
fill with zeros;
}
else {
if (receivedqueue.size() > 3) {
receivedqueue.erase_the_two_first_elements();
}
fill from queue;
}
}
}
else {
fill with zeros;
if (queue.size() >= 2) {
playing = true;
}
}
}

void packet_received(packet p) {
receivedqueue.push_back(p);
}


If you use threading, add locking as needed.

This is the simplest, most robust algorithm that I know about, and uses a nice interaction between UDP delivery semantics, network behavior, sound card behavior, and general sound playback to deliver robust, reliable sound that compensates for some amount of network jitter, and adapts to network changes over time.

If the jitter is more than two packet's worth of data, you need a better network :-) You could detect this and up the values "2" and "3" in the algorithm above, although this will lead to more playout latency (necessarily, to compensate for the jitter.)
enum Bool { True, False, FileNotFound };

#16laylay  Members

310
Like
0Likes
Like

Posted 05 January 2013 - 04:29 PM

Thanks, I'll give it a try. You say to drop the packet if it's not the in correct order, why not just swap them around?

#17hplus0603  Moderators

10568
Like
0Likes
Like

Posted 06 January 2013 - 12:09 AM

drop the packet if it's not the in correct order, why not just swap them around?

Because you likely already started playing out zeros where the late packet would have been, and re-ordered packets is very uncommon anyway. If they happen, it's typically on connections so bad you're unlikely to get a good connection anyway (microwave links during rainstorms, et c.)

However, if you can safely pay attention to the out-of-order-arrived packet, then it's of course fine to do so, assuming the code doesn't become too complex or it introduces other problems.
enum Bool { True, False, FileNotFound };

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.