HELP! Problems with Multi-User Voicechat through JAVA

Started by
3 comments, last by hplus0603 11 years, 1 month ago

Hi guys, I need some advice from anyone familiar with voice-chat applications / java.

A couple of months ago we hired developers to build us a voice-chat.

Among the functional requirements were

1. it had to be browser-based, integrated on a website, without making the user download an external application (except java)

2. it had to handle more than 3 users at a time, even 5 or 10+

3. It had to be a square box, integrated within a website, where each user was assigned a simple avatar (a square image)

4. If avatars moved towards each other, the audio level would increase, as they moved away the audio would decrease (proximity effect)

Unfortunately, the developers came across 2 major problems. wacko.png

1. There was this annoying echo during the chat. Sometimes nasty high-pitched feedback whenever the offending user came close. This ruined the entire experience.

2. It didn't work for everybody, some people had clean audio, others sounded like an old distorted walkie-talkie.

The developers built in a volume-test which the users could use before entering the chat, this to ensure the right levels.

They even tried some sort of echo-cancellation or voice activation, which would stop the audio streams whenever the user wasn't talking..

But none of these solutions worked effectively..

In any case, months later the developers simply gave up. They said they tried everything, but 'since it's browser-based, the necessary calculations to perform echo cancellation cannot be achieved'. As for why it didn't work with certain users, we never had an explanation.

We are to receive a RAR archive with the scripts, and that's it, end of story... sad.png

Again, what we wanted was a voice-chat, between 3-5 people, with skype-quality audio, without the need to wear headsets.

My questions are:

1. Was JAVA a bad idea from the start?

2. What alternatives/approaches do you suggest for a webbased voice chat?

3. Should we let go of the 'no external download' requirement? (if a plugin solves the issue, heck why not)

Thanks a million for your tips! smile.png

VPME

Advertisement

since it's browser-based, the necessary calculations to perform echo cancellation cannot be achieved

That's not necessarily true -- if you run Java within a browser, you have similar levels of performance as running a desktop application.
However, there are multiple gotchas that a developer not well versed in audio and networking may run into.

1) Different network connections have different quality. You will need to detect the typical amount of jitter, and add de-jitter compensation on the receiving side. This needs to be adaptive, and per-remote-user-per-listener.

2) A user who is also streaming video, or using bittorrent, or otherwise utilizing the connection will often get very high jitter or even packet loss. This will cause bad quality. Not much you can do about that, but you should detect this case and indicate it within the UI. For example, you can show a red icon next to any user with more than, say, 3% packet loss in the last minute or two. Note that the user with the poor connection will see everyone else as having packet loss, because there is loss between this user and everyone else.

3) Echo cancellation needs to be done with some knowledge of the hardware involved to work well. Ideally, you can convolve the signal with the inverse of your microphone impulse response after capture and also pre-convolve it with the inverse of your speaker response before playing out. Additionally, take into account the scheduling jitter on the machine for audio play-back. The good news is that, since Windows Vista, Windows will do echo cancellation for you. (You may need to use a known microphone and/or configure it to turn it on.) Additioanlly, for headset-less voice input, you really need a microphone array. Many laptops have at least two microphones in the bezel; some have four. For really high quality conferencing, I would suggest at least a four-microphone array.

This is an application that requires a significant amount of knowledge across a few different disciplines; your typically contracting/outsourcing house is unlikely to actually have all of the needed skills in one place.
enum Bool { True, False, FileNotFound };

There are actually quite a few java based SIP clients (user agents) out there you can use with open source server side conferencing such as Freeswitch, Asterisk and many others. Of the issues hplus mentioned #2 is the most difficult one to manage (IMHO) but for CPU contention, not network, and many VoIP providers are still trying to figure it out for desktop based soft phones although network congestion is absolutely a valid concern.

I sympathize with the contracted developers because there is a measure of truth to their statement although on a broader scale - your application, especially in a browser, will not be able to schedule itself on the client computer appropriately to provide any real quality assurances. Soft phone providers have this same issue. Most VoIP protocols expect a packet every 20ms depending on the CODEC and sample size which is 50 packets per second.

Consider a Skype conversation that has audio that continues flawlessly but the video ocaisionally pauses and jumps ahead. This is usually more acceptable than the opposite where there is video but the audio spuradically skips or has static in it.

That all being said, the conferencing should be mixed server side so the number of people in the conversation shouldn't be an issue for the client side. There are free conferencing applications out there so this shouldn't be an issue. A good example of something that takes advantage of java, flash and red5 along with Freeswitch/Asterisk to do this is Big Blue Button. I will stress however though that the general nature of conferencing suggests that a single user can mess up the conference for everyone - consider someone putting a massive conference on hold (happens at work all the time)...

Evillive2

Thank you very much for the feedback.

This is all pretty technical but I do understand what you're saying.

So basically, the project is feasible, but very difficult and requires that the developers be skilled in various discplines.

We were pretty much happy with the result except for the fact that the echo kept ruining all the conversations.

And then, echo + distortion from 1 or more users would just make the entire experience very poor.

Then we came at a cross-roads.... do we find a new developer to finish what we started? (since the others gave up).

Or, do we start over, from scratch, with a different technology.... (according to me, Google Hangout doesn't use Java, neither does Facebook chat, so why should we?)..

But I myself am more of a webdeveloper rather than programmer...

So i really appreciate your input!

Echo cancellation requires the client to be well behaved. It ideally also requires high-quality hardware on the client. If just one client is not well-behaved, or has poor hardware, then that client connection may end up with poor echo cancellation performance, and it will ruin the conference for everybody.

Either, require all clients to use good hardware and Windows Vista or up, with built-in echo cancellation, or require all clients to use headsets. There's a reason headsets are so ubiquitous...

If you want to keep going on the speaker-based echo cancellation thread, you could start by playing a quick frequency sweep, to calibrate the relationship between the microphone and the speaker. Run a reverse convolution between the signal you play out, and the signal you record; this will give you the impulse response of the system. Now, in turn, convolve everything you play out with that impulse response, and subtract that from the recorded signal you get from the microphone, and you have a "perfect" echo cancellation system.

Some caveats:
- As the room changes (even just the user moving, or re-positioning a laptop, or whatever,) the impulse response will change.
- Some audio APIs don't actually let you get access at the raw bits of recording and playback with the necessary time stamps to do this convolution.
- Depending on hardware parameters (buffering) you may need a significant amount of buffering in the application to support a "long" impulse response. Typically, a large part of that impulse response will be empty, so if you can "cut out" the impulse response for the long empty part, and only convolve the part with actual echo signal, that will save processing power.
- You may find that you want to filter out the higher frequencies from your inverse convolution, and let those through, because reverse convolution that goes off-kilter is more disruptive than some low-volume high-frequency-only echo. Exactly what the right trade-off is is up to you.
enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement