Jump to content
  • Advertisement

Aharon Satt

Member
  • Content Count

    6
  • Joined

  • Last visited

Community Reputation

0 Neutral

About Aharon Satt

  • Rank
    Newbie

Personal Information

  • Interests
    Audio
    Programming

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Hello, We would like to offer our beta-level service for free for about half a year from now, until Aug 10, 2018. We hope it is attractive enough for game developers to experience the tool and use it, this way. Our purpose is to gain feedback to improve the tool. We encourage you to read the help to get information about the new capabilities and explore them. Any audio content generated through this tool (https://ivva-tts.sl.haifa.il.ibm.com/) until Aug 10, 2018, can be used for free, forever, including for commercial purpose. I will post more details about the technology behind the scenes over the next weeks. Aharon
  2. Hello JoeJ, thank you for your feedback. Specifically, Some language that describes intonations, feelings and stuff, e.g. "<frightened 5> Help! The monster! <whispering 2> Hide under the desc!" - we are in fact working on a technology that does exactly this, I will find the proper context to describe it in another post. We need laughter and screams, not just speech - yes, absolutely; the (smaller) challenge here is to adapt some basic laughter sample to the specific voice you are using to synthesize the speech. many thanks, Aharon
  3. Thank you Frob! In addition, please consider that TTS technology has gone through significant improvement recently, based on new machine learning / deep learning technology. So yes, it isn't fancy, but it does much better job than it used to until recently. If you like, you can listen to a research prototype as an example, and there are additional examples: https://ivva-tts.sl.haifa.il.ibm.com. I am curious to find out what is your opinion of this. thank you, Aharon
  4. Hi Alvaro, thanks for fast response! Your description IMO describes a good approach and sounds feasible to me. It has the potential of generating high quality voices with fine grained attributes like gender, age, accent etc. In comparison, we have worked on an approach that makes use of parametric model of the speaker as well, but using "simple" heuristic controls that enable generation of a large variety of voices, including "non-human" ones, with lower dependency on pre-recorded voices to establish the baseline (the embedding). Here is a link to the scientific publication: http://www.isca-speech.org/archive/Interspeech_2017/pdfs/1202.PDF Two different approaches, each with its own merits.
  5. [To the moderator: I selected this forum as TTS technology falls within AI; I found no better alternative. I hope it is acceptable.] In this post I would like to share opinions about Text To Speech (TTS) technology in the context of gaming. I am not a gaming expert – rather, my expertise is in signal and speech processing. I am working at the IBM Research division. I hope this post will trigger a discussion and opinion sharing. My purpose here is to discuss “quality”. I plan to discuss technology trends in follow on posts . The potential benefit of “good quality” TTS technology for game developers is clear. But it is still considered as delivering “insufficient quality”. What is “quality” anyway, in our context? 1. The basic quality of modern TTS is good, in general. State of the art machine learning algorithms enable good prediction of the prosody (“intonation”, duration, loudness, emphasis and more) from the text, and the synthesized speech achieves good scores in subjective quality tests. This is no more the “robotic” sound it used to be. It sounds natural and “clean”. For applications such as announcements or commercial question answering, modern TTS provides a good alternative. 2. Natural speech, however, needs correct emphasis of different words across the sentences. Due to the ambiguity of the natural language, the algorithms (and humans as well) cannot always determine “correctly” the emphasis from the text of isolated sentences or utterances, without full knowledge of the entire context. This can limit the quality achievable by modern TTS technology. 3. When we consider gaming applications, additional needs arise. For example, using a formal style for generating the speech for a scene of action would sound weird – where are the emotions? The emotional content in human speech is essential for conveying messages. This is certainly an important aspect of “quality”. 4. Yet another aspect that relates to “quality”, at least in the broader sense, is the variety of voices. Most modern TTS technologies are based on pre-recorded human voices (recording of voice talents uttering a large collection of sentences). As recording, and processing the recorded speech, are expensive and time consuming, the variety of voices in typical TTS products is limited, often a few and less commonly several tens of different voices per “major” language. Moreover, gaming often requires non-human voices, such as “cartoonish” ones, to best support the different characters. To summarize, using repeatedly the same voices across games and characters, amounts to less than optimal experience, or in other words – to lower “quality”. I hope this provides some initial insights – from the perspective of a speech technology researcher. I hope to get feedback from the gaming experts. Am I right? What have I missed? What would allow game developers to start benefiting from the TTS technology? All the best, Aharon.
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!