• 04/16/03 10:21 AM
    Sign in to follow this  

    Using Text-To-Speech as a Game Programming Tool

    Engines and Middleware

    Myopic Rhino
    The purpose of this article is to introduce the use of Microsoft's Speech API (SAPI) 5.1 as an effective tool in game development. It is necessary to have the SAPI 5.1 SDK installed, and have the library/header paths set in VC++. An understanding of C++, object-oriented programming, and Win32 programming is recommended. SAPI 5.1 is a COM based API, and so an understanding of COM would be an asset, but is not essential.

    Note: This article will be using Visual C++ 6.0, on the Win32 platform. The source code for this article is available at the bottom of this page.


    The first thing I should tell you is my reason for writing this article. When I sobered up this morning - and inevitably started coding - I was quite annoyed, because I had to keep typing in the message box function, and lots of string formatting garbage. The repetition - without production - was really getting to me. So then, I thought to myself: "Hey, wouldn't it be cool if I could make the computer talk to me instead?" Later on, I realized that it would also save me a lot of time, since all of the string formatting could be done transparently, and I would no longer have to hit 'Enter' whenever a message appeared. Or did I think of it yesterday? It's all a real blur...

    Anyway, all joking aside, most games are Win32 applications, and if you've ever programmed in Win32, the first thing you notice is that you don't have the luxury of console input and output. This, to say the least, is not going to help... ever. Now, you could just stick with the tried and true message boxes, but, as I've already said, that can get tedious. The way I see it, this situation is analogous to the difference between [font="Courier New"][color="#000080"]printf()[/color][/font] and cout. They are both fine for displaying intrinsic types, but cout allows you to output all sorts of data, such as the members of a class (thanks to operator overloading) with ease. Using [font="Courier New"][color="#000080"]printf()[/color][/font] is also fairly easy, but it is not nearly as versatile. This is similar to the message box versus Text-To-Speech problem, because message boxes can only output text, and you have to do string formatting to output other data types. Now, as the name suggests, a Text-To-Speech engine also has this restriction, but if we put the engine in a class, we can deal with converting all data types using overloaded operators - behind the scenes. The end result is a single class object that is very flexible and easy to use, and that can be employed anywhere in lieu of a message box.

    I suppose you could encapsulate a message box in a similar fashion, but, on principle, I refuse to waste a Saturday morning on something so trivial - let alone write my first article about it - when there is something much more useful available. Besides, I think making my computer talk is much more entertaining.

    Accompanying this article are two samples. The first is a simple "Hello World!!!" application, and the other is a complete class, that is meant to behave in a similar manner as the [font="Courier New"][color="#000080"]cout[/color][/font] object. For the article itself, I will be focusing on the "Hello World!!!" application, since it is the simpler of the two. The basic concepts are the same, and so my intention is that you will go through the article, the first sample, and then understand the class (second sample) without difficulty.

    I will tell you now that this is a very easy subject to pick up, and I'm quite surprised that nobody has written about it yet. Microsoft has put a lot of time and effort into developing this technology, and it would be foolish for us to not at least consider using it. There are many possible applications of Text-To-Speech engines in modern game development, but using it as a debugging tool is just the first one that I thought was worth writing about. A "Hello World!!!" application, such as the one described in this article, could be made as small as ten executable lines. All of the code for the output class is no more than 500 lines (due to white space, commenting, etc.), and I have documented the code to make it as clear as possible. There are also many tutorials and whitepapers in the SDK documentation (some of which are even shorter than this article).

    I'm not going to try to explain every possible use of SAPI to you, but I do hope to spark your interest in it. Maybe I'm off in Never Never Land, but I just think this is neat.


    Now, let's get to it. The first step, in any application, is to link the needed libraries, and include the header files. As it happens, SAPI doesn't have any libraries that you need to link manually, but there is one header file, "sapi.h", which must be included. The second step is to initialize COM, and the voice interface. This is shown below:


    // The voice interface pointer
    IspVoice* Voice = NULL;

    // Initialize COM
    CoInitialize ( NULL );

    // Create the voice instance
    CoCreateInstance ( CLSID_SpVoice, NULL, CLSCTX_ALL, IID_ISpVoice, (void**)&Voice );
    The first two lines are pretty straightforward. They initialize the interface pointer and COM (the parameter is reserved, and must be NULL). The third line has a few parameters, which are explained in the below:

    [font="Courier New"][color="#000080"]CLSID_SpVoice[/color][nbsp][nbsp] [/font]The class identifier; for us, a speech voice class
    [font="Courier New"][color="#000080"]NULL[/color][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][/font]Used with COM aggregation; we don't use it
    [font="Courier New"][color="#000080"]CLSCTX_ALL[/color][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][/font]Class context; we use all class contexts
    [font="Courier New"][color="#000080"]IID_IspVoice[/color][nbsp][nbsp][nbsp][nbsp][/font]The specific object identifier
    [font="Courier New"][color="#000080"]&Voice[/color][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][/font]Address of the new interface

    Initialization is just that simple. The next task is making the voice interface speak to us.


    Now that the voice interface is initialized, we can use it to speak. The following code accomplishes this:

    // Our text to be spoken
    WCHAR* TextBuffer = "Hello World";

    // Use our voice interface to speak the contents of the buffer
    Voice -> Speak ( TextBuffer, SPF_DEFAULT, NULL );
    You'll notice that our string is in wide characters. To convert between wide characters and ASCII character, use the "MultiByteToWideChar" function in the Win32 API. For more information, please refer to the MSDN Library.

    At this point, the program will suspend and you should hear the contents of the buffer being spoken by the computer's default voice. Once the voice has finished speaking, the program continues normally. The parameters for the "Speak" member function are as follows:

    [font="Courier New"][color="#000080"]TextBuffer[/color][nbsp][nbsp] [/font]The text that is to be spoken
    [font="Courier New"][color="#000080"]SPF_DEFAULT[/color][nbsp][nbsp][/font]Rendering flags; we don't want anything fancy, just the default
    [font="Courier New"][color="#000080"]NULL[/color][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp][nbsp] [/font]Current input stream number; we don't use; we don't care

    [size="5"]Shutting Down

    The last thing to do is shutdown the application. The following lines accomplish this:

    // Safely release the voice interface
    if ( Voice != NULL ) Voice -> Release (); Voice = NULL;

    // Shutdown COM
    CoUninitialize ();
    The first line determines if the interface is in use, and if so, it is released. The pointer is then set to NULL, just to be safe. The second line shuts down COM.


    Now, as I said at the beginning, this is a "Hello World!!!" program, and consists of about ten lines of code, the bulk of which we have just looked at. You should be able to go through the samples, and, almost immediately, implement it in the debugging code of your games. When you build the samples, you will notice that the speech quality is rather low, more specifically, words sometimes sound distorted or choppy. I presume this is just the nature of SAPI, and will improve in future versions. However, this does not, in any way, stop us from using TTS for debugging and testing.

    [size="5"]Last Thoughts...

    This is only the first application of TTS in game development. In the future, you may wish to implement this technology in the actual game. A few ideas I'm turning over include speaking into a DirectSound or OpenAL buffer, then rendering in 3D space. Alternatively, one could use SAPI to speak into a wave file, and then simply use the file, as any regular sound effect. But that's for the future...

    Anyway, I hope that I've been able to you teach something useful. This is the first article that I have written, and I would appreciate any feedback you may wish to offer. Thanks.

      Report Article
    Sign in to follow this  

    User Feedback

    Create an account or sign in to leave a review

    You need to be a member in order to leave a review

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

    There are no reviews to display.