Gary Garrison

BUS 620

VoiceXML
(Voice eXtensible Markup Language)
Source: (http://www.voicexmlreview.org/voicexml)

Introduction
Voice Extensible Markup Language (VoiceXML), is a technology that enables human-computer interaction with the Internet through voice-recognition applications by using a voice browser and/or telephone. It uses speech recognition and/or touchtone keypad for input, and pre-recorded audio and Text-to-Speech synthesis (TTS) for output. It is based on the Worldwide Web Consortium's (W3C's) Extensible Markup Language (XML), thereby creating a common language standard for application developers, platform vendors, and tool providers that all can benefit from code portability and reuse. In addition, VoiceXML supports many sophisticated voice recognition and text to speech technologies as well as more traditional tone and grunt-based interfaces.

Advantages
One of the many advantages of VoiceXML is that its applications build on existing XML and web authoring tools, whereas traditional telephony applications have been rather laborious to create. In addition, VoiceXML applications are written in a declarative manner, in that the programmer tells the system “what to do” rather than the more traditional procedural manner of telling the program “not how to do it.” This programming change aids rapid application development and easily links in with existing web based applications and Common Gateway Interface (CGI) back ends. Because VoiceXML is an XML language, using similar web infrastructure, tools and Web servers, a programmer can create VoiceXML scripts from existing XML data sources, including speech recognition applications. One of the major benefits for programming with VoiceXML is any telephone can access VoiceXML applications by utilizing a browser running on a telephony server instead of using the current method of access provided by the traditional PC with a Web browser.

Whereas HTML is commonly used for creating graphical Web applications, VoiceXML can be used for voice-enabled Web applications. Like HTML, VoiceXML is a well-defined set of tags that determine how specified data will be executed. VoiceXML accomplishes this task by organizing data into two types of dialogs, called menus and forms. Menus give options about what to do next and expect a response. Forms ask a particular question or provide particular information and may or may not expect a response. The typical VoiceXML voice browser of today runs on a specialized voice gateway node that is connected both to the public switched telephone network and to the Internet. These voice gateways extend the power of the web to the phones allowing VoiceXML pages to be called from a web-application, similar to HTML pages.

How VoiceXML Works
The user can use a phone as the audio interface to call the VoiceXML platform that runs the VoiceXML interpreter, speech recognizer and speech synthesizer engine. The voice prompter will request information from the user, whereby the user will say or key the information that is associated with the URL requested by the voice prompter. The requested pages are submitted using the CGI interface used to write the code that generates VoiceXML dynamically. The VoiceXML interpreter interprets the pages and output is presented as audio, and the user provides input by speaking or pressing touchtone keys.

Corporate Value
For web-based businesses that currently rely on consumers with personal computers, integrating a voice portal will easily expand their consumer base by taking advantage of the 1.5 billion phones in existence and the growing rate of mobile phone users. Instead of users browsing a website for data through their personal computer, they will be able to access the same data through a phone, since the web browser now becomes the voice browser. VoiceXML is especially valuable to businesses utilizing e-commerce because VoiceXML was specifically created to work with Web applications. This will allow companies to maintain their current communication systems with minimal change and expense to their current system’s architecture.

Benefits
There are many areas where voice services, utilizing VoiceXML will be used. Some examples include, automated banking, checking stock quotes or purchasing stocks, checking the status of bids at electronic auction sites, bill payment authorization, delivery scheduling, renting videos, ordering office supplies, and purchasing concert or game tickets. Because standard Web security features apply to the voice web, intranet applications can also be written in VoiceXML for inventory control, ordering supplies and for providing human resource services.

Speech Synthesis
Source: (http://www.commweb.com/article/COM20001003S0023)

Introduction
Speech synthesis or Text-to-Speech (TTS), comprises the hardware, firmware and software by which computers convert text into spoken numbers, words and sentences. Synthesized speech requires ample processing power to iterate the thousands of lines of rules and tables, formalized by linguists and coded by programmers, before uttering a phrase. A Text to Speech engine creates the natural sounds of speech (i.e. emphasis, volume, pauses, rate and pitch) by taking text and turning it into an audible voice.

Two Approaches to Creating Sound
To make computer generated speech sound natural, all of the attributes associated with human speech must be considered and programmed to fit the particular situation. This was accomplished because standards were created for specifying parameters so TTS engines could go beyond the obvious recognition of punctuation and sentence structure. The two basic approaches to speech synthesis are formant synthesis and the concatenative approach. Formant synthesis, is the method of breaking down text and generating individual speech sounds electronically from scratch, mimicking the resonances of the human vocal tract. This method has the advantage of being adaptable to many different languages and occupies less memory than digitized speech fragments. The concatenative approach synthesizes speech by recording human speech and slicing it into stored fragments, which are then concatenated as needed.

Whether formed with formant or concatenated speech sounds, TTS systems adhere to strict guidelines that the TTS algorithm must consult prior to correctly reading a sentence or pronouncing a word. On a much more complex level, it must be able to parse sentences in order to know, for example, which syllables to stress in two words that may look identical by consulting exception dictionaries, where such words and their phonetic parameters are stored prior to pronunciation.

TTS Process
The Text-to-Speech process starts with a text file that may be the output of an application such as a word processor or a database. From there, generally follows "text normalization." The text normalization component of all TTS systems prepares text for phonetic translation by interpreting and converting abbreviations, acronyms, and deciphering numbers in context. It breaks down all symbolic material into letters, even though at this point, these words still retain the idiosyncrasies of English (or another language's) spelling.

Phonetic Rules
The last pass before the phonetic code is delivered to the voice generator does some fine tuning and smoothing, modifying the duration of phonemes, pitch commands, and improving the transitions between phonemes. Its output is a detailed phonetic description of an utterance, to be converted into numeric targets for the voice generator.

Speech Recognition
Source: (http://www.voiceio.com/good.htm)

Introduction
Speech recognition is a computer application that converts an acoustic signal, captured by a microphone or a telephone, into a set of words. This technology allows people to control a computer by speaking commands into a microphone that is connected to a computer. The user is able to tell the computer to execute some commands such as open a document, save changes, delete a paragraph, and move the cursor all without touching a key. Second, the user can dictate text using speech recognition in conjunction with a standard word processing program. When users speak into the microphone their words can appear on a computer screen in a word processing format, ready for revision and editing. Ultimately, the recognized words can be the final results for applications such as commands & control, data entry, and document preparation.

How Speech Recognition Works
Speech recognition software offers discrete speech and continuous speech technology. Discrete speech recognition requires the user to speak one word at a time, whereas, the more advanced technology, continuous speech recognition, allows the user to dictate by speaking in a normal conversational manner. As the user speaks, the software puts one or more words on the screen by matching the sound input with the information it has in the user's voice file.

Both kinds of speech recognition store frequently used words and related information in the computer's memory (RAM) for immediate use in guessing a word or string of words this is called the active dictionary. When new vocabulary is added, it enters the active dictionary. For less common vocabulary, speech recognition products have a large back-up dictionary stored on the hard drive, so that it is relatively rare that one would use a word that is entirely unknown to the software.

As the user "trains" and speaks to the system, the software creates a user-specific voice file that contains a lot of information about the user's voice qualities and pronunciations and patterns of word usage, the latter is available only with the continuous speech technology. Both types of speech recognition software also capture the user's preferred vocabulary. The voice file in discrete speech recognition software is built primarily on the user's pronunciation of individual words, whereas the voice file in continuous speech recognition also contains information about the user's grammar and word usage. The software uses this acoustic and linguistic information to make its best guess at each word or phrase as it is dictated.

Challenges with Speech Recognition
The first challenge in speech recognition is to identify the difference between speech and noise. Computers don't have the ability to filter out noise and still carry on a conversation as do people, therefore computers need help separating speech sounds from other sounds. A second challenge is to recognize speech from more than one speaker. Speech-recognition software can't easily adjust to the unique characteristics of every voice. Therefore, the software works best when the computer has a chance to adjust to each new speaker by training a computer to recognize different voices. A third challenge is programming a computer to distinguish between two or more phrases that sound alike. Speech-recognition programs don't understand what words mean, so they can't use common sense the way people do. Instead, they keep track of how frequently words occur by themselves and in the context of other words. This information helps the computer choose the most likely word or phrase from among several possibilities. Finally, computers won't understand mumbled speech or missing words. They only understand what was actually spoken and don't know enough to fill in the gaps by guessing what was meant.

Why use speech technology?
Speech technology can save companies thousands of dollars when comparing the costs associated with operating a speech recognition system to a similar human-powered call center. In most cases, the operating costs of a speech recognition system are about 10% of a comparable human-powered call center and half to 1/3 the cost of touch-tone or web systems. Secondly, there are no operating training required and no turnover issues to deal with. Finally, speech technology improves the customer experience because the customer never gets a busy signal nor is ever put on hold.

Helpful Links

VoiceXML
Speech Objects and VoiceXML. White Paper

VoiceXML Forum. 101 pages of everything you need to know about VoiceXML

The VoiceXML Approach. The history, advantages, applications and future of VoiceXML

VXML Code. Great Website for example code and VXML descriptions

What is VoiceXML? A brief article on VoiceXML with nice architectural pictorials

Speech Synthesis
Speech Synthesis Markup Language Specification. Great website that includes terminology and design concepts, text structure elements, sample code and demos

Speech Synthesis. An introduction into text to speech synthesis

Text to Speech Synthesis. History of Speech Synthesis Research at Bell Labs

Text to Speech Synthesis. AT&T Demo

Speech Recognition
Microsoft Research: Speech Technology. Dr. Who Project, Dr. Who Engine, MiPad and other technologies related to speech

Who Should Use Speech Recognition Technology? A vendor Web site providing recognition technology uses and testimonials

Speech Recognition Links. A plethora of links dealing with Speech Recognition

Speech Recognition Demo. Etrade demo