|
VoiceXML
(Voice eXtensible
Markup
Language)
Source: (http://www.voicexmlreview.org/voicexml)
Introduction
Voice Extensible Markup Language (VoiceXML),
is a technology that enables human-computer interaction with the Internet
through voice-recognition applications by using a voice browser and/or telephone. It uses speech recognition and/or touchtone keypad
for input, and pre-recorded audio and Text-to-Speech synthesis (TTS) for
output. It is based on the Worldwide Web Consortium's (W3C's)
Extensible Markup Language (XML), thereby creating a common language
standard for application developers, platform vendors, and tool providers that
all can benefit from code portability and reuse. In addition, VoiceXML supports
many sophisticated voice recognition and text to speech technologies as well as
more traditional tone and grunt-based interfaces.
Advantages
One of the many advantages of VoiceXML is that its
applications build on existing XML and web authoring tools, whereas traditional
telephony applications have been rather laborious to create. In addition,
VoiceXML applications are written in a declarative manner, in that the
programmer tells the system “what to do” rather than the more traditional
procedural manner of telling the program “not how to do it.” This programming
change aids rapid application development and easily links in with existing web
based applications and Common Gateway Interface (CGI) back ends. Because
VoiceXML is an XML language, using similar web infrastructure, tools and Web
servers, a programmer can create VoiceXML scripts from existing XML data
sources, including speech recognition applications. One of the major benefits for programming with VoiceXML is any telephone can
access VoiceXML applications by utilizing a browser running on a telephony
server instead of using the current method of access provided by the traditional
PC with a Web browser.
Whereas HTML is commonly used for creating graphical Web applications, VoiceXML can be used for voice-enabled Web applications. Like HTML, VoiceXML is a well-defined set of tags that determine how specified data will be executed. VoiceXML accomplishes this task by organizing data into two types of dialogs, called menus and forms. Menus give options about what to do next and expect a response. Forms ask a particular question or provide particular information and may or may not expect a response. The typical VoiceXML voice browser of today runs on a specialized voice gateway node that is connected both to the public switched telephone network and to the Internet. These voice gateways extend the power of the web to the phones allowing VoiceXML pages to be called from a web-application, similar to HTML pages.
How VoiceXML Works
The user can use a phone as the audio interface to call the VoiceXML platform
that runs the VoiceXML interpreter, speech recognizer and speech synthesizer
engine. The voice prompter will request information from the user, whereby
the user
will say or key the information that is associated with the URL requested by the
voice prompter. The requested pages are submitted using the CGI interface used to write
the code that generates VoiceXML dynamically. The VoiceXML interpreter
interprets the pages and output is presented
as audio, and the user provides input by speaking or pressing touchtone keys.
Corporate Value
For web-based businesses that
currently rely on consumers with personal computers, integrating a voice portal
will easily expand their consumer base by taking advantage of the 1.5 billion
phones in existence and the growing rate of mobile phone users. Instead of users
browsing a website for data through their personal computer, they will be able
to access the same data through a phone, since the web browser now becomes the
voice browser. VoiceXML is especially valuable to businesses utilizing
e-commerce because VoiceXML was specifically created to work with Web
applications. This will allow companies to maintain their current communication
systems with minimal change and expense to their current system’s architecture.
Benefits
There are many areas where voice
services, utilizing VoiceXML will be used. Some examples include, automated
banking, checking stock quotes or purchasing stocks, checking the status of bids
at electronic auction sites, bill payment authorization, delivery scheduling,
renting videos, ordering office supplies, and purchasing concert or game
tickets. Because standard Web security features apply to the voice web,
intranet applications can also be written in VoiceXML for inventory control,
ordering supplies and for providing human resource services.
Speech
Synthesis
Source:
(http://www.commweb.com/article/COM20001003S0023)
Introduction
Speech synthesis or Text-to-Speech (TTS), comprises the hardware, firmware and
software by which computers convert text into spoken numbers, words and
sentences. Synthesized speech requires ample processing power to iterate the thousands of lines of rules and tables, formalized by linguists
and coded by programmers, before uttering a phrase. A Text to Speech
engine creates the natural sounds of speech (i.e. emphasis, volume, pauses, rate
and pitch) by taking text and turning it into an audible voice.
Two Approaches to Creating Sound
To make computer generated speech
sound natural, all of the attributes associated with human speech must be
considered and programmed to fit the particular situation. This was
accomplished because standards were created for
specifying parameters so TTS engines could go beyond the obvious recognition of
punctuation and sentence structure. The two basic approaches to
speech synthesis are formant synthesis and the concatenative approach. Formant synthesis, is the method of breaking down text and generating individual speech sounds
electronically from scratch, mimicking the resonances of the human vocal tract.
This method has the advantage of being adaptable to many different languages and
occupies less memory than digitized speech fragments. The concatenative approach synthesizes speech by
recording human speech and slicing it into stored fragments, which are then
concatenated as needed.
Whether formed with formant or concatenated speech sounds, TTS systems adhere to strict guidelines that the TTS algorithm must consult prior to correctly reading a sentence or pronouncing a word. On a much more complex level, it must be able to parse sentences in order to know, for example, which syllables to stress in two words that may look identical by consulting exception dictionaries, where such words and their phonetic parameters are stored prior to pronunciation.
TTS Process
The Text-to-Speech process starts with a text file that may be the output of an
application such as a word processor or a database. From there, generally
follows "text normalization." The text normalization component of all TTS
systems prepares text for phonetic translation by interpreting and converting
abbreviations, acronyms, and deciphering numbers in context. It breaks down all
symbolic material into letters, even though at this point, these words still
retain the idiosyncrasies of English (or another language's) spelling.
Phonetic Rules
The last pass before the phonetic code is delivered to the voice generator does
some fine tuning and smoothing, modifying the duration of phonemes, pitch
commands, and improving the transitions between phonemes. Its output is a
detailed phonetic description of an utterance, to be converted into numeric
targets for the voice generator.
Speech
Recognition
Source: (http://www.voiceio.com/good.htm)
Introduction
Speech recognition is a computer application
that converts an acoustic signal, captured by a microphone or a telephone, into
a set of words. This technology allows
people to control a computer by speaking commands into a microphone that is
connected to a computer. The user is able to tell the computer to execute some
commands such as open a document, save changes, delete a paragraph, and move the
cursor all without touching a key. Second, the user can dictate text using
speech recognition in conjunction with a standard word processing program. When
users speak into the microphone their words can appear on a computer screen in a
word processing format, ready for revision and editing. Ultimately,
the recognized words can be the final results for applications such as commands
& control, data entry, and document preparation.
How Speech Recognition Works
Speech recognition software offers
discrete speech and continuous speech technology. Discrete speech recognition
requires the user to speak one word at a time, whereas, the more advanced
technology, continuous speech recognition, allows the user to dictate by
speaking in a normal conversational manner. As the user speaks, the software
puts one or more words on the screen by matching the sound input with the
information it has in the user's voice file.
Both kinds of speech recognition store frequently used words and related information in the computer's memory (RAM) for immediate use in guessing a word or string of words this is called the active dictionary. When new vocabulary is added, it enters the active dictionary. For less common vocabulary, speech recognition products have a large back-up dictionary stored on the hard drive, so that it is relatively rare that one would use a word that is entirely unknown to the software.
As the user "trains" and speaks to the system, the software creates a user-specific voice file that contains a lot of information about the user's voice qualities and pronunciations and patterns of word usage, the latter is available only with the continuous speech technology. Both types of speech recognition software also capture the user's preferred vocabulary. The voice file in discrete speech recognition software is built primarily on the user's pronunciation of individual words, whereas the voice file in continuous speech recognition also contains information about the user's grammar and word usage. The software uses this acoustic and linguistic information to make its best guess at each word or phrase as it is dictated.
Challenges with Speech Recognition
The first challenge in speech recognition is to identify the difference
between speech and noise. Computers don't have the ability to filter out
noise and still carry on a conversation as do people, therefore computers need
help separating speech sounds from other sounds. A second challenge is to
recognize speech from more than one speaker. Speech-recognition software
can't easily adjust to the unique characteristics of every voice.
Therefore, the software works best when the computer has
a chance to adjust to each new speaker by training a computer to recognize
different voices.
A third challenge is programming a computer to distinguish between two or more
phrases that sound alike. Speech-recognition programs don't understand what words
mean, so they can't use common sense the way people do. Instead, they keep track
of how frequently words occur by themselves and in the context of other words. This information helps the computer choose the most likely word or phrase from
among several possibilities. Finally, computers won't
understand mumbled speech or missing words. They only understand what was
actually spoken and don't know enough to fill in the gaps by guessing what was
meant.
Why use speech
technology?
Speech technology can save companies thousands of dollars when comparing the
costs associated with operating a speech recognition system to a similar
human-powered call center. In most
cases, the operating costs of a speech recognition system are about 10% of a
comparable human-powered call center and half to 1/3 the cost of touch-tone or
web systems. Secondly, there are no operating training required and no turnover
issues to deal with. Finally, speech technology
improves the customer experience because the customer never gets a busy signal
nor is ever put on hold.
Helpful Links
VoiceXML
Speech Objects and VoiceXML.
White Paper
VoiceXML Forum. 101 pages of everything you need to know about VoiceXML
The VoiceXML Approach. The history, advantages, applications and future of VoiceXML
VXML Code. Great Website for example code and VXML descriptions
What is VoiceXML? A
brief article on VoiceXML with nice architectural pictorials
Speech Synthesis
Speech Synthesis Markup Language Specification.
Great website that includes terminology and
design concepts, text structure elements, sample code and demos
Speech Synthesis. An introduction into text to speech synthesis
Text to Speech Synthesis. History of Speech Synthesis Research at Bell Labs
Text to Speech Synthesis.
AT&T Demo
Speech Recognition
Microsoft Research: Speech Technology.
Dr. Who Project, Dr. Who Engine,
MiPad and other technologies related to speech
Who Should Use Speech Recognition Technology? A vendor Web site providing recognition technology uses and testimonials
Speech Recognition Links. A plethora of links dealing with Speech Recognition
Speech Recognition Demo. Etrade demo