Introduction to Voice XML 2 0 Rob Marchand
Introduction to Voice. XML 2. 0 Rob Marchand Director of Product Management Voice. Genie Technologies Inc.
Introduction to Voice. XML • Audience o Managers and programmers with little experience with Voice. XML • Attendees will learn o o The basic principles of Voice. XML, Just enough syntax to design and code simple speech applications requiring voice menus and voice forms.
Voice. XML in the Marketplace • Voice. XML 2. 0 is now ratified as a Recommendation (e. g. , official standard) by the W 3 C • Hundreds of millions of Voice. XML calls are answered every day Voice. XML is the standard for building speech-enabled applications
W 3 C and Voice. XML Forum • W 3 C manages the technical evolution and development of the Voice. XML language • Voice. XML Forum focuses on providing best practices, certification testing, resources and tools Together the W 3 C and Voice. XML Forum accelerate the adoption of Voice. XML-based speech applications
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
Motivation for Speech Applications • Users access Web sites from any telephone, anywhere, any time. • Speaking and listening are the natural usage modes for phones.
Speech-enabled Applications Are Possible Now • Increased computing power at less expense o Due to improved chip design and manufacturing techniques • Improved speech recognition o Due to refinements to basic speech recognition algorithms • Improved dialog design using voice o Minimizes the number of words and phrases that the speech recognizer must process at any point during the dialog
Strength of Voice. XML Applications • Traditional system-directed dialogs for novice users • Mixed initiative dialogs for experienced users • Novice users smoothly become experienced users at their own pace
Limitations of Voice. XML Applications • No special analysis of speech input o Not suitable for training speech skills—Reading, ESL, singing, etc. • VUI conversational bandwidth is slower than GUI conversational bandwidth o Using a VUI is like drinking from Lake Superior with a straw
Exercise 1 • Name or describe a speech application you could use at work. • Name or describe a speech application you or family member can use at home.
XML o o XML = e. Xtensible Markup Language Elements are surrounded by tags • <prompt>Welcome to the voice system </prompt> o Elements may be nested <prompt> Welcome to Ajax Travel <break/> we have the cheapest fares </prompt> o Elements may have attributes <choice next="#boat"> <grammar type="application/grammar+xml" version="1. 0" root = "by_boat" src = “boat. grxml”> o Because “<”, “>”, and “&” have special meanings • “< ” in place of “<” • “> ” in place of “>” • “& ” in place of “&”.
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
Documents Multimedia Files HTML Scripts DB Database Server Voice. XML Scripts Grammars Audio Files Web Browser Capture Voice ASR Voice DTMF Browser Replay Audio TTS Speech Server/Gateway Web Server
W 3 C Speech Interface Framework Voice. XML 2. 0 Speech Synthesis Call Control Other Semantic Interpretation Grammar
Status of W 3 C Speech Interface Languages Recommendation Proposed Recommendation Candidate Recommendation Last Call Working Draft Requirements Voice. XML 2. 0 Synthesis Grammar Semantic Interpretration Call Control Voice. XML 2. 1 V 3 PLS
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
Voice. XML 2. 0 Fragment Dialog Language (Voice. XML 2. 0) <? xml version="1. 0"? > <vxml version="2. 0"> Speech Synthesis Markup Language <form> (SSML) … Speech Recognition Grammar Speci <field> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice"> <rule id = “account_type"> <one-of> <item> savings </item> <item> checking </item> <item> CD </item> <item> certificate of deposit <tag>$ = “CD” <tag> </item> </one-of> </rule> </grammar> </field> …. <form> … </vxml>
Voice. XML 2. 0 Fragment Dialog Language (Voice. XML 2. 0) <? xml version="1. 0"? > Speech Synthesis Markup Language (SSML) <vxml version="2. 0"> <form> Speech Recognition Grammar … Specification erpretation (SI) <field> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice"> <rule id = “account_type"> <one-of> <item> savings </item> <item> checking </item> <item> CD </item> <item> certificate of deposit <tag>$ = “CD” <tag> </item> </one-of> </rule> </grammar> </field> …. </form> … </vxml>
Voice. XML 2. 0 Fragment <? xml version="1. 0"? > Dialog Language (Voice. XML 2. 0) <vxml version="2. 0"> Speech Synthesis Markup Language (SSML) <form> Speech Recognition Grammar Specification (SRGS) … Semantic Interpretation (SI) <field> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice"> <rule id = “account_type"> <one-of> <item> savings </item> <item> checking </item> <item> CD </item> <item> certificate of deposit <tag>$ = “CD”<tag> </item> </one-of> </rule> </grammar> </field> …. </form> … </vxml>
Voice. XML 2. 0 Fragment <? xml version="1. 0"? > Dialog Language (Voice. XML 2. 0) <vxml version="2. 0"> Speech Synthesis Markup Language (SSML) <form> Speech Recognition Grammar Specification (SRGS) … Semantic Interpretation (SI) <field> <prompt> Which account <break/> <emphasis> savings </emphasis> or <emphasis> checking </emphasis> </prompt> <grammar type = "application/grammar+xml" root = “account_type" mode = "voice"> <rule id = “account_type"> <one-of> <item> savings </item> <item> checking </item> <item> CD </item> <item> certificate of deposit <tag>$ = “CD”<tag> </item> </one-of> </rule> </grammar> </field> …. </form> … </vxml>
Voice. XML 2. 0 features • Menus, forms, sub-dialogs o • o o • Output o o • Speech recognition <grammar> Recording <record> Keypad <grammar mode=“dtmf”> Audio files <audio> Text-to-speech <prompt> Variables o <var> <script> <assign> Events – <nomatch>, <noinput>, <help>, <catch>, <throw> <menu>, <form>, <subdialog> Inputs o • • Transition and submission – <goto>, <submit> – Telephony – Connection control – <transfer>, <disconnect> – Telephony information – Platform – Objects – Performance – Fetch
A Typical Voice Menu <menu> <prompt> <audio src=“http: //www. ajax. com/three_blind_mice. wav"/> Do you want to listen, next, prior, buy, or exit? </prompt> <choice next="http: //www. ajax. com/listen. vxml"> listen </choice> <choice next="http: //www. ajax. com/next. vxml"> next </choice> <choice next="http: //www. ajax. com/prior. vxml"> prior </choice> <choice next="http: //www. ajax. com/buy. vxml"> buy </choice> <choice next="http: //www. ajax. com/exit. vxml"> exit </choice> </menu> Exercise 2: Write a menu that asks the user a “yes/no” question to confirm that the user wants to buy the audio “three blind mice
Answer to Exercise 2 A “yes/no” menu <menu> <prompt> Do you want to buy three blind mice now? </prompt> <choice next="http: //www. ajax. com/yes. vxml"> yes </choice> <choice next="http: //www. ajax. com/no. vxml"> no </choice> </menu>
Typical Form Fill-In <form> <prompt>Welcome to the electronic payment system. </prompt> <field name="card_number"> <prompt> Please enter your credit card number? </prompt> <grammar src=“http: //www. ajax. com/credit_card_number. grxml"/> </field> <field name="date"> <prompt>Please enter your expiration date </prompt> <grammar src=“http: //www. ajax. com/credit_card_date. grxml"/> </field> </form> Exercise 3: Write a form that solicits the month, day, and year for the user’s birth date.
Answer to Exercise 3 <form> <prompt> When were you born? </prompt> <field name = "month"> <prompt> What month? </prompt> <grammar src=“http: //www. ajax. com/month. grxml"/> </field> <field name = "day"> <prompt> What day of the month? </prompt> <grammar src=“http: //www. ajax. com/day. grxml"/> </field> <field name = "year"> <prompt> What year </prompt> <grammar src=“http: //www. ajax. com/year. grxml"/> </field> </form>
Event Handlers • Deal with exceptional or error conditions • Control mechanism for dialog turn retries o o o <catch event=“noinput”> … </catch> <catch event=“nomatch” … </catch> <catch event=“help”> … </catch> • Shorthand notation available o <noinput> … </noinput>, etc. • Scoped according to where they occur o <form>, <field>, etc.
Adding Event Handlers <form> <prompt> When were you born? </prompt> <field name = "month"> <catch event=“noinput”> …. . </catch> <catch event=“nomatch> …. . </catch> <prompt> What month? </prompt> <grammar src=“http: //www. ajax. com/month. grxml"/> </field> …. . </form>
Adding Event Handlers <form> <prompt> When were you born? </prompt> <field name = "month"> <catch event=“noinput”> …. . </catch> <catch event=“nomatch> …. . </catch> <prompt> What month? </prompt> <grammar src=“http: //www. ajax. com/month. grxml"/> </field> …. . </form>
Adding Event Handlers <form> <prompt> When were you born? </prompt> <field name = "month"> <catch event=“noinput”> …. . </catch> <catch event=“nomatch> …. . </catch> <prompt> What month? </prompt> <grammar src=“http: //www. ajax. com/month. grxml"/> </field> …. . </form>
Default Event Handlers <catch event = "nomatch"> <prompt> I did not understand, please try again </prompt> </catch> <catch event = "help"> <prompt> Sorry, no help is available. </prompt> </catch> <catch event = "noinput"> <prompt> I did not hear anything, please speak again </prompt> </catch>
Exercise 4 Write event handlers for the month field <catch event = "nomatch"> <prompt> _____________ </prompt> </catch> <catch event = "help"> <prompt> __________ </prompt> </catch> <catch event = "noinput"> <prompt> __________________ </prompt> </catch>
Answer to Exercise 4 Write event handlers for the month field <catch event = "nomatch"> <prompt> Which month, for example, January February, or March? </prompt> </catch> <catch event = "help"> <prompt> In what month were you born? </prompt> </catch> <catch event = "noinput"> <prompt> Say the name of the month you were born in </prompt> </catch>
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
Speech Synthesis ML Structure Analysis Markup support: p, s Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Waveform Production
Before and after Structure Analysis • Before structure analysis o Dr. Smith lives at 214 Elm Dr. He weights 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass. • After structure analysis <p> <s> Dr. Smith lives at 214 Elm Dr. </s> <s> He weights 214 lb. </s> <s> He plays bass guitar. </s> <s> He also likes to fish; last week he caught a 19 lb. bass. </s> </p>
Speech Synthesis ML Structure Analysis Text Normalization Text-to. Phoneme Conversion Markup support: p, s Non-markup behavior: infer structure by automated text analysis Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Prosody Analysis Waveform Production
After Text Normalization <p> <s> <sub alias= "doctor">Dr. </sub> Smith lives at 214 Elm <sub alias = "drive">Dr. </sub> </s> <s> He weights 214<sub alias= "pounds"> lb. </sub> </s> <s> He plays bass guitar. </s> <s> He also likes to fish; last week he caught a 19 <sub alias= "pound"> lb. </sub> bass. </s> </p>
<p> <s> <sub alias = "doctor">Dr. </sub> Smith lives at <say-as interpret-as = “address">214 </say-as> Elm <sub alias = "drive">Dr. </sub> </s> <s> He weighs <sayas interpret-as = “number">214 </sayas> <sub alias = "pounds"> lb. </sub> </s> <s> He plays bass guitar. </s> <s> He also likes to fish; last week he caught a <say-as interpret-as = “number">19 </say-as> <sub alias= "pound"> lb. </sub> bass. </s> </p>
Speech Synthesis ML Structure Analysis Markup support: p, s Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Waveform Production
After text-to-phoneme conversion <p> <s> <sub alias = "doctor">Dr. </sub> Smith lives at <say-as interpret-as = “address"> 214 </sayas> Elm <sub alias = "drive">Dr. </sub> </s> <s> He weighs <sayas interpret-as = “number”>214 </sayas> <sub alias= "pounds"> lb. </sub> </s> <s> He plays <phoneme alphabet = “IPA" ph="b@s">bass</phoneme> guitar. </s> <s> He also likes to fish; last week he caught a <sayas interpret-as= “number">19 </sayas> <sub alias= "pound"> lb. </sub> <phoneme alphabet = “IPA" ph="bas">bass</phoneme>. </s> </p>
Speech Synthesis ML Structure Analysis Markup support: p, s Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Waveform Production Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax
Prosody Analysis (Initial text) <prompt> Environmental control menu. Do you want to adjust the lighting or temperature? </prompt>
Prosody Analysis (Add pause at phrase boundaries) <prompt> Environmental control menu <break strength=“medium”/> Do you want to adjust the lighting or temperature? </prompt>
Prosody analysis (De-emphasize familiar words) <prompt> Environmental control menu <break strength=“medium” /> <emphasis level = "reduced"> Do you want to adjust </emphasis> the lighting or temperature? </prompt>
Prosody Analysis (pause to let the listener catch up) <prompt> Environmental control menu <break/> <emphasis level = "reduced " > do you want to adjust </emphasis> the lighting <break/> or temperature? </prompt>
Prosody Analysis (Add emphasis to focus listener’s attention) <prompt> Environmental control menu <break/> <emphasis level = "reduced" > do you want to adjust the </emphasis> <emphasis level = "strong"> lighting </emphasis> <break/> or <emphasis level = "strong"> temperature? </emphasis> </prompt>
Speech Synthesis ML Structure Analysis Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Waveform Production Markup support: voice, audio* *audio icons, branding, advertising Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax
Waveform Production <prompt> <audio src=“http: //www. example. com/adjust. wav" > Environmental control menu. Do you want to adjust the lighting or temperature </audio> </prompt>
Exercise 5 (insert SSML commands) <prompt> Welcome to Ajax Bank do you want to withdraw or deposit funds? </prompt>
Answer to Exercise 5 <prompt> Welcome to Ajax Bank <break/> <emphasis level = "reduced " > do you want to </emphasis> <emphasis level = "strong"> withdraw </emphasis> <break/> or <emphasis level = "strong">deposit </emphasis> funds? </prompt>
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
Grammars • Describe what the user may say at a point in the dialog • Enable the speech recognition engine to work faster and more accurately • Consist of one or more “rules”
Example Grammar <grammar <rule id = "single_digit"> type = "application/srgs+xml" <one-of> root = "zero_to_ten" <item> one </item> mode = "voice"> <item> two </item> <item> three </item> <rule id = "zero_to_ten"> <item> four </item> <one-of> XML form of <item> five </item> <item> zero </item> grammars <item> six </item> <ruleref uri = "#single_digit"/> <item> seven </item> <item> ten </item> <item> eight </item> </one-of> <item> nine </item> </rule> </one-of> </rule> </grammar>
Example Grammar Rule describing single digits <grammar type = "application/srgs+xml" <rule id = "single_digit"> <one-of> root = "zero_to_ten" <item> one </item> mode = "voice"> <item> two </item> <item> three </item> <rule id = "zero_to_ten"> <item> four </item> <one-of> <item> five </item> <item> zero </item> <item> six </item> <ruleref uri = "#single_digit"/> <item> seven </item> <item> ten </item> <item> eight </item> </one-of> <item> nine </item> </rule> </one-of> </rule> </grammar> Rule describing digits zero through ten
Example Grammar <grammar type = "application/srgs+xml" <rule id = "single_digit"> root = "zero_to_ten" <one-of> mode = "voice"> <item> one </item> Grammar processor <item> two </item> should start with the <rule id = "zero_to_ten"> <item> three </item> “zero_to_ten” rule <one-of> <item> four </item> <item> five </item> <item> zero </item> <item> six </item> <ruleref uri = "#single_digit"/> <item> seven </item> <item> ten </item> <item> eight </item> </one-of> <item> nine </item> </rule> </one-of> </rule> </grammar>
Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"> <rule id = "single_digit"> <one-of> <item> one </item> <item> two </item> <item> three </item> <rule id = "zero_to_ten"> <item> four </item> <one-of> <item> five </item> <item> zero </item> <item> six </item> <ruleref uri = "#single_digit"/> <item> seven </item> <item> ten </item> <item> eight </item> </one-of> <item> nine </item> </rule> </one-of> </rule> </grammar> This is a grammar used by the speech recognizer. (There may also be grammars for DTMF recognizers. )
Example Grammar <grammar type = "application/srgs+xml" <rule id = "single_digit"> <one-of> root = "zero_to_ten" <item> one </item> mode = "voice"> <item> two </item> <item> three </item> <one-of> describes <rule id = "zero_to_ten"> <one-of> alternatives <item> four </item> <item> five </item> <item> zero </item> <item> six </item> <ruleref uri = "#single_digit"/> <item> seven </item> <item> ten </item> <item> eight </item> </one-of> <item> nine </item> </rule> </one-of> </rule> </grammar>
Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"> <rule id = "single_digit"> <one-of> <item> one </item> <item> two </item> <item> three </item> <rule id = "zero_to_ten"> <item> four </item> <one-of> <item> five </item> <item> zero </item> <item> six </item> <ruleref uri = #single_digit"/> <item> seven </item> <item> ten </item> <item> eight </item> </one-of> <item> nine </item> </rule> </one-of> </rule> </grammar> Rule element references another rule
Example Grammar <grammar type = "application/srgs+xml" root = "zero_to_ten" mode = "voice"> <rule id = "single_digit"> <one-of> <item> one </item> <item> two </item> <rule id = "zero_to_ten"> <item> three </item> <one-of> <item> four </item> <item> zero </item> <item> five </item> <ruleref uri = "#single_digit"/> <item> six </item> <item> ten </item> <item> seven </item> </one-of> <item> eight </item> </rule> <item> nine </item> </one-of> </rule> </grammar> Exercise 6: Write a grammar for that recognizes the digits zero to nineteen
Answer to Exercise 6 Write a grammar for zero to nineteen <grammar type = "application/srgs+xml" root = "zero_to_19" mode = "voice"> <rule id = "zero_to_19"> <one-of> <item> zero </item> <ruleref uri = "#single_digit"/> <item> ten </item> <item> eleven </item> <item> twelve </item> <item> thirteen </item> <item> fourteen </item> <item> fifteen </item> <item> sixteen </item> <item> seventeen </item> <item> eighteen </item> <item> nineteen </item> </one-of> </rule> <rule id = "single_digit"> <one-of> <item> one </item> <item> two </item> <item> three </item> <item> four </item> <item> five </item> <item> six </item> <item> seven </item> <item> eight </item> <item> nine </item> </one-of> </rule> </grammar>
More Grammar Elements • Repeat and optional <rule id = "goodness" scope = "public"> <item repeat = "0 -3" > very </item> good </rule> • Sequence <rule id = "twenty_thru_twentynine“> Twenty <ruleref uri = "#single_digit"/> </rule> • Garbage <rule name = "James_Lewis"> <item> James <ruleref special = “garbage"/> Lewis </item> </rule>
Exercise 7 • Write a grammar for that recognizes the digits zero to thirty-nine
Answer to Exercise 7 Write a grammar for zero to thirty-nine <grammar type = "application/srgs+xml" root = "zero_to_39" mode = "voice"> <rule id = "single_digit"> <one-of> <item> one </item> <item> two </item> <item> three </item> <rule id = "zero_to_39"> <item> four </item> <item> five </item> <one-of> <item> six </item> <item> zero </item> <item> seven </item> <ruleref uri = "#single_digit"/> </item> <item> eight </item> <ruleref uri = "#teens"/> </item> <item> nine </item> <item> twenty </item> </one-of> <item> twenty <ruleref uri = "#single_digit"/> </rule> </item> <item> thirty <ruleref uri = "#single_digit"/> </item> <rule id = “teens"> </one-of> <one-of> </rule> <item> ten </item> <item> eleven </item> <item> twelve </item> <item> thirteen </item> <item> fourteen </item> <item> fifteen </item> <item> sixteen </item> <item> seventeen </item> <item> eighteen </item> <item> nineteen </item> </one-of> </rule>
Reusing existing grammars <grammar type = "application/srgs+xml" root = "size” src = “http: //www. example. com/size. grxml"/>
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
Semantic Interpretation • To create smart voice user interfaces, we need to extract the semantic information from speech utterances • Example: o o Utterance: “I want to fly from Dublin to Paris” Semantic Interpretation: { origin: “Dublin” destination: “Paris” }
Semantic Interpretationfourteen Grammar with Semantic Interpretation Scripts Application <submit> text Voice. XML Interpreter ECMAScript object ASR Semantic Interpretation Processor
Semantic Interpretationfourteen Grammar with Semantic Interpretation Scripts fourteen Application <submit> text Voice. XML Interpreter ECMAScript object ASR Semantic Interpretation Processor
Semantic Interpretationfourteen <item> fourteen <tag>$. quantity=“ 14”; </tag> </item> Application <submit> Grammar with Semantic Interpretation Scripts text Voice. XML Interpreter ECMAScript object ASR Semantic Interpretation Processor
Semantic Interpretationfourteen <item> fourteen <tag>$. quantity=“ 14”; </tag> </item> Application <submit> Grammar with Semantic Interpretation Scripts text Voice. XML Interpreter ECMAScript object { quantity: “ 14” } ASR Semantic Interpretation Processor
Semantic Interpretationfourteen <item> fourteen <tag>$. quantity=“ 14”; </tag> </item> Application <submit> Grammar with Semantic Interpretation Scripts text Voice. XML Interpreter ECMAScript object quantity = “ 14” { quantity: “ 14” } ASR Semantic Interpretation Processor
Semantic Interpretation • Semantic Interpretation defines the content of <tag>s in SRGS grammars • Two kinds of syntax for <tag> contents: o o Semantic Literals (literal values) Semantic Scripts (ECMAScript)
Semantic Interpretation • Semantic Literals example: <rule id=“drink“> <one-of> <item> coca cola <tag> coke </tag> </item> <item> black fizzy stuff <tag> coke </tag> </item> <item> coke </item> </one-of> </rule>
Semantic Interpretation • Semantic Literals example: <rule id=“drink“> <one-of> <item> coca cola <tag> coke </tag> </item> <item> black fizzy stuff <tag>coke </tag> </item> <item> coke </item> Default Assignment </one-of> </rule>
Semantic Interpretation • Semantic Scripts employ ECMAScript • Advantages: • Richer structure (objects) • Ability to perform computations
Semantic Interpretation • Large white Example grammar rule with Script Syntax: <rule id = "action"> <one-of> <item> small <tag> $. size = "small"; </tag> </item> <item> medium <tag> $. size = "medium"; </tag> </item> <item> large <tag> $. size = “large"; </tag> </item> </one-of> <one-of> <item> green <tag> $. color = "green"; </tag> </item> <item> blue <tag> $. color = "blue"; </tag> </item> <item> white <tag> $. color = "white"; </tag> </item> </one-of> </rule> • ECMAScript structure: action: { size: "large" color: "white" }
Semantic Interpretation • Example grammar rule with Script Syntax: What is <rule id="calculator"> 1+ 2+ 3? What is <ruleref uri="#digit"/><tag>$. total = $digit; </tag> <item repeat="1 -"> plus <ruleref uri="#digit"/> <tag> $. total = $. total + $digit; </tag> </item> </rule> • ECMAScript structure: calculator: { total: 6 }
Exercise 8 Fill in the contents of <tag> • Grammar rule: From savings to checking <rule id = “transfer"> from <one-of> <item> savings <tag>____________ </tag> </item> <item> checking <tag>____________</tag> </item> </one-of> to <one-of> <item> savings <tag>____________</tag> </item> <item> checking <tag>____________</tag> </item> </one-of> </rule> • ECMAScript structure: transfer: { source_account: "savings" target_account: “checking" }
Answer to Exercise 8 From savings to checking • Grammar rule: <rule id = “transfer"> from <one-of> <item> savings <tag> $. source_account = “savings"; </tag> </item> <item> checking <tag> $. source_account = “checking"; </tag> </item> </one-of> to <one-of> <item> savings <tag> $. target_account = “savings"; </tag> </item> <item> checking <tag> $. target_account = “checking"; </tag> </item> </one-of> </rule> • ECMAScript structure: transfer: { source_account: "savings" target_account: “checking" }
Outline • Motivation for Voice. XML • W 3 C Speech Interface Framework Languages • Dialog—Voice. XML 2. 0 • Speech Synthesis—SSML • Grammars—SRGS • Semantic Interpretation—SI • Call Control
CCXML • Provides call control support for Voice. XML and other dialog languages • Separate interpreter from Voice. XML o o Lives on its own thread Handles asynchronous events • May be used to create standalone applications • Replaces <transfer> and <disconnect> currently in Voice. XML 2. 0 (or provides the underlying support for them)
CCXML • Voice. XML
CCXML • Voice. XML + CCXML
CCXML • Features o o o Multi-party conferencing (human and machine) Sophisticated multi-call handling and control Support for async external messages and events More sophisticated call control than Voice. XML Call control protocol independence • Goal to support very high density and performance
CCXML <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> </eventprocessor>
CCXML <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> </transition> </eventprocessor>
CCXML <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <assign name="state" expr="‘joining’"/> </transition> </eventprocessor>
CCXML <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <assign name="state" expr="‘joining’"/> </transition> <transition state="joining" event="conference. joined"> </transition> </eventprocessor>
CCXML <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <assign name="state" expr="‘joining’"/> </transition> <transition state="joining" event="conference. joined"> <dialogstart conferenceid="conf_id" src=“‘newcaller. vxml’"/> <assign name="state" expr="‘active’"/> </transition> </eventprocessor>
CCXML <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <assign name="state" expr="‘joining’"/> </transition> <transition state="joining" event="conference. joined"> <dialogstart conferenceid="conf_id" src=“‘newcaller. vxml’"/> <assign name="state" expr="‘active’"/> </transition> <vxml xmlns="http: //www. w 3. org/2001/vxml" version="2. 0"> <form> <block> A new participant has entered the conference. </block> </eventprocessor> </form> </vxml>
Exercise 9 Announce when a caller leaves <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <assign name="state" expr="‘joining’"/> </transition> <transition state="joining" event="conference. joined"> <dialogstart conferenceid="conf_id" src=“‘newcaller. vxml’"/> <assign name="state" expr="‘active’"/> </transition> </eventprocessor>
Answer to Exercise 9 <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <assign name="state" expr="‘joining’"/> </transition> <transition state="joining" event="conference. joined"> <dialogstart conferenceid="conf_id" src=“‘newcaller. vxml’"/> <assign name="state" expr="‘active’"/> </transition> <transition state=“active" event="connection. disconnected"> <dialogstart conferenceid="conf_id" src=“‘callerleft. vxml’"/> <assign name="state" expr="‘inactive’"/> </transition> </eventprocessor>
Answer to Exercise 9 <var name="state" expr="‘initial’"/> <eventprocessor statevariable="state"> <transition state="initial" event=“connection. connected"> <join id 1="conf_id" id 2="conn_id"/> <vxml xmlns="http: //www. w 3. org/2001/vxml" version="2. 0"> <assign name="state" expr="‘joining’"/> <form> </transition> <block> <transition state="joining" event="conference. joined"> A participant has left the conference. <dialogstart conferenceid="conf_id" src=“‘newcaller. vxml’"/> </block> <assign name="state" expr="‘active’"/> </form> </transition> </vxml> <transition state=“active" event="connection. disconnected"> <dialogstart conferenceid="conf_id" src=“‘callerleft. vxml’"/> <assign name="state" expr="‘inactive’"/> </transition> </eventprocessor>
Example Applications with CCXML-Voice. XML • Alerts o Stock value changes, order is available, flight is delayed, road closure, school closure • Conference o o o Add additional person to the conference Whisper Eject • Find me o Try alternative telephone numbers • Instant messaging o Notify me when John calls in to access his e-mail • Control home applications o Turn on/off coffee pot, oven, air conditioner, lights, arm/disarm the security system • Call Center/Customer Care Applications
Voice. XML 2. 1 • Voice. XML’s success and popularity resulted in many implementations early in the standardization process • Additional, innovative features were conceived after Voice. XML 2. 0 content was agreed • Goals of Voice. XML 2. 1: o o o Ensure portability by specifying a set of commonly implemented extensions Backwards-compatible with Voice. XML 2. 0 Follow a “fast track” to standardization
Voice. XML 2. 1 • Standardized extensions: o o o Locate barge-in occurrences within prompts Interact directly with XML-based infrastructure Access recognition utterances for analysis Increase performance be reducing server round -trips Extended call transfer types
Summary • W 3 C Speech Interface Framework o o o Dialog—Voice. XML Grammar—SRGS Synthesis—SSML Semantic Interpretation—SI Call Control—CCXML • Can work together or separately • See http: //www. w 3. org/voice/ for details
Resources
Industry Organizations • World Wide Web Consortium o http: //www. w 3. org • W 3 C Voice Browser Working Group o http: //www. w 3. org/voice/ • W 3 C Multi-Modal Working Group o http: //www. w 3. org/2002/mmi/ • Voice. XML Forum o http: //www. voicexml. org • SALT Forum: o http: //www. saltforum. org • Speech Technology Magazine o http: //www. amcommexpos. com/
Books James A. Larson, Voice. XML—An Introduction to Developing Speech Applications, 2002, Upper Saddle River, NJ: Prentice Hall. • Eve Astrid Andersson, et. al. , Early Adopter Voice, 2001, Birmingham UK: Vrox. • Bruce Balentine & David P. Morgan, How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues, 1999, San Ramon, CA: Enterprise Integration Group. • Rick Beasley et. al. , Voice Application Development with Voice, 2002, Indianapolis: Sams. • Bob Edgar, The Voice Handbook, 2001, New York: CMP. • Susan Weinschenk & Dean T. Barker, Designing Effective Speech Interfaces, 2000, New York: John Wiley & Sons. • Chetan Sharma & Jeff Kunins, Voice: Strategies and Techniques for Effective Voice Application Development with Voice 2. 0, 2002, New York: John Wiley. • Michael H. Cohen, James P. Giangola, & Jennifer Balogh, Voice User Interface • Design, 2004, Addison Wesley.
Tutorials and Articles • Voice. XML Forum o http: //www. voicexmlforum. org/ • Voice. XML Review o http: //www. voicexmlreview. org/ • World of Voice. XML o http: //www. kenrehor. com/voicexml/
Online Voice SDKs Name URL Be. Vocal Cafe http: //cafe. bevocal. com Hey Anita Free. Speech http: //www. heyanita. com Tellme Studio http: //studio. tellme. com Voice. Genie Developer Workshop http: //developer. voicegenie. com Voxeo Community http: //www. voxeo. com Voxpilot voxbuilder http: //www. voxbuilder. com
Downloadable Voice Interpreters Name URL IBM Web. Sphere Voice Server SDK http: //www. ibm. com/software/voice
Public Voice. XML Interpreters Interpreter Source Open. VXI - Voice. XML Interpreter Carnegie-Melon http: //www. speech. c University Department s. cmu. edu/openvxi/i of Computer Science ndex. html Speech Group Public. Voice. XML - Public Voice Lab Voice. XML platform Vienna, Austria URL http: //www. publicvoi cexml. org/
Introduction to Voice. XML • Questions?
- Slides: 106