Voice Recognition Lawrence Pan Syen Hassan Jamme Tan

  • Slides: 21
Download presentation
Voice Recognition Lawrence Pan Syen Hassan Jamme Tan

Voice Recognition Lawrence Pan Syen Hassan Jamme Tan

Overview l l l l History of voice recognition Why voice recognition? Technology behind

Overview l l l l History of voice recognition Why voice recognition? Technology behind voice recognition ¡ Five major steps Common applications Current leaders ¡ Demonstrations ¡ Product Evaluation Implementation of our own voice recognition system ¡ Grade retrieval system for EE 3414 Future Challenges

History of Voice Recognition l Radio Rex (house trained dog), 1922 l U. S

History of Voice Recognition l Radio Rex (house trained dog), 1922 l U. S Department of Defense, 1940’s ¡ Speech Understanding Research (SUR) program l Carnegie Mellon University & MIT ¡ Automatic interception & translation of Russian radio transmissions (FAILURE) Original message: “the spirit is willing but the flesh is weak” l Translated message: “the vodka is strong but the meat is disgusting. ” l

History Cont’d l First major achievements ¡ Bell l Laboratories, 1952 Successful recognition of

History Cont’d l First major achievements ¡ Bell l Laboratories, 1952 Successful recognition of numbers 0 to 9, spoken over telephone ¡ MIT, l 1959 Successful recognition of vowels with 93% accuracy ¡ Carnegie l Mellon University, 1970’s HARPY system: capable of recognizing complete sentences

History Cont’d l Obstacles ¡ Computing power: over 50 computers needed for HARPY system

History Cont’d l Obstacles ¡ Computing power: over 50 computers needed for HARPY system to perform ¡ Ability to recognize speech from any person l Taking in account different accents, speech tones, etc. ¡ Ability l to recognize continuous speech so…we…do…not…have…to…speak…like…this! l Commercialization systems of voice recognition

History Cont’d Computation required and computation available in available processors over time Accuracy and

History Cont’d Computation required and computation available in available processors over time Accuracy and task complexity progress over time

Why Voice Recognition? l Convenience ¡ Natural user interface: human speech ¡ Improved services

Why Voice Recognition? l Convenience ¡ Natural user interface: human speech ¡ Improved services for the disabled ¡ Wider range of users l Future possibilities and improvements ¡ Internet use over phones through voice portals ¡ Advanced applications implementing voice control in all areas

Technology behind Voice Recognition l Five major steps used by speech recognizer

Technology behind Voice Recognition l Five major steps used by speech recognizer

Five major steps in voice recognition l Capture and Digitalization ¡ l Spectral Representation

Five major steps in voice recognition l Capture and Digitalization ¡ l Spectral Representation ¡ l System interacts with the telephony device to capture voice input at 8000 samples/sec Voice samples converted to graphical representation Segmentation ¡ ¡ ¡ Speech signals are broken down into segmented parts. Improves accuracy Reduces computation: impossible to process entire signal in real time

Graphical Representations

Graphical Representations

Acoustic Model l Phonemes – smallest phonetic unit in a language ¡ Creates distinction

Acoustic Model l Phonemes – smallest phonetic unit in a language ¡ Creates distinction between other words l l Allophone – different pronunciations of a phoneme/letter ¡ l e. g. b in boy and t in toy E. g. t in tab, t in stab, tt in stutter Database (Lexicon) of all words known to the system for a language ¡ Should contain several recordings for certain words l E. g. “the” can be pronounced “duh” or “dee”

Acoustic Model Cont’d l Trelliss ¡ Data structure made up of all possible combinations

Acoustic Model Cont’d l Trelliss ¡ Data structure made up of all possible combinations of allophones l Training of Acoustic models ¡ For single-user systems l Text is read by user and recognized by system ¡ For multi-user systems Utterances spoken by many users compiled into a database, then inputted into a recognizer l Weights are put on certain allophones l

Language Model l Languages have structures (i. e. grammar) ¡ Difference between two words

Language Model l Languages have structures (i. e. grammar) ¡ Difference between two words can be difficult to understand ¡ Can be distinguished using context l E. g. “ours” and “hours” can be determined if previous word is “two”

Common Applications l Call Center Automation ¡ Widely used in all industries (consumer interface)

Common Applications l Call Center Automation ¡ Widely used in all industries (consumer interface) Airline companies: booking flights, general info, etc. l Banking companies: “pay by phone”, account balances, etc. l Delivery Services (Fed. Ex): tracking orders, etc. l All general customer service systems l l Computer Integration of voice recognition ¡ Personal Computers Speech to Text Dictation l Accessibility purposes: voice control of computers l

Common Applications cont’d l Integrated into automobiles: ¡ ¡ Visteon Voice Technology™ used in

Common Applications cont’d l Integrated into automobiles: ¡ ¡ Visteon Voice Technology™ used in Infiniti Q 45 Controls: l l l Climate CD player Navigation system

Competing Standards l Voice. XML (extensible markup language) ¡ ¡ ¡ l Partners: AT&T,

Competing Standards l Voice. XML (extensible markup language) ¡ ¡ ¡ l Partners: AT&T, IBM, Motorola, Lucent Tech. Used in implementation of most voice portals Shifting target toward web developers SALT (Speech Application Language Tags) ¡ ¡ Partners: Microsoft, Intel, Cisco, Speech. Works Targeted toward web developers

Current Leaders l Dragon Systems: ¡ ¡ l Naturally Speaking: PC based user side

Current Leaders l Dragon Systems: ¡ ¡ l Naturally Speaking: PC based user side programs for Automated speech recognition (ASR) Automotive, Telephony, Mobile, Games, Embedded Chips Speech. Works: Connects users to industry voice portals ¡ AOLBy. Phone, Fed. Ex, E*Trade, etc. Be. Vocal: provides voice portals for Bell South, etc. Tell. Me: provides voice portals for AT&T, Merrill Lynch, etc. l Philips Speech Recognition l l ¡ l Services automotive, mobile device, and consumer electronic industries IBM Via Voice, MS Agent

Demonstrations l Speech. Works. TM product line ¡ ¡ ¡ l United Airlines' toll

Demonstrations l Speech. Works. TM product line ¡ ¡ ¡ l United Airlines' toll free flight information line (demo) Bank. Works Automated Bill Payment (demo) Fed. Ex Rate Finder (demo) E*Trade Stock (demo) AOLby. Phone service (demo) Be. Vocal solutions

Magical Merlin’s Grade Retrieval System l Designed in Visual Basic using Microsoft’s MSAgent Click

Magical Merlin’s Grade Retrieval System l Designed in Visual Basic using Microsoft’s MSAgent Click on my belly for a short demonstration Menu Recognized voice commands First Exam, First Test, First Midterm Second Exam, Second Test, Second Midterm Quiz Grades, Grade on Quizzes Homework Grades, Grade on Homework Project Grade, Grade on Project Final Grade, Grade for course Main Menu Main menu, Main, Class

Future Challenges Speech Technology l Voice. XML vs. SALT l Voice enabling web content

Future Challenges Speech Technology l Voice. XML vs. SALT l Voice enabling web content l Real time access to source data l ¡ Stock market, traffic, sports, etc. Clear connection needed for effective use of voice portals l Security Issues involved l Advertising based revenue l

References l l l l l http: //www. stanford. edu/~jmaurer/homepage. htm http: //www. bevocal.

References l l l l l http: //www. stanford. edu/~jmaurer/homepage. htm http: //www. bevocal. com/corporateweb/technology/index. html http: //www. speechworks. com/demos/index. cfm http: //www. speechworks. com/learn/index. cfm http: //www. scansoft. com/realspeak/tts 2500/ http: //www. out-loud. com/speechacts. html http: //www. gignews. com/fdlspeech 1. htm http: //www. gignews. com/fdlspeech 2. htm http: //www. gignews. com/fdlspeech 3. htm http: //www. microsoft. com/msagent/default. asp