Automatic Speech Recognition and Audio Indexing E M

Topics Live TV Caption l Audio-based Surveillance Systems l Speaker Recognition l Music/Speech Segmentation

Live TV Captioning transcribe and display the spoken parts of television programs l BBC

Live TV Captioning l l Essential for people with hearing ompairment Helps non-native viewers

Live TV Captioning l Manual transcription of pre-recorded programs l l l Human captioner

Live TV Captioning of pre-recorded programs using Speech Recognition l l Human dictates the

Live TV Captioning Live Captioning of TV Programs l Challenge is to produce captions

Live TV Captioning l l l BBC started live captioning using speech recognition in

Live TV Captioning Czech Television in cooperation with the University of Bohemia in Pilsen

Live TV Captioning Application: Automatic Translation l Youtube offers translations of the captions in

References [1] F. Jurcicek, Speech Recognition for Live TV, IEEE Signal Processing Society –

Fear-Type Emotion Recognition [1] l Fear-Type emotion recognition for future audio based systems surveillance

Fear-Type Emotion Recognition The HUMAINE network of excellence evaluated emotional databases and noted the

Fear-Type Emotion Recognition Acted databases l Stereotype l Realism depends on acting skills l

Annotation of Emotional Content Scherer et al. (1980) push/pull opposite effects in emotional speech.

Acoustic Features Physiologically related and salient for fear characterization. Voice quality features: High Level

Classification Algorithms For emotion recognition studies used: l Support Vector Machine l Gaussian Mixture

Audio-Surveillance l l Audio clues work even if there are no visual clues: gun

Approach New emotional database with a large scope of threat concepts following previously listed

SAFE Situation Analysis in a Fictional and Emotional Corpus 7 hours of recordings, fictional

SAFE Situation Analysis in a Fictional and Emotional Corpus

Annotation Speaker track: genre and position: aggressor, victim, other l Threat track: degree of

Fear-Type Emotion Recognition Emotional categories and subcategories Broad categories Subcategories Fear Stress, terror, anxiety,

Features Pitch related features (most useful for the fear vs neutral classifier) l Voice

References [1] C. Clavel, I. Vasilescu, L. Devillers, G. Richard, T. Ehrette, Fear-type Emotion

Special Classes A. Drahota, A. Costall, V. Reddy, The Vocal Communication of Different kind

Speaker Recognition/Verification Text independent speaker verification systems (Reynolds et al. 2000) l Short-term spectra

Speaker Recognition/Verification Training the background model is relatively straightforward using large speech corpora of

Speaker Recognition/Verification Text dependent & very short utterances: l Phoneme dependent HMMs Text Independent

Speaker Recognition/Verification Features from high- to lower-level Features (Shriberg, 2007) l Pronunciations (Place of

Speaker Recognition/Verification Pronunciations: l Multilingual phone streams obtained by language dependent phone ASR (N-grams,

Training of Unadapted Phoneme Dependent AFCPM Speaker Models

Adaptation Method A Classical MAP. Adapted from phoneme dependent background models. l This is

Adaptation Method B: Phoneme-independent adaptation (PIA). l Adapted from phoneme-dependent speaker models and phonemeindependent

Adaptation Method C Scaled phoneme-independent adaptation (SPI). l Adapted from phoneme-independent speaker models with

Adaptation Method D l l Mixed phoneme-dependent and scaled phoneme independent adaptation (MSPI). Adapted

Adaptation Method E l l l Method E: Mixed phoneme-independent and scaled phoneme-dependent adaptation

References [1] L. Mary, B. Yegnanarayana, Extraction and Represenation of Prosodic Features for Language

Detecting Speech and Music based on Spectral Tracking [1]

References [1] T. Taniguchi, M. Tohyama, K. Shirai, Detection of Speech and Music based

Slides: 55

Download presentation

Automatic Speech Recognition and Audio Indexing E. M. Bakker LIACS Media Lab

Topics Live TV Caption l Audio-based Surveillance Systems l Speaker Recognition l Music/Speech Segmentation l

Live TV Captioning transcribe and display the spoken parts of television programs l BBC pioneer started live captioning in 2001 and is captioning all of its broadcasted programs since May 2008 l Czech Television live captions for Parliament broadcasts since November 2008 l

Live TV Captioning l l Essential for people with hearing ompairment Helps non-native viewers with their language Improve first-language literacy Studies show l l In UK 80% of viewers of TV with captions have no hearing impairment [] In India watching TV with captions improved literacy skills []

Live TV Captioning l Manual transcription of pre-recorded programs l l l Human captioner listens to and transcribes all speech in a program Transcription is edited, positioned and aligned 16 hours work for 1 hour program

Live TV Captioning of pre-recorded programs using Speech Recognition l l Human dictates the speech audio of the program Transcription is produced by speech recognition Manually edited if necessary Significant reduction of effort

Live TV Captioning Live Captioning of TV Programs l Challenge is to produce captions with a minimum delay l Delay should be less than a few seconds l Live is essential for sport events, parliamentary broadcast, live news, etc. l Highly skilled stenographers (200 words/minute) l Have to have a break every 15 minutes. l Difficult to find enough people for large events (Olympic games, World Championship Soccer, …)

Live TV Captioning l l l BBC started live captioning using speech recognition in April 2001 [] Since May 2008 100% of its broadcasted programs are captioned Typical scheme: l l a captioner listens to the live program Re-speak, rephrase and simplify the spoken speech Enter punctuation and speaker-change info Speech Recognition engine is adapted to the voice of the captioner

Live TV Captioning Czech Television in cooperation with the University of Bohemia in Pilsen started in November 2008 to do live captioning of parliament meetings Pilot study: l The speech recognition is done on the original speech signal which is sent 5 seconds ahead. l Transcription is sent back immediately l System is trained on 100 hours of Czech Parliament broadcast with their stenographic records Near future l Re-speakers will be used for sports programs, and other live events l Note: Slavic languages have the difficulty of having high degree of inflection, and large numbers of pre- and suf-fixes => much larger vocabulary

Live TV Captioning Application: Automatic Translation l Youtube offers translations of the captions in 41 languages l Translation modeling l Robust error handling of transcription errors made by the sp[eech recognition engine l More difficult for languages that are further apart: English – Japanese, Chinese l Although with errors, it still may be beneficial: human interpretable

References [1] F. Jurcicek, Speech Recognition for Live TV, IEEE Signal Processing Society – SLTC Newsletter, April 2009 [2] M. J. Evans: Speech Recognition in Assited and Live Subtitling for Television, BBC R&D White Paper WHP 065, July 2003 [3] B. Kothari, A. Pandey, and A. R. Chudgar: Reading Out of the "Idiot Box": Same-Language Subtitling on Television in India, Information Technologies and International Development, Volume 2, Issue 1, September 2004 [4] Ofcom: Television access services - Summary, March 2006 [5] R. Griffiths: Giving voice to subtitling: Ruth Griffiths, director of Access Services at BBC Broadcast, July 2005 [6] BBC: Press Releases - BBC Vision celebrates 100% subtitling, May 2008 [7] Blog You. Tube: Auto Translate Now Available For Videos With Captions, November 2008

Fear-Type Emotion Recognition [1] l Fear-Type emotion recognition for future audio based systems surveillance l Fear-Type emotions expressed during abnormal situations, possibly life threatening => Public Safety (SERKET Project addressing multi sensory data) l Specially developed corpus SAFE

Fear-Type Emotion Recognition The HUMAINE network of excellence evaluated emotional databases and noted the lack of corpora that contained strong emotions with a good level of realism. l (Justin, Laukka, 2003): from 104 studies, 87% of the experiments conducted on acted data l For strong emotions this percentage is almost 100%. l

Fear-Type Emotion Recognition Acted databases l Stereotype l Realism depends on acting skills l Depends on the context given to the actor l Recently (Banziger and Priker, 2006), (Enos and Hirschberg, 2006), more realistic emotions captured by application of certain acting techniques for more genuine emotions Induce realistic emotions l e. WIZ database (Auberge et al. 2003), SAL (Douglas-Cowie et al. 2003). l Fear type emotions: maybe medically dangerous, and/or unethical. Therefore not available. Real-Life databases l Contain strong emotions l Very typical: emergency call center and therapy sessions l Restricted scope of concepts

Annotation of Emotional Content Scherer et al. (1980) push/pull opposite effects in emotional speech. l Physiological excitations push the voice into one direction l Conscious cultural driven effects pull in another direction Represent Emotions in Abstract Dimensions l Various dimensions have been proposed l Whissel (1989) activation/evaluation space is used most frequently because of th elarge range of emotional variation Emotional Description Categories l Basic emotions, Orthony and Turner (1990) l Primary emotions, Damasio (1994) l Ekman and Friesen (1975), the big six, (fear, anger, joy, disgust, sadness, surprise) l And more fuller richer lists of ‘basic’ emotions. Challenges l Real-life emotions are rarely basic emotions, much more a complex blend. l This makes annotating real-life corpora a difficult task. l Diversity of data. l Not living up to industrial expectations. l EARL (Emotion and Annotation Representation Language) by the W 3 C

Acoustic Features Physiologically related and salient for fear characterization. Voice quality features: High Level Features l Pitch l Intensity l Speech rate l Creaky, breathy, tensed Initially used for speech processing but also for emotion recognition: Low Level Features l Spectral feature l Cepstral features

Classification Algorithms For emotion recognition studies used: l Support Vector Machine l Gaussian Mixture Models l Hidden Markov Models l K-Nearest Neighbors l … Evaluation of the methods is difficult: l Diversity of data and context l The different number and type of emotional classes l Training and test conditions: speaker dependent or independent, etc. l Acoustic feature extraction which are depending on prior knowledge on linguistic content, speaker identity, normalization by speaker, phone, etc.

Audio-Surveillance l l Audio clues work even if there are no visual clues: gun shot, human shouts, events out of the camera shot Important emotional category: Fear Constraints l High diversity of number and type of speakers l Noisy environments: stadium, bank, airport, subway, station, bank, etc. l Speaker independent, high number of unknown speakers l Text-independent (i. e. , no speech recognition, quality of recordings is of less importance)

Approach New emotional database with a large scope of threat concepts following previously listed constraints l A task dependent annotation strategy l Feature extraction of relevant acoustic features l Classification robust to noise and variability of data l Performance evaluation l

SAFE Situation Analysis in a Fictional and Emotional Corpus 7 hours of recordings, fictional movies l 400 audio visual sequences (8 secs – 5 minutes) l Dynamics: normal and abnormal situation l Variability: high variety of emotional Manifestations l Large number of unknown speaker, unknown situations in noisy environment l

SAFE Situation Analysis in a Fictional and Emotional Corpus

Annotation Speaker track: genre and position: aggressor, victim, other l Threat track: degree of threat (no threat, potential, latent, immediate, past threat) plus intensity l Speech track: categories verbal and nonverbal (shouts, breathing, etc. ) contents, audio environment (music/noise), quality of speech l

Fear-Type Emotion Recognition Emotional categories and subcategories Broad categories Subcategories Fear Stress, terror, anxiety, worry, anguish, panic, distress, mixed subcategories Other negative emotions Anger, sadness, disgust, suffering, deception, contempt, shame, despair, cruelty, mixed subcategories Neutral – Positive emotions Joy, relief, determination, pride, hope, gratitude, surprise, mixed subcategories

Fear-Type Emotion Recognition

Features Pitch related features (most useful for the fear vs neutral classifier) l Voice quality: jitter and shimmer l Voiced classifier: spectral centroid l Unvoiced classifier: spectral features, Bark band energy l

Results

References [1] C. Clavel, I. Vasilescu, L. Devillers, G. Richard, T. Ehrette, Fear-type Emotion Recognition for Future Audio-Based Surveillance Systems, Speech Communication, pp 487 -503, Vol. 50, 2008

Special Classes A. Drahota, A. Costall, V. Reddy, The Vocal Communication of Different kind of Smiles, Speech Communication, pp. 278 – 287, Vol. 50, 2008 l H. S. Cheang, M. D. Pell, The Sound of Sarcasm, Speech Communication, pp. 366 – 381, Vol. 50, 2008 l

Speaker Recognition/Verification Text independent speaker verification systems (Reynolds et al. 2000) l Short-term spectra of speech signals used to train speaker dependent Gaussian Mixture Models (GMMs) l A GMM-based background model is used to represent the distribution of imposters speech. l Verification is based on th elikelihood-ratio hypothesis test where l l Client GMM is distribution of the 0 -hypotheses Background GMM is distribution of the alternative hypotheses

Speaker Recognition/Verification Training the background model is relatively straightforward using large speech corpora of non-target speakers But: Speaker verification based on high-level speaker features requires a large amount of speech for enrollment. Therefore: Adaptation techniques are necessary

Speaker Recognition/Verification Text dependent & very short utterances: l Phoneme dependent HMMs Text Independent & moderate enrollment data (adaptation techniques) l l l Maximum a Posteriory (MAP) Maximum Likelihood Linear Regression (MLLR) Low-level speaker models and speaker clustering Linear combination of reference models in an eigen voice (EV) EV adaptation => EVMLLR Non-linear adaptation by introducing kernel PCA Kernel eigenspace MLLR outperforms other adaptation models when the amount of enrollment data is extremely limited

Speaker Recognition/Verification Features from high- to lower-level Features (Shriberg, 2007) l Pronunciations (Place of birth, education, socioeconomic status, etc. ) l Idiolect (Education, socioecconomic status, etc. ) l Prosodic (Personality type, parental influence, etc. ) l Acoustic (Physical structure of vocal organs)

Speaker Recognition/Verification Pronunciations: l Multilingual phone streams obtained by language dependent phone ASR (N-grams, Bin-tree) l Multilingual phone cross-streams by language dependent phone ASR (N-gram, CPM) l Articulatory features by MLP and phone ASR (AFCPM) Idiolect (Education, socioecconomic status, etc. ) l Word streams by word ASR (N-gram, SVM) Prosodic (Personality type, parental influence, etc. ) l F 0 and Energy distribution by Energy estimator (GMM) l Pitch contour by F 0 estimator and word ASR (DTW) l F 0 & Energy contour & duration dynamics by F 0 & energy estimator & phone ASR (N-gram) l Prosodic statistics from F 0 & duration by F 0 and energy estimator & word ASR (KNN) Acoustic (Physical structure of vocal organs) l MFCC & its time derivatives (GMM)

Training of Unadapted Phoneme Dependent AFCPM Speaker Models

Adaptation Method A Classical MAP. Adapted from phoneme dependent background models. l This is based on the classical MAP used in (Leung et al. , 2006). l

Adaptation Method B: Phoneme-independent adaptation (PIA). l Adapted from phoneme-dependent speaker models and phonemeindependent speaker models. l

Adaptation Method C Scaled phoneme-independent adaptation (SPI). l Adapted from phoneme-independent speaker models with a phonemedependent scaling factor that depends on both the phoneme-dependent and phoneme-independent background models. l

Adaptation Method D l l Mixed phoneme-dependent and scaled phoneme independent adaptation (MSPI). Adapted from phoneme-dependent background models and phoneme-independent speaker models with a phoneme-dependent scaling factor that depends on both the phonemedependent and phoneme-independent background models. This method is a combination of Methods A and C.

Adaptation Method E l l l Method E: Mixed phoneme-independent and scaled phoneme-dependent adaptation (MSPD). Adapted from phoneme-independent speaker models and phoneme-dependent background models with a speakerdependent scaling factor that depends on both the phoneme-independent speaker model and background models. This method is a combination of Methods B and C.

Adaptation Methods for AFCPMs

Method A

Method A Principals Relationships

Method B - E

Method B Principals Relationships

Method C Principals Relationships

Method D Principals Relationships

Method E Principals Relationships

Used Databases

Results NIST 00

Results NIST 02

Results NIST 00

Results NIST 02

References [1] L. Mary, B. Yegnanarayana, Extraction and Represenation of Prosodic Features for Language and Speaker Recognition, Speech Communication, pp. 782 -796, Vol. 50, 2008. [2] S. -X. Zhang, M. -W. Mak, A New Adaptation Approach to High-Level Speaker-Model Creation in Speaker Verification, Speech Communication, pp. 534 -550, Vol. 51, 2009

Detecting Speech and Music based on Spectral Tracking [1]

References [1] T. Taniguchi, M. Tohyama, K. Shirai, Detection of Speech and Music based on Spectral Tracking, Speech Communication, pp. 547 – 563, Vol. 50, 2008