Introduction Mapping Modeling Speaker Diarization Summary IntraClass Variability

Introduction Mapping Modeling Speaker Diarization Summary Intra-Class Variability Modeling for Speech Processing Dr. Hagai Aronowitz IBM Haifa Research Lab Presentation is available online at: http: //aronowitzh. googlepages. com/ H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1

Introduction Mapping Modeling Speaker Diarization Summary Speech Classification Proposed framework Given labeled training segments from class + and class –, classify unlabeled test segments Classification framework 1. Represent speech segments in segment-space 2. Learn a classifier in segment-space • SVMs • NNs • Bayesian classifiers • … H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 2

Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 3

Introduction Mapping Modeling Speaker Diarization Summary Text-Independent Speaker Recognition GMM-Based Algorithm [Reynolds 1995] Assuming frame independence: Estimate Pr(yt|S) 1. Train a universal background model (UBM) GMM using EM 2. For every target speaker S: Train a GMM GS by applying MAP-adaptation GMM based speaker recognition μ 1 μ 2 μ 3 UBM Q 1 - speaker #1 Q 2 - speaker #2 R 26 MFCC feature space H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 4

Introduction Mapping Modeling Speaker Diarization Summary GMM Based Algorithm - Analysis 1. Invalid frame independence assumption: Factors such as channel, emotion, lexical variability, and speaker aging cause frame dependency 2. GMM scoring is inefficient – linear in the length of the audio 3. GMM scoring does not support indexing H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 5

Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 6

Introduction Mapping Modeling Speaker Diarization Summary Mapping Speech Segments into Segment Space GMM scoring approximation 1/4 Definitions X: training session for target speaker Y: test session Q: GMM trained for X P: GMM trained for Y Goal Compute Pr(Y |Q) using GMMs P and Q only Motivation 1. Efficient speaker recognition and indexing 2. More accurate modeling H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 7

Introduction Mapping Modeling Speaker Diarization Summary Mapping Speech Segments into Segment Space GMM scoring approximation 2/4 Negative cross entropy (1) Approximating the cross entropy between two GMMs 1. Matching based lower bound [Aronowitz 2004] 2. Unscented-transform based approximation [Goldberger & Aronowitz 2005] 3. Others options in [Hershey 2007] H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 8

Introduction Mapping Modeling Speaker Diarization Summary Mapping Speech Segments into Segment Space GMM scoring approximation 3/4 Matching based approximation (2) Assuming weights and covariance matrices are speaker independent (+ some approximations): (3) Mapping T is induced: (4) H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 9

Introduction Mapping Modeling Speaker Diarization Summary Mapping Speech Segments into Segment Space GMM scoring approximation 4/4 Results Figure and Table taken from: H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007. H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 10

Introduction Mapping Modeling Speaker Diarization Summary Other Mapping Techniques 1. Anchor modeling projection [Sturim 2001] • efficient but inaccurate 2. MLLR transofrms [Stolcke 2005] • accurate but inefficient 3. Kernel-PCA-based mapping [Aronowitz 2007 c] Given - a set of objects - a kernel function (a dot product between each pair of objects) Finds a mapping of the objects into Rn which preserves the kernel function. • accurate & efficient H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 11

Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 13

Introduction Mapping Modeling Speaker Diarization Summary Intra-Class Variability Modeling [Aronowitz 2005 b] Introduction The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: • channel, noise • language • stress, emotion, aging The frame independence assumption does not hold in these cases! (1) Instead, we can use a more relaxed assumption: (2) which leads to: (3) H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 14

Introduction Mapping Modeling Speaker Diarization Summary Old vs. New Generative Models Old Model New Model Speaker a GMM Speaker a PDF over GMM space Session GMM a GMM Frame sequence H. Aronowitz (IBM) generated independently Frame sequence Intra-Class Variability Modeling for Speech Processing generated independently June 08 15

Introduction Mapping Modeling Speaker Diarization Summary Session-GMM Space GMM for session A of speaker #1 GMM for session B of speaker #1 speaker #2 speaker #1 speaker #3 Session-GMM space H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 16

Introduction Mapping Modeling Speaker Diarization Summary Modeling in Session-GMM space 1/2 Recall mapping T induced by the GMM approximation analysis: • is called a supervector • A speaker is modeled by a multivariate normal distribution in supervector space: (3) • A typical dimension of is 50, 000*50, 000 • is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 17

Introduction Mapping Modeling Speaker Diarization Summary Modeling in Session-GMM Space 2/2 Estimating covariance matrix 1 1 2 2 speaker #1 2 22 1 speaker #3 Delta supervector space Supervector space H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 18

Introduction Mapping Modeling Speaker Diarization Summary Experimental Setup Datasets • is estimated from the NIST-2006 -SRE corpus • Evaluation is done on the NIST-2004 -SRE corpus System description • ETSI MFCC (13 -cep + 13 -delta-cep) • Energy based voice activity detector • Feature warping • 2048 Gaussians • Target models are adapted from GI-UBM • ZT-norm score normalization H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 19

Introduction Mapping Modeling Speaker Diarization Summary Results 38% reduction in EER H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 20

Introduction Mapping Modeling Speaker Diarization Summary Other Modeling Techniques • NAP+SVMs [Campbell 2006] • Factor Analysis [Kenny 2005] • Kernel-PCA [Aronowitz 2007 c] Kernel-PCA based algorithm • Model each supervector as s S : Common speaker subspace u U : Speaker unique subspace • S is spanned by a set of development supervectors (700 speakers) • U is the orthogonal complement of S in supervector space • Intra-speaker variability is modeled separately in S and in U • U was found to be more discriminative than S • EER was reduced by 44% compared to baseline GMM H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 21

Introduction Mapping Modeling Speaker Diarization Summary Kernel-PCA Based Modeling Feature space Speaker unique subspace f(x) Session space f(y) x Ker ind nel ux uy K-PCA y uce Tx d Ty Anchor sessions Common speaker subspace (Rn) H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 22

Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 23

Introduction Mapping Modeling Speaker Diarization Summary Trainable Speaker Diarization [Aronowitz 2007 d] Goals • Detect speaker changes – “speaker segmentation” • Cluster speaker segments - “speaker clustering” Motivation for new method Current algorithms do not exploit available training data! (besides tuning thresholds, etc. ) Method Explicitly model inter-segment intra-speaker variability from labeled training data, and use for the metric used by change-detection / clustering algorithms. H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 24

Introduction Mapping Modeling Speaker Diarization Summary Speaker recognition on pairs of 3 s segments Dev data • BNAD 05 (5 hr) - Arabic, broadcast news Eval data • BNAT 05 – Arabic, broadcast news, (207 target models, 6756 test segments) System EER (%) Anchor modeling (baseline) 15. 1 Anchor modeling - Kernel based scoring 10. 8 Kernel-PCA projection (CSS) 8. 8 Kernel-PCA projection (CSS) + inter-segment variability modeling 7. 4 H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 25

Introduction Mapping Modeling Speaker Diarization Summary Speaker Diarization System & Experiments Speaker change detection • 2 adjacent sliding windows (3 s each) • Speaker verification scoring + normalization Speaker clustering • Speaker verification scoring + normalization • Bottom-up clustering Speaker Error Rate (SER) on BNAT 05 • Anchor modeling (baseline): 12. 9% • Kernel-PCA based method: H. Aronowitz (IBM) 7. 9% Intra-Class Variability Modeling for Speech Processing June 08 26

Introduction Mapping Modeling Speaker Diarization Summary Outline Intra-Class Variability Modeling for Speech Processing 1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 27

Introduction Mapping Modeling Speaker Diarization Summary 1/2 • A method for mapping speech segments into a GMM supervector space was described • Intra-speaker inter-session variability is modeled in GMM supervector space Speaker recognition • EER was reduced by 38% on the NIST-2004 SRE • A corresponding kernel-PCA based approach reduces EER by 44% Speaker diarization • SER for speaker diarization was reduced by 39%. H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 28

Introduction Mapping Modeling Speaker Diarization Summary 2/2 Algorithms based on the proposed framework • Speaker recognition [Aronowitz 2005 b; Aronowitz 2007 c] • Speaker diarization (“who spoke when”) [Aronowitz 2007 d] • VAD (voice activity detection) [Aronowitz 2007 a] • Language identification [Noor & Aronowitz 2006] • Gender identification [Bocklet 2008] • Age detection [Bocklet 2008] • Channel/bandwidth classification [Aronowitz 2007 d] H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 29

Introduction Mapping Modeling Speaker Diarization Summary Bibliography 1/2 [1] D. A. Reynolds et al. , “Speaker identification and verification using Guassian mixture speaker models, ” Speech Communications, 17, 91 -108. [2] models”, in Proc. ICASSP, 2001. H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 30

Introduction Mapping Modeling Speaker Diarization Summary Bibliography 2/2 [9] A. Stolcke et al. , “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005. [10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines, “ in Proc. ISCA Odyssey Workshop, 2006. [11] W. M. Campbell et al. , “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006. [12] H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007. [13] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models” , in Proc. ICASSP 2007. [14] H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007. [15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007. [16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007. [17] T. Bocklet et al. , “Age and Gender Recognition for Telephone Applications Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008. H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 31

Introduction Mapping Modeling Speaker Diarization Summary Thanks! Presentation is available online at: http: //aronowitzh. googlepages. com/ H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 32

Introduction Mapping Modeling Speaker Diarization Summary Backup slides H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 33

Introduction Mapping Modeling Speaker Diarization Summary Kernel-PCA Based Mapping 2/5 Dot-product feature space Session space x y f() Kernel trick f(x) f(y) Anchor sessions Goals: - Map sessions into feature space - Model in feature space H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 34

Introduction Mapping Modeling Speaker Diarization Summary Kernel-PCA Based Mapping 3/5 Given - kernel K - n anchor sessions Find an orthonormal basis for Method 1) Compute eigenvectors of the centralized kernel-matrix ki, j = K(Ai, Aj). 2) Normalize eigenvectors by square-roots of corresponding eigenvalues → {vi} 3) for H. Aronowitz (IBM) is the requested basis Intra-Class Variability Modeling for Speech Processing June 08 35

Introduction Mapping Modeling Speaker Diarization Summary Kernel-PCA Based Mapping 4/5 Common speaker subspace Speaker unique subspace Given sessions x, y, may be uniquely represented as: is a mapping x→Rn with the property: H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 36

Introduction Mapping Modeling Speaker Diarization Summary Kernel-PCA Based Mapping 5/5 Speaker unique subspace Session space Feature space ux x uy K-PCA f(x) y f(y) Tx Ty Anchor sessions Common speaker subspace (Rn) H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 37

Introduction Mapping Modeling Speaker Diarization Summary Modeling in Segment-GMM Supervector Space Segment-GMM supervector space speech silence music Frame sequence: segment #1 segment #2 segment #n H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 38

Introduction Mapping Modeling Speaker Diarization Summary Segmental Modeling for Audio Segmentation Goal • Segment audio accurately and robustly into speech / silence / music segments. Novel idea • Acoustic modeling is usually done on a frame-basis. • Segmentation/classification is usually done on a segment-basis (using smoothing). Why not explicitly model whole segments? Note: speaker, noise, music-context, channel (etc. ) are constant during a segment. H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 39

Introduction Mapping Modeling Speaker Diarization Summary Speech / Silence Segmentation – Results 1/2 System EVAL 06 H. Aronowitz (IBM) EER FA @ FR=0. 5% FR @ FA=1% FA=24. 2% @ FR=0. 25% GMM baseline 2. 9% 7. 9% 29. 6% Segmental 1. 7% 5. 1% 2. 7% Error reduction 41% 35% 91% Intra-Class Variability Modeling for Speech Processing June 08 40

Introduction Mapping Modeling Speaker Diarization Summary Speech / Silence Segmentation – Results 2/2 System EVAL 06 H. Aronowitz (IBM) EER FA @ FR=0. 5% FA=1% FA=69% @ FR=0. 25% GMM baseline 1. 43% 3. 4% 3. 2% Segmental 1. 27% 2. 0% 1. 9% Error reduction 11% 41% Intra-Class Variability Modeling for Speech Processing June 08 41

Introduction Mapping Modeling Speaker Diarization Summary LID in Session Space English Session space French Arabic Test session Training session H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 42

Introduction Mapping Modeling Speaker Diarization Summary LID in Session Space - Algorithm 1. Front end: shifted delta cepstrum (SDC). 2. Represent every train/test session by a GMM super-vector. 3. Train a linear SVM to classify GMM super-vectors. 4. Results • EER=4. 1% on the NIST-03 Eval (30 sec sessions). H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 43

Introduction Mapping Modeling Speaker Diarization Summary Anchor Modeling Projection Given: anchor models λ 1, …, λn and session X= x 1, …, x. F Projection: = average normalized log-likelihood • Speaker indexing [Sturim et al. , 2001] • Intersession variability modeling in projected space [Collet et al. , 2005] • Speaker clustering [Reynolds et al. , 2004] • Speaker segmentation [Collet et al. , 2006] • Language identification [Noor and Aronowitz, 2006] H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 44

Introduction Mapping Modeling Speaker Diarization Summary Intra-Class Variability Modeling Introduction The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: • Noise • Channel • Language • Changing speaker characteristics – stress, emotion, aging The frame independence assumption does not hold in these cases! (1) Instead, we get: (2) H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 45