Introduction to Speaker Diarization Date 20070816 Speaker ShihSian

  • Slides: 28
Download presentation
Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng Institute of Information Science, Academia

Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng Institute of Information Science, Academia Sinica, Taiwan

Outline n Speaker diarization Ø Problem formulation Ø A prototypical speaker diarization system n

Outline n Speaker diarization Ø Problem formulation Ø A prototypical speaker diarization system n Speaker segmentation Ø Ø Ø ü ü Problem formulation Speaker segmentation using a fixed-size analysis window Speaker segmentation using a variable-size analysis window Bottom-up segmentation using BIC Top-down segmentation using BIC n Speaker clustering Ø Problem formulation Ø Hierarchical agglomerative clustering Ø Optimization-oriented approaches n Two leading speaker diarization systems Ø LIMSI’s system Ø Cambridge’s system 2

Speaker diarization (Problem formulation) n Problem formulation: the “who spoke when” task on an

Speaker diarization (Problem formulation) n Problem formulation: the “who spoke when” task on an continuous audio stream (NIST RT 03 Spring Eval. ) speaker segmentation speaker clustering Speaker 1 Speaker 2 Speaker 3 3

Speaker diarization (Problem formulation) Ø Performance measure of the speaker diarization task (C. Barras

Speaker diarization (Problem formulation) Ø Performance measure of the speaker diarization task (C. Barras et. al. , 2006 ; NIST RT 03 Spring Eval. ) Find the mapping between reference speakers and hypothesis speakers such that their overlapping in time is largest. In this case, S 1 ->A and S 3 ->B. Ø Applications 4

Speaker diarization (Problem formulation) Ø Example: Automatic transcription for a broadcast news show By

Speaker diarization (Problem formulation) Ø Example: Automatic transcription for a broadcast news show By speaker recognition Speaker adaptation+ speech recognition 5

Speaker diarization (A prototypical system) n The prototypical speaker diarization system (S. E. Tranter

Speaker diarization (A prototypical system) n The prototypical speaker diarization system (S. E. Tranter & D. A. Reynolds, 2006) To filter out non-speech data Speaker segmentation (usually, over segmentation) Speaker clustering Change boundary refinement 6

Speaker segmentation (Problem formulation) n Problem formulation detect the speaker change boundaries Ø Performance

Speaker segmentation (Problem formulation) n Problem formulation detect the speaker change boundaries Ø Performance measure miss detection ü Error type: miss detection & false alarm ü Performance metric: ROC curve: F-score: Target changes Hypothesized changes P: precision rate R: recall rate false alarm 7

Speaker segmentation (Fixed-size analysis window approach) n Speaker segmentation using a fixed-size analysis window

Speaker segmentation (Fixed-size analysis window approach) n Speaker segmentation using a fixed-size analysis window (Siegler et. al. , 1997) Sliding windows Data stream Distance computation Distance curve Ø Distance measure of two segments ü Kullback-Leibler (KL) distance (Siegler et. al. , 1997) 8

Speaker segmentation (Fixed-size analysis window approach) 9 ü SVM training error (王駿發 et. al.

Speaker segmentation (Fixed-size analysis window approach) 9 ü SVM training error (王駿發 et. al. , 2005) More overlap, larger training error larger distance, less similarity Y Y X X

Speaker segmentation (Fixed-size analysis window approach) üΔBIC (S. Chen et. al. , 1998; P.

Speaker segmentation (Fixed-size analysis window approach) üΔBIC (S. Chen et. al. , 1998; P. Delacourt et. al. , 2001) Bayesian information criterion (BIC) for model selection: • Data set: • Candidate models: • Model selection by BIC: λ=1 in the BIC theory, but is usually tuned for trade-off between error types; maximum likelihood of X for model ; : the number of parameters of ; 10

Speaker segmentation (Fixed-size analysis window approach) 11 Use BIC as an inter-segment distance computation

Speaker segmentation (Fixed-size analysis window approach) 11 Use BIC as an inter-segment distance computation Given two audio segments represented by feature vectors and these two segments can be judged as under the same or different acoustic conditions via the following hypothesis test: X and Y are judged as from the same acoustic condition if BIC <0. Ex: X and Y are from different acoustic conditions, BIC>=0 Seg Y Seg X X and Y are from the same acoustic condition, BIC<=0 Seg X Seg Y

Speaker segmentation (Variable-size analysis window approach) 12 n Speaker segmentation using a variable-size analysis

Speaker segmentation (Variable-size analysis window approach) 12 n Speaker segmentation using a variable-size analysis window Ø Bottom-up detection using BIC (S. Chen and P. Gopalakrishnan, 1998; M. Cettolo et. al. , 2005 ) ü The bottom-up detection process on an audio stream Audio stream Seg 1 Seg 2 Change point One-changepoint detection Seg 3 Seg 4

Speaker segmentation (Variable-size analysis window approach) ü One-change-point detection using BIC Feature vectors X

Speaker segmentation (Variable-size analysis window approach) ü One-change-point detection using BIC Feature vectors X Calculate BIC X Y Y Calculate BIC at each feature vector BIC 13

Speaker segmentation (Variable-size analysis window approach) 14 Ø Top-down detection using BIC (C. Top-down

Speaker segmentation (Variable-size analysis window approach) 14 Ø Top-down detection using BIC (C. Top-down detection using BIC ( H. Wu and C. H. Hsieh, 2006; M. Cettolo ; et. al. , 2005 ) ü The top-down detection process for an audio stream Audio stream Seg 1 multiple-change-detection Seg 2 Seg 3 Seg 4

Speaker segmentation (Variable-size analysis window approach) 15 ü Multiple-change-detection using BIC Assumption: different segments

Speaker segmentation (Variable-size analysis window approach) 15 ü Multiple-change-detection using BIC Assumption: different segments arise from different Gaussian processes Audio stream Seg 1 Seg 2 Seg 3 Seg 4 X H 0 : H 1 : H 2 : Intuitively, pr(X| H 0)<pr(X| H 1)<pr(X| H 2)<pr(X| H 3) but, BIC(X|H 2)>BIC(X| H 3)>BIC(X| H 1)> BIC(X| H 0) H 3 : Multiple-change-detection: Search the H that has the largest BIC value in the solution space • Exhausted search

Speaker segmentation (Variable-size analysis window approach) • Top-down, hierarchical search (C. H. Wu and

Speaker segmentation (Variable-size analysis window approach) • Top-down, hierarchical search (C. H. Wu and C. H. Hsieh, 2006) Audio stream Seg 1 Seg 2 X Pass 1: Pass 2: Terminate An sub-optimal search • Dynamic programming (M. Cettolo et. al. , 2005 ) An optimal search Seg 3 Seg 4 16

Speaker clustering (Problem formulation) 17 n Problem formulation Ø given N speech utterances from

Speaker clustering (Problem formulation) 17 n Problem formulation Ø given N speech utterances from P unknown speakers, partition these utterances into M clusters, such that M = P and each cluster consists exclusively of utterances from only one speaker

Speaker clustering (Problem formulation) 18 Ø Cluster Purity Increases as the number of clusters

Speaker clustering (Problem formulation) 18 Ø Cluster Purity Increases as the number of clusters increases The probability that if we pick any utterance from a cluster twice at random, with replacement, both of the selected utterances are from the same speaker P : total no. of speakers involved, M : total no. of clusters, m : purity of the m-th cluster, nm* : no. of utterances in the m-th cluster, n*p : no. of utterances from the p-th speaker, nmp : no. of utterances in the m-th cluster that are from the p-th speaker

Speaker clustering (Problem formulation) Ø Rand Index Two error types: I: The number of

Speaker clustering (Problem formulation) Ø Rand Index Two error types: I: The number of utterance pairs (with replacement) in the same cluster but from different speakers II: The number of utterance pairs (with replacement) from the same speaker but in different clusters Type II error: cluster speaker 1 2 … 1 n 11 n 21 … n. M 1 n 1 2 n 12 n 22 … n. M 2 n 2 … … … n. MP n P … n. M P n 1 P … n 2 P Sum n 1 n 2 M Sum N The number of utterance pairs from the same speaker Type I error: The number of utterance pairs from the same cluster and are in the same cluster Reaches its minimum only when M = P The number of utterance pairs from the same speaker that are in the same cluster 19

Speaker clustering (Hierarchical agglomerative 20 clustering) n Hierarchical agglomerative clustering (S. Chen and P.

Speaker clustering (Hierarchical agglomerative 20 clustering) n Hierarchical agglomerative clustering (S. Chen and P. Gopalakrishnan, 1998; Barras et. al. , 2006) X 1 X 2 XN Ø Distance of two clusters: ΔBIC X 19 Ø Stopping criteria: ü Local BIC ü Global BIC X 1 X 2 X 13 XN X 19 X 13 XN X 14 XN

Speaker clustering ( Optimization-oriented approaches ) 21 n Optimization-oriented approaches Ø Maximum purity clustering

Speaker clustering ( Optimization-oriented approaches ) 21 n Optimization-oriented approaches Ø Maximum purity clustering (W. H. Tsai et. al. , IEEE Trans. ASLP, 2007) ü For a given number of cluster and a set of cluster indices H = [ h 1, h 2, …, h. N ] for N utterances X 1 , X 2 , …, XN , the average cluster purity is oi is the true speaker index of utterance Xi, (1 oi P ) (oi , oj ) (the ground truth) is unknown and needs to be estimated. (oi, oj) is approximated by S(Xi, Xj): similarity between utterances Xi and Xj R[S(Xi, Xj)]: rank of inter-utterance similarity S(Xi, Xj) among S(Xi, X 1), S(Xi, X 2), …, S(Xi, XN) in descending order i : utterance most similar to Xi, i. e. , R[S(Xi, X i)] = 2. … mth-cluster ; nm=4

Speaker clustering ( Optimization-oriented approaches ) 22 Let denote the estimated purity. Use Genetic

Speaker clustering ( Optimization-oriented approaches ) 22 Let denote the estimated purity. Use Genetic Algorithm to find H* such that ü Use BIC to determine the cluster number Ø Minimum rand index clustering (W. H. Tsai and H. M. Wang, Proc. ICASSP, 2007): Performing the grouping of utterances and determining the group number at within the optimization process (oi , oj ) (the ground truth) is unknown and needs to be estimated.

Speaker clustering ( Optimization-oriented approaches ) 23 (oi, oj) is approximated by a normalized

Speaker clustering ( Optimization-oriented approaches ) 23 (oi, oj) is approximated by a normalized inter-utterance similarity: (Generalized likelihood Ratio) where Smax is the maximum among the similarities S(Xi, Xj), i j. Rˆ ( H (M ) N N ) = å å d(h i =1 j =1 (M ) i , h (M ) j N N ) + W - 2 å å d ( hi( M ) , h (j M ) ) dˆ ( oi , o j ) Use Genetic Algorithm to find H* such that i =1 j =1

Two leading systems Ø LIMSI’s system (Barras et. al. , 2006) 24 To remove

Two leading systems Ø LIMSI’s system (Barras et. al. , 2006) 24 To remove only long regions without speech such as silence, music, and noise using GMM Fixed-size sliding window segmentation Boundary refinement Use ΔBIC to measure the inter-cluster similarity Boundary refinement; Align the change boundaries to silence portions Use the cross-likelihood ratio, , to measure the inter-cluster similarity. Mi is a MAP-adapted GMM. To filter out short-duration silence segments that were not removed in the initial speech detection step

25 Two leading systems Ø Cambridge’s system (Sinha et. al. , 2005) SD: speech

25 Two leading systems Ø Cambridge’s system (Sinha et. al. , 2005) SD: speech detection CPD: change point detection IAC: iterative agglomerative clustering Speaker identification (SID) clustering: MAP adaptation (mean-only) was applied towards each cluster from the appropriate gender/bandwidth UBM. Use the cross likelihood ratio (CLR) between any two given clusters.

Reference 26 n C. Barras, X. Zhu, S. Meignier, and J. -L. Gauvain, “Multistage

Reference 26 n C. Barras, X. Zhu, S. Meignier, and J. -L. Gauvain, “Multistage Speaker Diarization of Broadcast News, ” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006. n NIST 2003 Spring, http: //www. nist. gov/speech/tests/rt/rt 2003/spring/ n R. Sinha, S. E. Tranter, M. J. F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System, ” INTERSPEECH 2005. n S. E. Tranter & D. A. Reynolds, “An Overview of Automatic Speaker Diarisation Systems, ” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006. n S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion, ” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998. n C. H. Wu and C. H. Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model, ” IEEE Transactions on Audio, Speech and Language Processing, 2006. n M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation, ” Computer Speech and Language, 2005. n M. Siegler, U. Jain, B. Raj and R. Stern, “Automatic Segmentation, Classification and clustering of broadcast News Audio, ” in Proc. DARPA Speech Recognition Workshop, 1997. n P. Delacourt and C. J. Welkens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing", Speech Communication, vol. 32, pp 111 -126, 2000. n 王駿發, 林博川, 王家慶, 宋豪靜, “以支援向量機為基礎之新穎語者切換偵測演算法, ” in Proc. ROCLING 2005.

Reference 27 n C. Barras, X. Zhu, S. Meignier, and J. -L. Gauvain, “Multistage

Reference 27 n C. Barras, X. Zhu, S. Meignier, and J. -L. Gauvain, “Multistage Speaker Diarization of Broascast News, " IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1505 -1512, 2006. n Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang, "Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation, " IEEE Trans. on Audio, Speech, and Language Processing, volume 15, number 4, pages 1461 -1474, May 2007. n Wei-Ho Tsai and Hsin-min Wang, "Speaker Clustering Based on Minimum Rand Index, " IEEE Int. Conf. Acoustics, Speech, Signal processing (ICASSP 2007) , April 2007. n R. Sinha, S. E. Tranter, M. J. F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System, ” INTERSPEECH 2005.

28 Thank You

28 Thank You