Digital Processing of Speech Signals Static Analysis4 EM

  • Slides: 43
Download presentation
Digital Processing of Speech Signals Static Analysis(4) EM Algorithm and Gaussian Mixture Model Dong

Digital Processing of Speech Signals Static Analysis(4) EM Algorithm and Gaussian Mixture Model Dong Wang CSLT, Tsinghua Univ. 2017. 04

Copyright Note: This presentation is partly adopted from the following materials: – Professor Taiwen

Copyright Note: This presentation is partly adopted from the following materials: – Professor Taiwen Yu’s “EM Algorithm”. – Professor Andrew W. Moore’s “Clustering with Gaussian Mixtures”. – Presentation of Haiguang Li, University of Vermont, 2011. 2

Contents 1. 2. 3. 4. Introduction Gaussian mixture model EM-Main Body EM-Algorithm Running on

Contents 1. 2. 3. 4. Introduction Gaussian mixture model EM-Main Body EM-Algorithm Running on GMM

Model Speech signals with Gaussians • Speech signals can be modeled by Gaussian distributions,

Model Speech signals with Gaussians • Speech signals can be modeled by Gaussian distributions, but a single Gaussian is probably not enough • Solution: Mixtures of Gaussians

Gaussian mixture model For a single Gaussian, it is simple to estimate the mean

Gaussian mixture model For a single Gaussian, it is simple to estimate the mean and covariance. For a mixture of Gaussians, how to estiamte the parameters?

Starting from Gaussian Sampling

Starting from Gaussian Sampling

Maximum Likelihood Sampling Given x, it is a function of and 2 We want

Maximum Likelihood Sampling Given x, it is a function of and 2 We want to maximize it.

Log-Likelihood Function Maximize this instead By setting and

Log-Likelihood Function Maximize this instead By setting and

Max. the Log-Likelihood Function

Max. the Log-Likelihood Function

Max. the Log-Likelihood Function

Max. the Log-Likelihood Function

Gaussian Mixture Model (GMM) • Now extend to cluster of Gaussians, or Gaussian mixture

Gaussian Mixture Model (GMM) • Now extend to cluster of Gaussians, or Gaussian mixture model (GMM). .

Gaussian mixture model • The major difficulty of GMM is that we do not

Gaussian mixture model • The major difficulty of GMM is that we do not know which Gaussian component a training sample belongs to. Or, we miss a variable z that label each sample to its component. • K-mean approach – Assign a sample to its ‘closest’ Gaussian • Soft assignment – Assign a sample to a Gaussian with an associated weight.

Iterative GMM estimation 1. Initial model parameters: {ϕi, μi, Σi) 2. Compute the soft

Iterative GMM estimation 1. Initial model parameters: {ϕi, μi, Σi) 2. Compute the soft assignment (weight) of each training sample to each component 3. Optimize model parameters with the soft assignment 4. Repeat (2)-(3) until some criterion reached, e. g. , maximum iterations, small change on parameters

But, we are not sure… • It is really a naïve idea: – If

But, we are not sure… • It is really a naïve idea: – If the above process converge? – If the above process find the optimal? – Can and how the above process deal with more complex models? • We hope more mathematical explanatoins

A generative representation • Graphical models φ z μ, Σ x Hidden! N

A generative representation • Graphical models φ z μ, Σ x Hidden! N

EM… • We seek a general solution for models with hidden variables {zi}, given

EM… • We seek a general solution for models with hidden variables {zi}, given observation X. • Iterative estimation with two phases – Estimate the probability of the hidden variable {p(zi|x)} from the visible variables (training data) and the current model parameters – Soft combine the observed variables and the hidden variables according to the probability obtained in the first step to form complete variables. Maximize the model parameters with the complete variable

EM in brief • The EM algorithm was explained and given its name in

EM in brief • The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin. • They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. • EM is typically used to compute maximum likelihood estimates given incomplete samples. • The EM algorithm estimates the parameters of a model iteratively. – Starting from some initial guess, each iteration consists of • an E step (Expectation step) • an M step (Maximization step)

EM Clustering Algorithm

EM Clustering Algorithm

EM Clustering Algorithm (2)

EM Clustering Algorithm (2)

EM for GMM

EM for GMM

What’s K-means? • It is a simplified GMM, with hard assignment • The assignment

What’s K-means? • It is a simplified GMM, with hard assignment • The assignment considers only distance, no covariance, i. e. , it assumes shared covariance • It is a special case of EM • E: assignment to clusters • M: estiamte parameters of new clusters (just the mean)

M-step of K-mean

M-step of K-mean

More mathematics This is the EXPECTATION, and we want to maxiize it!

More mathematics This is the EXPECTATION, and we want to maxiize it!

Maximize expectation

Maximize expectation

Applications • • • Filling in missing data in samples Discovering the value of

Applications • • • Filling in missing data in samples Discovering the value of latent variables Estimating the parameters of HMMs Estimating parameters of finite mixtures Unsupervised learning of clusters …

GMM in speech processing • Speech processing: noise removal, voice activity detection, emotion detection….

GMM in speech processing • Speech processing: noise removal, voice activity detection, emotion detection…. • Speech recognition: the former state-of-the -art for acoustic modeling

GMM in speech processing • Speaker recognition: the former state-ofthe-art – GMM-UBM framework •

GMM in speech processing • Speaker recognition: the former state-ofthe-art – GMM-UBM framework • Voice conversion • Speech synthesis: model feature generation

Demos • Matlab demo

Demos • Matlab demo

Question #1 • What are the main advantages of parametric methods? – You can

Question #1 • What are the main advantages of parametric methods? – You can easily change the model to adapt to different distribution of data sets. – Knowledge representation is very compact. Once the model selected, the model is represented by a specific number of parameters. The number of parameters does not increase with the increasing of training data.

Question #2 • What are the EM algorithm initialization methods? – Random guess. –

Question #2 • What are the EM algorithm initialization methods? – Random guess. – Initialized by k-means. After a few iterations of k-means, using the parameters to initialize EM.

Question #3 • What are the differences between EM and K-means(or VQ)? – K-means

Question #3 • What are the differences between EM and K-means(or VQ)? – K-means is a simplified EM. – K-means make a hard decision while EM make a soft decision when update the parameters of the model.

Question #4 • How to train a health GMM? – Choose appropriate number of

Question #4 • How to train a health GMM? – Choose appropriate number of components, Maybe using a development set, or some regulation, such as Bayesian information criterion (BIC). – Choose appropriate structure. Diagonal or full covariance? – Be careful to floor the covariance!

Question 5# • What are the state-of-the-art? – Discriminative training – Bayesian approach –

Question 5# • What are the state-of-the-art? – Discriminative training – Bayesian approach – Non-parametric Bayesian approach, such as Dirichlet process GMM (DPGMM)

Question 6# • How GMMs are used in speech processing? – UBM-GMM in speaker

Question 6# • How GMMs are used in speech processing? – UBM-GMM in speaker recognition – HMM-GMM in speech recognition – HMM-GMM in text to speech – GMM in voice transform

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. ^ Dempster, A. P. ; Laird, N. M. ; Rubin, D. B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1– 38. JSTOR 2984875. MR 0501537. ^ Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family". Scandinavian Journal of Statistics 1 (2): 49– 58. JSTOR 4615553. MR 381110. ^ a b Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University. ^ a b Sundberg, Rolf (1976). "An iterative method for solution of the likelihood equations for incomplete data from exponential families". Communications in Statistics – Simulation and Computation 5 (1): 55– 64. doi: 10. 1080/03610917608812007. MR 443190. ^ See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11. ^ G. Kulldorff. 1961. Contributions to theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell. ^ a b Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula") ^ a b Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg formula" credited to Anders Martin-Löf). ^ a b Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969– 1970 (Notes from seminars in the academic year 1969 -1970), with the assistance of Rolf Sundberg. Stockholm University. ("Sundberg formula") ^ a b Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. With a discussion by F. Abildgård, A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard, O. Barndorff-Nielsen, J. D. Kalbfleisch and G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in Statistical Inference (Aarhus, 1973), pp. 1– 42. Memoirs, No. 1, Dept. Theoret. Statist. , Inst. Math. , Univ. Aarhus, 1974. ^ a b Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data. Scand. J. Statist. 1 (1974), no. 1, 3– 18. ^ Wu, C. F. Jeff (Mar. 1983). "On the Convergence Properties of the EM Algorithm". Annals of Statistics 11 (1): 95– 103. doi: 10. 1214/aos/1176346060. JSTOR 2240463. MR 684867. ^ a b Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed. "A view of the EM algorithm that justifies incremental, sparse, and other variants". Learning in Graphical Models (Cambridge, MA: MIT Press): 355– 368. ISBN 0262600323. Retrieved 2009 -03 -22. ^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). "8. 5 The EM algorithm". The Elements of Statistical Learning. New York: Springer. pp. 236– 243. ISBN 0 -387 -95284 -5. ^ Jamshidian, Mortaza; Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by using Quasi-Newton Methods". Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (2): 569– 587. doi: 10. 1111/1467 -9868. 00083. MR 1452026. ^ Meng, Xiao-Li; Rubin, Donald B. (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework". Biometrika 80 (2): 267– 278. doi: 10. 1093/biomet/80. 2. 267. MR 1243503. ^ Hunter DR and Lange K (2004), A Tutorial on MM Algorithms, The American Statistician, 58: 30 -37