Improved Deep Speaker Feature Learning for TextDependent Speaker
- Slides: 22
Improved Deep Speaker Feature Learning for Text-Dependent Speaker Recognition Lantian Li CSLT / RIIT, Tsinghua University lilt 13@mails. tsinghua. edu. cn Co-work with Dong Wang, Yiye Lin and Zhiyong Zhang Dec. 3, 2016
Outline • Introduction • Improved Deep Feature Learning • Experiments • Conclusions
Rich Information contained in speech signal Where is he/she from? What language was spoken? Accent Recognition Language Recognition Speech Recognition Emotion Recognition Gender Recognition Speaker Recognition Positive? Negative? Happy? Sad? Male or Female? Who spoke? What was spoken?
Speaker recognition history , tion a m for n i tor e c n e o v ph p spkdee Keep pace with times Work together to develop 2010 PLP grm re u t a Fe ch spee form e wav Time el d o M 1930 CC, P L , LPC AR PL tro spec 1960 late p m Te hing c mat C MFC DT 1990 1980 1970 2000 MM Q, H V , W an e l c and ta l l a Sm ech da spe BM, U GMM -SVM GMM JFA, i- or vect ing De arn e l p e ical t c a d pr ata n a Big eech d sp
Introduction • Speaker recognition systems ² Human-crafted acoustic features (e. g. MFCC) ² Statistical models (e. g. GMM-UBM (Reynolds 2000), JFA/i-vector (Kenny 2007)) • Discriminative models ² SVM for GMM-UBM (Campbell 2006) ² PLDA for i-vector (Ioffe 2006)
Introduction • Deep feature learning (Ehsan 2014) ² Drawbacks of d-vector on text-Dep. ü Simple input feature o No phone content information ü Simple average scoring o Ignoring the temporal constraint
Outline • Introduction • Improved Deep Feature Learning • Experiments • Conclusions
Improved Deep Feature Learning • Phonetic-aware training
Improved Deep Feature Learning • Segment pooling and dynamic time warping (DTW) (Berndt 1994)
Outline • Introduction • Improved Deep Feature Learning • Experiments • Conclusions
Experiments • Database ² 100 speakers, 10 short phrases. Each phrase has 150 utterances per speaker. ü Dev. Set: 80 speakers and 12000 utterances. training DNN model / UBM / T matrix / LDA / PLDA. ü Eva. Set: 20 speakers, 2100 target trials and 42750 non-target trials for each phrase. • Experimental Setup ² i-vector system ü 39 -dims MFCCs, 128 -components UBM, 200 -dims i-vector. ² d-vector system ü 40 -dims Fbanks, 10 left and right frames, 200 -dims of each hidden layer.
Experiments • Baseline ² Observations ü The i-vector outperforms the d-vector. ü LDA/PLDA is suitable for i-vector, while has no effect on d-vector. ü The d-vector is a ‘discriminative’ vector.
Experiments • Phone-dependent learning ² Descriptions üA DNN model was trained for ASR with a Chinese database consisting of 6000 h. üThe phone set consists of 66 initials and finals in Chinese. üThe ‘DNN+PT’ leads to marginal but consistent performance improvement.
Experiments • Segment pooling and DTW ² Illustrations üThe segment pooling(DNN+PT+seg-n) generally outperforms the ‘DNN+PT’. üThe ‘DNN+PT+DTW’ offers clear performance improvement than the segment pooling.
Experiments • System combination ² Descriptions üCombine the best i-vector(PLDA) and the best dvector (DNN+PT+DTW) from the score-level. where is the interpolation factor. üThe combination leads to the best performance.
Outline • Introduction • Improved Deep Feature Learning • Experiments • Conclusions
Conclusions • A phone-dependent DNN structure. • Two scoring strategies ² Segment pooling ² Dynamic time warping • System combination
References • D. Reynolds, T. Quatieri, and R. Dunn (Reynolds 2000), “Speaker verification using adapted gaussian mixture models, ” Digital Signal Processing, vol. 10, no. 1, pp. 19– 41, 2000. • P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel (Kenny 2007), “Joint factor analysis versus eigenchannels in speaker recognition, ” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1435– 1447, 2007. • ——, “Speaker and session variability in gmm-based speaker verification, ” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1448– 1460, 2007. • P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam (Kenny 2014), “Deep neural networks for extracting baum-welch statistics for speaker recognition, ” Odyssey, 2014. • W. Campbell, D. Sturim, and D. Reynolds (Campbell 2006), “Support vector machines using gmm supervectors for speaker verification, ” Signal Processing Letters, IEEE, vol. 13, no. 5, pp. 308– 311, 2006. • S. Ioffe (Ioffe 2006), “Probabilistic linear discriminant analysis, ” Computer Vision ECCV 2006, Springer Berlin Heidelberg, pp. 531– 542, 2006. • T. Kinnunen and H. Li (Kinnunen 2010), “An overview of text-independent speaker recognition: From features to supervectors, ” Speech communication, vol. 52, no. 1, pp. 12– 40, 2010. • V. Ehsan, L. Xin, M. Erik, L. M. Ignacio, and G. -D. Javier (Ehsan 2014), “Deep neural networks for small footprint text-dependent speaker verification, ” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 28, no. 4, pp. 357– 366, 2014. • D. Berndt and J. Clifford (Berndt 1994), “Using dynamic time warping to find patterns in time series, ” KDD workshop, vol. 10, no. 16, pp. 359– 370, 1994.
Deep learning in speaker recognition • Multi-task Recurrent Model for Speech and Speaker Recognition Lantian Li+, Zhiyuan Tang+, Dong Wang, “Multi-task Recurrent Model for Speech and Speaker Recognition”, ar. Xiv: 1603. 09643, 2016.
Deep learning in speaker recognition
Deep learning in speaker recognition End-to-end architecture for speaker recognition
Thank you Homepage: lilt. cslt. org Dec. 3, 2016
- Deep learning vs machine learning
- Tony wagner's seven survival skills
- Deep asleep deep asleep it lies
- Deep forest towards an alternative to deep neural networks
- O the deep deep love of jesus
- Feature dataset vs feature class
- Isolated feature combined feature effects
- Cuadro comparativo de e-learning
- Operator fusion deep learning
- Lstm andrew ng
- Hortonworks gpu
- Gandiva: introspective cluster scheduling for deep learning
- Deep residual learning for image recognition
- Deep learning speech recognition
- Cs 7643 fall 2020
- Autoencoders
- Youtube.com
- Mitesh m khapra
- Who is the father of deep learning?
- Optimal auctions through deep learning
- Deep learning competencies 6 c's
- Cost function in deep learning
- Bird eye view deep learning