Speaker Recognition J S Roger Jang jangmirlab org

Speaker Recognition 語者辨識 J. -S. Roger Jang (張智星) jang@mirlab. org http: //mirlab. org/jang MIR Lab, CSIE Dept. National Taiwan University 2021/2/21

Outline 計畫主題與目標 Speaker recognition � Speaker Identification � Speaker verification Current work � SV based on digit recordings � SV based on SRE-2010 預定達成之目標預計交付之項目 2/15

Tasks of Speaker Recognition Two tasks in speaker recognition (SR) � Speaker identification (SI, 語者識別 � Speaker verification (SV, 語者驗證 ) ) Input types � Text-dependent input Higher accuracy The user needs to say pre-recorded spoken passwords. � Text-related input The user needs to say texts prompted by the system. � Text-independent input The user can say anything. Lower accuracy General steps in SR � Training, enrollment, evaluation 3/15

Application Scenarios of SI & SV SI � Criminal ID � Phone fraud ID � Person ID via digital assistants at home SV � 顧客來電確認 Speaker diarization Text-independent speaker verification � 銀行支付 Text-related speaker verification � 大樓出入門禁 Text-dependent speaker verification 4/15

Performance Indices of SV Various performance indices of SV � Accuracy or EER (equal error rate) � Confusion matrix Specificity (= TN/(FP+TN)) > 99. 9% DET curves False acceptance rate (FAR) < 0. 1% Sensitivity (= TP/(TP+FN)) > 95% False rejection rate (FRR) < 5% Prediction P N TN FP P FN TP N Groundtruth N Prediction P N > 99. 9% < 0. 1% P < 5% > 95% Typical values! 7/15

Performance Indices of SI Mostly used performance indice for SV � Accuracy If we use SI for SV Better SI implies better SV � Given accuracy of SI = 99% SV accuracy: Specificity = [100 -1/(N-1)]% if the imposter is in database Sensitivity = 99% 8/15

計畫主題與目標 Topic: Speaker verification (SV) � Using various methods � Under various app. scenarios Test conditions � Text-dependent（每一句包含 4至 8個中文字）. � Each speaker has a different set of 10 speech passwords with 3 recordings each. � Single recording device for each speaker. � Normal office environment. Goals � Specificity (= TN/(FP+TN)) > 99. 9% � Sensitivity (= TP/(TP+FN)) > 95% 9/15

Methods/Features for Speaker Recognition Classification methods � Neural networks (NN) DNN (deep NN) CNN (convolution NN) RNN (recurrent NN) � K-means clustering � Fuzzy c-means clustering LSTM (long short-term memory) � Others GMM (Gaussian mixture models) SVM (Support vector machine) p. LDA … Clustering methods �… Features � MFCC � i-Vector �… 10/15

Past Work: SI with Input of Spoken Passwords Thesis � 劉玉情，文本相關之語者識別及其不佳輸入之濾除機制，清大碩士論文，2009 Corpus Result 12/15

Past Work: Text-independent SI (2/3) Result of TIMIT 14/15

Past Work: Text-independent SI (3/3) Results of NTIMIT Results of CTIMIT 15/15

Flowchart of i-vectors Basic flowchart of using i-vectors � Train UBM, TV space and PLDA model using MFCC from training data � Extract enrollment i-vectors and test i-vectors � Score with PLDA model 16/15

Current Work SV based on digit recordings SV based on SRE-10 � Gender-specific classifiers � End-to-end methods 17/15

Thank you for listening! Questions & comments welcome! 20/15