PUMS 2003 2004 seminaari 14 10 2004 Turku

  • Slides: 22
Download presentation
PUMS 2003 -2004 –seminaari 14. 10. 2004 Turku Speaker Recognition Pasi Fränti, Juhani Saastamoinen,

PUMS 2003 -2004 –seminaari 14. 10. 2004 Turku Speaker Recognition Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen University of Joensuu, Department of Computer Science University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Research Group PUMS project Pasi Fränti Professor Ismo Kärkkäinen Clustering algorithms Juhani Saastamoinen Project

Research Group PUMS project Pasi Fränti Professor Ismo Kärkkäinen Clustering algorithms Juhani Saastamoinen Project manager Evgeny Karpov Project researcher University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi Tomi Kinnunen Researcher Ville Hautamäki Project researcher

PUMS & Jo. Y • Speaker Recognition • PUMS season 2003 -2004: – Identification,

PUMS & Jo. Y • Speaker Recognition • PUMS season 2003 -2004: – Identification, no verification – Port it in mobile phone – Feature fusion – Real-time • http: //cs. joensuu. fi/pages/pums University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Application Scenarios Speaker Recognition Speaker Verification Is this Bob’s voice? (Claim) + Speaker Identification

Application Scenarios Speaker Recognition Speaker Verification Is this Bob’s voice? (Claim) + Speaker Identification Whose voice is this? ? Identification Verification Imposter! University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Identification System Speech Audio Signal Processing Recognition: min. MSE within DB over input speech

Identification System Speech Audio Signal Processing Recognition: min. MSE within DB over input speech Decision Feature Vectors Use all profiles in recognition Speaker Modelling Add trained speaker profiles Speaker Profile Database University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Results 2003 -2004 TCL/TK (HY) console UI Speaker. Profiler sprofiler Winsprofiler Series 60 Epocsprofiler

Results 2003 -2004 TCL/TK (HY) console UI Speaker. Profiler sprofiler Winsprofiler Series 60 Epocsprofiler Windows console UI Prof. Match common speaker recognition app. interface Fusion Real-time srlib DB Speech features (HY) University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Planned Results Large scale database Applications Access control Teleconference Mobile phone login? Results 2003

Planned Results Large scale database Applications Access control Teleconference Mobile phone login? Results 2003 -2004 sprofiler Winsprofiler Epocsprofiler Speaker. Profiler Prof. Match common speaker recognition app. interface Fusion Verification Real-time Segmentation VAD srlib DB Speech features (HY) University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

System in Mobile Phone Port to Symbian OS with Series 60 UI platform University

System in Mobile Phone Port to Symbian OS with Series 60 UI platform University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Symbian Phones • Series 60 phone features: – – 16 MB ROM 8 MB

Symbian Phones • Series 60 phone features: – – 16 MB ROM 8 MB RAM 176 x 208 display 32 -bit ARMprocessor – No floating-point unit!!! Series 60 UIQ University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi Series 80

FFTGEN • Multiplication results must fit in 32 bits: truncate multiplication inputs • FFTGEN:

FFTGEN • Multiplication results must fit in 32 bits: truncate multiplication inputs • FFTGEN: Truncate to 16/16 bits (“ 16/16 FFT”) FFT layer input 16 -bit integer X X FFT Twiddle Factor 16 -bit integer 32 -bit multiplication result 16 used bits 16 crop-off bits 16 -bit integer FFT layer output (part of it) Crop-off for next layer: 16 bits! University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Proposed Information Preserving “ 22/10 FFT” • Approximate DFT operator F with G •

Proposed Information Preserving “ 22/10 FFT” • Approximate DFT operator F with G • Increase ||F-G||, preserve more signal information – minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024 – Truncate multiplication inputs to 22/10 bits (signal/op) FFT layer input X FFT Twiddle Factor 32 -bit integer 22 used bits 10 crop-off bits 32 -bit integer, 22 bits used 32 -bit multiplication result X 16 -bit integer, 10 bits used FFT layer output (part of it) Crop-off for next layer: 10 bits University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Scale of Error in Proposed FFT Log 10 of relative error in FFT elements

Scale of Error in Proposed FFT Log 10 of relative error in FFT elements FFTGEN 22/10 FFT average -0. 775 -2. 118 standard deviation 0. 797 0. 590 16/16 22/10 University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Mobile Phone Results TIMIT, 100 speakers FLOAT FFTGEN FIXED MIXED 2 recog. rate (%)

Mobile Phone Results TIMIT, 100 speakers FLOAT FFTGEN FIXED MIXED 2 recog. rate (%) std. dev. (%) 100. 0 N/A 9. 7 1. 6 95. 8 1. 2 100. 0 98. 0 N/A 0. 6 implementation, signal recog. rate (%) std. dev. (%) FLOAT, Symbian audio 83. 2 4. 38 FLOAT, PC audio 100. 0 N/A FIXED, Symbian audio 76. 0 2. 83 FIXED, PC audio 100. 0 N/A University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Improving Accuracy by Information Fusion feature vector Feature set 1 (e. g. 5 MFCCs)

Improving Accuracy by Information Fusion feature vector Feature set 1 (e. g. 5 MFCCs) . . . Feature set 2 (e. g. F 0 + -F 0) Feature set 3 (e. g. formants F 1, F 2, F 3) score 1 Decision Score combiner Classifier 1 score 2 Classifier 2 score 3 Classifier 3 University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Information Fusion Results Fusion succesfull Fusion sucks Feature set combination BASELINE: Best individual Featurelevel

Information Fusion Results Fusion succesfull Fusion sucks Feature set combination BASELINE: Best individual Featurelevel fusion Scorelevel fusion Decisionlevel fusion MFCC + MFCC 16. 8 15. 8 14. 6 N/A LPCC + LPCC 16. 0 19. 8 14. 7 N/A ARCSIN + ARCSIN 17. 1 18. 2 16. 8 N/A FMT + FMT 19. 4 29. 9 52. 0 N/A All feature sets 16. 0 21. 2 15. 2 12. 6 University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Real-Time Speaker Identification Speech input stream Speed up NN search Speaker database Fill buffer

Real-Time Speaker Identification Speech input stream Speed up NN search Speaker database Fill buffer with new data Speaker 1 model Frame blocking All frames Speaker N model Silence detection v v . . . Reducing # vectors v v Vantage-point tree (VPT) indexing of the code vectors Non-silent frames Feature extraction 1. Averaging 2. Random sampling 3. Decimation 4. Clustering (LBG) v v Feature vectors Active speakers Pruned speakers Pre-quantization v v List of candidate speakers Redused set of vectors Database pruning Matching 1. Static pruning v No Reduce # speakers Decision ? Yes 2. Hierarchical pruning END 3. Adaptive pruning 4. Confidence-based pruning University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Results: Baseline System (TIMIT) (Average length of test utterance = 8. 9 s) 4

Results: Baseline System (TIMIT) (Average length of test utterance = 8. 9 s) 4 x realtime Real-time requirement satisfied University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Results: Pre-Quantization (TIMIT) (Codebook size = 64) 9 x realtime • Averaging performs worst,

Results: Pre-Quantization (TIMIT) (Codebook size = 64) 9 x realtime • Averaging performs worst, clustering best • About 2: 1 speed-up to full search (no pre-quantization) without degradation in the accuracy University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Results: Pruning Variants (TIMIT) (Codebook size = 64) 11 x realtime University of Joensuu

Results: Pruning Variants (TIMIT) (Codebook size = 64) 11 x realtime University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi • Recommended method : adaptive pruning (AP)

Results: PQ, Pruning and PQP (TIMIT) (Codebook size = 64) 33 x realtime University

Results: PQ, Pruning and PQP (TIMIT) (Codebook size = 64) 33 x realtime University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi • Recommended method : Combination of prequantization and pruning (PQP)

Results : VQ vs. GMM (TIMIT) (Average length of test utterance = 8. 9

Results : VQ vs. GMM (TIMIT) (Average length of test utterance = 8. 9 s) VQ GMM 13: 1 speed-up without degradation Best time : 0. 27 s = 33 x realtime @ error rate 0. 32 % Smallest error : 0. 00 % @ 0. 31 s = 28 x realtime 9: 1 to 10: 1 speed-up without degradation Best time : 0. 18 s = 49 x realtime @ error rate 0. 16 % Smallest error : 0. 16 % @ 0. 18 s = 49 x realtime University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi

Results : VQ vs. GMM (NIST-1999) (Average length of test utterance = 30. 4

Results : VQ vs. GMM (NIST-1999) (Average length of test utterance = 30. 4 s) VQ Best time : 0. 48 s = 63 x realtime @ error rate 19. 22 % Smallest error : 17. 34 % @ 11. 4 s = 3 x realtime 13: 1 to 16: 1 speedup with minor degradation GMM 23: 1 to 34: 1 speedup with minor degradation Best time : 0. 82 s = 37 x realtime @ error rate 19. 36 % Smallest error: 16. 90 % @ 37. 9 s = 0. 8 x realtime University of Joensuu Dept. of Computer Science P. O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www. cs. joensuu. fi