Formant Measurement Errors From Real Speech Philip Harrison

  • Slides: 19
Download presentation
Formant Measurement Errors From Real Speech Philip Harrison J P French Associates & University

Formant Measurement Errors From Real Speech Philip Harrison J P French Associates & University of York IAFPA 20 th Annual Conference 24 th – 28 th July 2011 – Vienna

Outline • Motivation & background • Formant measurement errors from synthetic speech • Formant

Outline • Motivation & background • Formant measurement errors from synthetic speech • Formant measurement errors from real speech – VTR database – Praat tracker – CAb. S tracker – Published results – MSR & Wave. Surfer l • Discussion 2 2

Motivation & Background • All measurements are subject to ‘error’ • An estimate of

Motivation & Background • All measurements are subject to ‘error’ • An estimate of the error should accompany all measurements • Increasing use of formant measurements in forensic casework – no errors quoted 3 • Significant problem – can’t obtain ‘ground truth’ values from real 3 l speech to determine error of

Errors from Synthetic Speech • Idealised synthetic male speaker – 2, 858 monophthongs over

Errors from Synthetic Speech • Idealised synthetic male speaker – 2, 858 monophthongs over F 1, F 2 vowel space – Specified F 1 to F 5 centre frequency & bandwidth – Pulse train glottal source, range of F 0 s (70 – 190 Hz) l • Measured formants at different LPC orders (6 to 20) in Praat – Burg (LPC) analysis, not a tracker • Calculation of error: Ferror = Fmeasured – Fspecified 4 4

Error Summary Results – Absolute Error @ F 0 = 100 Hz F 1

Error Summary Results – Absolute Error @ F 0 = 100 Hz F 1 Abs (SD) Hz LPC 7 9. 6 (10. 4) LPC 8 8. 3 (8. 8) LPC 9 8. 8 (8. 7) l % 2. 1 (2. 3) 1. 8 (1. 9) 2. 0 (2. 0) F 2 Abs (SD) Hz 31. 6 (19. 4) 18. 3 (11. 6) 8. 0 (9. 2) % 2. 1 (1. 2) 1. 3 (0. 8) 0. 6 (0. 7) F 3 Abs (SD) Hz 99. 8 (36. 9) 41. 1 (17. 9) 10. 7 (8. 3) % 4. 2 (1. 4) 1. 7 (0. 7) 0. 5 5 (0. 3) 5

Multiple Synthetic Speakers l • Variation both within and between real speakers in many

Multiple Synthetic Speakers l • Variation both within and between real speakers in many speech production parameters – e. g. F 0 range, F 1 -F 2 vowel space, formant bandwidths • Single synthetic speaker unlikely to be representative or capture variation 6 • Consider multiple synthetic speakers: 6 – Alternative specified F 3 values – 8

Multiple Synthetic Speakers – Summary Results • Alternative F 3 – Negligible influence on

Multiple Synthetic Speakers – Summary Results • Alternative F 3 – Negligible influence on F 1, F 2 errors – Changes in F 3 error surface – influenced by F 3 surface – F 3 error dependent on location within F 1, F 2 space – constant F 3 speakers – high F 1 & F 2 -> larger F 3 errors • Glottal source signal l – Impact on error surfaces & performance – across all formants – some better, some worse than baseline 7 7

Real Speech • How do these results translate to real speech? • Can’t directly

Real Speech • How do these results translate to real speech? • Can’t directly test real speech – reason for using synthetic speech initially • Compare overall performance of real and synthetic speech… 8 l 8

VTR Database l • Database of hand-corrected vocal tract resonance values (Deng et al

VTR Database l • Database of hand-corrected vocal tract resonance values (Deng et al 2006) – balanced subset of TIMIT corpus – good quality digital recs • 516 sentences – 186 speakers (113 male, 73 female) – 61, 000 vowel frames, 6, 600 vowel tokens • Similar method to synthetic speakers 9 but frame by frame measurements and token means across monophthongs & 9 diphthongs

VTR Results Frame LPC 15 Frame ♂ 17 Frame ♀ 14 Token 15 F

VTR Results Frame LPC 15 Frame ♂ 17 Frame ♀ 14 Token 15 F 1 Abs 86 (124) 78 (108) 94 (141) 66 (89) % 17 (27) 17 (25) 17 (26) 14 (18) LPC 11 11 10 11 F 2 Abs 201 (301) 177 (259) 225 (331) 161 (220) % 13 (19) 12 (18) 13 (20) 10 (13) LPC 10 11 F 3 Abs % 217 9 (337) (14) 202 9 (306) (13) 228 9 (347)10 (14) 179 7 (249) (9)

Comparison with Synthetic Speech F 1 LPC Abs % Synth 8 Real l 8.

Comparison with Synthetic Speech F 1 LPC Abs % Synth 8 Real l 8. 3 (8. 8) 17 63 (84) F 2 LPC Abs % 1. 8 9 (1. 9) 14 11 (19) 8. 0 (9. 2) 151 (205) F 3 LPC Abs % 0. 6 9 (0. 7) 10 11 (14) 10. 7 (8. 3) 168 (235) 0. 5 (0. 3) 7 (10) • Both speakers = male, monophthong token average 11 • Best performance of all real results shown 11

Can Results be Improved? l • Real speech results not as good as synthetic

Can Results be Improved? l • Real speech results not as good as synthetic speech • But measurements so far made without any ‘intelligence’ in selection of values • Praat standard formant measurement tool is not a tracker • Formant trackers attempt to select most likely values based on criteria – bandwidth, centre frequency, frame 12 transitions 12

Trackers Tested • Trackers – Praat tracker – Viterbi algorithm, considers centre frequency, bandwidth

Trackers Tested • Trackers – Praat tracker – Viterbi algorithm, considers centre frequency, bandwidth and frame transitions – CAb. S tracker (Clermont et al 2007) – cepstral compatibility between original signal and candidate formants, plus continuity constraints • ‘Default’ settings used l 13 13

Praat Tracker Results Frame LPC 15 Tracker Frame Token 15 Tracker Token 15 15

Praat Tracker Results Frame LPC 15 Tracker Frame Token 15 Tracker Token 15 15 F 1 Abs 86 (124) 55 (75) 66 (89) 46 (61) % 17 (27) 11 (17) 14 (18) 10 (14) LPC 11 11 F 2 Abs 201 (301) 94 (163) 161 (220) 81 (141) % 13 (19) 7 (15) 10 (13) 6 (12) LPC 10 14 11 14 F 3 Abs % 217 9 (337) (14) 179 8 (359) (19) 179 7 (249)14 (9) 162 8 (332) (18)

CAb. S Tracker Results Frame LPC 15 Tracker Frame Token 14 Tracker Token 14

CAb. S Tracker Results Frame LPC 15 Tracker Frame Token 14 Tracker Token 14 15 F 1 Abs 86 (124) 69 (139) 66 (89) 62 (132) % 17 (27) 15 (34) 14 (18) 14 (33) LPC 11 13 F 2 Abs 201 (301) 122 (239) 161 (220) 115 (218) % 13 (19) 8 (18) 10 (13) 8 (16) LPC 10 12 11 11 F 3 Abs % 217 9 (337) (14) 413 18 (544) (25) 179 7 (249)15 (9) 414 18 (512) (23)

Tracker Comparison Frame Data Praat LPC 15 Praat Tracker 15 CAb. S Tracker 14

Tracker Comparison Frame Data Praat LPC 15 Praat Tracker 15 CAb. S Tracker 14 Wav. Surf MSR F 1 Abs 86 (124) 55 (75) 69 (139) 70 64 % 17 (27) 11 (17) 15 (34) LPC 11 11 13 F 2 Abs 201 (301) 94 (163) 122 (239) 94 105 % 13 (19) 7 (15) 8 (18) LPC 10 14 12 F 3 Abs % 217 9 (337) (14) 179 8 (359) (19) 413 18 (544)16 (25) 154 125

Discussion • Even with a tracker real speech results not as good as synthetic

Discussion • Even with a tracker real speech results not as good as synthetic performance • But VTR database not perfect • Does allow comparison of trackers – no obvious ‘winner’ • Even though best performance at different LPC orders across F 1, F 217 & F 3, results similar enough to use l same LPC order for all formants 17

Further Questions… • What is the variation across speakers and vowel categories? Is it

Further Questions… • What is the variation across speakers and vowel categories? Is it significant? • What is the maximum acheivable performance? • Is 10% error a realistic estimate? – Possibly test more diverse synthetic speech l • Is 10% error acceptable? • What impact does this have on LRs and other numerical analyses (LTFAs)? 18 • Are trackers accurate enough to be used unattended on large corpera? How 18 much manual intervention is necessary?

Questions ? Thanks to Frantz Clermont, Peter French & Paul Foulkes l References: Clermont,

Questions ? Thanks to Frantz Clermont, Peter French & Paul Foulkes l References: Clermont, F. , Harrison, P. & French, P. (2007) ‘Formant-pattern estimation guided by cepstral compatibility’. Proceedings of IAFPA 2007 Annual Conference, Plymouth, UK. Deng, L. , Cui, X. , Pruvenok, R. , Huang, L. , Momen, S. , Chen, Y. and 19 Alwan, A. “A database of vocal tract resonance trajectories for research in speech processing, ” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 19 Toulouse, France, May 2006.