Large Scale Evaluation of Corpusbased Synthesizers The Blizzard

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005

What is corpus-based speech synthesis? Speech Synthesizer Transcript Corpus + Voice talent speech New text = New speech 2

M o t i v a t i o n Need for Speech Synthesis Evaluation ¡ ¡ Determine effectiveness of our “improvements” Closer comparison of various corpus -based techniques Learn about users' preferences Healthy competition promotes progress and brings attention to the field 3

M o t i v a t i o n Blizzard Challenge Goals ¡ Compare methods across systems ¡ Remove effects of different data by providing & requiring same data to be used ¡ Establish a standard for repeatable evaluations in the field ¡ [My goal: ] Bring need for improved speech synthesis evaluation to forefront in community (positioning CMU as a leader in this regard) 4

C h a l l e n g e Blizzard Challenge: Overview ¡ ¡ ¡ Released first voices and solicited participation in 2004 Additional voices and test sentences released Jan. 2005 1 - 2 weeks allowed to build voices & synthesize sentences l 1000 samples from each system (50 sentences x 5 tests x 4 voices) 5

C h a l l e n g e Evaluation Methods ¡ Mean Opinion Score (MOS) l ¡ Modified Rhyme Test (MRT) l ¡ Evaluate sample on a numerical scale Intelligibility test with tested word within a carrier phrase Semantically Unpredictable Sentences (SUS) l Intelligibility test preventing listeners from using knowledge to predict words 6

C h a l l e n g e Challenge setup: Tests ¡ 5 tests from 5 genres l 3 MOS tests (1 to 5 scale) ¡ l News, prose, conversation 2 “type what you hear” tests MRT – “Now we will say ___ again” ¡ SUS – ‘det-adj-noun-verb-det-adj-noun’ ¡ ¡ 50 sentences collected from each system, 20 selected for use in testing 7

C h a l l e n g e Challenge setup: Systems ¡ 6 systems: (random ID A-F) l l l ¡ CMU Delaware Edinburgh (UK) IBM MIT Nitech (Japan) Plus 1: “Team Recording Booth” (ID X) l Natural examples from the 4 voice talents 8

C h a l l e n g e Challenge setup: Voices ¡ CMU ARCTIC databases ¡ American English; 2 male, 2 female l 2 from initial release bdl (m) ¡ slt (f) ¡ l 2 new DBs released for quick build rms (m) ¡ clb (f) ¡ 9

C h a l l e n g e Challenge setup: Listeners ¡ Three listener groups: l S – speech synthesis experts (50) ¡ l V – volunteers (60, 97 registered*) ¡ l 10 requested from each participating site Anyone online U – native US English speaking undergraduates (58, 67 registered*) ¡ Solicited and paid for participation *as of 4/14/05 10

C h a l l e n g e Challenge setup: Interface ¡ Entirely online http: //www. speech. cs. cmu. edu/blizzard/register-R. html http: //www. speech. cs. cmu. edu/blizzard/login. html Register/login with email address ¡ Keeps track of progress through tests ¡ Can stop and return to tests later ¡ Feedback questionnaire at end of tests ¡ 11

R e s u l t s Fortunately, Team X is clear “winner” Listener type S Listener type V Listener type U MOS type-in X - 4. 76 X - 8. 5 X - 4. 41 X - 10. 3 X - 4. 58 X - 7. 3 D - 3. 19 D - 14. 7 D - 3. 02 D - 17. 1 D - 3. 06 D - 16. 3 E - 3. 11 B - 15. 0 E - 2. 83 A - 19. 7 E - 2. 83 A - 19. 3 C - 2. 91 A - 17. 4 B - 2. 66 B - 20. 3 B - 2. 67 B - 19. 6 B - 2. 88 E - 20. 6 C - 2. 48 E - 25. 0 C - 2. 42 E - 21. 7 F - 2. 15 C - 22. 5 F - 2. 07 C - 25. 6 A - 2. 00 C - 22. 8 A - 2. 07 F - 32. 7 A - 1. 98 F - 41. 8 F - 1. 98 F - 35. 2 12

R e s u l t s Team D consistently outperforms others Listener type S Listener type V Listener type U MOS type-in X - 4. 76 X - 8. 5 X - 4. 41 X - 10. 3 X - 4. 58 X - 7. 3 D - 3. 19 D - 14. 7 D - 3. 02 D - 17. 1 D - 3. 06 D - 16. 3 E - 3. 11 B - 15. 0 E - 2. 83 A - 19. 7 E - 2. 83 A - 19. 3 C - 2. 91 A - 17. 4 B - 2. 66 B - 20. 3 B - 2. 67 B - 19. 6 B - 2. 88 E - 20. 6 C - 2. 48 E - 25. 0 C - 2. 42 E - 21. 7 F - 2. 15 C - 22. 5 F - 2. 07 C - 25. 6 A - 2. 00 C - 22. 8 A - 2. 07 F - 32. 7 A - 1. 98 F - 41. 8 F - 1. 98 F - 35. 2 13

R e s u l t s Speech experts are biased “optimistic” Listener type S Listener type V Listener type U MOS type-in X - 4. 76 X - 8. 5 X - 4. 41 X - 10. 3 X - 4. 58 X - 7. 3 D - 3. 19 D - 14. 7 D - 3. 02 D - 17. 1 D - 3. 06 D - 16. 3 E - 3. 11 B - 15. 0 E - 2. 83 A - 19. 7 E - 2. 83 A - 19. 3 C - 2. 91 A - 17. 4 B - 2. 66 B - 20. 3 B - 2. 67 B - 19. 6 B - 2. 88 E - 20. 6 C - 2. 48 E - 25. 0 C - 2. 42 E - 21. 7 F - 2. 15 C - 22. 5 F - 2. 07 C - 25. 6 A - 2. 00 C - 22. 8 A - 2. 07 F - 32. 7 A - 1. 98 F - 41. 8 F - 1. 98 F - 35. 2 14

R e s u l t s Speech experts are better in fact experts Listener type S Listener type V Listener type U MOS type-in X - 4. 76 X - 8. 5 X - 4. 41 X - 10. 3 X - 4. 58 X - 7. 3 D - 3. 19 D - 14. 7 D - 3. 02 D - 17. 1 D - 3. 06 D - 16. 3 E - 3. 11 B - 15. 0 E - 2. 83 A - 19. 7 E - 2. 83 A - 19. 3 C - 2. 91 A - 17. 4 B - 2. 66 B - 20. 3 B - 2. 67 B - 19. 6 B - 2. 88 E - 20. 6 C - 2. 48 E - 25. 0 C - 2. 42 E - 21. 7 F - 2. 15 C - 22. 5 F - 2. 07 C - 25. 6 A - 2. 00 C - 22. 8 A - 2. 07 F - 32. 7 A - 1. 98 F - 41. 8 F - 1. 98 F - 35. 2 15

R e s u l t s Voice results: Listener preference ¡ slt is most liked, followed by rms l Type S: ¡ l Type V: ¡ l slt - 50% of votes cast; rms - 28. 26% Type U: ¡ ¡ slt - 43. 48% of votes cast; rms - 36. 96% slt - 47. 27% of votes cast; rms - 34. 55% But, preference does not necessarily match test performance… 16

R e s u l t s Voice results: Test performance Female voices - slt Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3. 233 bdl - 4. 827 rms - 10. 5 rms - 3. 2 clb - 3. 154 rms - 4. 809 clb - 16. 0 clb - 9. 3 slt - 2. 994 slt - 4. 738 slt - 20. 8 bdl - 9. 4 bdl - 2. 941 clb - 4. 690 bdl - 22. 7 slt - 11. 3 clb - 2. 946 rms - 4. 568 rms - 14. 0 rms - 3. 8 rms - 2. 894 clb - 4. 404 clb - 17. 1 bdl - 12. 0 slt - 2. 884 bdl - 4. 382 slt - 25. 2 slt - 12. 0 bdl - 2. 635 slt - 4. 296 bdl - 29. 3 clb - 13. 1 clb - 2. 987 slt - 4. 611 clb - 11. 9 slt - 5. 9 slt - 2. 930 clb - 4. 587 slt - 17. 5 clb - 5. 9 rms - 2. 873 rms - 4. 584 rms - 17. 6 rms - 8. 8 bdl - 2. 678 bdl - 4. 551 bdl - 28. 7 bdl - 9. 1 17

R e s u l t s Voice results: Test performance Female voices - clb Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3. 233 bdl - 4. 827 rms - 10. 5 rms - 3. 2 clb - 3. 154 rms - 4. 809 clb - 16. 0 clb - 9. 3 slt - 2. 994 slt - 4. 738 slt - 20. 8 bdl - 9. 4 bdl - 2. 941 clb - 4. 690 bdl - 22. 7 slt - 11. 3 clb - 2. 946 rms - 4. 568 rms - 14. 0 rms - 3. 8 rms - 2. 894 clb - 4. 404 clb - 17. 1 bdl - 12. 0 slt - 2. 884 bdl - 4. 382 slt - 25. 2 slt - 12. 0 bdl - 2. 635 slt - 4. 296 bdl - 29. 3 clb - 13. 1 clb - 2. 987 slt - 4. 611 clb - 11. 9 slt - 5. 9 slt - 2. 930 clb - 4. 587 slt - 17. 5 clb - 5. 9 rms - 2. 873 rms - 4. 584 rms - 17. 6 rms - 8. 8 bdl - 2. 678 bdl - 4. 551 bdl - 28. 7 bdl - 9. 1 18

R e s u l t s Voice results: Test performance Male voices - rms Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3. 233 bdl - 4. 827 rms - 10. 5 rms - 3. 2 clb - 3. 154 rms - 4. 809 clb - 16. 0 clb - 9. 3 slt - 2. 994 slt - 4. 738 slt - 20. 8 bdl - 9. 4 bdl - 2. 941 clb - 4. 690 bdl - 22. 7 slt - 11. 3 clb - 2. 946 rms - 4. 568 rms - 14. 0 rms - 3. 8 rms - 2. 894 clb - 4. 404 clb - 17. 1 bdl - 12. 0 slt - 2. 884 bdl - 4. 382 slt - 25. 2 slt - 12. 0 bdl - 2. 635 slt - 4. 296 bdl - 29. 3 clb - 13. 1 clb - 2. 987 slt - 4. 611 clb - 11. 9 slt - 5. 9 slt - 2. 930 clb - 4. 587 slt - 17. 5 clb - 5. 9 rms - 2. 873 rms - 4. 584 rms - 17. 6 rms - 8. 8 bdl - 2. 678 bdl - 4. 551 bdl - 28. 7 bdl - 9. 1 19

R e s u l t s Voice results: Test performance Male voices - bdl Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3. 233 bdl - 4. 827 rms - 10. 5 rms - 3. 2 clb - 3. 154 rms - 4. 809 clb - 16. 0 clb - 9. 3 slt - 2. 994 slt - 4. 738 slt - 20. 8 bdl - 9. 4 bdl - 2. 941 clb - 4. 690 bdl - 22. 7 slt - 11. 3 clb - 2. 946 rms - 4. 568 rms - 14. 0 rms - 3. 8 rms - 2. 894 clb - 4. 404 clb - 17. 1 bdl - 12. 0 slt - 2. 884 bdl - 4. 382 slt - 25. 2 slt - 12. 0 bdl - 2. 635 slt - 4. 296 bdl - 29. 3 clb - 13. 1 clb - 2. 987 slt - 4. 611 clb - 11. 9 slt - 5. 9 slt - 2. 930 clb - 4. 587 slt - 17. 5 clb - 5. 9 rms - 2. 873 rms - 4. 584 rms - 17. 6 rms - 8. 8 bdl - 2. 678 bdl - 4. 551 bdl - 28. 7 bdl - 9. 1 20

R e s u l t s Voice results: Natural examples Listener type S MOS type-in bdl - 4. 827 Listener type V MOS type-in rms - 3. 2 rms - 4. 568 Listener type U MOS type-in rms - 3. 8 slt - 4. 611 slt - 5. 9 rms - 4. 809 clb - 9. 3 clb - 4. 404 bdl - 12. 0 clb - 4. 587 clb - 5. 9 slt - 4. 738 bdl - 9. 4 bdl - 4. 382 slt - 12. 0 rms - 8. 8 clb - 4. 690 slt - 11. 3 slt - 4. 296 clb - 13. 1 bdl - 4. 551 rms - 4. 584 bdl - 9. 1 What makes natural rms different? 21

R e s u l t s Voice results: By system ¡ Only system B consistent across listener types: (slt best MOS, rms best WER) ¡ Most others showed group trends, i. e. (with exception of B above and F*) l S: rms always best WER, often best MOS l V: slt usually best MOS, clb usually best WER l U: clb usually best MOS and always best WER Again, people clearly don’t prefer the voices they most easily understand 22

L e s s o n s Lessons learned: Listeners ¡ Reasons to exclude listener data: l ¡ Type-in tests very hard to process automatically: l ¡ Incomplete test, failure to follow directions, inability to respond (type-in), unusable responses Homophones, misspellings/typos, dialectal differences, “smart” listeners Group differences: l V most variable, U most controlled, S least problematic but not representative 23

L e s s o n s Lessons learned: Test design ¡ Feedback re tests: l l MOS: Give examples to calibrate scale (ordering schema); use multiple scales (lay-people? ) Type-in: Warn about SUS; hard to remember SUS; words too unusual/hard to spell ¡ Uncontrollable user test setup ¡ Pros & Cons to having natural examples in the mix l Analyzing user response (+), differences in delivery style (-), availability of voice talent (? ) 24

L e s s o n s Goals Revisited ¡ ¡ One methodology clearly outshined rest All systems used same data allowing for actual comparison of systems ¡ Standard for repeatable evaluations in the field was established ¡ [My goal: ] Brought attention to need for better speech synthesis evaluation (while positioning CMU as the experts) 25

F u t u r e For the Future ¡ (Bi-)Annual Blizzard Challenge l ¡ ¡ Introduced at Interspeech 2005 special session Improve design of tests for easier analysis post-evaluation Encourage more sites to submit their systems! More data resources (problematic for the commercial entities) Expand types of systems accepted (& therefore test types) l e. g. voice conversion 26