OGI CSLU Speech Toolkit Il riconoscimento automatico del

  • Slides: 31
Download presentation
OGI - CSLU Speech Toolkit** Il riconoscimento automatico del linguaggio naturale alla portata di

OGI - CSLU Speech Toolkit** Il riconoscimento automatico del linguaggio naturale alla portata di tutti Piero Cosi* * Istituto di Fonetica e Dialettologia (C. N. R. ) Via G. Anghinoni, 10 35121 Padova (Italy) Phone +39 49 8274418 Fax: +39 49 827 4416 E-mail: cosi@csrf. pd. cnr. it **Center for Spoken Language Understanding at the Oregon Graduate Institute of Science and Technology 20000 NW Walker Road, Beaverton, Oregon 97006, (503) 690 -1121 E-mail: cole@cse. ogi. edu GFS-IX - Le IX Giornate di Studio del Gruppo di Fonetica Sperimentale Università Ca' Foscari, Venezia, 17 -19 dicembre 1998 Copyright, 1998 © IFD-CNR

Sommario * * Introduzione OGI - CSLU Speech Toolkit Riconoscimento (HMM / NN) Esempi

Sommario * * Introduzione OGI - CSLU Speech Toolkit Riconoscimento (HMM / NN) Esempi di applicazioni: • Italian/English Connected Digit Recog. * Universal Language Systems • Video (OGI) * Connected Digit Recognition • Italian/English Demo Copyright, 1998 © IFD-CNR

Introduzione * I CSLU Speech Toolkit sono stati sviluppati per facilitare le attività di

Introduzione * I CSLU Speech Toolkit sono stati sviluppati per facilitare le attività di ricerca e di sviluppo nel campo delle tecnologie vocali per un ampio range di scopi e di utilizzatori. * Fra le varie attività • rendere possibile, anche da parte di persone inesperte di tecnologie vocali, la progettazione e lo sviluppo di sistemi ASR per applicazioni reali, in varie lingue, mediante l’utilizzazione di opportuni tool grafici Copyright, 1998 © IFD-CNR

OGI CSLU Speech Toolkit RAD Copyright, 1998 © IFD-CNR

OGI CSLU Speech Toolkit RAD Copyright, 1998 © IFD-CNR

OGI CSLU Speech Toolkit NN & Speech Recognition HMM Festival Visualization Tools Speech Synthesis

OGI CSLU Speech Toolkit NN & Speech Recognition HMM Festival Visualization Tools Speech Synthesis Authoring Tools Facial Animation Programming Environment RAD UCSC – PSL Baldi Documentation & Tutorials Copyright, 1998 © IFD-CNR

Speech Recognition * ANN * HMM * ASR vocabolario-indipendente * ASR vocabolario-specifico (digit) *

Speech Recognition * ANN * HMM * ASR vocabolario-indipendente * ASR vocabolario-specifico (digit) * Tutorial e Tools per il training di nuovi ASR (ANN/HMM) Copyright, 1998 © IFD-CNR

Speech Synthesis * Festival TTS Centre for Speech Technology Research University of Edimburgh •

Speech Synthesis * Festival TTS Centre for Speech Technology Research University of Edimburgh • British/American English • Mexican, German * Sistema di sviluppo • normalizzazione del testo • trasformazione in una sequenza di segmenti fonetici (durata, f 0. . . ) • sintesi per difoni o concatenativa Copyright, 1998 © IFD-CNR

Facial Animation * Baldi 3 D Talking Head University of California, Santa Cruz D.

Facial Animation * Baldi 3 D Talking Head University of California, Santa Cruz D. Massaro M. Cohen Copyright, 1998 © IFD-CNR

Authoring Tools * RAD - Rapid Application Developer Copyright, 1998 © IFD-CNR

Authoring Tools * RAD - Rapid Application Developer Copyright, 1998 © IFD-CNR

Authoring Tools * RAD - Rapid Application Developer • • • Barge-in (Full-Duplex) Open

Authoring Tools * RAD - Rapid Application Developer • • • Barge-in (Full-Duplex) Open Microphone Touch-tone Input Wizard-of-Oz mode Multilingual Natural Speech and Lip Syncing • including recording tools and automatic generation and synchronization of facial animation with the recorded speech output. • New Dialogue Objects Copyright, 1998 © IFD-CNR

Visualization Tools *Wave Editor Copyright, 1998 © IFD-CNR

Visualization Tools *Wave Editor Copyright, 1998 © IFD-CNR

Programming Environment OGI CSLU Speech Toolkit Bin lib 1 Cslu-C lib 2 Cslu-sh Cslu-RAD

Programming Environment OGI CSLU Speech Toolkit Bin lib 1 Cslu-C lib 2 Cslu-sh Cslu-RAD lib 3 libraries lib 1 packages lib 2 pkg 1 pkg 2 pkg 3 Copyright, 1998 © IFD-CNR

Speech Recognition Copyright, 1998 © IFD-CNR

Speech Recognition Copyright, 1998 © IFD-CNR

Context Dependent Modeling Copyright, 1998 © IFD-CNR

Context Dependent Modeling Copyright, 1998 © IFD-CNR

Neural Network Copyright, 1998 © IFD-CNR

Neural Network Copyright, 1998 © IFD-CNR

Viterbi Search Copyright, 1998 © IFD-CNR

Viterbi Search Copyright, 1998 © IFD-CNR

Viterbi Search Copyright, 1998 © IFD-CNR

Viterbi Search Copyright, 1998 © IFD-CNR

Word Spotting Copyright, 1998 © IFD-CNR

Word Spotting Copyright, 1998 © IFD-CNR

Connected Digit Recognition /p/ /t/ /f/ /a. I/ /v/ pau Copyright, 1998 © IFD-CNR

Connected Digit Recognition /p/ /t/ /f/ /a. I/ /v/ pau Copyright, 1998 © IFD-CNR

HMM * Hidden Markov Model model li o 1 i a 11 q 1

HMM * Hidden Markov Model model li o 1 i a 11 q 1 o 2 a 12 a 22 q 2 o 3 a 23 a 33 q 3 O = (o 1 o 2 … o 3) observation sequence li=(Ai , Bi , pi ) f a 13 Copyright, 1998 © IFD-CNR

HMM Decoding ^ n w phoneme model 9 r ou f one word end

HMM Decoding ^ n w phoneme model 9 r ou f one word end iv v . pau one sp one four word end five one-five sil sp one-sp oh oh sp oh 650 oh sil two oh 100 one four one sil one-four one-sil sp sil one-one 740 960 oh sp sp one three sp sp oh 970 oh 1350 1480 1770 2210 2330 t [ms] Copyright, 1998 © IFD-CNR

HMM-NN (hybrid) f<a. I>v HMM $lab<a. I n<a. I>n <a. I>$lab output nodes hidden

HMM-NN (hybrid) f<a. I>v HMM $lab<a. I n<a. I>n <a. I>$lab output nodes hidden nodes input nodes Copyright, 1998 © IFD-CNR

Connected Digit Recognition corpus: OGI numbers • • • connected digit sequences 150 talkers

Connected Digit Recognition corpus: OGI numbers • • • connected digit sequences 150 talkers (speaker indipendent) telephonic channel Training: a) 7000 sequences English (phonetically transcribed) b) 9000 sequences (word-level transcribed) Development: Test: 3000 sequences w s • HMM: 96. 4% 86. 1% • NN: 97. 6% 91. 0% • HMM/NN: 97. 9% 91. 8% (HYBRID) Copyright, 1998 © IFD-CNR

Connected Digit Recognition corpus: IRST Spk numbers (ELRA) • connected digit sequences • 40

Connected Digit Recognition corpus: IRST Spk numbers (ELRA) • connected digit sequences • 40 talkers (speaker pooled) • clean channel Training: a) 1533 sequences Italian (phonetically transcribed) b) 5110 sequences (word-level transcribed) Development: Test: NN: 1720 sequences w s 99. 71% 99. 48% Copyright, 1998 © IFD-CNR

Universal Language Systems OGI Speech Toolkit video Copyright, 1998 © IFD-CNR

Universal Language Systems OGI Speech Toolkit video Copyright, 1998 © IFD-CNR

RAD Demo Italian / English Connected Digit Recognition Copyright, 1998 © IFD-CNR

RAD Demo Italian / English Connected Digit Recognition Copyright, 1998 © IFD-CNR

Conclusioni * CSLU Speech Toolkit è un efficace ambiente integrato per lo sviluppo e

Conclusioni * CSLU Speech Toolkit è un efficace ambiente integrato per lo sviluppo e la ricerca nel campo delle tecnologie vocali Copyright, 1998 © IFD-CNR

Utilizzazione * Ricerca • • • Speech/Speaker Recognition Natural Language Processing Speech Synthesis *

Utilizzazione * Ricerca • • • Speech/Speaker Recognition Natural Language Processing Speech Synthesis * Sviluppo di Sistemi * Didattica in Tecnologie Vocali * Insegnamento di Lingue Copyright, 1998 © IFD-CNR

Disponibilità * Windows 95/NT, Unix * Free Download (per un uso non commerciale) http:

Disponibilità * Windows 95/NT, Unix * Free Download (per un uso non commerciale) http: //www. cse. ogi. edu/cslu * Fluent Speech Technology (per un uso commerciale) http: //www. fluent-speech. com Copyright, 1998 © IFD-CNR

Sviluppi Futuri *Versione Italiana Copyright, 1998 © IFD-CNR

Sviluppi Futuri *Versione Italiana Copyright, 1998 © IFD-CNR

Riferimenti Bibliografici Black, A. and Taylor, P. “Festival Speech Synthesis System: system documentation, ”

Riferimenti Bibliografici Black, A. and Taylor, P. “Festival Speech Synthesis System: system documentation, ” Human Communication Research Centre Technical Report HCRC/TR-83, 1997. Cole, R. , Carmell, T. , Connors, P. , Macon, M. , Wouters, J. , de Villiers, J. , Tarachow, A. , Massaro, D. , Cohen, M. , Beskow, J. , Yang, J. , Meier, U. , Waibel, A. , Stone, P. , Fortier, G. , Davis, A. , Soland, C. , “Intelligent Animated Agents for Interactive Language Training” Proceedings of ESCA-Sti. LL, Marholmen, Sweden, May 1997. Cole, R. , Sutton, S. , Yan, Y. , Vermeulen, P. , Fanty, F. “Accessible Technology for Interactive Systems: A New Approach to Spoken Language Research, ” Proceedings of International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, May 1998. Cosi, P. , Hosom, J. P. , Shalkwyk, J. , Sutton, S. , and Cole, R. A. , “Connected Digit Recognition Experiments with the OGI Toolkit's Neural Network and HMM-Based Recognizers”, 4 th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Turin, Italy, September 29 -30, 1998. Hosom, J. P. , Cole, R. A. “Evaluation and Integration of Neural-Network Training Techniques for Continuous Digit Recognition”, Proceedings of 1998 International Conference on Spoken Language Processing, Sydney, Nov. -Dec. 1998. Kain A. , and Macon, M. , “Spectral voice conversion for text-to-speech synthesis, ” Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 285 -288, May 1998. Macon, M. , Cronk, A. , Wouters, J. , and Kain, A. “OGIres. LPC: Diphone synthesizer using residual-excited linear prediction”, Tech. Rep. CSE-97 -007, Department of Computer Science, Oregon Graduate Institute of Science and Technology, Portland, OR, September 1997. Schalkwyk, J. , de Villiers, J. , van Vuuren, Sarel and Vermeulen, P. “CSLUsh: An Extendible Research Environment", Proceedings of Eurospeech'97. Serridge, B. , Cole, R. , Barbosa, A. , Munive, N. , and Vargas, A. , "Creating a Mexican Spanish version of the CSLU toolkit”, To be presented at International Conference of Spoken Language Processing 1998, Sydney, Australia. Cole, R. , Sutton, S. , Yan, Y. , Vermeulen, P. , and Fanty, M. , "Accessible Technology for Interactive Systems: A new approach to spoken language research, " ICASSP 1998. Sutton, S. , Kaiser, E. , Cronk, A. , and Cole, R. "Bringing Spoken Language Systems to the Classroom", Proceedings of Eurospeech'97. Sutton, S. , Novick, D. , Cole, R. , and Fanty, M. , “Building 10, 000 spoken-dialogue systems. ” Proceedings of the International Conference on Spoken Language Processing, Philadelphia, PA, 1996. Copyright, 1998 © IFD-CNR