Keyword Spotting Dynamic Time Warping Ali Akbar Jabini
Keyword Spotting Dynamic Time Warping Ali Akbar Jabini Alexandre Mercier-Dalphond Spring 2006
Introduction Ú Speech recognition: – Computer can interpret speech – Need input to digitalize sounds • Microphone – People can speak faster than type – Commercial systems available since 1990 s • People prefer Physical interactions – Keyboard/Mouse, On/Off switch • Low Accuracy for large vocabulary with noise (50%)
Introduction Ú Speech recognition is more and more used for smaller vocabulary banks – Credit Card Systems – Simple switching commands – Directory assistance Ú Cheap to implement Ú High Accuracy – Can verify their interpretation Ú Idea: speech recognition for household appliances
OUTLINE Ú Area of investigation Ú Concrete task/Goal Ú Schematic Ú Feature extraction Ú DTW Ú Training Ú Evaluation metrics Ú Conclusion
Area of Investigation Ú Keyword Spotting: – Subfield of speech recognition – Grammar constrained Ú Keyword Spotting in isolated word recognition – Keywords utterances – Keyword separated by silence – Main technique is DTW
Concrete task/Goal Ú Goal: develop a robust speaker independent keyword spotting scheme to operate household appliances Ú Concrete tasks – Digitalize the sound inputs – Implementation in Mat. Lab – Train the model with the grammar – Analyze the performances of our scheme
Schematic Microphone A/D Feature extraction DTW Grammar Output
Feature extraction Ú Pre-emphasis – Flattening the spectrum of the signal Ú Blocking into frames – Length of the Fourier Transform Ú Windowing – Sample window (maybe Hamming) Ú Mel frequency Cepstral coefficients – More reliable than LPC coefficients – This will be imputed in the DTW algorithm
DTW ÚIdea: smallest distance between an input and the training bank – Cepstrum features ÚDynamic programming: the time axis his not linear to account for utterances – t 0 -> t 0+5 – t 1 -> t 1 -2
DTW
DTW
Training Ú Need to create our own grammar – On: Onnn, Honnn, opeeenn – Off: Hooofff, Hoff, offfff, close – As many potential utterances as possible Ú Use this data with DTW
Evaluation metrics Ú Accuracy – High noise – Low noise – Independent speaker – Training data speaker – Would like to obtain 80% or more
Conclusion Ú Early stage – No code implemented yet – Many challenges a head – Our methodology may change slightly Ú There is a big potential market for such technique -> influence on every day life.
- Slides: 14