Keyword Spotting Dynamic Time Warping Ali Akbar Jabini

Keyword Spotting Dynamic Time Warping Ali Akbar Jabini Alexandre Mercier-Dalphond Spring 2006

Introduction Ú Speech recognition: – Computer can interpret speech – Need input to digitalize sounds • Microphone – People can speak faster than type – Commercial systems available since 1990 s • People prefer Physical interactions – Keyboard/Mouse, On/Off switch • Low Accuracy for large vocabulary with noise (50%)

Introduction Ú Speech recognition is more and more used for smaller vocabulary banks – Credit Card Systems – Simple switching commands – Directory assistance Ú Cheap to implement Ú High Accuracy – Can verify their interpretation Ú Idea: speech recognition for household appliances

OUTLINE Ú Area of investigation Ú Concrete task/Goal Ú Schematic Ú Feature extraction Ú DTW Ú Training Ú Evaluation metrics Ú Conclusion

Area of Investigation Ú Keyword Spotting: – Subfield of speech recognition – Grammar constrained Ú Keyword Spotting in isolated word recognition – Keywords utterances – Keyword separated by silence – Main technique is DTW

Concrete task/Goal Ú Goal: develop a robust speaker independent keyword spotting scheme to operate household appliances Ú Concrete tasks – Digitalize the sound inputs – Implementation in Mat. Lab – Train the model with the grammar – Analyze the performances of our scheme

Schematic Microphone A/D Feature extraction DTW Grammar Output

Feature extraction Ú Pre-emphasis – Flattening the spectrum of the signal Ú Blocking into frames – Length of the Fourier Transform Ú Windowing – Sample window (maybe Hamming) Ú Mel frequency Cepstral coefficients – More reliable than LPC coefficients – This will be imputed in the DTW algorithm

DTW ÚIdea: smallest distance between an input and the training bank – Cepstrum features ÚDynamic programming: the time axis his not linear to account for utterances – t 0 -> t 0+5 – t 1 -> t 1 -2

DTW

DTW

Training Ú Need to create our own grammar – On: Onnn, Honnn, opeeenn – Off: Hooofff, Hoff, offfff, close – As many potential utterances as possible Ú Use this data with DTW

Evaluation metrics Ú Accuracy – High noise – Low noise – Independent speaker – Training data speaker – Would like to obtain 80% or more

Conclusion Ú Early stage – No code implemented yet – Many challenges a head – Our methodology may change slightly Ú There is a big potential market for such technique -> influence on every day life.