Endpoint Detection JyhShing Roger Jang http mirlab orgjang
Endpoint Detection ( 端點偵測) Jyh-Shing Roger Jang (張智星) http: //mirlab. org/jang MIR Lab, CSIE Dept National Taiwan Univ. , Taiwan
Intro to Endpoint Detection z Endpoint detection (EPD, 端點偵測) y. Goal: Determine the start and end of voice activity y. Also known as voice activity detection (VAD) z Importance Cell phone too! y. Acts as a preprocessing step for speech-based app. y. Requires as small computing power as possible z Two modes for recording for speech-base app. y. Push to talk Offline EPD x. Example: Voice command Quiz! y. Continuously listening Online EPD x. Example: Dialog system, such as SIRI -2 -
Types of Features for EPD z Time-domain z Frequency-domain y. Volume only y. Volume and ZCR (zero crossing rate) y. Volume and HOD (highorder difference) y… y. Variance of spectrum y. Entropy of spectrum y. Spectrum y. MFCC y… Some features belong to both! -3 -
Typical Approaches to EPD z Thresholding y. Simple thresholding x. Compute a feature (e. g. , volume) from each frame x. Select a threshold vth to identify frames of voice activity y. Combined thresholding x. Use two features (e. g. , volume and ZCR) to make decision z Static classification y. Extract features y. Perform binary classification x. Negative sil or noise x. Positive voice activity z Sequence alignment y. Use hidden Markov models (HMM) for sequence alignment You need to use these approaches in EPD program competition. -4 -
Performance Evaluation for EPD (1/2) z. Two types of errors (typical for all binary classification) Quiz! y. False negative (aka false rejection) positive negative y. False positive (aka false acceptance) negative positive z. Confusion matrix/table -5 -
Performance Evaluation for EPD (2/2) z. Typical methods Quiz! y. Start & end position accuracy y. Frame-based accuracy -6 -
EPD by Volume Thresholding z. The simplest method for EPD y. Volume is abs sum of samples in a frame. z. Four intuitive way to select vth: yvth = vmax*a yvth = vmedian*b yvth = vmin*g yvth = v 1*d -8 -
How Do They Fail? z. Unfortunately… y. All the thresholds fail one way or another. y. Under what situations do they fail? xvth = vmax*a Plosive sounds xvth = vmedian*b Silence too long xvth = vmin*g Total-zero frame xvth = v 1*d Unstable frame z. We need a a better strategy… -9 -
A Better Strategy for Threshold Finding z. A presumably better way to select vth yvlower = 3 rd percentile of volumes yvupper = 97 th percentile of volumes yvth = (vupper-vlower)*k+vlower z. Why do we need to use percentile? y. To deal with plosive sounds y. To deal total-zero frames z. Does it fail? Yes, still, in certain situation… -10 -
Example: EPD by Volume zepd. By. Vol 01. m -11 -
How to Enhance EPD by Volume? z. Major problem of EPD by volume y. Threshold is hard to determine Corpus-based fine-tuning y. Unvoiced parts are likely to be ignored We need a feature to enhance the unvoiced parts This can be achieved by ZCR or HOD -12 -
-13 -
ZCR for Unvoiced Sound Detection z. ZCR: zero crossing rate Quiz! y. No. of zero crossing in a frame y. ZCRvoiced < ZCRsilence < ZCRunvoiced Quiz: If frame=[-1 2 -2 3 5 2 -2 1], what is its ZCR? z. Example: epd. Show. Zcr 01. m -14 -
EPD by Volume and ZCR 1. Determine initial endpoints by tu 2. Expand the initial endpoints based on tl 3. Further expand the endpoints based on ZCR threshold tzc -15 -
Example: EPD by Volume and ZCR zepd. By. Vol. Zcr 01. m -16 -
-17 -
EPD by Volume and HOD z. Another feature to enhance unvoiced sounds: y. High order difference x. Order-1 HOD = sum(abs(diff(s))) x. Order-2 HOD = sum(abs(diff(s)))) x. Order-3 HOD = sum(abs(diff(diff(s))))) x… Quiz: If frame=[-1 2 -2 3 -3 2 -2 1], what is its order-n HOD when n is 1, 2, and 3? -18 -
Example: Plots of Volume and HOD zhigh. Order. Diff 01. m -19 -
Example: EPD by Vol. and HOD zepd. By. Vol. Hod 01. m -20 -
Hard Example: EPD by Vol. and HOD z. A hard example: epd. By. Vol. Hod 02. m -21 -
-22 -
Spectrogram z Goal y. Describe energy distribution in each frame along time z MATLAB command y[S, F, T] = spectrogram(signal, frame. Size, overlap, fft. Size, fs); z Facts y. Real signals for FFT Complex conjugate spectrum Take first frame. Size/2+1 points when we consider magnitude only y. Use zero padding to have a larger fft. Size finer freq resolution -23 -
EPD by Spectrum z epd. Show. Spec 01. m z epd. Show. Spec 02. m -24 -
How to Aggregate Spectrum? z. How to aggregate spectrum as a single feature which is larger (or smaller) when the spectral energy distribution is diversified? y. Entropy function y. Geometric mean over arithmetic mean -25 -
Entropy Function (1/2) Quiz! z. Entropy function z. Property -26 -
Entropy Function (2/2) Quiz! z. Proof by taking derivative -27 -
Plots of Entropy Function z n=2 z n=3 entropy. Plot. m -28 -
Spectral Entropy z. PDF: z. Normalization y z. Spectral entropy: Reference: Jialin Shen, Jeihweih Hung, Linshan Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments”, International Conference on Spoken Language Processing, Sydney, 1998 -29 -
Geometric/Arithmetic Means z. Arithmetic & Geometric means Quiz! z. Property z. Proof… -30 -
-31 -
Classification Based EPD z. Classify each frame into silence or not y. Feature of a frame x. Magnitude/power spectrum x. Others: ZCR, HOD, entropy, gm/am, … y. Static classifiers to detect S from UV x. KNNC, NBC, SVM, NN, … Use Machine Learning Toolbox! y. Sequence aligner to find boundaries of S UV & UV S x. HMM, CRF, … -32 -
- Slides: 32