Speech Signal Representations I Seminar Speech Recognition 2002

  • Slides: 35
Download presentation
Speech Signal Representations I Seminar Speech Recognition 2002 F. R. Verhage

Speech Signal Representations I Seminar Speech Recognition 2002 F. R. Verhage

Speech Signal Representations I Decomposition of the speech signal (x[n]) as a source (e[n])

Speech Signal Representations I Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear timevarying filter (h[n]).

Speech Signal Representations I Estimation of the filter, inspired by: § Speech production models

Speech Signal Representations I Estimation of the filter, inspired by: § Speech production models – Linear Predictive Coding (LPC) – Cepstral analysis § Speech perception models (part II) – Mel-frequency cepstrum – Perceptual Linaer Prediction (PLP) Speech recognizers estimate filter characteristics and ignore the source

Speech Signal Representations I Short-Time Fourier Analysis § Spectrogram – Representation of a signal

Speech Signal Representations I Short-Time Fourier Analysis § Spectrogram – Representation of a signal highlighting several of its properties based on short-time Fourier analysis – Two dimensional: time horizontal and frequency vertical – Third ‘dimension’: gray or color level indicating energy

Speech Signal Representations I Short-Time Fourier Analysis § Spectrogram – Narrow band § Long

Speech Signal Representations I Short-Time Fourier Analysis § Spectrogram – Narrow band § Long windows (> 20 ms) → § Narrow bandwidth § Lower time resolution, better frequency resolution – Wide band § Short windows ( <10 ms) → § Wide bandwidth § Good time resolution, lower frequency resolution – Pitch synchronous § Requires knowledge of local pitch period

Speech Signal Representations I Short-Time Fourier Analysis § Spectrogram

Speech Signal Representations I Short-Time Fourier Analysis § Spectrogram

Speech Signal Representations I Short-Time Fourier Analysis § Window analysis – – – Series

Speech Signal Representations I Short-Time Fourier Analysis § Window analysis – – – Series of short segments, analysis frames Short enough so that the signal is stationary Usually constant, 20 -30 ms Overlaps possible Different types of window functions (wm[n]): § § § Rectangular (equal to no window function) Hamming Hanning

Speech Signal Representations I Short-Time Fourier Analysis § Window analysis – Window size must

Speech Signal Representations I Short-Time Fourier Analysis § Window analysis – Window size must be long enough § Rectangular: N ≥ M § Hamming, Hanning: N ≥ 2 M – Pitch period not known in advance → – Prepare for lowest pitch period → – At least 20 ms for rectangular or 40 ms for Hamming/Hanning (50 Hz) – But longer windows give a more average spectrum instead of distinct spectra → – Rectangular window has better time resolution

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis § Window analysis – Frequency response not

Speech Signal Representations I Short-Time Fourier Analysis § Window analysis – Frequency response not completely zero outside main lobe → Spectral leakage – Second lobe of a Hamming window is approx. 43 d. B below main lobe → less spectral leakage – Hamming, Hanning, triangular windows offer less spectral leakage → – Rectangular windows are rarely used despite their better time resolution

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis

Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of male voice speech a)

Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of male voice speech a) b) c) d) e) Time signal /ah/ local pitch 110 Hz 30 ms rectangular window 15 ms rectangular window 30 ms Hamming window 15 ms Hamming window

Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of female voice speech a)

Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of female voice speech a) b) c) d) e) Time signal /aa/ local pitch 200 Hz 30 ms rectangular window 15 ms rectangular window 30 ms Hamming window 15 ms Hamming window

Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of unvoiced speech a) b)

Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of unvoiced speech a) b) c) d) e) Time signal 30 ms rectangular window 15 ms rectangular window 30 ms Hamming window 15 ms Hamming window

Speech Signal Representations I Linear Predictive Coding § LPC a. k. a. auto-regressive (AR)

Speech Signal Representations I Linear Predictive Coding § LPC a. k. a. auto-regressive (AR) modeling § All-pole filter is good approximation of speech, with p as the order of the LPC analysis: § Predicts current sample as linear combination of past p samples

Speech Signal Representations I Linear Predictive Coding § To estimate predictor coefficients (ak), use

Speech Signal Representations I Linear Predictive Coding § To estimate predictor coefficients (ak), use shortterm analysis technique § Per segment, minimize the total prediction error by calculating the minimum squared error § Take the derivative, equate it to 0; expressed as a set of p linear equations: the Yule-Walker equations

Speech Signal Representations I Linear Predictive Coding § Solution of the Yule-Walker equations: –

Speech Signal Representations I Linear Predictive Coding § Solution of the Yule-Walker equations: – Any standard matrix inversion package – Due to the special form of the matrix, efficient solutions: § Covariance method using the Cholesky decomposition § Autocorrelation method using windows, results in equations with Toeplitz matrices, solved by the Durbin recursion algorithm § Lattice method equivalent to Levinson Durbin recursion often used in fixed-point implementations because lack of precision doesn’t result in unstable filters

Speech Signal Representations I Linear Predictive Coding

Speech Signal Representations I Linear Predictive Coding

Speech Signal Representations I Linear Predictive Coding

Speech Signal Representations I Linear Predictive Coding

Speech Signal Representations I Linear Predictive Coding § Spectral analysis via LPC – All-pole

Speech Signal Representations I Linear Predictive Coding § Spectral analysis via LPC – All-pole (IIR) filter – Peaks at the roots of the denominator

Speech Signal Representations I Linear Predictive Coding § Prediction error – – – Should

Speech Signal Representations I Linear Predictive Coding § Prediction error – – – Should be (approximately) the excitation Unvoiced speech, expect white noise; OK Voiced speech, expect impulse train; NOK § All-pole assumption not altogether valid § Real speech not perfectly periodic § Pitch synchronous analysis gives better results – LPC order § Larger p gives lower prediction errors § Too large a p results in fitting the individual harmonics → separation between filter and source will not be so good

Speech Signal Representations I Linear Predictive Coding § Prediction error – Inverse LPC filter

Speech Signal Representations I Linear Predictive Coding § Prediction error – Inverse LPC filter gives residual signal

Speech Signal Representations I Linear Predictive Coding § Alternatives for the predictor coefficients –

Speech Signal Representations I Linear Predictive Coding § Alternatives for the predictor coefficients – Line Spectral Frequencies § local sensitivity § efficiency – Reflection Coefficients § Guaranteed stable → useful for coefficient interpolated over time – Log-area ratios § Flat spectral sensitivity – Roots of the polynomial § Represent resonance frequencies and bandwidths

Speech Signal Representations I Cepstral Processing – A homomorphic transformation converts a convolution into

Speech Signal Representations I Cepstral Processing – A homomorphic transformation converts a convolution into a sum:

Speech Signal Representations I Cepstral Processing

Speech Signal Representations I Cepstral Processing

Speech Signal Representations I Cepstral Processing

Speech Signal Representations I Cepstral Processing