Voice DSP Processing II Yaakov J Stein Chief

Voice DSP Part 1 Speech biology and what we can learn from it Part

Voice DSP - Part 2 Simplest processing – Gain – AGC – VAD More

Voice DSP - Part 2 a Simplest voice DSP Stein Voice. DSP 2. 4

Gain (volume) Control In analog processing (electronics) gain requires an amplifier Great care must

Automatic Gain Control (AGC( Can we set the gain automatically? Yes, based on the

AGC - cont. What if the input isn’t stationary (gets stronger and weaker over

AGC - cont. The a coefficient determines how fast G(t) can change In more

Simple VAD Sometimes it is useful to know whether someone is talking (or not(

Simple VAD - cont. VADs operate by recognizing that speech is different from noise

Other “simple” processes Simple = not significantly dependent on details of speech signal n

Voice DSP - Part 2 b Complex voice DSP Stein Voice. DSP 2. 12

Correlation One major difference between simple and complex processing is the computation of correlations

Correlation - cont. D 2 = < (x(t) - y(t) )2 > = <

Autocorrelation Crosscorrelation Cx y (t) = < x(t) y(t+t< ( Autocorrelation Cx (t) =

Pitch tracking How can we measure (and track) the pitch? We can look for

Pitch tracking - cont. Sondhi’s algorithm for autocorrelation-based pitch tracking: – obtain window of

Other Pitch Trackers Miller’s data-reduction &Gold and Rabiner’s parallel processing methods Zero-crossings, energy, extrema

U/V decision Between VAD and pitch tracking n Simplest U/V decision is based on

LPC Coefficients How do we find the vocal tract filter coefficients? System identification problem

LPC Coefficients For simplicity let’s assume three a coefficients Sn = en + a

LPC Coefficients - cont. S=e+Sa so by simple algebra a = S-1 ( s

LPC Coefficients - cont. Can’t just average over time - all equations would be

Alternative features The a coefficients aren’t the only set of features n Reflection coefficients

LSP coefficients n n n a coefficients are not statistically equally weighted pole positions

Voice DSP - Part 2 c Echo Cancellation Stein Voice. DSP 2. 26

Line echo Telephone 1 hybrid Telephone 2 Stein Voice. DSP 2. 28

Echo suppressor 4 w switch comp inv switch 4 w In practice need more:

Why not echo suppresion? n Echo suppression makes conversation half duplex – Waste of

Echo cancellation? Unfortunately, it’s not so easy Outgoing signal is delayed, attenuated, distorted -

LEC architecture h y b r i d NLP - Y doubletalk detector filter

Adaptive Algorithms How do we n find the echo cancelling filter? n keep it

Noise cancellation y hn x en x n - y h e Stein Voice.

Noise cancellation - cont. Assume that noise is distorted only by unknown gain h

The LMS algorithm Gradient descent on energy correction to H is proportional to error

Nonlinear processing Because of finite numeric precision the LEC (linear) filtering can not completely

Doubletalk detection Adaptation of H should take place only when far end speaks So

Slides: 38

Download presentation

Voice DSP Processing II Yaakov J. Stein Chief Scientist RAD Data Communications Stein Voice. DSP 2. 1

Voice DSP Part 1 Speech biology and what we can learn from it Part 2 Speech DSP (AGC, VAD, features, echo cancellation) Part 3 Speech compression techiques Part 4 Speech Recognition Stein Voice. DSP 2. 2

Voice DSP - Part 2 Simplest processing – Gain – AGC – VAD More complex processing – pitch tracking – U/V decision – computing LPC – other features Echo Cancellation – – – – Sources of echo Echo suppression Echo cancellation Adaptive noise cancellation The LMS algorithm Other adaptive algorithms The standard LEC Stein Voice. DSP 2. 3

Voice DSP - Part 2 a Simplest voice DSP Stein Voice. DSP 2. 4

Gain (volume) Control In analog processing (electronics) gain requires an amplifier Great care must be taken to ensure linearity! In digital processing (DSP) gain requires only multiplication y=Gx Need enough bits! Stein Voice. DSP 2. 5

Automatic Gain Control (AGC( Can we set the gain automatically? Yes, based on the signal’s Energy! E= x 2 (t) dt = S xn 2 All we have to do is apply gain until attain desired energy Assume we want the energy to be Y Then y = Y/ E x = Gx has exactly this energy Stein Voice. DSP 2. 6

AGC - cont. What if the input isn’t stationary (gets stronger and weaker over time? ( <t> 8 8 The energy is defined for all times so it can’t help! So we define “energy in window” E(t( and continuously vary gain G(t( This is Adaptive Gain Control We don’t want gain to jump from window to window so we smooth the instantaneous gain G(t) a G(t) + (1 -a) Y/E(t( IIR filter Stein Voice. DSP 2. 7

AGC - cont. The a coefficient determines how fast G(t) can change In more complex implementations we may separately control integration time, attack time, release time What is involved in the computation of G(t) ? – – Squaring of input value Accumulation Square root )or Pythagorean sum( Inversion (division) Square root and inversion are hard for a DSP processor but algorithmic improvements are possible (and often needed) Stein Voice. DSP 2. 8

Simple VAD Sometimes it is useful to know whether someone is talking (or not( – Save bandwidth – Suppress echo – Segment utterances We might be able to get away with “energy VOX” Normally need Noise Riding Threshold / Signal Riding Threshold However, there are problems energy VOX since it doesn’t differentiate between speech and noise What we really want is a speech-specific activity detector Voice Activity Detector Stein Voice. DSP 2. 9

Simple VAD - cont. VADs operate by recognizing that speech is different from noise – Speech is low-pass while noise is white – Speech is mostly voiced and so has pitch in a given range – Average noise amplitude is relatively constant A simple VAD may use: – zero crossings – zero crossing “derivative” – spectral tilt filter – energy contours – combinations of the above Stein Voice. DSP 2. 10

Other “simple” processes Simple = not significantly dependent on details of speech signal n n n n Speed change of recorded signal Speed change with pitch compensation Pitch change with speed compensation Sample rate conversion Tone generation Tone detection Dual tone generation Dual tone detection (need high reliability( Stein Voice. DSP 2. 11

Voice DSP - Part 2 b Complex voice DSP Stein Voice. DSP 2. 12

Correlation One major difference between simple and complex processing is the computation of correlations (related to LPC model( Correlation is a measure of similarity Shouldn’t we use squared difference to measure similarity? D 2 = < (x(t) - y(t) )2< No, since squared difference is sensitive to – gain – time shifts Stein Voice. DSP 2. 13

Correlation - cont. D 2 = < (x(t) - y(t) )2 > = < x 2 > + < y 2 > - 2 < x(t) y(t< ( So when D 2 is minimal C(0) = < x(t) y(t) > is maximal and arbitrary gains don’t change this To take time shifts into account C(t) = < x(t) y(t+t< ( and look for maximal t! We can even find out how much a signal resembles itself Stein Voice. DSP 2. 14

Autocorrelation Crosscorrelation Cx y (t) = < x(t) y(t+t< ( Autocorrelation Cx (t) = < x(t) x(t+t <( Cx (0) is the energy! Autocorrelation helps find hidden periodicities! Much stronger than looking in the time representation Wiener Khintchine Autocorrelation C(t) and Power Spectrum S(f) are FT pair So autocorrelation contains the same information as the power spectrum … and can itself be computed by FFT Stein Voice. DSP 2. 15

Pitch tracking How can we measure (and track) the pitch? We can look for it in the spectrum – but it may be very weak – may not even be there (filtered out( – need high resolution spectral estimation Correlation based methods The pitch periodicity should be seen in the autocorrelation! Sometimes computationally simpler is the Absolute Magnitude Difference Function | >x(t) - x(t+t< | ( Stein Voice. DSP 2. 16

Pitch tracking - cont. Sondhi’s algorithm for autocorrelation-based pitch tracking: – obtain window of speech – determine if the segment is voiced )see U/V decision below( – low-pass filter and center-clip to reduce formant induced correlations – compute autocorrelation lags corresponding to valid pitch intervals • find lag with maximum correlation OR • find lag with maximal accumulated correlation in all multiples Post processing Pitch trackers rarely make small errors )usually double pitch) So correct outliers based on neighboring values Stein Voice. DSP 2. 17

Other Pitch Trackers Miller’s data-reduction &Gold and Rabiner’s parallel processing methods Zero-crossings, energy, extrema of waveform Noll’s cepstrum based pitch tracker Since the pitch and formant contributions are separated in cepstral domain Most accurate for clean speech, but not robust in noise Methods based on LPC error signal LPC technique breaks down at pitch pulse onset Find periodicity of error by autocorrelation Inverse filtering method Remove formant filtering by low-order LPC analysis Find periodicity of excitation by autocorrelation Sondhi-like methods are the best for noisy speech Stein Voice. DSP 2. 18

U/V decision Between VAD and pitch tracking n Simplest U/V decision is based on energy and zero crossings n More complex methods are combined with pitch tracking n Methods based on pattern recognition Is voicing well defined? n Degree of voicing (buzz) n Voicing per frequency band (interference) n Degree of voicing per frequency band Stein Voice. DSP 2. 19

LPC Coefficients How do we find the vocal tract filter coefficients? System identification problem Unknown input n n All-pole (AR) filter Connection to prediction Sn = G e n + Sm filter known output am sn-m Can find G from energy (so let’s ignore it( Stein Voice. DSP 2. 20

LPC Coefficients For simplicity let’s assume three a coefficients Sn = en + a 1 sn-1 + a 2 s n-2 + a 3 s n-3 Need three equations! Sn = en + a 1 sn-1 + a 2 s n-2 + a 3 s n-3 Sn+1 = en+1 + a 1 sn + a 2 s n-1 + a 3 s n-2 Sn+2 = en+2 + a 1 sn+1 + a 2 s n + a 3 s n-1 In matrix form Sn Sn+1 Sn+2 s = = en en+1 en+2 + e + sn-1 s n-2 s n-3 sn s n-1 s n-2 sn+1 s n-1 S a 1 a 2 a 3 a Stein Voice. DSP 2. 21

LPC Coefficients - cont. S=e+Sa so by simple algebra a = S-1 ( s - e( and we have reduced the problem to matrix inversion Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm( Unfortunately noise makes this attempt break down! Move to next time and the answer will be different. Need to somehow average the answers The proper averaging is before the equation solving correlation vs autocovariance Stein Voice. DSP 2. 22

LPC Coefficients - cont. Can’t just average over time - all equations would be the same! Let’s take the input to be zero Sn = Sm am sn-m multiply by Sn-q and sum over n Sn Sn Sn-q = Sm am Sn sn-m sn-q we recognize the autocorrelations Cs (q) = Sm Cs (|m-q|) am Yule-Walker equations autocorrelation method: sn outside window are zero (Toeplitz( autocovariance method: use all needed sn (no window( Also - pre-emphasis! Stein Voice. DSP 2. 23

Alternative features The a coefficients aren’t the only set of features n Reflection coefficients (cylinder model( n log-area coefficients (cylinder model( n pole locations n LPC cepstrum coefficients n Line Spectral Pair frequencies All theoretically contain the same information (algebraic transformations( n Euclidean distance in LPC cepstrum space ~ Itakura Saito measure so these are popular in speech recognition LPC (a) coefficients don’t quantize or interpolate well n so these aren’t good for speech compression LSP frequencies are best for compression n Stein Voice. DSP 2. 24

LSP coefficients n n n a coefficients are not statistically equally weighted pole positions are better (geometric( but radius is sensitive near unit circle Is there an all-angle representation ? Theorem 1: Every real polynomial with all roots on the unit circle is palindromic (e. g. 1 + 2 t + t 2) or antipalindromic (e. g. t + t 2 - t 3( Theorem 2: Every polynomial can be written as the sum of palindromic and antipalindromic polynomials Consequence: Every polynomial can be represented by roots on the unit circle, that is, by angles Stein Voice. DSP 2. 25

Voice DSP - Part 2 c Echo Cancellation Stein Voice. DSP 2. 26

Acoustic Echo Stein Voice. DSP 2. 27

Line echo Telephone 1 hybrid Telephone 2 Stein Voice. DSP 2. 28

Echo suppressor 4 w switch comp inv switch 4 w In practice need more: VOX, over-ride, reset, etc. Stein Voice. DSP 2. 29

Why not echo suppresion? n Echo suppression makes conversation half duplex – Waste of full-duplex infrastructure – Conversation unnatural – Hard to break in – Dead sounding line - far end near end It would be better to cancel the echo subtract the echo signal allowing desired signal through but that requires DSP. Stein Voice. DSP 2. 30

Echo cancellation? Unfortunately, it’s not so easy Outgoing signal is delayed, attenuated, distorted - far end MODEM TYPE near end Two echo canceller architectures: echo path clean LINE ECHO CANCELLER (LEC( far end - near end clean echo path Stein Voice. DSP 2. 31

LEC architecture h y b r i d NLP - Y doubletalk detector filter H adapt far end near end A/D X D/A Stein Voice. DSP 2. 32

Adaptive Algorithms How do we n find the echo cancelling filter? n keep it correct even if the echo path parameters change? Need an algorithm that continually changes the filter parameters All adaptive algorithms are based on the same ideas )lack of corellation between desired signal and interference( Let’s start with a simpler case - adaptive noise cancellation Stein Voice. DSP 2. 33

Noise cancellation y hn x en x n - y h e Stein Voice. DSP 2. 34

Noise cancellation - cont. Assume that noise is distorted only by unknown gain h We correct by transmitting e n so that the audience hears y = x + h n - e n = x + (h-e) n the energy of this signal is Ey = < y 2 > = < x 2 > + (h-e)2 < n 2 > + 2 (h-e) < x n< Assume that Cxn = < x n> = 0 We need only set e to minimize Ey ! (turn knob until minimal( Even if the distortion is a complete filter h we set the ANC filter e to minimize Ey Stein Voice. DSP 2. 35

The LMS algorithm Gradient descent on energy correction to H is proportional to error d times input X H H+ld. X Stein Voice. DSP 2. 36

Nonlinear processing Because of finite numeric precision the LEC (linear) filtering can not completely remove echo Standard LEC adds center clipping to remove residual echo Clipping threshold needs to be properly set by adaptation Stein Voice. DSP 2. 37

Doubletalk detection Adaptation of H should take place only when far end speaks So we freeze adaptation when no far end or double-talk, that is whenever near end speaks Geigel algorithm compares absolute value of near-end speech to half the maximum absolute value in X buffer If near-end exceeds far-end can assume only near-end is speaking Stein Voice. DSP 2. 38