Speech Processing Speech Coding Veton Kpuska Speech Coding

  • Slides: 103
Download presentation
Speech Processing Speech Coding Veton Këpuska

Speech Processing Speech Coding Veton Këpuska

Speech Coding u Definition: n u Even though availability of high-bandwidth communication channels has

Speech Coding u Definition: n u Even though availability of high-bandwidth communication channels has increased, speech coding for bit reduction has retained its importance. n n u Reduced bit-rates transmissions required for cellular networks Voice over IP Coded speech n n u Speech Coding is a process that leads to the representation of analog waveforms with sequences of binary digits. Is less sensitive than analog signals to transmission noise Easier to: u protect against (bit) errors u Encrypt u Multiplex, and u Packetize Typical Scenario depicted in next slide (Figure 12. 1) 10/7/2020 Veton Këpuska 2

Digital Telephone Communication System 10/7/2020 Veton Këpuska 3

Digital Telephone Communication System 10/7/2020 Veton Këpuska 3

Categorization of Speech Coders u Waveform Coders: n u Hybrid Coders n u Used

Categorization of Speech Coders u Waveform Coders: n u Hybrid Coders n u Used to quantize speech samples directly and operate at high-bit rates in the range of 16 -64 kbps (bps - bits per second) Are partially waveform coders and partly speech model-based coders and operate in the mid bit rate range of 2. 4 -16 kbps. Vocoders n n 10/7/2020 Largely model-based and operate at a low bit rate range of 1. 2 -4. 8 kbps. Tend to be of lower quality than waveform and hybrid coders. Veton Këpuska 4

Quality Measurements u Quality of coding can is viewed as the closeness of the

Quality Measurements u Quality of coding can is viewed as the closeness of the processed speech to the original speech or some other desired speech waveform. n n n 10/7/2020 Naturalness Degree of background artifacts Intelligibility Speaker identifiability Etc. Veton Këpuska 5

Quality Measurements u Subjective Measurement: n n u Diagnostic Rhyme Test (DRT) measures intelligibility.

Quality Measurements u Subjective Measurement: n n u Diagnostic Rhyme Test (DRT) measures intelligibility. Diagnostic Acceptability Measure and Mean Opinion Score (MOS) test provide a more complete quality judgment. Objective Measurement: n n 10/7/2020 Segmental Signal to Noise Ratio (SNR) – average SNR over a short-time segments Articulation Index – relies on an average SNR across frequency bands. Veton Këpuska 6

Quality Measurements u A more complete list and definition of subjective and objective measures

Quality Measurements u A more complete list and definition of subjective and objective measures can be found at: n n 10/7/2020 J. R. Deller, J. G. Proakis, and J. H. I Hansen, “Discrete-Time Processing of Speech”, Macmillan Publishing Co. , New York, NY, 1993 S. R. Quackenbush, T. P. Barnwell, and M. A. Clements, “Objective Measures of Speech Quality. Prentice Hall, Englewood Cliffs, NJ. 1988 Veton Këpuska 7

Statistical Models Veton Këpuska

Statistical Models Veton Këpuska

Statistical Models u u Speech waveform is viewed as a random process. Various estimates

Statistical Models u u Speech waveform is viewed as a random process. Various estimates are important from this statistical perspective: n n u Probability Density Mean, Variance and Autocorrelation One approach to estimate a probability density function (pdf) of x[n] is through histogram. n n 10/7/2020 Count up the number of occurrences of the value of each speech sample in different ranges: for many speech samples over a long time duration. Normalize the area of the resulting curve to unity. Veton Këpuska 9

Statistical Models u u The histogram of speech (Davenport, Paez & Glisson) was shown

Statistical Models u u The histogram of speech (Davenport, Paez & Glisson) was shown to approximate a gamma density: where x is the standard deviation of the pdf. Simpler approximation is given by the Laplacian pdf of the form: 10/7/2020 Veton Këpuska 10

PDF of Speech 10/7/2020 Veton Këpuska 11

PDF of Speech 10/7/2020 Veton Këpuska 11

PDF of Modeled Speech 10/7/2020 Veton Këpuska 12

PDF of Modeled Speech 10/7/2020 Veton Këpuska 12

PDF of Speech u Knowing pdf of speech is essential in order to optimize

PDF of Speech u Knowing pdf of speech is essential in order to optimize the quantization process of analog/continous valued samples. 10/7/2020 Veton Këpuska 13

Quantization of Speech & Audio Signals Scalar Quantization Vector Quantization Veton Këpuska

Quantization of Speech & Audio Signals Scalar Quantization Vector Quantization Veton Këpuska

Time Quantization (Sampling) of Analog Signals a) b) x(t) Analog Low-pass Filter Sample and

Time Quantization (Sampling) of Analog Signals a) b) x(t) Analog Low-pass Filter Sample and Hold Analog to Digital Converter DSP c) 10/7/2020 Veton Këpuska Analog-to-Digital Conversion. a) Continuous Signal x(t). b) Sampled signal with sampling period T satisfying Nyquist rate as specified by Sampling Theorem. c) Digital sequence obtained after sampling and quantization x[n] 15

Example u Assume that the input continuous-time signal is pure periodic signal represented by

Example u Assume that the input continuous-time signal is pure periodic signal represented by the following expression: where A is amplitude of the signal, 0 is angular frequency in radians per second (rad/sec), is phase in radians, and f 0 is frequency in cycles per second measured in Hertz (Hz). Assuming that the continuous-time signal x(t) is sampled every T seconds or alternatively with the sampling rate of fs=1/T, the discrete-time signal x[n] representation obtained by t=n. T will be: 10/7/2020 Veton Këpuska 16

Example (cont. ) u Alternative representation of x[n]: reveals additional properties of the discrete-time

Example (cont. ) u Alternative representation of x[n]: reveals additional properties of the discrete-time signal. u The F 0= f 0/fs defines normalized frequency, and 0 digital frequency is defined in terms of normalized frequency: 10/7/2020 Veton Këpuska 17

Reconstruction of Digital Signals a) DSP b) y[n] Digital to Analog Converter c) ya(n.

Reconstruction of Digital Signals a) DSP b) y[n] Digital to Analog Converter c) ya(n. T) Analog Low-pass Filter y(t) Digital-to-Analog Conversion. a) Processed digital signal y[n]. b) Continuous signal representation ya(n. T). c) Low-pass filtered continuous signal y(t). 10/7/2020 Veton Këpuska 18

Scalar Quantization Veton Këpuska

Scalar Quantization Veton Këpuska

Conceptual Representation of ADC x(t) u u u C/D Quantizer Coder This conceptual abstraction

Conceptual Representation of ADC x(t) u u u C/D Quantizer Coder This conceptual abstraction allows us to assume that the sequence is obtained with infinite precession. Those values of are scalar quantized to a set of finite precision amplitudes denoted here by. Furthermore, quantization allows that this finite-precision set of amplitudes to be represented by corresponding set of (bit) patterns or symbols, . Without loss of generality, it can be assumed that input signals cover finite range of values defined by minimal, xmin and maximal values xmax respectively. This assumption in turn implies that the set of symbols representing is finite. The process of representing finite set of values to a finite set of symbols is know as encoding; performed by the coder, as in Figure above. Thus one can view quantization and coding as a mapping of infinite precision value of to a finite precision representation picked from a finite set of symbols. 10/7/2020 Veton Këpuska 20

Scalar Quantization u Quantization, therefore, is a mapping of a value x[n], xmin x

Scalar Quantization u Quantization, therefore, is a mapping of a value x[n], xmin x xmax, to. The quantizer operator, denoted by Q(x), is defined by: where denotes one of L possible quantization levels where 1 ≤ i ≤ L and xi represents one of L +1 decision levels. u The above expression is interpreted as follows; If , then x[n] is quantized to the quantization level and is considered quantized sample of x[n]. Clearly from the limited range of input values and finite number of symbols it follows that quantization is characterized by its quantization step size i defined by the difference of two consecutive decision levels: 10/7/2020 Veton Këpuska 21

Scalar Quantization u Assume that a sequence x[n] was obtained from speech waveform that

Scalar Quantization u Assume that a sequence x[n] was obtained from speech waveform that has been lowpass-filtered and sampled at a suitable rate with infinite amplitude precision. n x[n] samples are quantized to a finite set of amplitudes denoted by. n Associated with the quantizer is a quantization step size . n Quantization allows the amplitudes to be represented by finite set of bit patterns – symbols. n Encoding: u u n 10/7/2020 Mapping of to a finite set of symbols. This mapping yields a sequence of codewords denoted by c[n] (Figure 12. 3 a in the next slide). Decoding – Inverse process whereby transmitted sequence of codewords c’[n] is transformed back to a sequence of quantized samples (Figure 12. 3 b in the next slide). Veton Këpuska 22

Scalar Quantization 10/7/2020 Veton Këpuska 23

Scalar Quantization 10/7/2020 Veton Këpuska 23

Fundamentals u u u Assume a signal amplitude is quantized into M levels. Quantizer

Fundamentals u u u Assume a signal amplitude is quantized into M levels. Quantizer operator is denoted by Q(x); Thus Where denotes L possible reconstruction levels – quantization levels, and n 1≤i≤ L n xi denotes L +1 possible decision levels with 0≤i≤ L If xi-1< x[n] < xi, then x[n] is quantized to the reconstruction level is quantized sample of x[n]. 10/7/2020 Veton Këpuska 24

Fundamentals u Scalar Quantization Example: n n n Assume there L=4 reconstruction levels. Amplitude

Fundamentals u Scalar Quantization Example: n n n Assume there L=4 reconstruction levels. Amplitude of the input signal x[n] falls in the range of [0, 1] Decision levels and Reconstruction levels are equally spaced: u u n 10/7/2020 Decision levels are [0, 1/4, 1/2, 3/4, 1] Reconstruction levels assumed to be [0, 1/8, 3/8, 5/8, 7/8] Figure 12. 4 in the next slide. Veton Këpuska 25

Example of Uniform 2 -bit Quantizer 10/7/2020 Veton Këpuska 26

Example of Uniform 2 -bit Quantizer 10/7/2020 Veton Këpuska 26

Example u Assume there are L = 24 = 16 reconstruction levels. Assuming that

Example u Assume there are L = 24 = 16 reconstruction levels. Assuming that input values fall within the range [xmin=1, xmax=1] and that the each value in this range is equally likely. Decision levels and reconstruction levels are equally spaced; = i, = (xmax- xmin)/L i=0, …, L-1. , u Decision Levels: u Reconstruction Levels: 10/7/2020 Veton Këpuska 27

16 -4 bit Level Quantization Example 10/7/2020 Veton Këpuska 28

16 -4 bit Level Quantization Example 10/7/2020 Veton Këpuska 28

Uniform Quantizer u u u A uniform quantizer is one whose decision and reconstruction

Uniform Quantizer u u u A uniform quantizer is one whose decision and reconstruction levels are uniformly spaced. Specifically: is the step size equal to the spacing between two consecutive decision levels which is the same spacing between two consecutive reconstruction levels. Each reconstruction level is attached a symbol – the codeword. Binary numbers typically used to represent the quantized samples (as in Figure 12. 4 in previous slide). 10/7/2020 Veton Këpuska 29

Uniform Quantizer u u u Codebook: Collection of codewords. In general with B-bit binary

Uniform Quantizer u u u Codebook: Collection of codewords. In general with B-bit binary codebook there are 2 B different quantization (or reconstruction) levels. Bit rate is defined as the number of bits B per sample multiplied by sample rate fs: I=Bfs Decoder inverts the coder operation taking the codeword back to a quantized amplitude value (e. g. , 01 → ). Often the goal of speech coding/decoding is to maintain the bit rate as low as possible while maintaining a required level of quality. Because sampling rate is fixed for most applications this goal implies that the bit rate be reduced by decreasing the number of bits per sample 10/7/2020 Veton Këpuska 30

Uniform Quantizer u u Designing a uniform scalar quantizer requires knowledge of the maximum

Uniform Quantizer u u Designing a uniform scalar quantizer requires knowledge of the maximum value of the sequence. Typically the range of the speech signal is expressed in terms of the standard deviation of the signal. n Specifically, it is often assumed that: -4 x≤x[n]≤ 4 x where x is signal’s standard deviation. Under the assumption that speech samples obey Laplacian pdf there approximately 0. 35% of speech samples fall outside of the range: -4 x≤x[n]≤ 4 x. Assume B-bit binary codebook ⇒ 2 B. n Maximum signal value xmax = 4 x. n n 10/7/2020 Veton Këpuska 31

Uniform Quantizer u u For the uniform quantization step size we get: Quantization step

Uniform Quantizer u u For the uniform quantization step size we get: Quantization step size relates directly to the notion of quantization noise. 10/7/2020 Veton Këpuska 32

Quantization Noise u Two classes of quantization noise: n n u Granular Distortion Overload

Quantization Noise u Two classes of quantization noise: n n u Granular Distortion Overload Distortion Granular Distortion n x[n] un-quantized signal and e[n] is the quantization noise. For given step size the magnitude of the quantization noise e[n] can be no greater than /2, that is: n Figure 12. 5 depicts this property were: n 10/7/2020 Veton Këpuska 33

Quantization Noise 10/7/2020 Veton Këpuska 34

Quantization Noise 10/7/2020 Veton Këpuska 34

Example u u For the periodic sine-wave signal use 3 -bit and 8 -bit

Example u u For the periodic sine-wave signal use 3 -bit and 8 -bit quantizer values. The input periodic signal is given with the following expression: MATLAB fix function is used to simulate quantization. The following figure depicts the result of the analysis. 10/7/2020 Veton Këpuska 35

L=23=8 & 28= 256 Levels Quantization u Plot a) represents sequence x[n] with infinite

L=23=8 & 28= 256 Levels Quantization u Plot a) represents sequence x[n] with infinite precision, b) represents quantized version L=8, c) represents quantization error e[n] for B=3 bits (L=8 quantization levels), and d) is quantization error for B=8 bits (L=256 quantization levels). 10/7/2020 Veton Këpuska 36

Quantization Noise u Overload Distortion n Maximum-value constant: u u 10/7/2020 xmax = 4

Quantization Noise u Overload Distortion n Maximum-value constant: u u 10/7/2020 xmax = 4 x (4 x≤x[n]≤ 4 x) For Laplacian pdf, 0. 35% of the speech samples fall outside the range of the quantizer. Clipped samples incur a quantization error in excess of /2. Due to the small number of clipped samples it is common to neglect the infrequent large errors in theoretical calculations. Veton Këpuska 37

Quantization Noise u Statistical Model of Quantization Noise n n n 10/7/2020 Desired approach

Quantization Noise u Statistical Model of Quantization Noise n n n 10/7/2020 Desired approach in analyzing the quantization error in numerous applications. Quantization error is considered an ergodic white -noise random process. The autocorrelation function of such a process is expressed as: Veton Këpuska 38

Quantization Error u u Previous expression states that the process is uncorrelated. Furthermore, it

Quantization Error u u Previous expression states that the process is uncorrelated. Furthermore, it is also assumed that the quantization noise and the input signal are uncorrelated, i. e. , n u E(x[n]e[n+m])=0, m. Final assumption is that the pdf of the quantization noise is uniform over the quantization interval: 10/7/2020 Veton Këpuska 39

Quantization Error u Stated assumptions are not always valid. n n u Consider a

Quantization Error u Stated assumptions are not always valid. n n u Consider a slowly varying – linearly varying signal ⇒ then e[n] is also changing linearly and is signal dependent (see Figure in the next slide). Correlated quantization noise can be annoying. When quantization step is small then assumptions for the noise being uncorrelated with itself and the signal are roughly valid when the signal fluctuates rapidly among all quantization levels. Quantization error approaches a white-noise process with an impulsive autocorrelation and flat spectrum. One can force e[n] to be white-noise and uncorrelated with x[n] by adding white-noise to x[n] prior to quantization. 10/7/2020 Veton Këpuska 40

Example of Quantization Error due to Correlation u a) b) c) d) Example of

Example of Quantization Error due to Correlation u a) b) c) d) Example of slowly varying signal that causes quantization error to be correlated. Plot represents sequence x[n] with infinite precision, represents quantized version , represents quantization error e[n] for B=3 bits (L=9 quantization levels), and is quantization error for B=8 bits (L=256 quantization levels). Note reduction in correlation level with increase of number of quantization levels which implies degrease of step size . 10/7/2020 Veton Këpuska 41

Quantization Error u u u Process of adding white noise is known as Dithering.

Quantization Error u u u Process of adding white noise is known as Dithering. This decorrelation technique was shown to be useful not only in improving the perceptual quality of the quantization noise but also with image signals. Signal-to-Noise Ratio n n 10/7/2020 A measure to quantify severity of the quantization noise. Relates the strength of the signal to the strength of the quantization noise. Veton Këpuska 42

Quantization Error u SNR is defined as: u Given assumptions for n n n

Quantization Error u SNR is defined as: u Given assumptions for n n n 10/7/2020 Quantizer range: 2 xmax, and Quantization interval: = 2 xmax/2 B, for a B-bit quantizer Uniform pdf, it can be shown that: Veton Këpuska 43

Quantization Error u Thus SNR can be expressed as: u Or in decibels (d.

Quantization Error u Thus SNR can be expressed as: u Or in decibels (d. B) as: u Because xmax = 4 x, then SNR(d. B)≈6 B-7. 2 10/7/2020 Veton Këpuska 44

Quantization Error u u u Presented quantization scheme is called pulse code modulation (PCM).

Quantization Error u u u Presented quantization scheme is called pulse code modulation (PCM). B-bits per sample are transmitted as a codeword. Advantages of this scheme: n n u It is instantaneous (no coding delay) Independent of the signal content (voice, music, etc. ) Disadvantages: n n n 10/7/2020 It requires minimum of 11 bits per sample to achieve “toll quality” (equivalent to a typical telephone quality) For 10, 000 Hz sampling rate, the required bit rate is: B=(11 bits/sample)x(10000 samples/sec)=110, 000 bps=110 kbps For CD quality signal with sample rate of 20, 000 Hz and 16 -bits/sample, SNR(d. B) =96 -7. 2=88. 8 d. B and bit rate of 320 kbps. Veton Këpuska 45

Nonuniform Quantization u u u Uniform quantization may not be optimal (SNR can not

Nonuniform Quantization u u u Uniform quantization may not be optimal (SNR can not be as small as possible for certain number of decision and reconstruction levels) Consider for example speech signal for which x[n] is much more likely to be in one particular region than in other (low values occurring much more often than the high values). This implies that decision and reconstruction levels are not being utilized effectively with uniform intervals over xmax. A Nonuniform quantization that is optimal (in a leastsquared error sense) for a particular pdf is referred to as the Max quantizer. Example of a nonuniform quantizer is given in the figure in the next slide. 10/7/2020 Veton Këpuska 46

Nonuniform Quantization 10/7/2020 Veton Këpuska 47

Nonuniform Quantization 10/7/2020 Veton Këpuska 47

Nonuniform Quantization u Max Quantizer n n Problem Definition: For a random variable x

Nonuniform Quantization u Max Quantizer n n Problem Definition: For a random variable x with a known pdf, find the set of M quantizer levels that minimizes the quantization error. Therefore, finding the decision and boundary levels x i and x^i, respectively, that minimizes the meansquared error (MSE) distortion measure: ^ 2] D=E[(x-x) n n 10/7/2020 ^ E-denotes expected value and x is the quantized version of x. It turns out that optimal decision level xk is given by: Veton Këpuska 48

Nonuniform Quantization u Max Quantizer (cont. ) n n n The optimal reconstruction level

Nonuniform Quantization u Max Quantizer (cont. ) n n n The optimal reconstruction level x^k is the centroid of px(x) over the interval xk-1≤ x ≤xk: It is interpreted as the mean value of x over interval ~ xk-1≤ x ≤xk for the normalized pdf p(x). ^ Solving last two equations for xk and x k is a nonlinear problem in these two variables. u 10/7/2020 Iterative solution which requires obtaining pdf (can be difficult). Veton Këpuska 49

Nonuniform Quantization 10/7/2020 Veton Këpuska 50

Nonuniform Quantization 10/7/2020 Veton Këpuska 50

Companding A fixed non-uniform quantizer Veton Këpuska

Companding A fixed non-uniform quantizer Veton Këpuska

Companding u u Alternative to the nonuniform quantizer is companding. It is based on

Companding u u Alternative to the nonuniform quantizer is companding. It is based on the fact that uniform quantizer is optimal for a uniform pdf. n n 10/7/2020 Thus if a nonlinearity is applied to the waveform x[n] to form a new sequence g[n] whose pdf is uniform then Uniform quantizer can be applied to g[n] to obtain ^ g[n], as depicted in the Figure 12. 10 in the next slide. Veton Këpuska 52

Companding 10/7/2020 Veton Këpuska 53

Companding 10/7/2020 Veton Këpuska 53

Companding u u A number of other nonlinear approximations nonlinear transformation that achieves uniform

Companding u u A number of other nonlinear approximations nonlinear transformation that achieves uniform density are used in practice which do not require pdf measurement. Specifically and A-law and –law companding. -law coding is give by: CCITT international standard coder at 64 kbps is an example application of -law coding. n n 10/7/2020 -law transformation followed by 7 -bit uniform quantization giving toll quality speech. Equivalent quality of straight uniform quantization achieved by 11 bits. Veton Këpuska 54

Adaptive Coding Veton Këpuska

Adaptive Coding Veton Këpuska

Adaptive Coding u u u Nonuniform quantizers are optimal for a long term pdf

Adaptive Coding u u u Nonuniform quantizers are optimal for a long term pdf of speech signal. However, considering that speech is a highly-timevarying signal, one has to question if a single pdf derived from a long-time speech waveform is a reasonable assumption. Changes in the speech waveform: n n u Temporal and spectral variations due to transitions from unvoiced to voiced speech, Rapid volume changes. Approach: n n 10/7/2020 Estimate a short-time pdf derived over 20 -40 msec intervals. Short-time pdf estimates are more accurately described by a Gaussian pdf regardless of the speech class. Veton Këpuska 56

Adaptive Coding u u A pdf derived from a short-time speech segment more accurately

Adaptive Coding u u A pdf derived from a short-time speech segment more accurately represents the speech nonstationarity. One approach is to assume a pdf of a specific shape in particular a Gaussian with unknown variance 2. n n u Measure the local variance then adapt a nonuniform quantizer to the resulting local pdf. This approach is referred to as adaptive quantization. For a Gaussian we have: 10/7/2020 Veton Këpuska 57

Adaptive Coding u u Measure the variance x 2 of a sequence x[n] and

Adaptive Coding u u Measure the variance x 2 of a sequence x[n] and use resulting pdf to design optimal max quantizer. Note that a change in the variance simply scales the time signal: n 1. 2. 10/7/2020 If E(x 2[n]) = x 2 then E[( x [n])2] = 2 x 2 Need to design only one nonuniform quantizer with unity variance and scale decision and reconstruction levels according to a particular variance. Fix the quantizer and apply a time-varying gain to the signal according to the estimated variance (scale the signal to match the quantizer). Veton Këpuska 58

Adaptive Coding 10/7/2020 Veton Këpuska 59

Adaptive Coding 10/7/2020 Veton Këpuska 59

Adaptive Coding u There are two possible approaches for estimation of a time-varying variance

Adaptive Coding u There are two possible approaches for estimation of a time-varying variance 2[n]: n n Feed-forward method (shown in Figure 12. 11) where the variance (or gain) estimate is obtained from the input Feedback method where the estimate is obtained from a quantizer output. u u u Advantage – no need to transmit extra side information (quantized variance) Disadvantage – additional sensitivity to transmission errors in codewords. Adaptive quantizers can achieve higher SNR than the use of –law companding is generally preferred for high-rate waveform coding because of its lower background noise when transmission channel is idle. Adaptive quantization is useful in variety of other coding schemes. 10/7/2020 Veton Këpuska 60

Differential and Residual Quantization u u Presented methods are examples of instantaneous quantization. Those

Differential and Residual Quantization u u Presented methods are examples of instantaneous quantization. Those approaches do not take advantage of the fact that speech is highly correlated signal: n n u Short-time (10 -15 samples), as well as Long-time (over a pitch period) In this section methods that exploit shorttime correlation will be investigated. 10/7/2020 Veton Këpuska 61

Differential and Residual Quantization u Short-time Correlation: n n n 10/7/2020 Neighboring samples are

Differential and Residual Quantization u Short-time Correlation: n n n 10/7/2020 Neighboring samples are “self-similar”, that is, not changing too rapidly from one another. Difference of adjacent samples should have a lower variance than the variance of the signal itself. This difference, thus, would make a more effective use of quantization levels: u Higher SNR for fixed number of quantization levels. u Predicting the next sample from previous ones (finding the best prediction coefficients to yield a minimum mean-squared prediction error same methodology as in Liner Prediction Coefficients - LPC). Two approaches: 1. Have a fixed prediction filter to reflect the average local correlation of the signal. 2. Allow predictor to short-time adapt to the signal’s local correlation. u Requires transmission of quantized prediction coefficients as well as the prediction error. Veton Këpuska 62

Differential and Residual Quantization u u Illustration of a particular error encoding scheme presented

Differential and Residual Quantization u u Illustration of a particular error encoding scheme presented in the Figure 12. 12 of the next slide. In this scheme the following sequences are required: n n n u ~ x[n] – prediction of the input sample x[n]; This is the output of the predictor P(z) whose input is a ^ quantized version of the input signal x[n], i. e. , x[n] r[n] – prediction error signal; residual ^ r[n] – quantized prediction error signal. This approach is sometimes referred to as residual coding. 10/7/2020 Veton Këpuska 63

Differential and Residual Quantization 10/7/2020 Veton Këpuska 64

Differential and Residual Quantization 10/7/2020 Veton Këpuska 64

Differential and Residual Quantization u Quantizer in the previous scheme can be of any

Differential and Residual Quantization u Quantizer in the previous scheme can be of any type: n n u u Fixed Adaptive Uniform Nonuniform Whatever the case is, the parameter of the quantizer are determined so that to match variance of r[n]. Differential quantization can also be applied to: n n Speech signal Parameters that represent speech: u u u 10/7/2020 LPC – linear prediction coefficients Cepstral coefficients obtained from Homomorphic filtering. Sinewave parameters, etc. Veton Këpuska 65

Differential and Residual Quantization u u Consider quantization error of the quantized residual: From

Differential and Residual Quantization u u Consider quantization error of the quantized residual: From Figure 12. 12 we express the quantized input ^ x[n] as: 10/7/2020 Veton Këpuska 66

Differential and Residual Quantization u u Quantized signal samples differ form the input only

Differential and Residual Quantization u u Quantized signal samples differ form the input only by the quantization error er[n]. Since the er[n] is the quantization error of the residual: ⇒ if the prediction of the signal is accurate then the variance of r[n] will be smaller than the variance of x[n] ⇒ A quantizer with a given number of levels can be adjusted to give a smaller quantization error than would be possible when quantizing the signal directly. 10/7/2020 Veton Këpuska 67

Differential and Residual Quantization u The differential coder of Figure 12. 12 is referred

Differential and Residual Quantization u The differential coder of Figure 12. 12 is referred to: n Differential PCM (DPCM) when used with u u n Adaptive Differential PCM (ADPCM) when used with u u u Adaptive prediction (i. e. , adapting the predictor to local correlation) Adaptive quantization (i. e. , adapting the quantizer to the local variance of r[n]) ADPCM yields greatest gains in SNR for a fixed bit rate. n u a fixed predictor and fixed quantization. The international coding standard CCITT, G. 721 with toll quality speech at 32 kbps (8000 samples/sec x 4 bits/sample) has been designed based on ADPCM techniques. To achieve higher quality with lower rates it is required to: n n n 10/7/2020 Rely on speech model-based techniques and The exploiting of long-time prediction, as well as Short-time prediction Veton Këpuska 68

Differential and Residual Quantization u Important variation of the differential quantization scheme of Figure

Differential and Residual Quantization u Important variation of the differential quantization scheme of Figure 12. n Prediction has assumed an all-pole model (autoregressive model). n In this model signal value is predicted from its past samples: u u Any error in a codeword due to for example bit errors over a degraded channel propagate over considerable time during decoding. Such error propagation is severe when the signal values represent speech model parameters computed frame-by frame (as opposed to sample-by-sample). Alternative approach is to use a finite-order movingaverage predictor derived from the residual. One common approach of the use of the movingaverage predictor is illustrated in Figure 12. 13 in the next slide. 10/7/2020 Veton Këpuska 69

Differential and Residual Quantization 10/7/2020 Veton Këpuska 70

Differential and Residual Quantization 10/7/2020 Veton Këpuska 70

Differential and Residual Quantization u Coder Stage of the system in Figure 12. 13:

Differential and Residual Quantization u Coder Stage of the system in Figure 12. 13: n n u Residual as the difference of the true value and the value predicted from the moving average of K quantized residuals: p[k] – coefficients of P(z) Decoder Stage: n n 10/7/2020 Predicted value is given by: Error propagation is thus limited to only K samples (or K analysis frames for the case of model parameters) Veton Këpuska 71

Vector Quantization Veton Këpuska

Vector Quantization Veton Këpuska

Vector Quantization (VQ) u u Investigation of scalar quantization techniques was the topic of

Vector Quantization (VQ) u u Investigation of scalar quantization techniques was the topic of previous sections. A generalization of scalar quantization referred to as vector quantization is investigated in this section. In vector quantization a block of scalars are coded as a vector rather than individually. An optimal quantization strategy can be derived based on a mean-squared error distortion metric as with scalar quantization. 10/7/2020 Veton Këpuska 73

Vector Quantization (VQ) u Motivation n n Assume the vocal tract transfer function is

Vector Quantization (VQ) u Motivation n n Assume the vocal tract transfer function is characterized by only two resonance's thus requiring four reflection coefficients. Furthermore, suppose that the vocal tract can take on only one of possible four shapes. This implies that there exist only four possible sets of the four reflection coefficients as illustrated in Figure 12. 14 in the next slide. Scalar Quantization – considers each of the reflection coefficient individually: u u n Vector Quantization – since there are only four possible vocal tract positions of the vocal tract corresponding to only four possible vectors of reflection coefficients. u u u Each coefficient can take on 4 different values ⇒ 2 bits required to encode each coefficient. For 4 reflection coefficients it is required 4 x 2=8 bits per analysis frame to code the vocal tract transfer function. Scalar values of each vector are highly correlated. Thus 2 bits are required to encode the 4 reflection coefficients. Note: if scalars were independent of each other treating them together as a vector would have no advantage over treating them individually. 10/7/2020 Veton Këpuska 74

Vector Quantization (VQ) 10/7/2020 Veton Këpuska 75

Vector Quantization (VQ) 10/7/2020 Veton Këpuska 75

Vector Quantization (VQ) u u u Consider a vector of N continuous scalars: With

Vector Quantization (VQ) u u u Consider a vector of N continuous scalars: With VQ, the vector x is mapped into another ^ N-dimensional vector x: ^ Vector x is chosen from M possible reconstruction (quantization) levels: 10/7/2020 Veton Këpuska 76

Vector Quantization (VQ) T T 10/7/2020 Veton Këpuska 77

Vector Quantization (VQ) T T 10/7/2020 Veton Këpuska 77

Vector Quantization (VQ) n n n u VQ-vector quantization operator ri-M possible reconstruction levels

Vector Quantization (VQ) n n n u VQ-vector quantization operator ri-M possible reconstruction levels for 1≤i<M Ci-ith “cell” or cell boundary If x is in the cell Ci, then x is mapped to ri. n n 10/7/2020 ri – codeword {ri} – set of all codewords; codebook. Veton Këpuska 78

Vector Quantization (VQ) u Properties of VQ: P 1: In vector quantization a cell

Vector Quantization (VQ) u Properties of VQ: P 1: In vector quantization a cell can have an arbitrary size and shape. In scalar quantization a “cell” (region between two decision levels) can have an arbitrary size, but is shape is fixed. P 2: Similarly to scalar quantization, ^ distortion measure D(x, x), is a measure ^ of dissimilarity or error between x and x. 10/7/2020 Veton Këpuska 79

VQ Distortion Measure u u u Vector quantization noise is represented by the vector

VQ Distortion Measure u u u Vector quantization noise is represented by the vector e: The distortion is the average of the sum of squares of scalar components: For the multi-dimensional pdf px(x): 10/7/2020 Veton Këpuska 80

VQ Distortion Measure u Goal to minimize: u Two conditions formulated by Lim: C

VQ Distortion Measure u Goal to minimize: u Two conditions formulated by Lim: C 1: A vector x must be quantized to a reconstruction level ri that gives the smallest distortion between x and ri. C 2: Each reconstruction level ri must be the centroid of the corresponding decision region (cell Ci) n Condition C 1 implies that given the reconstruction levels we can quantize without explicit need for the cell boundaries. u u n 10/7/2020 To quantize a given vector the reconstruction level is found which minimizes its distortion. This process requires a large search – active area of research. Condition C 2 specifies how to obtain a reconstruction level from the selected cell. Veton Këpuska 81

VQ Distortion Measure u u u Stated 2 conditions provide the basis for iterative

VQ Distortion Measure u u u Stated 2 conditions provide the basis for iterative solution of how to obtain VQ codebook. n Start with initial estimate of ri. n Apply condition 1 by which all the vectors from a set that get quantized by ri can be determined. n Apply secondition to obtain a new estimate of the reconstruction levels (i. e. , centroid of each cell) Problem with this approach is that it requires estimation of joint pdf of all x in order to compute the distortion measure and the multi-dimensional centroid. Solution: k-means algorithm (Lloyd for 1 -D and Forgy for multi-D). 10/7/2020 Veton Këpuska 82

k-Means Algorithm 1. 2. 3. 4. 5. u Compute the ensemble average D as:

k-Means Algorithm 1. 2. 3. 4. 5. u Compute the ensemble average D as: xk are the training vectors and xk are the quantized ^ vectors. Pick an initial guess at the reconstruction levels {ri} For each xk select closest ri. Set of all xk nearest to ri forms a cluster (see Figure 12. 16) – “clustering algorithm”. Compute the mean of xk in each cluster which gives a new ri’s. Calculate D. Stop when the change in D over two consecutive interactions is insignificant. This algorithm converges to a local minimum of D. 10/7/2020 Veton Këpuska 83

k-Means Algorithm 10/7/2020 Veton Këpuska 84

k-Means Algorithm 10/7/2020 Veton Këpuska 84

Neural Networks Based Clustering Algorithms u Kohonen’s SOFM n n 10/7/2020 Topological Ordering of

Neural Networks Based Clustering Algorithms u Kohonen’s SOFM n n 10/7/2020 Topological Ordering of the SOFM Offers potential for further reduction in bit rate. Veton Këpuska 85

Use of VQ in Speech Transmission u u Obtain the VQ codebook from the

Use of VQ in Speech Transmission u u Obtain the VQ codebook from the training vectors - all transmitters and receivers must have identical copies of VQ codebook. Analysis procedure generates a vector xi. Transmitter sends the index of the centroid ri of the closest cluster for the given vector xi. This step involves search. Receiving end decodes the information by accessing the codeword of the received index and performing synthesis operation. 10/7/2020 Veton Këpuska 86

Model-Based Coding u u The purpose of model-based speech coding is to increase the

Model-Based Coding u u The purpose of model-based speech coding is to increase the bit efficiency to achieve either: n Higher quality for the same bit rate or n Lower bit rate for the same quality. Chronological perspective of model-based coding starting with: n All-pole speech representation used for coding: u u n Mixed Excitation Linear Prediction (MELP) coder: u n Remove deficiencies in binary source representation. Code-excited Linear Prediction (CELP) coder: u 10/7/2020 Scalar Quantization Vector Quantization Does nor require explicit multi-band decision and source characterization as MELP. Veton Këpuska 87

Basic Linear Prediction Coder (LPC) u Recall the basic speech production model of the

Basic Linear Prediction Coder (LPC) u Recall the basic speech production model of the form: where the predictor polynomial is given as: u Suppose: n n Linear Prediction analysis performed at 100 frames/s 13 parameters are used: u u n u 10 all-pole spectrum parameters, Pitch Voicing decision Gain Resulting in 1300 parameters/s. Compared to telephone quality signal: n n 10/7/2020 4000 Hz bandwidth 8000 samples/s (8 bit per sample). 1300 parameters/s < 8000 samples/s Veton Këpuska 88

Basic Linear Prediction Coder (LPC) u Instead of prediction coefficients ai use: n n

Basic Linear Prediction Coder (LPC) u Instead of prediction coefficients ai use: n n u Behavior of prediction coefficients is difficult to characterize: n n u Corresponding poles bi Partial Correlation Coefficients ki (PARCOR) Reflection Coefficients ri, or Other equivalent representation. Large dynamic range ( large variance) Quantization errors can lead to unstable system function at synthesis (poles may move outside the unit circle). Alternative equivalent representations: n n 10/7/2020 Have a limited dynamic range Can be easily enforced to give stability because |bi|<1 and |ki|<1. Veton Këpuska 89

Basic Linear Prediction Coder (LPC) u Many ways to code linear prediction parameters: n

Basic Linear Prediction Coder (LPC) u Many ways to code linear prediction parameters: n u Ideally optimal quantization uses the Max quantizer based on known or estimated pdf’s of each parameter. Example of 7200 bps coding: 1. 2. 3. 4. Voice/Unvoiced Decision: 1 bit (on or off) Pitch (if voiced): 6 bits (uniform) Gain: 5 bits (nonuniform) Each Pole bi: 10 bits (nonuniform) n n n 5 bits for bandwidth 5 bits for center frequency Total of 6 poles 100 frames/s 1+6+5+6 x 10=72 bits Quality limited by simple impulse/noise excitation model. 10/7/2020 Veton Këpuska 90

Basic Linear Prediction Coder (LPC) u Improvements possible based on replacement of poles with

Basic Linear Prediction Coder (LPC) u Improvements possible based on replacement of poles with PARCOR. n n 10/7/2020 Higher order PARCOR have pdf’s closer to Gaussian centered around zero nonuniform quantization. Companding is effective with PARCOR: u Transformed pdf’s close to uniform. u Original PARCOR coefficients do not have a good spectral sensitivity (change in spectrum with a change in spectral parameters that is desired to minimize). u Empirical finding that a more desirable transformation in this sense is to use logarithm of the vocal tract area function ratio: Veton Këpuska 91

Basic Linear Prediction Coder (LPC) u Parameters gi: n n Have a pdf close

Basic Linear Prediction Coder (LPC) u Parameters gi: n n Have a pdf close to uniform Smaller spectral sensitivity than PARCOR: u u u The all pole spectrum changes less with a change in gi than with a change in ki Note that spectrum changes less with the change in ki than with the change in pole positions. Typically these parameters can be coded at 5 -6 bits each (significant improvement over 10 bits): n n 100 frames/s Order 6 of the predictor (6 poles) u u 10/7/2020 (1+6+5+6 x 6)x 100 bps = 4800 bps Same quality as 7200 bps by coding pole positions for telephone bandwidth speech. Veton Këpuska 92

Basic Linear Prediction Coder (LPC) u u Government standard for secure communications using 2.

Basic Linear Prediction Coder (LPC) u u Government standard for secure communications using 2. 4 kbps for about a decade used this basic LPC scheme at 50 frames per second. Demand for higher quality standards opened up research on two primary problems with speech codes base on all-pole linear prediction analysis: 1. 2. 10/7/2020 Inadequacy of the basic source/filter speech production model Restrictions of one-dimensional scalar quantization techniques to account for possible parameter correlation. Veton Këpuska 93

A VQ LPC Coder u VQ based LPC PARCOR coder. 10/7/2020 Veton Këpuska K-means

A VQ LPC Coder u VQ based LPC PARCOR coder. 10/7/2020 Veton Këpuska K-means algorithm 94

A VQ LPC Coder 1. Use VQ LPC Coder to achieve same quality of

A VQ LPC Coder 1. Use VQ LPC Coder to achieve same quality of speech with lower bit-rate: u u u 10/7/2020 10—bit code book (1024 codewords) 800 bps 2400 bps of scalar quantization 44. 4 frames/s 440 bits to code PARCOR coefficients per second. 8 bits per frame for: u Pitch u Gain u Voicing 1 bit for frame synchronization per second. Veton Këpuska 95

A VQ LPC Coder u Maintain 2400 bps bit rate with a higher quality

A VQ LPC Coder u Maintain 2400 bps bit rate with a higher quality of speech coding (early 1980): u u 22 -bit codebook 222 = 4200000 codewords. Problems: 1. u u 2. VQ based spectrum characterized by a “wobble” due to LPC-based spectrum being quantized: u u Intractable solution due to computational requirements (large VQ search) Memory (large Codebook size) Spectral representation near cell boundary “wobble” to and from neighboring cells insufficient number of codebooks. Emphasis changed from improved VQ of the spectrum and better excitation models ultimately to a return to VQ on the excitation. 10/7/2020 Veton Këpuska 96

Mixed Excitation LPC (MELP) u u Multi-band voicing decision (introduced as a concept in

Mixed Excitation LPC (MELP) u u Multi-band voicing decision (introduced as a concept in Section 12. 5. 2 – not covered in slides) Addresses shortcomings of conventional linear prediction analysis/synthesis: n n n 10/7/2020 Realistic excitation signal Time varying vocal tract formant bandwidths Production principles of the “anomalous” voice. Veton Këpuska 97

Mixed Excitation LPC (MELP) u Model: n n u Different mixtures of impulses and

Mixed Excitation LPC (MELP) u Model: n n u Different mixtures of impulses and noise are generated in different frequency bands (4 -10 bands) The impulse train and noise in the MELP model are each passed through time-varying spectral shaping filters and are added together to form a full-band signal. MELP unique components: 1. 2. 3. 4. 10/7/2020 An auditory-based approach to multi-band voicing estimation for the mixed impulse/noise excitation. Aperiodic impulses due to pitch jitter, the creaky voice, and the diplophonic voice. Time-varying resonance bandwidth within a pitch period accounting for nonlinear source/system interaction and introducing the truncation effects. More accurate shape of the glottal flow velocity source. Veton Këpuska 98

Mixed Excitation LPC (MELP) u u 2. 4 kbps coder has been implemented based

Mixed Excitation LPC (MELP) u u 2. 4 kbps coder has been implemented based on the MELP model and has been selected as government standard for secure telephone communications. Original version of MELP uses: n 34 bits for scalar quantization of the LPC coefficients (Specifically the line spectral frequencies LSFs). n 8 bits for gain n 7 bits for pitch and overall voicing u n n n u Uses autocorrelation technique on the lowpass filtered LPC residual. 5 -bits to multi-band voicing. 1 -bit for the jittery state (aperiodic) flag. 54 bits per 22. 5 ms frame 2. 4 bps. In actual 2. 4 kbs standard greater efficiency is achieved with vector quantization of LSF coefficients. 10/7/2020 Veton Këpuska 99

Mixed Excitation LPC (MELP) u Line Spectral Frequencies (LSFs) n n u u More

Mixed Excitation LPC (MELP) u Line Spectral Frequencies (LSFs) n n u u More efficient parameter set for coding the all-pole model of linear prediction. The LSFs for a pth order all-pole model are defined as follows: u Two polynomials of order p+1 are created from the pth order inverse filter A(z) according to: LSFs can be coded efficiently and stability of the resulting syntheses filter can be guaranteed when they are quantized. Better quantization and interpolation properties than the corresponding PARCOR coefficients. Disadvantage is the fact that solving for the roots of P(z) and Q(z) can be more computationally intensive than the PARCOR coefficients. Polynomial A(z) is easily recovered from the LSFs (Exercise 12. 18). 10/7/2020 Veton Këpuska 100

Code-Excited Linear Prediction (CELP) u Concept: n n 10/7/2020 Core ideas of CELP: u

Code-Excited Linear Prediction (CELP) u Concept: n n 10/7/2020 Core ideas of CELP: u Utilization of long-term as well as short-term linear prediction models for speech synthesis ⇨ Avoiding the strict voiced/unvoiced classification of LPC coder. u Incorporation of an excitation codebook which is searched during encoding to locate the best excitation sequence. “Code Excited” LP comes from the excitation codebook that contains the “code” to “excite” the synthesis filters. On each frame a codeword is chosen from a codebook of residuals such as to minimize the mean-squared error between the synthesized and original speech waveform. The length of a codeword sequence is determined by the analysis frame length. u For a 10 ms frame interval split into 2 inner frames of 5 ms each a codeword sequence is 40 samples in duration for an 8000 Hz sampling rate. u The residual and long-term predictor is estimated with twice the time resolution (a 5 ms frame) of the short-term predictor (10 ms frame); n Excitation is more nonstationary than the vocal tract. Veton Këpuska 101

Code-Excited Linear Prediction (CELP) u Two approach to formation of the codebook: n n

Code-Excited Linear Prediction (CELP) u Two approach to formation of the codebook: n n u Deterministic codebook – It is formed by applying the k -means clustering algorithm to a large set of residual training vectors. n u Deterministic Stochastic Channel mismatch Stochastic codebook n n 10/7/2020 Histogram of the residual from the long-term predictor follows roughly a Gaussian probability pdf. A valid assumption with exception of plosives and voiced/unvoiced transitions. Cumulative distributions are nearly identical to those for white Gaussian random variables Alternative codebook is constructed of white Gaussian random variables with unit variance. Veton Këpuska 102

CELP Coders u Variety of government and International standard coders: n 1990’s Government standard

CELP Coders u Variety of government and International standard coders: n 1990’s Government standard for secure communications at 4. 8 kbps at 4000 Hz bandwidth (Fed-Std 1016) uses CELP coder: u Three bit rates: n n n u u u 9. 6 kbps (multi-pulse) 4. 8 kbps (CELP) 2. 4 kbps (LPC) Short-time predictor: 30 ms frame interval coded with 34 bits per frame. 10 th order vocal tract spectrum from prediction coefficients transformed to LSFs coded nonuniform quantization. Short-term and long-term predictors are estimated in openloop Residual codewords are determined in closed-loop form. Current international standards use CELP based coding. n n 10/7/2020 G. 729 G. 723. 1 Veton Këpuska 103