CHAPTER 10 WidrowHoff Learning MingFeng Yeh 1 Objectives

CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh 1

Objectives Widrow-Hoff learning is an approximate steepest descent algorithm, in which the performance index is mean square error. It is widely used today in many signal processing applications. It is precursor to the backpropagation algorithm for multilayer networks. Ming-Feng Yeh 2

ADALINE Network ADALINE (Adaptive Linear Neuron) network and its learning rule, LMS (Least Mean Square) algorithm are proposed by Widrow and Marcian Hoff in 1960. Both ADALINE network and the perceptron suffer from the same inherent limitation: they can only solve linearly separable problems. The LMS algorithm minimizes mean square error (MSE), and therefore tires to move the decision boundaries as far from the training patterns as possible. Ming-Feng Yeh 3

ADALINE Network p S R W R 1 b 1 a n = Wp + b + S 1 n S 1 R S p R 1 Single-layer perceptron 1 R Ming-Feng Yeh a = purelin(Wp + b) S R a W b S 1 + S 1 n S 1 S 4

Single ADALINE Set n = 0, then Wp + b = 0 specifies a decision boundary. The ADALINE can be used to classify objects into two categories if they are linearly separable. Ming-Feng Yeh 5

Mean Square Error The LMS algorithm is an example of supervised training. The LMS algorithm will adjust the weights and biases of the ADALINE in order to minimize the mean square error, where the error is the difference between the target output (tq) and the network output (pq). MSE : E[·]: expected value Ming-Feng Yeh 6

Performance Optimization Develop algorithms to optimize a performance index F(x), where the word “optimize” will mean to find the value of x that minimizes F(x). The optimization algorithms are iterative as or : a search direction : positive learning rate, which determines the length of the step : initial guess Ming-Feng Yeh 7

Taylor Series Expansion Taylor series: Vector case: Ming-Feng Yeh 8

Gradient & Hessian Gradient: Hessian: Ming-Feng Yeh 9

Directional Derivative The ith element of the gradient, F(x) xi, is the first derivative of performance index F along the xi axis. Let p be a vector in the direction along which we wish to know the derivative. Directional derivative: . Find the derivative of F(x) at the point in the direction Ming-Feng Yeh 10

Steepest Descent Goal: The function F(x) can decrease at each iteration, i. e. , Central idea: first-order Taylor series expansion Any vector pk that satisfies is called a descent direction. A vector that points in the steepest descent direction is Steepest descent: Ming-Feng Yeh 11

Approximated-Based Formulation Given input/output training data: {p 1, t 1}, {p 2, t 2}, …, {p. Q, t. Q}. The objective of network training is to find the optimal weights to minimize the error (minimum-squares error) between the target value and the actual response. Model (network) function: Least-squares-error function: The weight vector x can be training by minimizing the error function along the gradient-descent direction: Ming-Feng Yeh 12

Delta Learning Rule ADALINE: Least-Squares-Error Criterion: minimize Gradient: Delta learning rule: Ming-Feng Yeh 13

Mean Square Error Ming-Feng Yeh 14

Mean Square Error If the correlation matrix R is positive definite, there will be a unique stationary point , which will be a strong minimum. Strong Minimum: the point is a strong minimum of F(x) if a scalar exists, such that for all x such that. Global Minimum: the point is a unique global minimum of F(x) for all. Weak Minimum: the point is a weak minimum of F(x) if it is not a strong minimum, and a scalar exists, such that for all x such that. Ming-Feng Yeh 15

LMS Algorithm LMS algorithm is to locate the minimum point. Use an approximate steepest descent algorithm to estimate the gradient. Estimate the mean square error F(x) by Estimated gradient: Ming-Feng Yeh 16

LMS Algorithm Ming-Feng Yeh 17

LMS Algorithm The steepest descent algorithm with constant learning rate is Matrix notation of LMS algorithm: The LMS algorithm is also referred to as the delta rule or the Widrow-Hoff learning algorithm. Ming-Feng Yeh 18

Quadratic Functions General form of quadratic function: (A: Hessian matrix) If the eigenvalues of the Hessian matrix are all positive, then the quadratic function will have one unique global minimum. ADALINE network mean square error: Ming-Feng Yeh 19

Stable Learning Rates Suppose that the performance index is a quadratic function: Steepest descent algorithm with constant learning rate: A linear dynamic system will be stable if the eigenvalues of the matrix [I- A] are less than one in magnitude. Ming-Feng Yeh 20

Stable Learning Rates Let { 1, 2, …, n} and {z 1, z 2, …, zn} be the eigenvalues and eigenvectors of the Hessian matrix. Then Condition for the stability of the steepest descent algorithm is then Assume that the quadratic function has a strong minimum point, then its eigenvalues must be positive numbers. Hence, This must be true for all eigenvalues: Ming-Feng Yeh 21

Analysis of Convergence In the LMS algorithm , xk is a function only of z(k-1), z(k-2), …, z(0). Assume that successive input vectors are statistically independent, then xk is independent of z(k). The expected value of the weight vector will converge to. This is the minimum MSE solution. The condition on stability is The steady state solution is or. Ming-Feng Yeh 22

Orange/Apple Example In practical applications, the stable learning rate might NOT be practical to calculate R, and could be selected by trial and error. Ming-Feng Yeh 23

Orange/Apple Example Start, arbitrary, with all the weights set to zero, and then will apply input p 1, p 2, etc. , in that order, calculating the new weights after each input is presented. Ming-Feng Yeh 24

Orange/Apple Example This decision boundary falls halfway between the two reference patterns. The perceptron rule did NOT produce such a boundary, The perceptron rule stops as soon as the patterns are correctly classified, even though some patterns may be close to the boundaries. The LMS algorithm minimizes the mean square error. Ming-Feng Yeh 25

Solved Problem P 10. 2 Category I: Category II: Since they are linear separable, we can design an ADALINE network to make such a distinction. As shown in figure, Category III: Category IV: They are NOT linear separable, so an ADALINE network CANNOT distinguish between them. Ming-Feng Yeh 26

Solved Problem P 10. 3 These patterns occur with equal probability, and they are used to train an ADALINE network with no bias. What does the MSE performance surface look like? Ming-Feng Yeh 27

Solved Problem P 10. 3 4 3 The Hessian matrix of F(x), 2 R, has both eigenvalues at 2. So the contour of the performance surface will be circular. The center of the contours (the minimum point) is. Ming-Feng Yeh 2 1 0 -1 -2 -3 -2 -1 0 1 2 3 28

Solved Problem P 10. 4 Train the network using the LMS algorithm, with the initial guess set to zero and a learning rate = 0. 25. Ming-Feng Yeh 29

Tapped Delay Line D D D Ming-Feng Yeh At the output of the tapped delay line we have an R-dim. vector, consisting of the input signal at the current time and at delays of from 1 to R– 1 time steps. 30

Adaptive Filter D D D Ming-Feng Yeh 31

Solved Problem P 10. 1 D Just prior to k = 0 ( k < 0 ): D Three zeros have entered the filter, i. e. , y( 3) = y( 2) = y( 1) = 0, the output just prior to k = 0 is zero. k = 0: Ming-Feng Yeh 32

Solved Problem P 10. 1 k = 1: k = 2: k = 3: k = 4: Ming-Feng Yeh 33

Solved Problem P 10. 1 The effect of y(0) last from k = 0 through k = 2, so it will have an influence for three time intervals. This corresponds to the length of the impulse response of this filter. Ming-Feng Yeh 34

Solved Problem P 10. 6 Application of ADALINE: adaptive predictor The purpose of this filter is to predict the next value of the input signal from the two previous values. Suppose that the input signal is a stationary random process with autocorrelation function given by D + D Ming-Feng Yeh + 35

Solved Problem P 10. 6 i. Sketch the contour plot of the performance index (MSE). Ming-Feng Yeh 36

Solved Problem P 10. 6 Performance Index (MSE): The optimal weights are The Hessian 2 matrix is Þ Eigenvalues: 1 = 4, 2 = 8. 1 Þ Eigenvectors: 0 The contours-1 of F(x) will be elliptical, with the long axis of each ellipse along the 1 st eigenvector, since the 1 st eigenvalue has the smallest magnitude. -2 -2 -1 0 2 The ellipses will be centered at 1. Ming-Feng Yeh 37

Solved Problem P 10. 6 ii. The maximum stable value of the learning for the LMS algorithm: iii. The LMS algorithm is approximate steepest descent, so the trajectory for small learning rates will move perpendicular to the contour lines. 2 1 0 -1 -2 Ming-Feng Yeh -2 -1 0 1 2 38

Applications Noise cancellation system to remove 60 -Hz noise from EEG signal (Fig. 10. 6) Echo cancellation system in long distance telephone lines (Fig. 10) Filtering engine noise from pilot’s voice signal (Fig. P 10. 8) Ming-Feng Yeh 39