Introduction to Radial Basis Function Networks Contents l

Introduction to Radial Basis Function Networks

Contents l l l l l Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection

RBF l l Linear models have been studied in statistics for about 200 years and theory is applicable to RBF networks which are just one particular type of linear model. However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians

Typical Applications of NN l Pattern Classification l Function Approximation l Time-Series Forecasting

Function Approximation Unknown f Approximator ˆf

Introduction to Radial Basis Function Networks The Model of Function Approximator

Linear Models Weights Fixed Basis Functions

Linear Models y Output Units w 1 w 2 Hidden Units 1 2 Linearly weighted output wm m Inputs x = x 1 x 2 xn • Decomposition • Feature Extraction • Transformation Feature Vectors

Linear Models y Output Units w 1 w 2 Hidden Units 1 2 Can you Linearly weighted output wm m Inputs x = x 1 x 2 s e s a b e say som xn • Decomposition • Feature Extraction • Transformation Feature Vectors

Example Linear Models l Polynomial l Fourier Series Are they ? s e s a b l a orthogon

Single-Layer Perceptrons as Universal Aproximators y w 1 w 2 Hidden Units 1 x = x 1 2 x 2 wm m xn With sufficient number of sigmoidal units, it can be a universal approximator.

Radial Basis Function Networks as Universal Aproximators y w 1 w 2 Hidden Units 1 x = x 1 2 x 2 wm m xn With sufficient number of radial-basis-function units, it can also be a universal approximator.

Non-Linear Models Weights Adjusted by the Learning process

Introduction to Radial Basis Function Networks The Radial Basis Function Networks

Radial Basis Functions Three parameters for a radial function: i(x)= (||x xi||) xi l Center l Distance Measure l Shape r = ||x xi||

Typical Radial Functions l Gaussian l Hardy-Multiquadratic (1971) l Inverse Multiquadratic

Gaussian Basis Function ( =0. 5, 1. 0, 1. 5)

Inverse Multiquadratic c=5 c=4 c=3 c=2 c=1

Basis { i: i =1, 2, …} is `near’ orthogonal. Most General RBF + + +

Properties of RBF’s l l On-Center, Off Surround Analogies with localized receptive fields found in several biological structures, e. g. , – – visual cortex; ganglion cells

As a function approximator The Topology of RBF y 1 ym Output Units Interpolation Hidden Units Inputs Projection x 1 x 2 xn Feature Vectors

As a pattern classifier. The Topology of RBF y 1 ym Output Units Classes Hidden Units Inputs Subclasses x 1 x 2 xn Feature Vectors

Introduction to Radial Basis Function Networks RBFN’s for Function Approximation

Radial Basis Function Networks l l Radial basis function (RBF) networks are feedforward networks trained using a supervised training algorithm. The activation function is selected from a class of functions called basis functions. They usually train much faster than BP. They are less susceptible to problems with nonstationary inputs

Radial Basis Function Networks l l l Popularized by Broomhead and Lowe (1988), and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture. The major difference between RBF and BP is the behavior of the single hidden layer. Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.

The idea y Unknown Function to Approximate Training Data x

The idea y Unknown Function to Approximate Training Data x Basis Functions (Kernels)

The idea y Function Learned x Basis Functions (Kernels)

The idea y Nontraining Sample Function Learned x Basis Functions (Kernels)

The idea y Nontraining Sample Function Learned x

Radial Basis Function Networks as Universal Aproximators Training set Goal for all k w 1 w 2 x = x 1 x 2 wm xn

Learn the Optimal Weight Vector Training set Goal for all k w 1 w 2 x = x 1 x 2 wm xn

Regularization Training set Goal If regularization is unneeded, set for all k

Learn the Optimal Weight Vector Minimize

Learn the Optimal Weight Vector Define

Learn the Optimal Weight Vector

Learn the Optimal Weight Vector Design Matrix Variance Matrix

Introduction to Radial Basis Function Networks The Projection Matrix

The Empirical-Error Vector Unknown Function

If =0, the RBFN’s learning algorithm is to minimize SSE (MSE). Sum-Squared-Error Vector

The Projection Matrix Error Vector

Introduction to Radial Basis Function Networks Learning the Kernels

RBFN’s as Universal Approximators y 1 yl w 11 w 12 w 1 m wl 1 wl 2 2 1 x 1 Training set wlml m x 2 xn Kernels

What to Learn? y 1 yl l l w 11 w 12 w 1 m wl 1 wl 2 2 1 x 2 wlml l m l xn Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s Model Selection

One-Stage Learning

The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or online setting. One-Stage Learning

Two-Stage Training y 1 yl Step 2 w 11 w 12 w 1 m wl 1 wl 2 2 1 x 1 wlml m x 2 xn Determines wij’s. E. g. , using batch-learning. Step 1 Determines l Centers j’s of j’s. l Widths j’s of j’s. l Number of j’s.

Train the Kernels

Unsupervised Training + + +

Methods l Subset Selection – – – l Clustering Algorithms – KMEANS LVQ – GMM – l Random Subset Selection Forward Selection Backward Elimination Mixture Models

Subset Selection

Random Subset Selection l Randomly choosing a subset of points from training set l Sensitive to the initially chosen points. l Using some adaptive techniques to tune – – – Centers Widths #points

Clustering Algorithms e h t n o i t i Part rs. e t s u l c K o t n i s t n i o data p

Clustering Algorithms p a h c u s Is ? y r o t c a f s i t a s n artitio

Clustering Algorithms H ? s i h t t u ow abo

Clustering Algorithms 1 + +4 + 2 + 3

Introduction to Radial Basis Function Networks Bias-Variance Dilemma

Questions -How should the user choose the kernel? l Problem similar to that of selecting features for other learning algorithms. Ø Ø l Poor choice ---learning made very difficult. Good choice ---even poor learners could succeed. The requirement from the user is thus critical. – – can this requirement be lessened? is a more automatic selection of features possible?

Goal Revisit • Ultimate Goal Generalization Minimize Prediction Error • Goal of Our Learning Procedure Minimize Empirical Error

Badness of Fit l Underfitting – – l A model (e. g. , network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. Produces excessive bias in the outputs. Overfitting – – A model (e. g. , network) that is too complex may fit the noise, not just the signal, leading to overfitting. Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance l l Model selection Jittering Early stopping Weight decay – – l l Regularization Ridge Regression Bayesian learning Combining networks

Best Way to Avoid Overfitting l Use lots of training data, e. g. , – – l 30 times as many training cases as there are weights in the network. for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of Fit Underfit Overfit

However, it's not really a dilemma. Bias-Variance Dilemma Underfit Overfit Large bias Small variance Large variance

Bias-Variance Dilemma More on overfitting Underfit l l Overfit Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons Large bias even with noise-free data. Small variance Small bias Large variance

It's not really a dilemma. Bias-Variance Dilemma Bias Variance more underfit more overfit

The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise bias Solution obtained with training set 2. bias Solution obtained with training set 3. Solution obtained with training set 1. Sets of functions The true model E. g. , depend on # hidden nodes used.

The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma Variance noise bias Sets of functions E. g. , depend on # hidden nodes used. The true model

Reduce the effective number of parameters. Reduce the number of hidden nodes. Model Selection Variance noise bias Sets of functions E. g. , depend on # hidden nodes used. The true model

Goal: Bias-Variance Dilemma noise bias Sets of functions E. g. , depend on # hidden nodes used. The true model

Goal: Bias-Variance Dilemma Goal: 0 constant

Bias-Variance Dilemma 0

Goal: Bias-Variance Dilemma noise Cannot be minimized bias 2 variance Minimize both bias 2 and variance

Goal: Model Complexity vs. Bias-Variance noise Model Complexity (Capacity) bias 2 variance

Goal: Bias-Variance Dilemma noise Model Complexity (Capacity) bias 2 variance

Example (Polynomial Fits)

Example (Polynomial Fits) Degree 1 Degree 5 Degree 10 Degree 15

Introduction to Radial Basis Function Networks The Effective Number of Parameters

Variance Estimation Mean Variance In general, not available.

Variance Estimation Mean Variance Loss 1 degree of freedom

Simple Linear Regression

Minimize Simple Linear Regression

Minimize Mean Squared Error (MSE) Loss 2 degrees of freedom

Variance Estimation Loss m degrees of freedom m: #parameters of the model

The Number of Parameters m #degrees of freedom: w 1 w 2 x = x 1 x 2 wm xn

The Effective Number of Parameters ( ) The projection Matrix w 1 w 2 x = x 1 x 2 wm xn

Facts: The Effective Number of Parameters ( ) Pf) The projection Matrix

The effective number of parameters: Regularization SSE Penalize models with large weights

The effective number of parameters: Regularization SSE Penalize models with large weights Without penalty ( i=0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters =m.

The effective number of parameters: Regularization SSE Penalize models with large weights With penalty ( i>0), the liberty to minimize SSE will be reduced. The effective number of parameters <m.

The effective number of parameters: Variance Estimation Loss degrees of freedom

The effective number of parameters: Variance Estimation

Introduction to Radial Basis Function Networks Model Selection

Model Selection l l l Goal – Choose the fittest model – Least prediction error – Cross validation Projection matrix Criteria Main Tools (Estimate Model Fitness) – l Methods – – Weight decay (Ridge regression) Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness • Ultimate Goal Generalization Minimize Prediction Error • Goal of Our Learning Procedure Minimize Empirical Error (MSE) Minimize Prediction Error

Estimating Prediction Error l When you have plenty of data use independent test sets – l E. g. , use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use – – Cross-Validation Bootstrap

Cross Validation l l Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e. g. , – – – K-fold Cross-Validation Leave-One-Out Cross-Validation Generalize Cross-Validation

K-Fold CV l l Split the set, say, D of available input -output patterns into k mutually exclusive subsets, say D 1, D 2, …, Dk. Train and test the learning algorithm k times, each time it is trained on DDi and tested on Di.

K-Fold CV Available Data

Test Set K-Fold CV D 1 D 2 D 3 . . . Available Data D 1 D 2 D 3 . . . D 1 D 2 D 3 Training Set Dk Dk Dk Estimate 2

A special case of k-fold CV. Leave-One-Out CV l l Split the p available input-output patterns into a training set of size p 1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition.

A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

A special case of k-fold CV. Given a model, the function with least Error Variance Predicted by LOO empirical error for D. i Available input-output patterns. Training sets of LOO. Function learned using Di as training set. As an index of model’s fitness. We want for to find model also minimize this. The estimate the avariance of prediction error using LOO: Error-square for the left-out element.

A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. ? s y a w t n e i c i f f e ? e t a m i t s e y n o t a e w r e Ho. The h t re of prediction error using LOO: estimate for the A variance Function learned using Di as training set. Error-square for the left-out element.

Error Variance Predicted by LOO Error-square for the left-out element.

Generalized Cross-Validation

More Criteria Based on CV GCV (Generalized CV) UEV (Unbiased estimate of variance) Akaike’s Information Criterion FPE (Final Prediction Error) BIC (Bayesian Information Criterio)

More Criteria Based on CV

Standard Ridge Regression, Regularization SSE Penalize models with large weights

Solution Review Used to compute model selection criteria

Example Width of RBF r = 0. 5

Example Width of RBF r = 0. 5 l a m i t p o e h t e n i m r e t e r d e t e e h t m a w r o a p H n o i t a z i r a l regu ? y l e v i t c e f ef

Optimizing the Regularization Parameter Re-Estimation Formula

Local Ridge Regression Re-Estimation Formula

Example — Width of RBF

Example — Width of RBF There are two local-minima. Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.

Example — Width of RBF There are two local-minima.

Example RMSE: Root Mean Squared Error In real case, it is not available. — Width of RBF

Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available.

Local Ridge Regression Standard Ridge Regression Local Ridge Regression

Local Ridge Regression Standard Ridge Regression j implies that j( ) can be removed. Local Ridge Regression

The Solutions Linear Regression Standard Ridge Regression Local Ridge Regression Used to compute model selection criteria

Optimizing the Regularization Parameters Incremental Operation P: The current projection Matrix. Pj: The projection Matrix obtained by removing ( ). j

Optimizing the Regularization Parameters Solve Subject to

Optimizing the Regularization Parameters Solve Subject to Remove j( )

The Algorithm l Initialize i’s. – l e. g. , performing standard ridge regression. Repeat the following until GCV converges: – Randomly select j and compute – Perform local ridge regression – If GCV reduce & remove j( )