Ch 1 Introduction Pattern Recognition and Machine Learning

  • Slides: 24
Download presentation
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007 -03

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007 -03 -27 Updated by J. -H. Eom (2 nd round revision) Summarized by K. -I. Kim (originally) Biointelligence Laboratory, Seoul National University http: //bi. snu. ac. kr/

Contents 1. 1 Example: Polynomial Curve Fitting l 1. 2 Probability Theory l ¨

Contents 1. 1 Example: Polynomial Curve Fitting l 1. 2 Probability Theory l ¨ 1. 2. 1 Probability densities ¨ 1. 2. 2 Expectations and covariance ¨ 1. 2. 3 Bayesian probabilities ¨ 1. 2. 4 The Gaussian distribution ¨ 1. 2. 5 Curve fitting re-visited ¨ 1. 2. 6 Bayesian curve fitting l 1. 3 Model Selection (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 2

Pattern Recognition Training set, l Target vector, l Training (learning) phase l ¨ Determine

Pattern Recognition Training set, l Target vector, l Training (learning) phase l ¨ Determine l Generalization ¨ Test set l Preprocessing ¨ Feature extraction (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 3

Supervised, Unsupervised and Reinforcement Learning l Supervised Learning: with target vector ¨ Classification ¨

Supervised, Unsupervised and Reinforcement Learning l Supervised Learning: with target vector ¨ Classification ¨ Regression l Unsupervised learning: w/o target vector ¨ Clustering ¨ Density estimation ¨ Visualization l Reinforcement learning: maximize a reward ¨ Trade-off between exploration & exploitation (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 4

1. 1 Example: Polynomial Curve Fitting l N observations l Fit data ¨ With

1. 1 Example: Polynomial Curve Fitting l N observations l Fit data ¨ With polynomial function l Minimizing error function ¨ Sum of squares of errors (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 5

Model Selection & Over-fitting (1/2) Choosing order M (C) 2007, SNU Biointelligence Lab, http:

Model Selection & Over-fitting (1/2) Choosing order M (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 6

Model Selection & Over-fitting (2/2) l RMS(Root-Mean-Square) Error Too large → Over-fitting l The

Model Selection & Over-fitting (2/2) l RMS(Root-Mean-Square) Error Too large → Over-fitting l The more data, the better generalization l Over-fitting is a general property of maximum likelihood l (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 7

Regularization l Control over-fit phenomena ¨ Use penalty term l ¨ Shrinkage ¨ Ridge

Regularization l Control over-fit phenomena ¨ Use penalty term l ¨ Shrinkage ¨ Ridge regression ¨ Weight decay (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 8

1. 2 Probability Theory l “What is the overall probability that the selection procedure

1. 2 Probability Theory l “What is the overall probability that the selection procedure will pick an apple? ” l “Given that we have chosen an orange, what is the probability that the box we chose was the blue one? ” (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 9

Rules of Probability (1/2) l Joint probability l Marginal probability l Conditional probability (C)

Rules of Probability (1/2) l Joint probability l Marginal probability l Conditional probability (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 10

Rules of Probability (2/2) l Sum rule l Production rule l Bayes’ theorem Likelihood

Rules of Probability (2/2) l Sum rule l Production rule l Bayes’ theorem Likelihood Prior Posterior Normalizing constant (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 11

l N = 60 (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/

l N = 60 (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 12

1. 2. 1 Probability densities l Probabilities with respect to continuous variables ¨ Probability

1. 2. 1 Probability densities l Probabilities with respect to continuous variables ¨ Probability density over x, p(x): Cumulative distribution function Sum rule Product rule (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 13

1. 2. 2 Expectations and Covariances l Expectation of f(x): average value of f(x)

1. 2. 2 Expectations and Covariances l Expectation of f(x): average value of f(x) under a probability dist p(x) ¨ Conditional expectations l Variance – a measure of how much variability there is in f around its mean l Covariance – the extent to which x and y vary together 14

1. 2. 3 Bayesian Probabilities – Frequantist vs. Bayesian l Bayes’ theorem l l

1. 2. 3 Bayesian Probabilities – Frequantist vs. Bayesian l Bayes’ theorem l l Likelihood: Frequentist ¨ w: considered as a fixed parameter determined by ‘estimator’ < Maximum likelihood: Error function = < Error bars: Obtained by the distribution of possible data sets – Bootstrap l Bayesian ¨ a single data set (the one that is actually observed) ¨ the uncertainty in the parameters: a probability distribution over w ¨ Advantage: the inclusion of prior knowledge arises naturally < Leads less extreme conclusion by incorporating prior < Non-informative prior (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 15

1. 2. 3 Bayesian Probabilities – Expansion of Bayesian Application l Limited application of

1. 2. 3 Bayesian Probabilities – Expansion of Bayesian Application l Limited application of full Bayesian procedure ¨ Even it has its origin from 18 th century ¨ Need to marginalize over the whole of parameter space l Markov chain Monte Carlo sampling method ¨ Computationally intensive ¨ Used for small-scale problem l Highly efficient deterministic approximation schemes ¨ e. g. variational Bayes, expectation propagation ¨ Alternative to sampling methods ¨ Have allowed Bayesian to be used in large-scale problems (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 16

1. 2. 4 Gaussian distribution l Gaussian distribution for a single real-valued variable x

1. 2. 4 Gaussian distribution l Gaussian distribution for a single real-valued variable x Mean, variance, standard deviation, precision l D-dimensional Multivariate Gaussian Distribution : for d-dimensional vector of x of continuous variables Mean, covariance, determinant (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 17

1. 2. 4 Gaussian distribution – Example (1/2) Getting unknown parameters l Data points

1. 2. 4 Gaussian distribution – Example (1/2) Getting unknown parameters l Data points are i. i. d. l ¨ Maximizing with respect to < sample mean: ¨ Maximizing with respect to variance < sample Evaluate subsequently variance: (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 18

1. 2. 4 Gaussian distribution – Example (2/2) l Bias phenomenon ¨ Limitation of

1. 2. 4 Gaussian distribution – Example (2/2) l Bias phenomenon ¨ Limitation of the maximum likelihood approach ¨ Related to over-fitting l When we consider the expectations, ¨ We have correct mean ¨ But we have underestimated variance True mean Sample mean Size of N (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 19

1. 2. 5 Curve Fitting Re-visited (1/2) l Goal in the curve fitting problem

1. 2. 5 Curve Fitting Re-visited (1/2) l Goal in the curve fitting problem ¨ Prediction for the target variable t given some new input variable x l If we assume Gaussian for t, l Determine the unknown w & by maximum likelihood using training data {x, t} l Likelihood For data drawn from dist i. i. d. , l In log form (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 20

1. 2. 5 Curve Fitting Re-visited (2/2) l ¨ maximizing likelihood = minimizing the

1. 2. 5 Curve Fitting Re-visited (2/2) l ¨ maximizing likelihood = minimizing the sum-of-squares error function (with negative log likelihood) l Determining the precision with ML l Predictive distribution ¨ Predictions for new values of x (using w, ) (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 21

Maximum Posterior (MAP) l Introduce a prior (Gaussian distribution on w), ¨ : hyper-parameter

Maximum Posterior (MAP) l Introduce a prior (Gaussian distribution on w), ¨ : hyper-parameter Determine w with most probable value of w given the data (maximizing posterior) ¨ Taking the negative logarithm and combining previous terms, the maximum of the posterior is given by minimum of ¨ Maximizing the posterior dist = minimizing the regularized sum -of-squares error function (1. 4) ¨ With regularization parameters: (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 22

1. 2. 6 Bayesian Curve Fitting l Marginalization (C) 2007, SNU Biointelligence Lab, http:

1. 2. 6 Bayesian Curve Fitting l Marginalization (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 23

1. 3 Model Selection Proper model complexity → Good generalization & best model l

1. 3 Model Selection Proper model complexity → Good generalization & best model l Measuring the generalization performance l ¨ If data are plentiful, divide into training, validation & test set ¨ Otherwise, cross-validate < Leave-one-out technique < Drawbacks – Expensive computation – Using separate data → multiple complexity parameters ¨ New measures of performance < e. g. Akaike information criterion(AIC), Bayesian information criterion(BIC) (C) 2007, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 24