Ch 1 Introduction Pattern Recognition and Machine Learning

Contents 1. 1 Example: Polynomial Curve Fitting l 1. 2 Probability Theory l ¨

Pattern Recognition Training set, l Target vector, l Training (learning) phase l ¨ Determine

Supervised, Unsupervised and Reinforcement Learning l Supervised Learning: with target vector ¨ Classification ¨

Example: Polynomial Curve Fitting l N observations l l Minimizing error function (C) 2006,

Model Selection & Over-fitting (1/2) (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac.

Model Selection & Over-fitting (2/2) l RMS(Root-Mean-Square) Error Too large → Over-fitting l The

Regularization l ¨ Shrinkage ¨ Ridge regression ¨ Weight decay (C) 2006, SNU Biointelligence

Probability Theory l “What is the overall probability that the selection procedure will pick

Rules of Probability (1/2) l Joint probability l Marginal probability l Conditional probability (C)

Rules of Probability (2/2) l Sum rule l Production rule l Bayes’ theorem Likelihood

Probability densities (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 12

Expectations and Covariances l Expectation l Variance l Covariance (C) 2006, SNU Biointelligence Lab,

Bayesian Probabilities -Frequantist vs. Bayesian l l Likelihood: Frequantist ¨ w: a fixed parameter

Bayesian Probabilities -Expansion of Bayesian Application l Limited application of full Bayesian procedure ¨

Gaussian distribution l l D-demensional Multivariate Gaussian Distribution (C) 2006, SNU Biointelligence Lab, http:

Gaussian distribution -Example (1/2) Getting unknown parameters l Data points are i. i. d.

Gaussian distribution -Example (2/2) l Bias phenomenon ¨ Limitation of the maximum likelihood approach

Curve Fitting Re-visited (1/2) l Goal in the curve fitting problem ¨ Prediction for

Curve Fitting Re-visited (2/2) l ¨ maximizing likelihood = minimizing the sum-of-squares error function

Maximum Posterior (MAP) l Add prior probability ¨ : hyperparameter ¨ Minimum of equals

Bayesian Curve Fitting l Marginalization (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac.

Model Selection Proper model complexity → Good generalization & best model l Measuring the

Slides: 23

Download presentation

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by K. I. Kim Biointelligence Laboratory, Seoul National University http: //bi. snu. ac. kr/ 1

Contents 1. 1 Example: Polynomial Curve Fitting l 1. 2 Probability Theory l ¨ 1. 2. 1 Probability densities ¨ 1. 2. 2 Expectations and covariance ¨ 1. 2. 3 Bayesian probabilities ¨ 1. 2. 4 The Gaussian distribution ¨ 1. 2. 5 Curve fitting re-visited ¨ 1. 2. 6 Bayesian curve fitting l 1. 3 Model Selection (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 2

Pattern Recognition Training set, l Target vector, l Training (learning) phase l ¨ Determine l Generalization ¨ Test set l Preprocessing ¨ Feature selection (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 3

Supervised, Unsupervised and Reinforcement Learning l Supervised Learning: with target vector ¨ Classification ¨ Regression l Unsupervised learning: w/o target vector ¨ Clustering ¨ Density estimation ¨ Visualization l Reinforcement learning: maximize a reward ¨ Trade-off between exploration & exploitation (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 4

Model Selection & Over-fitting (2/2) l RMS(Root-Mean-Square) Error Too large → Over-fitting l The more data, the better generalization l Over-fitting is a general property of maximum likelihood l (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 7

Probability Theory l “What is the overall probability that the selection procedure will pick an apple? ” l “Given that we have chosen an orange, what is the probability that the box we chose was the blue one? ” (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 9

Bayesian Probabilities -Frequantist vs. Bayesian l l Likelihood: Frequantist ¨ w: a fixed parameter determined by 'estimator‘ < Maximum likelihood: Error function = < Error bars: Obtained by the distribution of possible data sets – Bootstrap l Bayesian ¨ a single data set ¨ a probability distribution w: the uncertainty in the parameters ¨ Prior knowledge < noninformative prior (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 14

Bayesian Probabilities -Expansion of Bayesian Application l Limited application of full Bayesian procedure ¨ from 18 th century ¨ Marginalize over the whole of parameter space l Markov chain Monte Carlo ¨ Small-scale problem l Highly efficient deterministic approximation schemes ¨ e. g. variational Bayes, expectation propagation ¨ Large-scale problem (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 15

Gaussian distribution -Example (1/2) Getting unknown parameters l Data points are i. i. d. l ¨ Maximizing with respect to < sample mean: ¨ Maximizing with respect to variance < sample variance: (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 17

Curve Fitting Re-visited (1/2) l Goal in the curve fitting problem ¨ Prediction for the target variable t given some new input variable x l Determine the unknown w & by maximum likelihood l (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 19

Model Selection Proper model complexity → Good generalization & best model l Measuring the generalization performance l ¨ If data are plentiful, divide into training, validation & test set ¨ Otherwise, cross-validate < Leave-one-out technique < Drawbacks – Expensive computation – Using separate data → multiple complexity parameters ¨ New measures of performance < e. g. Akaike information criterion(AIC), Bayesian information criterion(BIC) (C) 2006, SNU Biointelligence Lab, http: //bi. snu. ac. kr/ 23