Parameter Estimation Bayesian Estimation Chapter 3 Duda et

Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al. ) – Sections 3. 3 -3. 7 CS 479/679 Pattern Recognition Dr. George Bebis

Bayesian Estimation (BE) p(D/θ) • �� is assumed to be a random variable with a prior density p(�� ). • Using the training data, we will turn this to a posterior density p(�� |D). p(θ/D) • Ideally, the data will sharpen the posterior p(�� |D), that is, reduce our uncertainty about the parameters.

Role of Training Examples in Classification • The Bayes’ rule allows us to compute the posterior probabilities P(ωi /x): • Consider the role of the training examples D by introducing them in the computation of the posterior probabilities:

Role of Training Examples (cont’d) marginalize Using only the samples from class i/j

Role of Training Examples (cont’d) • The training examples are important in determining both the class-conditional densities and the prior probabilities: • For simplicity, replace P(ωi /Di) with P(ωi):

Role of Training Examples (cont’d) • Need to estimate p(x/ωi, Di) for every class ωi • If the samples in Dj give no information about qi, we need to solve c independent problems: “Given D, estimate p(x/D)”

BE Approach • Estimate p(x/D) as follows: marginalize • Since , we have: BE Solution assumed model (i. e. , Gaussian)

BE vs ML/MAP • ML/MAP makes a point estimate • BE estimates a distribution: • Note: BE solution might not be of the exact parametric form assumed.

Interpretation of BE Solution • If we are less certain about the exact value of θ, consider a weighted average of p(x / θ) over the possible values of θ: • Training data D exert their influence on p(x/D) through p(θ/D).

Relationship to ML solution • If p(D/ θ) peaks sharply at (i. e. , ML solution) then p(θ /D) will, in general, peak sharply at too (assuming p(θ) is broad and smooth) p(θ /D) • Therefore, ML is a special case of BE! p(θ /D)

BE Main Steps (1) Compute p(θ/D) : (2) Compute p(x/D) :

Case 1: Univariate Gaussian, Unknown μ (known (Step 1) D={x 1, x 2, …, xn} (independently drawn) )

Case 1: Univariate Gaussian, Unknown μ (cont’d) • It can be shown that p(μ/D) has the following form: X • So, p(μ/D) peaks at μn c

Case 1: Univariate Gaussian, Unknown μ (cont’d) (i. e. , lies between them) as 0 as (ML estimate) The uncertainty of our estimate gets smaller as n increases!

Example: Bayesian Learning Case 1: Univariate Gaussian, Unknown μ (cont’d)

Case 1: Univariate Gaussian, Unknown μ (cont’d) (Step 2) independent on μ Note: the assumed form is p(x/μ)~N(μ, σ2); but p(x/D)~N(μn, σ2+ σ2 n);

Case 2: Multivariate Gaussian, Unknown μ Assume p(x/μ)~N(μ, Σ) and p(μ)~N(μ 0, Σ 0) (known μ 0, Σ 0) D={x 1, x 2, …, xn} (Step 1) Compute p(μ/D): (independently drawn)

Case 2: Multivariate Gaussian, Unknown μ (cont’d) • It can be shown that p(μ/D) has the following form: where:

Case 2: Multivariate Gaussian, Unknown μ (cont’d) (2) Compute p(x/D): Note: the assumed form is p(x/μ)~N(μ, Σ); however, p(x/D)~N(μn, Σ+Σn);

Recursive Bayes Learning • Compute incrementally Dn: (x 1, x 2, …. , xn-1, xn) Dn-1 since

Recursive Bayes Learning (cont’d) substitute marginalize n=1, 2, … n= 0

Example p(θ)

Example (cont’d) (x 4=8) In general:

Example (cont’d) p(θ/D 4) peaks at p(θ)= p(θ/D 0) Iterations ML estimate: Bayesian estimate:

ML vs Bayesian Estimation • Number of training data – The two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). – For small training data sets, they give different results in most cases. • Computational complexity – ML uses differential calculus or gradient search for maximizing the likelihood. – Bayesian estimation requires complex multidimensional integration techniques.

ML vs Bayesian Estimation (cont’d) • Solution interpretation – Easier to interpret ML solutions (i. e. , must be of the assumed parametric form). – A Bayesian estimation solution might not be of the parametric form assumed.

Computational Complexity ML estimation dimensionality: d • Learning complexity (off-line) # training data: n # classes: c O(dn) O(d 2 n) Pre-compute certain terms to save time during classification: O(d 3) O(1) O(d 2) O(n) The above computations must be repeated c times (once for each class) )

Computational Complexity dimensionality: d • Classification complexity (on-line) O(d 2) # training data: n # classes: c O(1) These computations must be repeated c times and take max

Computational Complexity Bayesian Estimation • Learning complexity: higher than ML • Classification complexity: same as ML

Summary: Main Sources of Classification Errors • Bayes error – The error due to overlapping densities p(x/ωi) • Model error – The error due to choosing an incorrect model. • Estimation error – The error due to incorrectly estimated parameters

Next Quiz • When: Tuesday, March 10 th • What: ML & Bayesian Estimation