Pattern Recognition and Machine LearningChapter 2 Probability Distributions

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011 1

Terminologies For understanding distributions 2

Terminologies • Schur complement: relationship between original matrix and its inverse. • Completing the square: converting quadratic of form ax 2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic. • Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x) 3

Terminologies (cont. ) • [Stochastic appro. , wiki. , 2011] – Condition on that • Trace Tr(W) is sum of diagonals • Degree of freedom: dimension of subspace. Here it refers to a hyperparameter. 4

Distributions Gaussian distributions and motives 5

Conditional Gaussian Distribution Assume y=Xa, x=Xb • Derivation of conditional mean and variance: – Noting Schur complement • Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w. r. t. dependent variable Xb. Variance is independent of Xb. 6

Marginal Gaussian Distribution • Goal is also to identify mean and variance by ‘completing the square’. • Solving above integration while noting Schur complement and compare components 7

Bayesian relationship with Gaussian distr. (quick view) • Consider multivariable Gaussian where – Thus – According to Bayesian equation • The conditional Gaussian must have form where exponent is difference of p(x, y) and p(x) – Ie. becomes 8

Bayesian relationship with Gaussian distr. • Starting from Can be seem as prior Can be seem as likelihood • Mean and var. for joint Gaussian distr. P(x, y) • Mean and variance for P(x|y) Can be seem as posterior 9

Bayesian relationship with Gaussian distr. , sequential est. • Estimate mean by (N-1)+1 observations • Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. – solve for by Robbin-Monro 10

Bayesian relationship with Univariate Gaussian distr. • Conjugate prior for precision (inv. cov. ) of univariate Gaussian is gamma function • Conjugate prior of univariate Gaussian is Gaussian-gamma function 11

Bayesian relationship with Multivariate Gaussian distr. • Conjugate prior for precision (inv. cov. ) mat. of Multivariate Gaussian is Wishart distr. • Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr. 12

Distributions Gaussian distributions variations 13

Student’s t-distr • Use in analysis of variance on whether effect is real and statistical significant using t-distri. w/ n-1 degree of freedom. • If Xi are normal random then – T-distr. has lower peak and longer tail (allow more outliers thus robust) than Gaussian distr. • Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision 14

Student’s t-distr (cont. ) • For multivariate Gaussian corresponding t-distri. , – Mahalanobis dist. • Mean, variance 15

Gaussian with periodic variables • To avoid mean been dependent on choice of origin use polar coordinate – Solve for theta • Von Mises distr. a special case of von Mises. Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle 16

Gaussian with periodic variables (cont. ) • From Gaussian of Cartesian coordinate to polar – Becomes – Von Mises distr. • Mean • Precision (concentration) 17

Gaussian with periodic variables: mean and variance • Solving log likelihood – mean – precision ‘m’ • By noting 18

Mixture of Gaussians • In part 1 we already know one limitation of Gaussian is unimodal property. – Solution: linear comb. (superposition) of Gaussians • Mixing coefficients sum to 1 • Posterior here is known as ‘responsibilities’ – Log likelihood: 19

Exponential family • Natural form – Normalize by • 1) Bernoulli – Becomes • 2) Multinomial – Becomes 20

Exponential family (cont. ) • 3) Univariate Gaussian – Becomes • Solve for natural parameter – Becomes – From max. likelihood 21

Parameters of Distributions And interesting methodologies 22

Uninformative priors • “Subjective Bayesian”: avoid incorrect assumption by using uninformative (ex. uniform distr. ) prior. – Improper prior: prior need not sum to 1 for posterior to sum to 1 as per Bayes equation. • 1) location parameter for translation invariance • 2) scale parameter for scale invariance in 23

Nonparametric methods • Instead of assume form of distribution, use nonparametric methods. • 1) Histogram of constant bin width – Good for sequential data – Problem: discontinuity, dimensionality increase exp. • 2) Kernel estimators: sum of Parzen windows – ‘N’ Observations falling in region R (volume V) is ‘K’ – becomes 24

Nonparametric method: Kernel estimators • 2) Kernel estimators: fix V, determine K – Form of kernel function for points falling in R – h>0 is fixed parameter bandwidth for smoothing – Parzen estimator. Can choose k(u) (ex. Gaussian) 25

Nonparametric method: Nearestneighbor • 3) Nearest neighbor: this time use data to grow V Prior: – Same as kernel estimator: training set is store as knowledge base. – ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions. – For classifying N points into Nk points in class Ck from Bayesian maximize 26

Nonparametric method: Nearestneighbor (cont. ) • 3) Nearest neighbor: assign new point to class Ck by majority vote of its k nearest neighbors ……………… - for k=1 and n->∞ , error is bounded by Bayes error rate [k-nearest neighbor algorithm, wiki. , 2011] 27

Ch. 2 Basic Graph Concepts From David Barber’s book 28

Directed and undirected graphs • G with vertices and edges that are directed or undirected. • Directed graph, A->B but not B->A then A is ancestor or parent, where B is child • Directed Acyclic Graph (DAG): directed graph with no cycle (no revisit of vertex) • Connected undirected graph: path between every vertices • Clique: cycle for undirected graph 29

Representations of Graphs • Singled connected (tree): only one path from A to B • Spanning tree of undirected graph: singly connected subset covering all vertices • Graph representation (numerical) • Edge list: ex. • Adjacency matrix A: N vertex then Nx. N where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric. 30

Representations of Graphs (cont. ) • Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix • Provided there are no edge from a vertex to itself • K maximum clique undirected graph has a N x K matrix, where each column Ck express which nodes form a clique. • 2 cliques: vertices {1, 2, 3} and {2, 3, 4} 31

Incidence Matrix • Adjacency matrix A and incidence matrix Zinc • Maximum clique incidence matrix Z • Property: • Note: Zinc columns denote edges, and rows denote vertices 32

Additional Information • Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C. M. ] page 84 -127. • Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19 -23. • Slide uploaded to Google group. Use with reference. 33