Classification Bayesian Classifiers Jeff Howbert Introduction to Machine

Classification Bayesian Classifiers Jeff Howbert Introduction to Machine Learning Winter 2014 1

Bayesian classification l A probabilistic framework for solving classification problems. – Used where class assignment is not deterministic, i. e. a particular set of attribute values will sometimes be associated with one class, sometimes with another. – Requires estimation of posterior probability for each class, given a set of attribute values: for each class Ci – Then use decision theory to make predictions for a new sample x Jeff Howbert Introduction to Machine Learning Winter 2014 2

Bayesian classification l l Conditional probability: Bayes theorem: posterior probability Jeff Howbert likelihood prior probability evidence Introduction to Machine Learning Winter 2014 3

Example of Bayes theorem l Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50, 000 – Prior probability of any patient having stiff neck is 1/20 l If a patient has stiff neck, what’s the probability he/she has meningitis? Jeff Howbert Introduction to Machine Learning Winter 2014 4

Bayesian classifiers l Treat each attribute and class label as random variables. l Given a sample x with attributes ( x 1, x 2, … , xn ): – Goal is to predict class C. – Specifically, we want to find the value of Ci that maximizes p( Ci | x 1, x 2, … , xn ). l Can we estimate p( Ci | x 1, x 2, … , xn ) directly from data? Jeff Howbert Introduction to Machine Learning Winter 2014 5

Bayesian classifiers Approach: l Compute the posterior probability p( Ci | x 1, x 2, … , xn ) for each value of Ci using Bayes theorem: l Choose value of Ci that maximizes p( Ci | x 1, x 2, … , xn ) Equivalent to choosing value of Ci that maximizes p( x 1, x 2, … , xn | Ci ) p( Ci ) l (We can ignore denominator – why? ) l l Easy to estimate priors p( Ci ) from data. (How? ) The real challenge: how to estimate p( x 1, x 2, … , xn | Ci )? Jeff Howbert Introduction to Machine Learning Winter 2014 6

Bayesian classifiers l How to estimate p( x 1, x 2, … , xn | Ci )? l In the general case, where the attributes xj have dependencies, this requires estimating the full joint distribution p( x 1, x 2, … , xn ) for each class Ci. l There is almost never enough data to confidently make such estimates. Jeff Howbert Introduction to Machine Learning Winter 2014 7

Naïve Bayes classifier l Assume independence among attributes xj when class is given: p( x 1, x 2, … , xn | Ci ) = p( x 1 | Ci ) p( x 2 | Ci ) … p( xn | Ci ) l Usually straightforward and practical to estimate p( xj | Ci ) for all xj and Ci. l New sample is classified to Ci if p( Ci ) p( xj | Ci ) is maximal. Jeff Howbert Introduction to Machine Learning Winter 2014 8

How to estimate p ( xj | Ci ) from data? l For continuous attributes: – Discretize the range into bins u replace with an ordinal attribute – Two-way split: ( xi < v ) or ( xi > v ) u replace with a binary attribute – Probability density estimation: assume attribute follows some standard parametric probability distribution (usually a Gaussian) u use data to estimate parameters of distribution (e. g. mean and variance) u once distribution is known, can use it to estimate the conditional probability p( xj | Ci ) u Jeff Howbert Introduction to Machine Learning Winter 2014 10

How to estimate p ( xj | Ci ) from data? l Gaussian distribution: – one for each ( xj, Ci ) pair l For ( Income | Class = No ): – sample mean = 110 – sample variance = 2975 Jeff Howbert Introduction to Machine Learning Winter 2014 11

Naïve Bayes classifier Problem: if one of the conditional probabilities is zero, then the entire expression becomes zero. l This is a significant practical problem, especially when training samples are limited. l Ways to improve probability estimation: l c: number of classes p: prior probability m: parameter Jeff Howbert Introduction to Machine Learning Winter 2014 13

Example of Naïve Bayes classifier X: attributes M: class = mammal N: class = non-mammal p( X | M ) p( M ) > p( X | N ) p( N ) => mammal Jeff Howbert Introduction to Machine Learning Winter 2014 14

Summary of naïve Bayes Robust to isolated noise samples. l Handles missing values by ignoring the sample during probability estimate calculations. l Robust to irrelevant attributes. l NOT robust to redundant attributes. – Independence assumption does not hold in this case. – Use other techniques such as Bayesian Belief Networks (BBN). l Jeff Howbert Introduction to Machine Learning Winter 2014 15