Nave Bayes Classifier Adopted from slides by Ke

Naïve Bayes Classifier Adopted from slides by Ke Chen from University of Manchester and Yang. Qiu Song from MSRA 1

Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X Y, or P(Y|X) Discriminative classifiers (also called ‘informative’ by Rubinstein&Hastie): 1. Assume some functional form for P(Y|X) 2. Estimate parameters of P(Y|X) directly from training data Generative classifiers 1. Assume some functional form for P(X|Y), P(X) 2. Estimate parameters of P(X|Y), P(X) directly from training data 3. Use Bayes rule to calculate P(Y|X= xi)

Bayes Formula

Generative Model • • • Color Size Texture Weight …

Discriminative Model • Logistic Regression • • • Color Size Texture Weight …

Comparison • Generative models – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) • Discriminative models – Directly assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data

Probability Basics • Prior, conditional and joint probability for random variables – Prior probability: – Conditional probability: – Joint probability: – Relationship: – Independence: • Bayesian Rule 7

Probability Basics • Quiz: We have two six-sided dice. When they are tolled, it could end up with the following occurance: (A) dice 1 lands on side “ 3”, (B) dice 2 lands on side “ 1”, and (C) Two dice sum to eight. Answer the following questions: 8

Probabilistic Classification • Establishing a probabilistic model for classification – Discriminative model Discriminative Probabilistic Classifier 9

Probabilistic Classification • Establishing a probabilistic model for classification (cont. ) – Generative model Generative Probabilistic Model for Class 1 for Class 2 for Class L 10

Probabilistic Classification • MAP classification rule – MAP: Maximum A Posterior – Assign x to c* if • Generative classification with the MAP rule – Apply Bayesian rule to convert them into posterior probabilities – Then apply the MAP rule 11

Naïve Bayes • Bayes classification Difficulty: learning the joint probability • Naïve Bayes classification – Assumption that all input attributes are conditionally independent! – MAP classification rule: for 12

Naïve Bayes • Naïve Bayes Algorithm (for discrete input attributes) – Learning Phase: Given a training set S, Output: conditional probability tables; for – Test Phase: Given an unknown instance elements , Look up tables to assign the label c* to X’ if 13

Example • Example: Play Tennis 14

Example • Learning Phase Outlook Play=Yes Play=No Sunny Overcast Rain 2/9 4/9 3/5 0/5 2/5 Humidity Play=Yes Play=No High 3/9 6/9 4/5 1/5 Normal Temperature Play=Yes Play=No Hot 2/9 4/9 3/9 2/5 1/5 Mild Cool Wind Strong Weak Play=Yes Play=No 3/9 6/9 3/5 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 15

Relevant Issues • Violation of Independence Assumption – For many real world tasks, – Nevertheless, naïve Bayes works surprisingly well anyway! • Zero conditional probability Problem – If no example contains the attribute value – In this circumstance, during test – For a remedy, conditional probabilities estimated with 18

Relevant Issues • Continuous-valued Input Attributes – Numberless values for an attribute – Conditional probability modeled with the normal distribution – Learning Phase: Output: normal distributions and – Test Phase: • Calculate conditional probabilities with all the normal distributions • Apply the MAP rule to make a decision 19

Conclusions • Naïve Bayes based on the independence assumption – Training is very easy and fast; just requiring considering each attribute in each class separately – Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions • A popular generative model – Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption – Many successful applications, e. g. , spam mail filtering – A good candidate of a base learner in ensemble learning – Apart from classification, naïve Bayes can do more… 20

Extra Slides 21

Naïve Bayes (1) • Revisit • Which is equal to • Naïve Bayes assumes conditional independency • Then the inference of posterior is

Naïve Bayes (2) • Training: Observation is multinomial; Supervised, with label information – Maximum Likelihood Estimation (MLE) – Maximum a Posteriori (MAP): put Dirichlet prior • Classification

Naïve Bayes (3) • What if we have continuous Xi？ • Generative training • Prediction

Naïve Bayes (4) • Problems – Features may overlapped – Features may not be independent • Size and weight of tiger – Use a joint distribution estimation (P(X|Y), P(Y)) to solve a conditional problem (P(Y|X= x)) • Can we discriminatively train? – Logistic regression – Regularization – Gradient ascent