Data Classification and Segmentation Bayesian Methods III HK

Data Classification and Segmentation : Bayesian Methods III HK Book: Section 7. 4, Domingos’ Paper (online) Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs. ust. hk Thanks: Dan Weld, Eibe Frank 1

2

First, Review Naïve Bayesian 3

Naïve Bayesian is Surprisingly Good. Why? 4

Independence Test n n Naïve Bayesian’s good performance may be due to independence of attributes? Modeling independence of attributes Am and An n n D() is zero when completely independent D() is large if dependent 5

How to measure independence? n H(A|C): the once the class C is given, how much is A given? n n In other words, how random is A once C is given We know randomness can be measured in Entropy is –p*log(p), and then summed over all possible values Thus, to measure the dependence of A on C, we can look at each value Ci of C, such that under Ci, n n Find the entropy of A Then, sum over all possible Ci 6

Measuring dependency Windy n n Example: C=Play attribute, A=Windy Attribute C 1=yes, C 2=no Pr(C=Yes)=9/14 When C=C 1 (Play = yes) n n n When C=C 2 (no) n n n A=True: 3 counts A=False: 6 counts Pr(A=True and C=C 1)=3/14 Pr(A=False and C=C 1)=6/14 Pr(A=True and C=C 2)=3/14 Pr(A=False and C=C 2)=2/14 H(A|C)=9/14*[(3/14)log(3/14)(6/14)log(6/14)]+5/14*[3/14 log(3/14)-2/14 log(2/14)] Play FALSE no TRUE no FALSE *yes TRUE no TRUE *yes FALSE no FALSE *yes TRUE *yes FALSE *yes TRUE no 7

Property of H(A|C) n If A is completely dependent on C, what is H(A|C)? n n n In the extreme case, whenever C=yes, A is true; whenever C=no, A=false. Pr(C=yes and A=true)=1 Pr(C=yes and A=false)=0 H(A|C)=? The other extreme: if A is completely independent on C, what is H(A|C)? n n Pr(…) between 0 and 1, H(A|C)=? ? 8

Extending to two attributes n n For A 1 and A 2, we simply consider the Cartesian product of their values This gives a single attribute Then we measure the dependency of A 1 A 2 (a new attribute) on class C Example: Consider A 1=Humidity and A 2=Windy, Class= Play n n n Humidity={high, normal} Windy={True, False} A 1 A 2=Humidity. Windy={hightrue, highfalse, normaltrue, normalfalse} 9

Humidity. Windy Play high. FALSE no high. TRUE no high. FALSE yes normal TRUE no normal TRUE yes high FALSE no normal FALSE yes normal TRUE yes high TRUE yes normal FALSE yes high TRUE no high normal high Humidity. Windy Play high. FALSE no high. TRUE no high. FALSE yes normal TRUE no normal TRUE yes high FALSE no normal FALSE yes normal TRUE yes high TRUE yes normal FALSE yes high TRUE no 10

Putting them together n When the two attributes Am and An are independent, given C, n n n When they are dependent n n H(Am. An |C)=H(Am|C)+H(An|C) D(…) = 0 D(…) is a large value Thus, D (…) is a measure of dependency between attributes 11

Most domains are not independent 12

Why Perform So well? (section 4 of paper) n n n Assume three attributes A, B, C Two classes: + and – (say, play=+ means yes) Assume A and B are the same – completely dependent Assume Pr(+)=Pr(-)=0. 5 Assume that A and C are independent n n Optimal Decision: n n Thus, Pr(A, C|+)=Pr(A|+)*Pr(C|+) If Pr(+)*Pr(A, B, C|+)>Pr(-)*Pr(A, B, C|-), then answer =+; else answer=- Pr(A, B, C|+)=Pr(A, A, C|+)=Pr(A|+)*Pr(C|+) Likewise for – Thus, Optimal method is: n Pr(A|+)*Pr(C|+) > Pr(A|-)*Pr(C|-) 13

Analysis n If we use the Naïve Bayesian method, n IF Pr(+)*Pr(A|+)*Pr(B|+)*Pr(C|+)> Pr(-)*Pr(A| -)*Pr(B|-)*Pr(C|-) n n n Then answer = + Else, answer = - Since A=B, and Pr(+)=Pr(-), we have n Pr(A)2*Pr(C|+)> Pr(A)2*Pr(C|-) 14

Simplify the NB Formula n Naïve Bayesian Formula n Pr(A)2*Pr(C|+)> Pr(A)2*Pr(C|-) n becomes p 2 q > (1 -p)2 q n (Eq 2) Thus, our question is: n In order to know why Naïve Bayesian perform so well, we want to ask: n n When does the optimal decision agree (or differ) with Naïve Bayesian decision? That is, where do formulas (Eq 1) and (Eq 2) agree or disagree? 16

dis ag ree Optimal dis ag ree 17

Conclusion n In most cases, naïve Bayesian performs the same as the optimal classifiers n n That is, the error rates are minimal This has been confirmed in many practical applications 18

Applications of Bayesian Method n Gene Analysis n n Nir Friedman Iftach Nachman Dana Pe’er, Institute of Computer Science, Hebrew University Text and Email analysis n Spam Email Filter n n News classification for personal news delivery on the Web n n Microsoft Work User Profiles Credit Analysis in Financial Industry n Analyze the probability of payment for a loan 19

Gene Interaction Analysis n DNA l l l n DNA is a double-stranded molecule Hereditary information is encoded Complementation rules Gene l l Gene is a segment of DNA Contain the information required to make a protein 20

Gene Interaction Result: n n Example of interaction between proteins for gene SVS 1. The width of edges corresponds to the conditional probability. 21

Spam Killer n Bayesian Methods are used for weed out spam emails 22

Spam Killer 23

Construct your training data n n Each email is one record: M Emails are classified by user into n n n A email M is a spam email if n n Pr(+|M)>Pr(-|M) Features: n n Spams: + class Non-spams: - class Words, values = {1, 0} or {frequency} Phrases Attachment {yes, no} How accurate: TP rate > 90% n n We wish FP rate to be as low as possible Those are the emails that are nonspam but are classified as spam 24

Naïve Bayesian In Oracle 9 i http: //otn. oracle. com/products/oracle 9 i/htdocs/o 9 idm_faq. html n n n What is the target market? Oracle 9 i Data Mining is best suited for companies that have lots of data, are committed to the Oracle platform, and want to automate and operationalize their extraction of business intelligence. The initial end user is a Java application developer, although the end user of the application enhanced by data mining could be a customer service rep, marketing manager, customer, business manager, or just about any other imaginable user. What algorithms does Oracle 9 i Data Mining support? Oracle 9 i Data Mining provides programmatic access to two data mining algorithms embedded in Oracle 9 i Database through a Java-based API. Data mining algorithms are machine-learning techniques for analyzing data for specific categories of problems. Different algorithms are good at different types of analysis. Oracle 9 i Data Mining provides two algorithms: Naive Bayes for Classifications and Predictions and Association Rules for finding patterns of co-occurring events. Together, they cover a broad range of business problems. Naive Bayes: Oracle 9 i Data Mining's Naive Bayes algorithm can predict binary or multi-class outcomes. In binary problems, each record either will or will not exhibit the modeled behavior. For example, a model could be built to predict whether a customer will churn or remain loyal. Naive Bayes can also make predictions for multi-class problems where there are several possible outcomes. For example, a model could be built to predict which class of service will be preferred by each prospect. n Binary model example: n Multi-class model example: Q: Is this customer likely to become a high-profit customer? A: Yes, with 85% probability Q: Which one of five customer segments is this customer most likely to fit into — Grow, Stable, Defect, Decline or Insignificant? A: Stable, with 55% probability 25