Text categorization Feature selection chi square test 1


























- Slides: 26
Text categorization Feature selection: chi square test 1
Joint Probability Distribution The joint probability distribution for a set of random variables X 1…Xn gives the probability of every combination of values P(X 1, . . . , Xn) Sneeze Cold 0. 08 ¬Cold 0. 01 ¬Sneeze 0. 01 0. 9 The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated P(Cold | ¬Sneeze) BUT it’s often very hard to obtain all the probabilities for a joint distribution Slides adapted from Mary Ellen Califf 2
Bayes Independence Example Imagine there are diagnoses ALLERGY, COLD, and WELL and symptoms SNEEZE, COUGH, and FEVER Can these be correct numbers? Prob P(d) P(sneeze|d) P(cough | d) Well 0. 9 0. 1 Cold 0. 05 0. 9 0. 8 Allergy 0. 05 0. 9 0. 7 P(fever | d) 0. 01 0. 7 0. 4 Slides adapted from Mary Ellen Califf 3
KL divergence (relative entropy) Basis of comparing two probability distributions 4
Text Categorization Applications Web pages organized into category hierarchies Journal articles indexed by subject categories (e. g. , the Library of Congress, MEDLINE, etc. ) Responses to Census Bureau occupations Patents archived using International Patent Classification Patient records coded using international insurance categories E-mail message filtering News events tracked and filtered by topics Spam Slide adapted from Paul Bennet 5
Yahoo News Categories 6
Text Topic categorization: classify the document into semantics topics The U. S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3 -0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine. 7
The Reuters collection A gold standard Collection of (21, 578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms 135 valid topics categories 8
Reuters Top topics in Reuters 9
Reuters Document Example <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2 -MAR-1987 16: 51: 43. 42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> kicks off CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter  </BODY></TEXT></REUTERS> 10
Classification vs. Clustering Classification assumes labeled data: we know how many classes there and we have examples for each class (labeled data). Classification is supervised In Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are Clustering is unsupervised 11
Categories (Labels, Classes) Labeling data 2 problems: Decide the possible classes (which ones, how many) Domain and application dependent Label text Difficult, time consuming, inconsistency between annotators 12
Reuters Example, revisited Why not topic = policy ? <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2 -MAR-1987 16: 51: 43. 42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> kicks off CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter  </BODY></TEXT></REUTERS> 13
Binary vs. multi-way classification Binary classification: two classes Multi-way classification: more than two classes Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes 14
Features >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix. " >>> label = “sport” >>> labeled_text = Labeled. Text(text, label) Here the classification takes as input the whole string What’s the problem with that? What are the features that could be useful for this example? 15
Feature terminology Feature: An aspect of the text that is relevant to the task Some typical features Words present in text Frequency of words Capitalization Are there NE? Word. Net Others? 16
Feature terminology Feature: An aspect of the text that is relevant to the task Feature value: the realization of the feature in the text Words present in text Frequency of word Are there dates? Yes/no Are there PERSONS? Yes/no Are there ORGANIZATIONS? Yes/no Word. Net: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China) 17
Feature Types Boolean (or Binary) Features that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature. f 1(text) = 1 0 f 2(text) = 1 0 if text contain “elections” otherwise if text contain PERSON otherwise 18
Feature Types Integer Features that generate integer values. Integer features can be used to give classifiers access to more precise information about the text. f 1(text) = Number of times “elections” occurs f 2(text) = Number of times PERSON occurs 19
2 statistic (CHI) 2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and a category 20
2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? Term = jaguar Term jaguar Class = auto 2 500 Class auto 3 9500 We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent 21
2 statistic (CHI) Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? If independent: Pr(j, a) = Pr(j) Pr(a) So, there would be N Pr(j, a), i. e. N Pr(j) Pr(a) occurances of “jaguar” Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 N (5/N) (502/N)=2510/N=2510/10005 0. 25 Term = jaguar Term jaguar Class = auto 2 500 Class auto 3 9500 22
2 statistic (CHI) Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? Term = jaguar Class = auto Class auto 2 (0. 25) 3 Term jaguar expected: fe 500 9500 observed: fo 23
2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? Term = jaguar Class = auto Class auto 2 (0. 25) 3 (4. 75) Term jaguar 500 expected: fe (502) 9500 (9498) observed: fo 24
2 statistic (CHI) 2 is interested in (fo – fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence. 999, since 12. 9 > 10. 83 (the value for. 999 confidence). Term = jaguar Class = auto Class auto 2 (0. 25) 3 (4. 75) Term jaguar 500 expected: fe (502) 9500 (9498) observed: fo 25
2 statistic (CHI) There is a simpler formula for 2: A = #(t, c) C = #(¬t, c) B = #(t, ¬c) D = #(¬t, ¬c) N=A+B+C+D 26