Information Organization Classification Classification Type n SingleLabel vs

Information Organization: Classification

Classification: Type n Single-Label vs. Multi-Label n Document-Pivoted vs. Category-Pivoted n Binary: yes|no (i. e. Hard classification) Non-Binary: Multi-level (e. g. yes|no|maybe), Scored/Ranked Automatic vs. Manual n Document-Pivoted: document categories (e. g. email filtering) Category-Pivoted: category documents (newly created category) Binary vs. non-Binary n Single Label: non-overlapping categories Multi-Label: overlapping categories Automatic: no human intervention Manual: no machine intervention (Interactive/Semi-automatic) Machine-Learning (ML) vs. Knowledge-Engineering (KE) Search Engine ML: automatic classifier “learned” from training data KE: Rule-based (e. g. if a or b or c, then category C) 2

Classification: Binary vs. Multi-Class n Binary Classification Task of classifying an item of a given set into two groups (based on its property) Examples · · · n Put a tennis ball into the Color or no-Color bin (color) Decide if an email is spam or not (Medical Test) Determine if a patient has certain disease or not (Quality Control Test) Decide if a product should be sold or discarded (IR Test) Determine if a document should be in the search results or not Multi-Classification Task of classifying an item of a given set into multiple groups (based on its property) Examples Put a tennis ball into the Green, Orange, or White ball bin (color) · Decide if an email is advertisement, newsletter, phishing, hack, or personal. · Classify a document into Yahoo! Categories · (Optical Recognition) Classify a scanned character into digit (0. . 9) · Search Engine 3

Classification: Multi-Class n One vs. All M binary classifiers (BC) for M-classification · · · Characteristics · · n several binary classifiers can assign an item to their classes asymmetric training (many more negative than positive examples) Pairwise Classification (M-1)M/2 binary classifiers for M-classification · · a classifier trained for each possible pair of classes e. g. , (0 vs 1), (0 vs 2), …, (0 vs 9), …, (1 vs 2), …, (1 vs 9), …, (8 vs 9) Voting: the class with highest number of classifier votes DAG: directed acyclic graph DAG Characteristics · · n each BC trained to separate its own class from the rest e. g. , (0 vs. 1. . 9), (1 vs. 0, 2. . 9), …, (9 vs. 0. . 8) winner-take-all: the class with highest BC score wins. need to train many binary classifiers symmetric training (smaller problem space) Multi-Class Objective Function 1 M-classifier · · Characteristics · Search Engine the classifier is trained to output an ordering of classes first class is the winner solves the problem directly 4

Classification: Procedure 1. Select/Prepare training data set of categories and classified items · · 2. positive and negative samples split data into training and validation/test sets Build the classifier 1. Classifier: feature vector consisting of most “important” terms for the class Feature Representation · 2. Dimensionality Reduction · 3. 4. select the best set of features to improve accuracy and prevent overfitting Train/Learn the classifier on training set Optimize the classifiers on test set · 5. transform documents into a set of features (e. g. , term) Test/Evaluate the classifier on test set to optimize parameters (e. g. thresholds, feature count/weights) Retrain the classifier on the whole data 3. Apply the classifier to new data n Classifier (ML) Algorithms probability of class membership · similarity to class feature vector · e. g. Rocchio method, k-NN method other · Search Engine e. g. Bayes method Support Vector Machine, Decision Tree, Regression model, etc. 5

Dimension Reduction : Why Initial Features: size, color, shape, pattern · G 1 (BG, SM): size · G 2 (SQ, CR): shape · G 3 (BL, RD): color · G 4? Group 1 BG Group 2 SM SQ Group 3 CR BL RD Group 4 Class 1 Search Engine Class 2 Class 3 6

Classification: Dimension Reduction n Feature Reduction n stopping, stemming & lemmatization Feature Selection Select a subset of original feature set · · Document Frequency (df) Information Gain (IG): i. e. Kullback-Leibler divergence → → · Mutual Information (MI) → → · → Class (C) Feature (t) Y N Y A B N C D measures the lack of independence between a feature and a class way for measuring the degree to which two patterns (expected & observed) differ ((observed freq – expected freq)2 / expected freq) Term Strength (TS) → → n measures dependency between a feature and a class sensitive to small counts Chi Square ( 2) → · measures the usefulness (gain in information) of a feature in predicting a class best performance w/ small feature set Estimates term importance based on how commonly a term is likely to appear in related documents TS(t, C) = P(t x|t y) where x & y are a pair of related documents (e. g. , training set) Feature Extraction Extract a set of features by combination or transformation of the original feature set · · Search Engine Term Clustering Latent Semantic Indexing (SVD/PCA) 7

Bayes Classifier n Pick the most probable class c, given the evidence d: c = argmax [ P(Cj |d) ] Cj = class/category j d = document (t 1, t 2, …, tn) P(Cj |d) = probability document d belongs to category j = P(Cj ) P(d |Cj) / P(d) P(Cj ) P(d |Cj) · · n P(Cj ) = probability that randomly picked document belongs to category j P(d |Cj) = probability that category j contains document d Naïve Bayes assumption P(Cj ) = number of documents belonging to Cj / total number of documents P(tk |Cj ) = probability of a term (i. e. feature) k occurring in category j = number of documents in Cj with term k / number of documents in Cj Example Search Engine 8

Rocchio Classifier n Build the class vectors using Rocchio’s Relevance Feedback formula no initial query Class prototype vector = average vector of the class · n Compute document-class similarity n R = positive examples, S = negative examples class vector created from training data is used to classify new documents Rank classes by similarity Search Engine 9

K-Nearest Neighbor Classifier Similar to Rocchio classifier · All instances correspond to points in an n-dimensional Euclidean space → · Uses document’s k nearest neighbors in the training set to compute doc-class similarity → → · e. g. vector space find k nearest neighbors of document in each category of the training set rank category by doc-KNN similarity instead of using static class centroid vectors, use centroids of k vectors in the training set that are nearest to the document classified to compute doccluster similarity. 1 -NN 3 -NN Search Engine 10

Other Classifiers n Support Vector Machine Assumes linear separability · In 2 dimensions, can separate by a line → · Support vectors nodes = term, branch= probability, leaf=category Decision Rule-based · n The decision function is fully specified by subset of training samples, the support vectors Decision Tree n In higher dimensions, separate by a hyperplane Which hyperplane to choose? · n ax + by = c e. g. , if X or Y or Z, then C 1 Regression Fitting of training data to a real-valued function · e. g. , Linear Least Square Fit Maximize margin Search Engine 11

Classification: Problems n Noise Data Training data often contains noise · False positives and false negatives → n e. g. medical tests Inconsistent Classification structure static, ordered Categorization tomato in fruit or vegetable category Category label retrieval, search, IR Indexing inconsistency/Error n Resource Intensive n Solutions? Search Engine Faceted classification Fusion of IR & IO 12

Supplemental Material (Optional ) For curious-minded and advanced learners Search Engine 13

Information theory n Information Content n Entropy n Information conveyed by a message: I(p) = –log 2(p) = log 2(n) The higher the probability of a message (i. e. , message is predictable), the less the information content Expected amount of information conveyed by a message (i. e. , information content) Can be regarded as a measure of uncertainty Information conveyed by the distribution P = (p 1, p 2, . . , pn): entropy (E) of P E(P) = -(p 1*log(p 1) + p 2*log(p 2) +. . + pn*log(pn)) Examples: p 1=0. 5, p 2=0. 5 E(P) = -2*1/2*log 2(1/2) = -1*(-1) = 1 Fair coin toss: maximum entropy (uncertainty) · P=(0. 67, 0. 33) E(P) = -0. 67*log 2(0. 67)-0. 33*log 2(0. 33) = -0. 67*(-0. 58)-0. 33*(-1. 60) = 0. 92 unfair coin toss (p 1=head): 2 of 3 tosses will come up with head (less uncertainty) · P=(1, 0) E(P) = -1*log 2(1)-0*log 2(0) = -1*(0)-0 = 0 fixed coin toss: always heads (maximum certainty, no new information, zero entropy) · n Information Gain Search Engine the expected reduction in entropy 14

Information Gain: Day D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 D 11 D 12 D 13 D 14 Outlook Sunny Overcast Rain Overcast Sunny Rain Sunny Overcast Rain Example Temperature Hot Hot Mild Cool Mild Hot Mild Humidity High Normal Normal High Wind Weak Strong Weak Weak Strong Weak Strong Play Tennis No No Yes Yes Yes No Values(Wind) = {Weak, Strong}, S = Play Tennis E(S) = -9/14*log 2(9/14) - 5/14*log 2(5/14) = 0. 940 E(Sweak) = -6/8*log 2 (6/8) – 2/8*log 2 (2/8) = 0. 811 E(Sstrong) = -3/6*log 2 (3/6) – 3/6*log 2 (3/6) = 1. 0 IG(S, Wind) = E(S) - (8/14) E(Sweak) - (6/14) E(Sstrong) = 0. 940 - (8/14) 0. 811 - (6/14) 1. 0 = 0. 048 IG(S, Outlook) = 0. 246 IG(S, Humidity) = 0. 151 IG(S, Temperature) = 0. 029 Search Engine 15

Chi Square n Chi Square ( 2) measures the lack of independence between a feature and a class way for measuring the degree to which two patterns (expected & observed) differ (observed freq – expected freq)2 ------------------------------expected freq n Null Hypothesis n Feature & Class is independent Expected Frequencies Search Engine E(A) = N*p(t)*p(C) = N* (A+B)/N * (A+C)/N = (A+B)(A+C)/N E(B) = N*p(t)*p(Cc) = N* (A+B)/N * (B+D)/N = (A+B)(B+D)/N E(C) = N*p(tc)*p(C) = N* (C+D)/N * (A+C)/N = (C+D)(A+C)/N E(D) = N*p(tc)*p(Cc) = N* (C+D)/N * (B+D)/N = (C+D)(B+D)/N Class (C) Feature (t) Y N Total Y A B A+B N C D C+D Total A+C B+D N 16

Chi Square (observed freq – expected freq)2 ------------------------------expected freq Search Engine E(A) = N*p(t)*p(C) = N* (A+B)/N * (A+C)/N = (A+B)(A+C)/N E(B) = N*p(t)*p(Cc) = N* (A+B)/N * (B+D)/N = (A+B)(B+D)/N E(C) = N*p(tc)*p(C) = N* (C+D)/N * (A+C)/N = (C+D)(A+C)/N E(D) = N*p(tc)*p(Cc) = N* (C+D)/N * (B+D)/N = (C+D)(B+D)/N Class (C) Feature (t) Y N Total Y A B A+B N C D C+D Total A+C B+D N 17

Decision Tree Cough Mary no Fred no Julie yes Elvis yes Search Engine Fever yes yes no Weight normal skinny obese Pain throat abdomen none chest Classification flu appendicitis flu heart disease 18