Introduction to NLP Techniques Classification Clustering Classification Overview






























- Slides: 30
Introduction to NLP Techniques: Classification & Clustering
Classification
Overview • Classification (Categorization, Supervised Learning) • Framework 3 • Set of labeled items whose class is known • Training document • Training procedure -> Classifier • Set of testing documents whose class is known • Compare classifier prediction to truth • Measure accuracy • Use on documents whose class is not known 10/23/2021
Many Approaches • Probabilistic • Naïve Bayes • Decision Trees • Information Theory • Maximum Likelihood • Machine Learning • Neural Networks 4 10/23/2021
Many Approaches • Vector Space • Rocchio • K Nearest Neighbors • Support Vector Machines 5 10/23/2021
Vector Space Methods • Rocchio • Compare to centroid for each class, calculated from each classes training vectors • K Nearest Neighbors 6 • Compare to training vectors for each class • Rank order by similarity • Count (among top k, generally 5 or 7 or more if lots of classes? ) how many training vectors for each class appear • Instead of counting, add the similarity weights for each training vector for each class among the top K
Support Vector Machines • Support Vector Machines (SVM) • Only used to distinguish between two classes • First, do “math” to separate the training vectors for the two classes • Involves increasing the dimensions by adding more features that are linear (or non-linear) combinations of existing features • Keep only the training vectors near the boundaries of the classes in the new hyperplane • Compare the document vector to these “support vectors” 7
Support Vector Machines (cont’d) • To use for > 2 classes must either 8 • Create N classifiers • Ci vs all others for each class i • Issues: unbalanced training data • Classify new doc’t with each of the N classifiers, find the highest similarity • Create N 2 classifiers • Ci vs Cj for all i and j • Classify the new doc with each of the N 2 classifier and average all the results for each class to find the best
Dimensionality Reduction • Vocabularly Pruning (my term) • Select words based on frequency thresholds • T 1 < freq < T 2 • Select top features from each document • Calculate tf * idf for each word in each document -> Vd • Sort words in Vd • Linear Algebra 9 • Latent Semantic Indexing (LSI) / Latent Semantic Analysis (LSA • Latent Dirichlet Allocation (LDA) / “Topic Modelling”
Dimensionality Reduction • Vocabularly Pruning (my term) • Linear Algebra • Latent Semantic Indexing (LSI) / Latent Semantic Analysis (LSA • Latent Dirichlet Allocation (LDA) / “Topic Modelling” • Conceptual Modeling 10
LSI/LSA • LSI learns latent topics by performing a matrix decomposition on the term/document matrix • i. e. , Create term-document matrix • Do Principle Component Analysis (PCA) • Also called Singular Value Decomposition (SVD) • PCA/SVD are general, linear algebra terms • Essentially, they calculate the eigenvectors to reduce dimensions • LSI/LSA is specific to that approach applied to term/document matrix 11 10/23/2021
PLSI/PLSA • Probabilistic variant of LSI (2000) • Based on a mixture decomposition derived from a latent class model • Uses probabilities for word -> hidden topics for dimensionality reduction 12 10/23/2021
LDA • Latent Dirichlet Allocation • LDA is a generative probabilistic modelt • Assumes a Dirichlet prior over the latent topics • LSI is much faster to train but LDA has higher accuracy 13 10/23/2021
Comparisons • High level of abstraction • Humans 94% accurate • LDA 84% • LSA 84% • Low level of abstraction • Humans 76% accurate • LDA 64% • LSA 67% 14 10/23/2021
Clustering
One example • Ontology construction (Gauch et al) • Only 3 levels in previous ACM Computer Classification System (ACS) • Train and classify all million Cite. Seer documents into those classes • About 1, 000 categories, so 1, 000 documents on average in those categories; too many for browsing • Use clustering on documents in “large” leaf nodes to identify 610 clusters 16 • Add those as “subclasses”, growing the ontology/hierarchy to a 4 th level • Repeat for a 5 th level on large categories 10/23/2021 • Result: categories with ~30 documents each
Related problem: Labeling • The resulting clusters have no associated semantics • How label them for browsing? • Many methods • Eg: difference • Word freq in this category – avg word freq in sibling categories • TF*icf weight • Icfi = log (Numcategories/Numcategories (wordi) 17 10/23/2021
Types of Clustering Algorithms • Hard clustering vs soft clustering • Hard (partitioning): items may be in in only 1 cluster • Soft clustering items in >= 1 cluster with differing degrees of association • Flat vs Hierarchical • Items are in sets • Displayed using dots inside circles • Items (and clusters are related to each other pairwise) • Displayed using a dendogram 18 10/23/2021
Hierarchical Clustering • Agglomerative vs Divisive • Agglomerative: bottom up • Start with individual items • Pair closest item(cluster) at each step • Divisive: top down • Start with one cluster of all items • Split largest/most diverse cluster into two at each step 19
Hierarchical Clustering cont’d • Must define a way to measure distance between items/clusters • Distance between • Closest member: single link (can get long skinny clusters) • Farthest member: complete link (get many small, tight clusters) • Middle: • Centroid (need to recalculate every time a new item is added to a cluster) • Medoid (member of cluster that minimizes distance from all others) 20
When to stop clustering? • Creates a complete tree • How to decide on clusters? • Know how many clusters you want, i. e. , M • Stop *(or cut) last M links in the dendogram • Know maximum size of a cluster • Stop when all clusters <= size • Know minimum cluster coherence • Stop when all clusters are >= coherence 21
Agglomerative Hierarchical Clustering Algorithm (Simple Version) • Calculate N 2 distance matrix (similarity matrix) between all N items in the data set • At each step • Find smallest value (closest pair) • Combine them into a cluster • Replace those 2 rows/columns with a single new row/column for the resulting new cluster • Must recalculate distance between the new cluster and the remaining N-1 items 22
Agglomerative Hierarchical Clustering Algorithm (Simple Version) cont’d • New distance is: • Maximum of the 2 input item distances (complete link) • Minimum of the 2 input item distances (single link) • Average of the 2 input item distances (centroid-based clustering) • Complexity • N 3 time (N steps over N 2 matrix) • N 2 space (the matrix) More sophisticated algorithms: Complete link: based on minimum spanning tree O(N 2) 23
Non-hierarchical (flat) Clustering • Often start with an initial partition based on randomly selected seeds • One seed per cluster, treat as centroid • Multiple passes • Pass one • Compare items to seeds, put in cluster with closest seed 24
Non-hierarchical (flat) Clustering Algorithm • Subsequent passes • Move items to other clusters to improve overall cluster “goodness” • Need a goodness metric • Combination of distribution of cluster sizes • Cluster coherence 25
Non-hierarchical (flat) Clustering Issues • How many seeds to start with? • Need a priori knowledge of the domain (e. g. , number of senses) • Or, run with multiple n • evaluate resulting clustering • Pick the best • Complexity: O(N) • N * number of steps = O(N) 26
Non-hierarchical (flat) Clustering Issues • How many seeds to start with? • Need a priori knowledge of the domain (e. g. , number of senses) • Or, run with multiple n • evaluate resulting clustering • Pick the best • Complexity: O(N) • N * number of steps = O(N) 27
Evaluating Clustering • How good are the resulting clusters? • Cluster items from known classes • Measure accuracy • How often do items from same class end up in same cluster • Coherence • 1 / Average distance of all items in the cluster from the centroid • == average similarity of all times in the cluster to the centroid • Cluster separation • How far apart are the centroids? Closest members in 2 clusters? 28
K-Means Clustering • http: //www. cs. cmu. edu/~. /awm/tutorials/kmeans 11. pdf • Best known/widely used flat algorithm 29
Introduction to NLP Techniques: Classification