Text Mining 1 What is Text Mining There

  • Slides: 63
Download presentation
Text Mining 1

Text Mining 1

What is Text Mining? • There are many examples of text-based documents (all in

What is Text Mining? • There are many examples of text-based documents (all in ‘electronic’ format…) – e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more… • Not enough time or patience to read – Can we extract the most vital kernels of information… • So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…! – Some others (e. g. DNA seq. ) are hard to comprehend! 2

What is Text Mining? • Traditional data mining uses ‘structured data’ (n x p

What is Text Mining? • Traditional data mining uses ‘structured data’ (n x p matrix) • The analysis of ‘free-form text’ is also referred to as ‘unstructured data’, – successful categorisation of such data can be a difficult and time-consuming task… • Often, can combine free-form text and structured data to derive valuable, actionable information… (e. g. as in typical surveys) – semi-structured 3

“Search” versus “Discover” Structured Data Unstructured Data (Text) Search (goal-oriented) Discover (opportunistic) Data Retrieval

“Search” versus “Discover” Structured Data Unstructured Data (Text) Search (goal-oriented) Discover (opportunistic) Data Retrieval Data Mining Information Retrieval Text Mining © 2002, Ava. Quest Inc.

Data Retrieval • Find records within a structured database. Database Type Structured Search Mode

Data Retrieval • Find records within a structured database. Database Type Structured Search Mode Goal-driven Atomic entity Data Record Example Information Need “Find a Japanese restaurant in Boston that serves vegetarian food. ” Example Query “SELECT * FROM restaurants WHERE city = boston AND type = japanese AND has_veg = true” © 2002, Ava. Quest Inc.

Information Retrieval • Find relevant information in an unstructured information source (usually text) Database

Information Retrieval • Find relevant information in an unstructured information source (usually text) Database Type Unstructured Search Mode Goal-driven Atomic entity Document Example Information Need “Find a Japanese restaurant in Boston that serves vegetarian food. ” Example Query “Japanese restaurant Boston” or Boston->Restaurants->Japanese © 2002, Ava. Quest Inc.

Data Mining • Discover new knowledge through analysis of data Database Type Structured Search

Data Mining • Discover new knowledge through analysis of data Database Type Structured Search Mode Opportunistic Atomic entity Numbers and Dimensions Example Information Need “Show trend over time in # of visits to Japanese restaurants in Boston ” Example Query “SELECT SUM(visits) FROM restaurants WHERE city = boston AND type = japanese ORDER BY date” © 2002, Ava. Quest Inc.

Text Mining • Discover new knowledge through analysis of text Database Type Unstructured Search

Text Mining • Discover new knowledge through analysis of text Database Type Unstructured Search Mode Opportunistic Atomic entity Language feature or concept Example Information Need “Find the types of food poisoning most often associated with Japanese restaurants” Example Query Rank diseases found associated with “Japanese restaurants” © 2002, Ava. Quest Inc.

Motivation for Text Mining • • Approximately 90% of the world’s data is held

Motivation for Text Mining • • Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation) Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery. 10% 90% Structured Numerical or Coded Information Unstructured or Semi-structured Information © 2002, Ava. Quest Inc.

Text Mining: Examples • Text mining is an exercise to gain knowledge from stores

Text Mining: Examples • Text mining is an exercise to gain knowledge from stores of language text. • Text: – – – – Web pages Medical records Customer surveys Email filtering (spam) DNA sequences Incident reports Drug interaction reports News stories (e. g. predict stock movement) 10

What is Text Mining • Data examples – Web pages – Customer surveys Customer

What is Text Mining • Data examples – Web pages – Customer surveys Customer Age Sex Tenure Comments Outcome 123 24 M 12 years Incorrect charges on bill customer angry Y 243 26 F 1 month Inquiry about charges to India N 346 54 M 3 years Question about charges on bill N 11

Text Mining • Typically falls into one of two categories – Analysis of text:

Text Mining • Typically falls into one of two categories – Analysis of text: I have a bunch of text I am interested in, tell me something about it • E. g. sentiment analysis, “buzz” searches – Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E. g. web search, library catalogs, legal and medical precedent studies 12

Text Mining: Analysis • • Which words are most present Which words are most

Text Mining: Analysis • • Which words are most present Which words are most surprising Which words help define the document What are the interesting text phrases? 13

Text Mining: Retrieval • Find k objects in the corpus of documents which are

Text Mining: Retrieval • Find k objects in the corpus of documents which are most similar to my query. • Can be viewed as “interactive” data mining query not specified a priori. • Main problems of text retrieval: – What does “similar” mean? – How do I know if I have the right documents? – How can I incorporate user feedback? 14

IR architecture CS 583, Bing Liu, UIC 15

IR architecture CS 583, Bing Liu, UIC 15

Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance

Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance between two sentences or queries? • Evaluating retrieval is hard: what is the “right” answer ? (no ground truth) • User can query things you have not seen before e. g. misspelled, foreign, new terms. • Goal (score function) is different than in classification/regression: not looking to model all of the data, just get best results for a given user. • Words can hide semantic content – Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e. g. , data mining – Polysemy: The same keyword may mean different things in different contexts, e. g. , mining 16

Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are

Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are in fact relevant to the query (i. e. , “correct” responses) • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved 17

Precision vs. Recall Truth: Relvant Truth: Not Relevant Algorithm: Relevant TP FP Algorithm: Not

Precision vs. Recall Truth: Relvant Truth: Not Relevant Algorithm: Relevant TP FP Algorithm: Not Relevant FN TN • We’ve been here before! – Precision = TP/(TP+FP) – Recall = TP/(TP+FN) – Trade off: • If algorithm is ‘picky’: precision high, recall low • If algorithm is ‘relaxed’: precision low, recall high actual outcome predicted outcome – BUT: recall often hard if not impossible to calculate 1 0 1 a b 0 c 18 d 18

Precision Recall Curves • If we have a labelled training set, we can calculate

Precision Recall Curves • If we have a labelled training set, we can calculate recall. • For any given number of returned documents, we can plot a point for precision vs. recall. (similar to thresholds in ROC curves) • Different retrieval algorithms might have very different curves - hard to tell which is “best” 19

Term / document matrix • Most common form of representation in text mining is

Term / document matrix • Most common form of representation in text mining is the term - document matrix – Term: typically a single word, but could be a word phrase like “data mining” – Document: a generic term meaning a collection of text to be retrieved – Can be large - terms are often 50 k or larger, documents can be in the billions (www). – Can be binary, or use counts 20

Term document matrix Example: 10 documents: 6 terms Database SQL Index Regression Likelihood linear

Term document matrix Example: 10 documents: 6 terms Database SQL Index Regression Likelihood linear D 1 24 21 9 0 0 3 D 2 32 10 5 0 3 0 D 3 12 16 5 0 0 0 D 4 6 7 2 0 0 0 D 5 43 31 20 0 3 0 D 6 2 0 0 18 7 6 D 7 0 0 1 32 12 0 D 8 3 0 0 22 4 4 D 9 1 0 0 34 27 25 D 10 6 0 0 17 4 23 • Each document now is just a vector of terms, sometimes boolean 21

Term document matrix • We have lost all semantic content • Be careful constructing

Term document matrix • We have lost all semantic content • Be careful constructing your term list! – Not all words are created equal! – Words that are the same should be treated the same! • Stop Words • Stemming 22

Stop words • Many of the most frequently used words in English are worthless

Stop words • Many of the most frequently used words in English are worthless in retrieval and text mining – these words are called stop words. – the, of, and, to, …. – Typically about 400 to 500 such words – For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? – Reduce indexing (or data) file size • stopwords accounts 20 -30% of total word counts. – Improve efficiency • stop words are not useful for searching or text mining • stop words always have a large number of hits 23

Stemming • Techniques used to find out the root/stem of a word: – E.

Stemming • Techniques used to find out the root/stem of a word: – E. g. , – – • stem: users used using engineered engineer use engineer Usefulness • improving effectiveness of retrieval and text mining – matching similar words • reducing indexing size – combing words with same roots may reduce indexing size as much as 40 -50%. 24

Basic stemming methods • remove ending – if a word ends with a consonant

Basic stemming methods • remove ending – if a word ends with a consonant other than s, followed by an s, then delete s. – if a word ends in es, drop the s. – if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. – If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. – …. . . • transform words – if a word ends with “ies” but not “eies” or “aies” then “ies --> y. ” 25

Feature Selection • Performance of text classification algorithms can be optimized by selecting only

Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms – Even after stemming and stopword removal. • Greedy search – Start from full set and delete one at a time – Find the least important variable • Can use Gini index for this if a classification problem • Often performance does not degrade even with orders of magnitude reductions – Only 140 out of 20, 000 terms needed for classification! 26

Distances in TD matrices • Given a term doc matrix represetnation, now we can

Distances in TD matrices • Given a term doc matrix represetnation, now we can define distances between documents (or terms!) • Elements of matrix can be 0, 1 or term frequencies (sometimes normalized) • Can use Euclidean or cosine distance • Cosine distance is the angle between the two vectors • Not intuitive, but has been proven to work well • If docs are the same, dc =1, if nothing in common dc=0 27

 • We can calculate cosine and Euclidean distance for this matrix • What

• We can calculate cosine and Euclidean distance for this matrix • What would you want the distances to look like? Database SQL Index Regression Likelihood linear D 1 24 21 9 0 0 3 D 2 32 10 5 0 3 0 D 3 12 16 5 0 0 0 D 4 6 7 2 0 0 0 D 5 43 31 20 0 3 0 D 6 2 0 0 18 7 6 D 7 0 0 1 32 12 0 D 8 3 0 0 22 4 4 D 9 1 0 0 34 27 25 D 10 6 0 0 17 4 23 28

Document distance • Pairwise distances between documents • Image plots of cosine distance, Euclidean,

Document distance • Pairwise distances between documents • Image plots of cosine distance, Euclidean, and scaled Euclidean R function: ‘image’ 29

Information retrieval models • An IR model governs how a document and a query

Information retrieval models • An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. • Main models: – Boolean model – Vector space model – Statistical language model – etc CS 583, Bing Liu, UIC 30

Boolean model • Each document or query is treated as a “bag” of words

Boolean model • Each document or query is treated as a “bag” of words or terms. Word sequence is not considered. • Given a collection of documents D, let V = {t 1, t 2, . . . , t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary. • A weight wij > 0 is associated with each term ti of a document dj ∈ D. For a term that does not appear in document dj, wij = 0. CS 583, Bing Liu, UIC 31

Boolean model (contd) • Query terms are combined logically using the Boolean operators AND,

Boolean model (contd) • Query terms are combined logically using the Boolean operators AND, OR, and NOT. – E. g. , ((data AND mining) AND (NOT text)) • Retrieval – Given a Boolean query, the system retrieves every document that makes the query logically true. – Called exact match. • The retrieval results are usually quite poor because term frequency is not considered. CS 583, Bing Liu, UIC 32

Vector space model • Documents are also treated as a “bag” of words or

Vector space model • Documents are also treated as a “bag” of words or terms. • Each document is represented as a vector. • However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF-IDF scheme. • Term Frequency (TF) Scheme: The weight of a term ti in document dj is the number of times that ti appears in dj, denoted by fij. Normalization may also be applied. CS 583, Bing Liu, UIC 33

TF-IDF term weighting scheme • The most well known weighting scheme – TF: still

TF-IDF term weighting scheme • The most well known weighting scheme – TF: still term frequency – IDF: inverse document frequency. N: total number of docs dfi: the number of docs that ti appears. • The final TF-IDF term weight is: CS 583, Bing Liu, UIC 34

Retrieval in vector space model • Query q is represented in the same way

Retrieval in vector space model • Query q is represented in the same way or slightly differently. • Relevance of di to q: Compare the similarity of query q and document di. • Cosine similarity (the cosine of the angle between the two vectors) • Cosine is also commonly used in text clustering CS 583, Bing Liu, UIC 35

An Example • A document space is defined by three terms: – hardware, software,

An Example • A document space is defined by three terms: – hardware, software, users – the vocabulary • A set of documents are defined as: – A 1=(1, 0, 0), – A 4=(1, 1, 0), – A 7=(1, 1, 1) A 2=(0, 1, 0), A 5=(1, 0, 1), A 8=(1, 0, 1). A 3=(0, 0, 1) A 6=(0, 1, 1) A 9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved? CS 583, Bing Liu, UIC 36

An Example (cont. ) • In Boolean query matching: – document A 4, A

An Example (cont. ) • In Boolean query matching: – document A 4, A 7 will be retrieved (“AND”) – retrieved: A 1, A 2, A 4, A 5, A 6, A 7, A 8, A 9 (“OR”) • In similarity matching (cosine): – – – q=(1, 1, 0) S(q, A 1)=0. 71, S(q, A 2)=0. 71, S(q, A 4)=1, S(q, A 5)=0. 5, S(q, A 7)=0. 82, S(q, A 8)=0. 5, Document retrieved set (with ranking)= • {A 4, A 7, A 1, A 2, A 5, A 6, A 8, A 9} CS 583, Bing Liu, UIC 37 S(q, A 3)=0 S(q, A 6)=0. 5 S(q, A 9)=0. 5

Okapi relevance method • Another way to assess the degree of relevance is to

Okapi relevance method • Another way to assess the degree of relevance is to directly compute a relevance score for each document to the query. • The Okapi method and its variations are popular techniques in this setting. CS 583, Bing Liu, UIC 38

Relevance feedback • Relevance feedback is one of the techniques for improving retrieval effectiveness.

Relevance feedback • Relevance feedback is one of the techniques for improving retrieval effectiveness. The steps: – the user first identifies some relevant (Dr) and irrelevant documents (Dir) in the initial list of retrieved documents – the system expands the query q by extracting some additional terms from the sample relevant and irrelevant documents to produce qe – Perform a second round of retrieval. • Rocchio method (α, β and γ are parameters) CS 583, Bing Liu, UIC 39

Rocchio text classifier • In fact, a variation of the Rocchio method above, called

Rocchio text classifier • In fact, a variation of the Rocchio method above, called the Rocchio classification method, can be used to improve retrieval effectiveness too – so are other machine learning methods. Why? • Rocchio classifier is constructed by producing a prototype vector ci for each class i (relevant or irrelevant in this case): • In classification, cosine is used. CS 583, Bing Liu, UIC 40

Queries • A query is a representation of the user’s information needs – Normally

Queries • A query is a representation of the user’s information needs – Normally a list of words. • Once we have a TD matrix, queries can be represented as a vector in the same space – “Database Index” = (1, 0, 0, 0) • Query can be a simple question in natural language • Calculate cosine distance between query and the TF x IDF version of the TD matrix • Returns a ranked vector of documents 41

Latent Semantic Indexing • Criticism: queries can be posed in many ways, but still

Latent Semantic Indexing • Criticism: queries can be posed in many ways, but still mean the same – Data mining and knowledge discovery – Car and automobile – Beet and beetroot • Semantically, these are the same, and documents with either term are relevant. • Using synonym lists or thesauri are solutions, but messy and difficult. • Latent Semantic Indexing (LSI): tries to extract hidden semantic structure in the documents • Search what I meant, not what I said! 42

LSI • Approximate the T-dimensional term space using principle components calculated from the TD

LSI • Approximate the T-dimensional term space using principle components calculated from the TD matrix • The first k PC directions provide the best set of k orthogonal basis vectors - these explain the most variance in the data. – Data is reduced to an N x k matrix, without much loss of information • Each “direction” is a linear combination of the input terms, and define a clustering of “topics” in the data. 43

LSI • Typically done using Singular Value Decomposition (SVD) to find principal components TD

LSI • Typically done using Singular Value Decomposition (SVD) to find principal components TD matrix New orthogonal basis for the data (PC directions) Diagonal matrix of eigenvalues Term weighting by document - 10 x 6 For our example: S = (77. 4, 69. 5, 22. 9, 13. 5, 12. 1, 4. 8) Fraction of the variance explained (PC 1&2) = = 92. 5% 44

LSI Top 2 PC make new pseudo-terms to define documents… Also, can look at

LSI Top 2 PC make new pseudo-terms to define documents… Also, can look at first two Principal components: (0. 74, 0. 49, 0. 27, 0. 28, 0. 19) -> emphasizes first two terms (-0. 28, -0. 24, -0. 12, 0. 74, 0. 37, 0. 31) -> separates the two clusters Note how distance from the origin shows number of terms, And angle (from the origin) shows similarity as well 45

LSI • Here we show the same plot, but with two new documents, one

LSI • Here we show the same plot, but with two new documents, one with the term “SQL” 50 times, another with the term “Databases” 50 times. • Even though they have no phrases in common, they are close in LSI space 46

Textual analysis • Once we have the data into a nice matrix representation (TD,

Textual analysis • Once we have the data into a nice matrix representation (TD, TDx. IDF, or LSI), we can throw the data mining toolbox at it: – Classification of documents • If we have training data for classes – Clustering of documents • unsupervised 47

Automatic document classification • Motivation – Automatic classification for the tremendous number of on-line

Automatic document classification • Motivation – Automatic classification for the tremendous number of on-line text documents (Web pages, e-mails, etc. ) – Customer comments: Requests for info, complaints, inquiries • A classification problem – Training set: Human experts generate a training data set – Classification: The computer system discovers the classification rules – Application: The discovered rules can be applied to classify new/unknown documents • Techniques – Linear/logistic regression, naïve Bayes – Trees not so good here due to massive dimension, few interactions 48

Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model –

Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model – Also called “multivariate Bernoulli” – Assumes conditional independence assumption given the class: p( x | ck ) = P p( xj | ck ) – Note that we model each term xj as a discrete random variable In other words, the probability that a bunch of words comes from a given class equals the product of the individual probabilities of those words. 49

. Multinomial Classifier for Text • Multinomial Classification model – Assumes that the data

. Multinomial Classifier for Text • Multinomial Classification model – Assumes that the data are generated by a p-sided die (multinomial model) – where Nx = number of terms (total count) in document x – xj = number of times term j occurs in the document – ck = class = k – Based on training data, each class has its own multinomial probability across all words. 50

Document Clustering • Can also do clustering, or unsupervised learning of docs. • Automatically

Document Clustering • Can also do clustering, or unsupervised learning of docs. • Automatically group related documents based on their content. • Require no training sets or predetermined taxonomies. • Major steps – Preprocessing • Remove stop words, stem, feature extraction, lexical analysis, … – Hierarchical clustering • Compute similarities applying clustering algorithms, … – Slicing • Fan out controls, flatten the tree to desired number of levels. 51

Document Clustering • To Cluster: – Can use LSI – Another model: Latent Dirichlet

Document Clustering • To Cluster: – Can use LSI – Another model: Latent Dirichlet Allocation (LDA) – LDA is a generative probabilistic model of a corpus. Documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words. • LDA: – Three concepts: words, topics, and documents – Documents are a collection of words and have a probability distribution over topics – Topics have a probability distribution over words – Fully Bayesian Model 52

LDA • • Assume data was generated by a generative process: q is a

LDA • • Assume data was generated by a generative process: q is a document - made up from topics from a probability distribution z is a topic made up from words from a probability distribution w is a word, the only real observables (N=number of words in all documents) a=per-document topic distributions • Then, the LDA equations are specified in a fully Bayesian model: 53

Which can be solved via advance computational techniques see Blei, et al 2003 54

Which can be solved via advance computational techniques see Blei, et al 2003 54

LDA output • The result can be an often-useful classification of documents into topics,

LDA output • The result can be an often-useful classification of documents into topics, and a distribution of each topic across words: 55

Another Look at LDA • Model: Topics made up of words used to generate

Another Look at LDA • Model: Topics made up of words used to generate documents 56

Another Look at LDA • Reality: Documents observed, infer topics 57

Another Look at LDA • Reality: Documents observed, infer topics 57

Text Mining: Helpful Data • Word. Net Courtesy: Luca Lanzi 58

Text Mining: Helpful Data • Word. Net Courtesy: Luca Lanzi 58

Text Mining - Other Topics • Part of Speech Tagging – Assign grammatical tags

Text Mining - Other Topics • Part of Speech Tagging – Assign grammatical tags to words (verb, noun, etc) – Helps in understanding documents : uses Hidden Markov Models • Named Entity Classification – Classification task: can we automatically detect proper nouns and tag them – “Mr. Jones” is a person; “Madison” is a town. – Helps with dis-ambiguation: spears 59

Text Mining - Other Topics • Sentiment Analysis – Automatically determine tone in text:

Text Mining - Other Topics • Sentiment Analysis – Automatically determine tone in text: positive, negative or neutral – Typically uses collections of good and bad words – “While the traditional media is slowly starting to take John Mc. Cain’s straight talking image with increasingly large grains of salt, his base isn’t quite ready to give up on their favorite son. Jonathan Alter’s bizarre defense of Mc. Cain after he was caught telling an outright lie, perfectly captures that reluctance[. ]” – Often fit using Naïve Bayes • There are sentiment word lists out there: – See http: //neuro. imm. dtu. dk/wiki/Text_sentiment_analysis 60

Text Mining - Other Topics • Summarizing text: Word Clouds – Takes text as

Text Mining - Other Topics • Summarizing text: Word Clouds – Takes text as input, finds the most interesting ones, and displays them graphically – Blogs do this – Wordle. net 61

Modest Mouse lyrics 62

Modest Mouse lyrics 62

References • Text classification – Excellent lecture by William Cohen – quite detailed, uses

References • Text classification – Excellent lecture by William Cohen – quite detailed, uses knn, TFIDF, nerural nets, other models – http: //videolectures. net/mlas 06_cohen_tc/ • LDA and topic models – Seminal paper: Blei, David M. ; Ng, Andrew Y. ; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 – Tutorial on text topic modelling from David Blei: http: //www. cs. princeton. edu/~blei/papers/Blei 2011. pdf • General text mining topics – Many text mining tutorial available at Ling. Pipe: • http: //alias-i. com/lingpipe/demos/tutorial/cluster/read-me. html – Code available but written in Java • Sentiment Analysis – Bing Liu tutorial: http: //www. cs. uic. edu/~liub/FBS/Sentiment-Analysis-tutorial-AAAI 2011. pdf – Searching for Bing Liu will find many resources on this topic – Sentiment analysis in Twitter: http: //danzambonini. com/self-improving-bayesian-sentiment -analysis-for-twitter/ • Twitter text mining tutorial 63