Introduction to Text Mining Cheng Xiang Cheng Zhai

  • Slides: 29
Download presentation
Introduction to Text Mining Cheng. Xiang (“Cheng”) Zhai Department of Computer Science Graduate School

Introduction to Text Mining Cheng. Xiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and Institute for Genomic Biology University of Illinois, Urbana-Champaign 1

Outline - Overview of Text Mining - IR-Style Text Mining Techniques - NLP-Style Text

Outline - Overview of Text Mining - IR-Style Text Mining Techniques - NLP-Style Text Mining Techniques - ML-Style Text Mining Techniques 2

Two Definitions of “Mining” • Goal-oriented (effectiveness driven, NLP, AI) – Any process that

Two Definitions of “Mining” • Goal-oriented (effectiveness driven, NLP, AI) – Any process that generates useful results that are non-obvious is called “mining”. – Keywords: “useful” + “non-obvious” – Data isn’t necessarily massive • Method-oriented (efficiency driven, DB, IR) – Any process that involves extracting information from massive data is called “mining” – Keywords: “massive” + “pattern” – Patterns aren’t necessarily useful 3

What is Text Mining? • Data Mining View: Explore patterns in textual data –

What is Text Mining? • Data Mining View: Explore patterns in textual data – Find latent topics – Find topical trends – Find outliers and other hidden patterns • Natural Language Processing View: Make inferences based on partial understanding natural language text – Information extraction – Question answering 4

Applications of Text Mining • Direct applications – Discovery-driven (Bioinformatics, Business Intelligence, etc): We

Applications of Text Mining • Direct applications – Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit data mining to answer the questions? – Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it? • Indirect applications – Assist information access (e. g. , discover latent topics to better summarize search results) – Assist information organization (e. g. , discover hidden structures) 5

Text Mining Methods • • Data Mining Style: View text as high dimensional data

Text Mining Methods • • Data Mining Style: View text as high dimensional data – Frequent pattern finding – Association analysis – Outlier detection Information Retrieval Style: Fine granularity topical analysis – Topic extraction – Exploit term weighting and text similarity measures – Question answering Natural Language Processing Style: Information Extraction – Entity extraction – Relation extraction – Sentiment analysis Machine Learning Style: Unsupervised or semi-supervised learning – Generative models – Dimension reduction – Classification & prediction 6

IR-Style Techniques for Text Mining 7

IR-Style Techniques for Text Mining 7

Some “Basic” IR Techniques • Stemming • Stop words • Weighting of terms (e.

Some “Basic” IR Techniques • Stemming • Stop words • Weighting of terms (e. g. , TF-IDF) • Vector/Unigram representation of text • Text similarity (e. g. , cosine, KL-div) • Relevance/pseudo feedback (e. g. , Rocchio) 8

Generality of Basic Techniques t 1 t 2 … tn d 1 w 12…

Generality of Basic Techniques t 1 t 2 … tn d 1 w 12… w 1 n d 2 w 21 w 22… w 2 n …… … dm wm 1 wm 2… wmn Term similarity Stemming & Stop words Raw text Sentence selection SUMMARIZATION META-DATA/ ANNOTATION tt t t tt CLUSTERING d Doc d dd d similarity d Term Weighting Tokenized text tt t t tt dd dd d d Vector centroid d 9 CATEGORIZATION

Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering • Text Summarization

Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering • Text Summarization 10

Information Filtering • Stable & long term interest, dynamic info source • System must

Information Filtering • Stable & long term interest, dynamic info source • System must make a delivery decision immediately as a document “arrives” • Two Methods: Content-based vs. Collaborative my interest: … Filtering System 11

Examples of Information Filtering • News filtering • Email filtering • Recommending Systems •

Examples of Information Filtering • News filtering • Email filtering • Recommending Systems • Literature alert • And many others 12

Sample Applications • Information Filtering ÞText Categorization • Document/Term Clustering • Text Summarization 13

Sample Applications • Information Filtering ÞText Categorization • Document/Term Clustering • Text Summarization 13

Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) •

Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Categorization System Business Education Sports Business Education … … Science 14

Examples of Text Categorization • News article classification • Meta-data annotation • Automatic Email

Examples of Text Categorization • News article classification • Meta-data annotation • Automatic Email sorting • Web page classification 15

Sample Applications • Information Filtering • Text Categorization ÞDocument/Term Clustering • Text Summarization 16

Sample Applications • Information Filtering • Text Categorization ÞDocument/Term Clustering • Text Summarization 16

The Clustering Problem • Discover “natural structure” • Group similar objects together • Object

The Clustering Problem • Discover “natural structure” • Group similar objects together • Object can be document, term, passages • Example 17

Similarity-induced Structure 18

Similarity-induced Structure 18

Examples of Doc/Term Clustering • Clustering of retrieval results • Clustering of documents in

Examples of Doc/Term Clustering • Clustering of retrieval results • Clustering of documents in the whole collection • Term clustering to define “concept” or “theme” • Automatic construction of hyperlinks • In general, very useful for text mining 19

Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering ÞText Summarization 20

Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering ÞText Summarization 20

“Retrieval-based” Summarization • Observation: term vector summary? • Basic approach – Rank “sentences”, and

“Retrieval-based” Summarization • Observation: term vector summary? • Basic approach – Rank “sentences”, and select top N as a summary • Methods for ranking sentences – Based on term weights – Based on position of sentences – Based on the similarity of sentence and document vector 21

Examples of Summarization • News summary • Summarize retrieval results – Single doc summary

Examples of Summarization • News summary • Summarize retrieval results – Single doc summary – Multi-doc summary • Summarize a cluster of documents (automatic label creation for clusters) 22

NLP-Style Text Mining Techniques Most of the following slides are from William Cohen’s IE

NLP-Style Text Mining Techniques Most of the following slides are from William Cohen’s IE tutorial 23

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation +

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ * Microsoft Corporation CEO Bill Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . October 14, 2002, 4: 00 a. m. PT Richard Stallman, founder of the Free Software Foundation, countered saying… 24

Landscape of IE Tasks: Complexity E. g. word patterns: Closed set Regular set U.

Landscape of IE Tasks: Complexity E. g. word patterns: Closed set Regular set U. S. states U. S. phone numbers He was born in Alabama… Phone: (413) 545 -1323 The big Wyoming sky… The CALD main office can be reached at 412 -268 -1299 Complex pattern U. S. postal addresses University of Arkansas P. O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4 th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context and many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at Whiz. Bang Labs. 25

Landscape of IE Techniques Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky.

Landscape of IE Techniques Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. V V P Classifier VP NP END BEGIN END Mo st PP which class? BEGIN NP pa rs NNP ly NNP lik e Most likely state sequence? BEGIN VP S Any of these models can be used to capture words, formatting or both. 26

Statistical Learning Style Techniques for Text Mining 27

Statistical Learning Style Techniques for Text Mining 27

Many Techniques are Available • Supervised learning – Classification – Regression • Unsupervised learning

Many Techniques are Available • Supervised learning – Classification – Regression • Unsupervised learning – Topic models – Dimension reduction • Most relevant methods – Generative models – Matrix decomposition 28

Topics for Discussion • Social Science research questions: – Mining bias: selection bias, framing

Topics for Discussion • Social Science research questions: – Mining bias: selection bias, framing bias • Text Mining techniques – Sentiment analysis – Topic discovery and evolution graph – Joint text-image analysis 29