Chapter 6 Clustering Textual Data Chapter 6 1

  • Slides: 40
Download presentation
Chapter 6 Clustering Textual Data

Chapter 6 Clustering Textual Data

Chapter 6. 1 Introduction to Text Mining

Chapter 6. 1 Introduction to Text Mining

Objectives n n 3 Define text mining. State the problems in using textual data

Objectives n n 3 Define text mining. State the problems in using textual data for segmentation. Describe how the text mining process works. State the SAS Text Miner text processing features.

What Text Mining Is Text mining is a process that employs a set of

What Text Mining Is Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and then using quantitative methods to analyze these data objects. Text mining derives quantitative representation of documents

What Text Mining Is Not Text mining is not n a text summarization tool

What Text Mining Is Not Text mining is not n a text summarization tool – that is, executive summary of a report n an information extraction methodology – that is, search/query functions n a natural language processor. – that is, do not address conversation between human and machine . . .

Text Mining Applications n Analyzing documents in a corpus (collection of documents) for the

Text Mining Applications n Analyzing documents in a corpus (collection of documents) for the purpose of – Grouping documents into predetermined categories § Customer complaints, comments, and suggestions § Routing e-mails to specific persons § Routing news items to specific persons – Integrating text data with quantitative data to enrich predictive modeling § Predicting customer satisfaction using standard survey items plus comments (text) § Predicting true cost of service based on parts, labor, plus call center logs (text). . .

Text Mining Deficiencies Text mining algorithms often perform poorly in distinguishing negations, for example:

Text Mining Deficiencies Text mining algorithms often perform poorly in distinguishing negations, for example: n Mr. X was involved in a motor vehicle accident. n Mr. X was NOT involved in a motor vehicle accident. Text mining algorithms do not work well with large documents. n Performance is slow. n Increased term occurrence across documents decreases separation of documents. .

Text Mining Data A text mining data set contains records (rows) and fields (columns).

Text Mining Data A text mining data set contains records (rows) and fields (columns). The primary field for text mining is the document field. SAS Text Miner software supports a document field that contains text representing a single document. However, tools supplied with SAS Text Miner accommodate two raw data formats: n Textual data with one document per record (one file) n Documents stored as individual files in a well-defined file system hierarchy (multiple files, one document per file).

Generic Text Mining Process (How does it work? )

Generic Text Mining Process (How does it work? )

The SAS Text Mining Process in Details 1. Preprocess document files to create a

The SAS Text Mining Process in Details 1. Preprocess document files to create a SAS data set. – TMFILTER macro – SAS language features 2. Parse the document field. – PARSE property in Text Miner § Stemming § Part-of-speech tagging § Entities § Stop/start lists § Synonym lists § And so forth continued. . .

The SAS Text Mining Process 3. Derive the term by document frequency matrix. –

The SAS Text Mining Process 3. Derive the term by document frequency matrix. – The Text Miner Transform property – Frequency weights – Term weights 4. Transform the term by document frequency matrix. – The Text Miner Transform property – Singular Value Decomposition (SVD) – Roll Up Terms 5. Perform the analysis (cluster property). – Exploration – Clustering/unsupervised learning – Predictive modeling. . .

Unprocessed Text in a News Group Posting After preprocessing, text might look like: .

Unprocessed Text in a News Group Posting After preprocessing, text might look like: . . .

SAS Text Miner Text Processing Features n Text parsing n n Removal of stop

SAS Text Miner Text Processing Features n Text parsing n n Removal of stop words Part-of-speech tagging Stems and synonym handling Entities

Stop Words Stop words are words that have little or no value in identifying

Stop Words Stop words are words that have little or no value in identifying a document or in comparing documents. Standard stop lists contain stop words that are n Articles (the, a, this) n Conjunctions (and, but, or) n Prepositions (of, from, by). Custom stop lists identify low information words, like the word “computer”, in a collection of articles about computers. .

After Removal of Stop Words With STOP words removed, the text might look like:

After Removal of Stop Words With STOP words removed, the text might look like: . . .

Stop Lists and Start Lists n n n Stop list words are removed Only

Stop Lists and Start Lists n n n Stop list words are removed Only the words contained in a start list are kept SAS Text Miner accommodates either approach

Stop List versus Start List n n Use a start list when – documents

Stop List versus Start List n n Use a start list when – documents are dominated by technical jargon – domain expertise is available. Use a stop list when – documents are loosely related: news, business reports, Internet searches – domain expertise is not available. .

Tagging Parts of Speech Determines if the word is a common noun, verb, adjective,

Tagging Parts of Speech Determines if the word is a common noun, verb, adjective, proper noun, adverb, and so forth. Disambiguate parts of speech when a word is used in a different context, n I wish that my bank had more ATM machines. n You can bank on either Cowboys or Broncos winning the Super Bowl next year. n Settlers living on the west bank of the river were forced to relocate. .

Stemming Examples BIG: BIG, BIGGER, BIGGEST REACH: REACH, REACHES, REACHED, REACHING WORK: WORK, WORKS,

Stemming Examples BIG: BIG, BIGGER, BIGGEST REACH: REACH, REACHES, REACHED, REACHING WORK: WORK, WORKS, WORKED, WORKING CHILD: CHILD, CHILDREN KNIFE: KNIFE, KNIVES PERRO: PERRO, PERRA (Spanish, male and female dog)

Stemming and Synonyms in SAS Text Miner n n n SAS Text Miner performs

Stemming and Synonyms in SAS Text Miner n n n SAS Text Miner performs stemming to derive stem synonyms, for example, run/ran/runs/running, and combines these with defined synonyms, for example, run/sprint. The default synonym data set for SAS Text Miner, sashelp. engsynms, is primarily for illustration. – You can always modify and create an individual synonym list. Synonyms might split based on part of speech, for example, teach/train=verb, locomotive/train=noun. .

Spell Checking n n n SAS Text Miner does not perform spell checking. SAS

Spell Checking n n n SAS Text Miner does not perform spell checking. SAS Text Miner treats misspelled words as acceptable terms. If you believe that misspellings add noise, spell checking and correction should be performed as preprocessing tasks. – Issues are the trade-offs between making the termdocuments matrix bigger and more noisy versus removing potential information (author fingerprint) . . .

Syntax and Semantic Issues “I made her duck” might mean…. n I cooked waterfowl

Syntax and Semantic Issues “I made her duck” might mean…. n I cooked waterfowl for her. n I cooked waterfowl belonging to her. n I caused her to quickly lower head or body. n I waved my magic wand turned her into a waterfowl. n n Resolving ambiguities requires context (surrounding words and sentences). The written word loses the discriminating features of spoken language, such as volume and inflection. .

Term-Document Frequency Matrices Documents D 1 D 2 … Term ID Dn T 1

Term-Document Frequency Matrices Documents D 1 D 2 … Term ID Dn T 1 1 D 1, 2 … D 1, n T 2 2 D 2, 1 D 2, 2 … D 2, n … … Di, j=count of number of times term i occurs in document j

Term-Document Frequency Matrices Pitfalls n n n Sparse cells (too many zeroes) Weak discriminatory

Term-Document Frequency Matrices Pitfalls n n n Sparse cells (too many zeroes) Weak discriminatory power Too large Solution n Weighted term document frequency n Singular value decomposition (SVD) . . .

Chapter 6. 2 Applications of Text Mining in Segmentation and Predictive Modeling

Chapter 6. 2 Applications of Text Mining in Segmentation and Predictive Modeling

Objectives n n 26 State the business contexts of text mining applications. Use SAS

Objectives n n 26 State the business contexts of text mining applications. Use SAS Text Miner to – cluster textual data – use text-based clusters as inputs in a predictive model.

Demonstrations of Text Mining Cluster analysis of book titles (amazon. com data) n Business

Demonstrations of Text Mining Cluster analysis of book titles (amazon. com data) n Business goal: to find similar groups of books based on their titles. Predictive modeling using data generated from text miner (Insurance company example) n Use textual data to explore if that helps in predictive modeling. .

The Amazon Book Titles Data n Amazon provides recommendations to users as they visit

The Amazon Book Titles Data n Amazon provides recommendations to users as they visit the site and make selections. In the Amazon book section, users are given the option of viewing recommended titles for every title that they select, and results are displayed nine at a time. Data were collected from a customer search for titles related to Web mining, text mining, and general data mining. The goal is to find clusters (segments) in these titles. n Data set: Amznbooks n n n

Clustering Textual Data This demonstration illustrates how to cluster textual data. 29

Clustering Textual Data This demonstration illustrates how to cluster textual data. 29

Predictive Modeling with SAS Text Miner Input X 1 Input X 2 … Cleaning

Predictive Modeling with SAS Text Miner Input X 1 Input X 2 … Cleaning Screening Derivation Transformation Imputation Input Xk Model Text Input T 1 Input T 2 … Pre-processing Parsing Transformation Input Tj Score

The Insurance Data Set n n n The INSSUBRO data set contains 2, 946

The Insurance Data Set n n n The INSSUBRO data set contains 2, 946 records and nine variables. The data set represents a modified version of an insurance data set created for the Special Investigative Unit (SIU) of an insurance company. Data has been changed to hide any confidential or proprietary information.

The Insurance Data Set n n n The text field of interest is called

The Insurance Data Set n n n The text field of interest is called Adjustor. Notes and represents free format text entered by a workers’ compensation insurance claims adjustor. Other fields contain special codes reflecting the facts of the case as well as claimant attributes. The target variable is Subro. Flag, which is coded 1 if the claim was successfully subrogated, or 0 if it was not successfully subrogated or if the SIU determined not to pursue subrogation. – Subrogation occurs when a third party might be responsible for an accident and an insurance carrier requires that the third party pay for some or all of the losses incurred. Subrogation is often referred to as “recovery” in that losses are recovered from a responsible party. .

Claim Attributes n Age. Group: Integers 1 to 5 represent five age groups going

Claim Attributes n Age. Group: Integers 1 to 5 represent five age groups going from youngest (1) to oldest (5), and 9 to represent an invalid or missing birth date, with age calculated for the date of the injury. n Body. Group: Integers 1 to 6 represent a coding of six predetermined body groups, and 9 when the part of body affected by the injury is invalid or missing. continued. . .

Claim Attributes n Natr. Group: Integers 1 to 6 represent a predetermined grouping of

Claim Attributes n Natr. Group: Integers 1 to 6 represent a predetermined grouping of the nature of the accident, and 9 if the nature of the accident is invalid or missing. n Occu. Group: Three digit integer grouping of occupation, and 999 if occupation is invalid or unknown. continued. . .

Claim Attributes n BACKflag: Coded as 1 if the injury involved the back, 0

Claim Attributes n BACKflag: Coded as 1 if the injury involved the back, 0 otherwise. n VEHflag: Coded as 1 if the injury involved a motor vehicle, and 0 otherwise. n Claim. No: Claim Number

Example Adjustor Notes (Text Field) SWUNG BRIEFCASE INTO MINIVAN AND FELT A PULL IN

Example Adjustor Notes (Text Field) SWUNG BRIEFCASE INTO MINIVAN AND FELT A PULL IN LEFT SHOULDER n CLAIMANT STATES THAT HER LEFT ARM AND SHOULDER HURT FROM THE REPETITIOUS WORK OF LOADING, UNLOADING, AND SANDING PARTS n THE EMPLOYEE CLAIMS STRESS . . .

Data Preparation n Clean and screen adjustor notes. – Adjustor notes were modified to

Data Preparation n Clean and screen adjustor notes. – Adjustor notes were modified to correct misspellings. – Adjustor notes were modified to translate abbreviations, for example, LT=LEFT or LT=LIGHT, and to expand acronyms. – The data for this example was screened for confidentiality reasons, and any identifiable information such as date, time, location were removed.

Use Clusters of Textual Data in Predictive Models This demonstration illustrates how to create

Use Clusters of Textual Data in Predictive Models This demonstration illustrates how to create clusters of textual data for use in predictive modleing. 38

Plan of Analysis We will run three predictive models and compare them. n Logistic

Plan of Analysis We will run three predictive models and compare them. n Logistic regression with numeric variables only (no text mining data) n Logistic regression with numeric plus cluster membership only from text mining data analysis n Logistic regression with numeric plus SVD only from text mining data analysis

Compare Predictive Models This demonstration compares predictive models built with numerical data only versus

Compare Predictive Models This demonstration compares predictive models built with numerical data only versus predictive models built with both numerical and textual data. 40