Unstructured Data and Text Mining D Silver Unstructured

Unstructured Data • Definition: Information that either does not have a predefined data model

80% of Data is Unstructured • Much of it is text based: – Business

Unstructured Information Management Architecture (UIMA) • Architecture for the development, discovery, composition, and deployment

References: • • Text to attributes p. 328 -329 Text mining Section 9. 5

Text Mining • Text is: – Unstructured, amorphous and challenging to parse – Most

Text Mining • Two broad approaches: – Natural Language Processing (Comp. Linguistics) • Extracts

Text Mining is a Variant of DM Text Mining Copyright 2003 -4, SPSS Inc.

NLP Approach Concept Maps Attitudes Attract Text Clustering Grow Surveys Concepts Web Channel Attributes

NLP Relies on the Building Blocks of Language • • Morphology Syntax Semantics Objective

Morphology • Understanding words Noun – Stems – Affixes • Prefix • Suffix –

Syntax • The Bank of Canada will curb inflation with higher interest rates Sentence

Semantics • The meaning of it all • Approaches to meaning – Semantic networks

Problems with NLP • Limitations of Natural Language Processing – Correctly identifying the role

IR Approach • Statistics applied to syntax yields pretty good results for: – Information

Generality of Basic Techniques t 1 t 2 … tn d 1 w 12…

Stemming • General: – http: //en. wikipedia. org/wiki/Stemming – http: //www. comp. lancs. ac.

Information Filtering • Stable & long term interest, dynamic info source • System must

Examples of Information Filtering • • • News filtering Email filtering Recommending Systems Literature

Sample Applications • Information Filtering • Text Categorization ÞDocument/Term Clustering • Text Summarization 28

The Clustering Problem • Discover “natural structure” • Group similar objects together • Object

Examples of Doc/Term Clustering • • • Clustering of retrieval results Clustering of documents

Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering ÞText Summarization 32

“Retrieval-based” Summarization • Observation: term vector summary? • Basic approach – Rank “sentences”, and

Examples of Summarization • News summary • Summarize retrieval results – Single doc summary

Sample Applications • Information Filtering ÞText Categorization • Document/Term Clustering • Text Summarization 35

Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) •

Examples of Text Categorization • • News article classification Meta-data annotation Automatic Email sorting

References • http: //paginas. fe. up. pt/~ec/files_0405/slides/07 %20 Text. Mining. pdf • http: //disi.

WEKA Tutorials • https: //moodle. umons. ac. be/pluginfile. php/4 3703/mod_resource/content/2/Weka. Tutorial. pdf • http:

Slides: 41

Download presentation

Unstructured Data and Text Mining D. Silver

Unstructured Data • Definition: Information that either does not have a predefined data model or is not organized in a predefined manner • Imprecise for several reasons: – Structure of data may be easily implied, but not explicit – Data may have explicitly structure but not for the task at hand – Data may have some underlying structure that is not understood

80% of Data is Unstructured • Much of it is text based: – Business data: • Call center transcripts • Other CRM – Email – Open-ended survey responses – Web pages – News. Groups – Organizational documents – Regulatory information Copyright 2003 -4, SPSS Inc. 3

Growth of Unstructured Data

Unstructured Information Management Architecture (UIMA) • Architecture for the development, discovery, composition, and deployment of analytics on unstructured data • Provides a common framework for processing unstructured data to extract meaning and create structured data and information • IBM’s Watson uses UIMA for real-time content analytics

References: • • Text to attributes p. 328 -329 Text mining Section 9. 5 Web mining and beyond , Section 9. 6 -9. 8 String conversion p. 439 http: //en. wikipedia. org/wiki/Unstructured_data http: //en. wikipedia. org/wiki/UIMA http: //bigdataintegration. blogspot. ca/2012/02/u nstructured-data-is-myth. html

Text Mining • Text is: – Unstructured, amorphous and challenging to parse – Most common form of information exchange – Motivation to extract information is compelling • Text Mining differs from Data Mining – Most authors strive to clearly inform the reader – But humans do not have time to read/interpret everything – TM focuses on extracting information ready for rapid machine or human consumption

Text Mining • Two broad approaches: – Natural Language Processing (Comp. Linguistics) • Extracts concepts based on semantics • Relies heavily on language morphology, syntax, and semantics – Information Retrieval • Exploits bag of word approach • Term weighting and text similarity measures

NLP Approach Concept Maps Attitudes Attract Text Clustering Grow Surveys Concepts Web Channel Attributes Customer Data Trending Outcomes Information Extraction Operational Systems Data Collection Retain Expert UI Copyright 2003 -4, SPSS Inc. Prediction Business UI Text Actions NLP Categorization Fraud Business User 10

NLP Relies on the Building Blocks of Language • • Morphology Syntax Semantics Objective is to go from syntactic phrase – Using a tool like Text Mining is a great idea for any organization that is interested in maintaining information on competitive intelligence. • To semantic concept: – Competitive Intelligence Copyright 2003 -4, SPSS Inc. 11

Morphology • Understanding words Noun – Stems – Affixes • Prefix • Suffix – Inflectional elements § Reduces complexity of analysis § Reduces complexity of representation § Supports text mining Copyright 2003 -4, SPSS Inc. Prefix Noun Stem Suffix in - dispute - able 12

Syntax • The Bank of Canada will curb inflation with higher interest rates Sentence Noun phrase Adjective The Verb phrase Aux Verb Noun will curb inflation Prepositional phrase Noun Bank of Canada with Adjective Copyright 2003 -4, SPSS Inc. higher Noun phrase Noun Interest rates 13

Problems with NLP • Limitations of Natural Language Processing – Correctly identifying the role of noun phrases – Representing abstract concepts – Classifying synonyms – Representing the number of concepts • Limitations of technology – Language specific designs are required – Classification speed – Classifying hybrid words and sentences Copyright 2003 -4, SPSS Inc. 15

IR Approach • Statistics applied to syntax yields pretty good results for: – Information Filtering – Text Categorization – Document/Term Clustering – Text Summarization

Generality of Basic Techniques t 1 t 2 … tn d 1 w 12… w 1 n d 2 w 21 w 22… w 2 n …… … dm wm 1 wm 2… wmn Term similarity CLUSTERING Doc similarity Term Weighting Tokenized text Stemming & Stop words Raw text tt t t tt Sentence selection SUMMARIZATION META-DATA/ ANNOTATION tt t t tt d d dd dd d d Vector centroid d CATEGORIZATION 17

Stemming • General: – http: //en. wikipedia. org/wiki/Stemming – http: //www. comp. lancs. ac. uk/computing/research/stemming/general/ – http: //snowball. tartarus. org/texts/introduction. html *READ* • Julie B. Lovins (1968) – http: //www. comp. lancs. ac. uk/computing/research/stemming/general/lovins. htm – http: //snowball. tartarus. org/algorithms/lovins/stemmer. html • Martin Porter (1979) – http: //www. comp. lancs. ac. uk/computing/research/stemming/general/porter. htm • Snowball (~2000) – Framework for writing stemming algorithms – Language and compiler for stemming algorithms – http: //snowball. tartarus. org

Information Filtering • Stable & long term interest, dynamic info source • System must make a delivery decision immediately as a document “arrives” • Two Methods: Content-based vs. Collaborative my interest: … Filtering System 26

Examples of Information Filtering • • • News filtering Email filtering Recommending Systems Literature alert And many others 27

Sample Applications • Information Filtering • Text Categorization ÞDocument/Term Clustering • Text Summarization 28

The Clustering Problem • Discover “natural structure” • Group similar objects together • Object can be document, term, passages 29

Similarity-induced Structure 30

Examples of Doc/Term Clustering • • • Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks In general, very useful for text mining 31

Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering ÞText Summarization 32

“Retrieval-based” Summarization • Observation: term vector summary? • Basic approach – Rank “sentences”, and select top N as a summary • Methods for ranking sentences – Based on term weights – Based on position of sentences – Based on the similarity of sentence and document vector – NOTE: Similarity can be measured by inner product of vectors of term frequencies 33

Examples of Summarization • News summary • Summarize retrieval results – Single doc summary – Multi-doc summary • Summarize a cluster of documents (automatic label creation for clusters) 34

Sample Applications • Information Filtering ÞText Categorization • Document/Term Clustering • Text Summarization 35

Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Categorization System Business Education Sports Business … … Science Education 36

Examples of Text Categorization • • News article classification Meta-data annotation Automatic Email sorting Web page classification 38

References • http: //paginas. fe. up. pt/~ec/files_0405/slides/07 %20 Text. Mining. pdf • http: //disi. unitn. it/~bernardi/Courses/CL/Slides/i r. pdf • Multinomimal Distribution – http: //onlinestatbook. com/2/probability/multinomial. html – http: //onlinestatbook. com/2/probability/binomial. ht ml

WEKA Tutorials • https: //moodle. umons. ac. be/pluginfile. php/4 3703/mod_resource/content/2/Weka. Tutorial. pdf • http: //www. unal. edu. co/diracad/einternacional/We ka. pdf