Unstructured Data and Text Mining D Silver Unstructured
- Slides: 41
Unstructured Data and Text Mining D. Silver
Unstructured Data • Definition: Information that either does not have a predefined data model or is not organized in a predefined manner • Imprecise for several reasons: – Structure of data may be easily implied, but not explicit – Data may have explicitly structure but not for the task at hand – Data may have some underlying structure that is not understood
80% of Data is Unstructured • Much of it is text based: – Business data: • Call center transcripts • Other CRM – Email – Open-ended survey responses – Web pages – News. Groups – Organizational documents – Regulatory information Copyright 2003 -4, SPSS Inc. 3
Growth of Unstructured Data
Unstructured Information Management Architecture (UIMA) • Architecture for the development, discovery, composition, and deployment of analytics on unstructured data • Provides a common framework for processing unstructured data to extract meaning and create structured data and information • IBM’s Watson uses UIMA for real-time content analytics
References: • • Text to attributes p. 328 -329 Text mining Section 9. 5 Web mining and beyond , Section 9. 6 -9. 8 String conversion p. 439 http: //en. wikipedia. org/wiki/Unstructured_data http: //en. wikipedia. org/wiki/UIMA http: //bigdataintegration. blogspot. ca/2012/02/u nstructured-data-is-myth. html
Text Mining • Text is: – Unstructured, amorphous and challenging to parse – Most common form of information exchange – Motivation to extract information is compelling • Text Mining differs from Data Mining – Most authors strive to clearly inform the reader – But humans do not have time to read/interpret everything – TM focuses on extracting information ready for rapid machine or human consumption
Text Mining • Two broad approaches: – Natural Language Processing (Comp. Linguistics) • Extracts concepts based on semantics • Relies heavily on language morphology, syntax, and semantics – Information Retrieval • Exploits bag of word approach • Term weighting and text similarity measures
Text Mining is a Variant of DM Text Mining Copyright 2003 -4, SPSS Inc. 9
NLP Approach Concept Maps Attitudes Attract Text Clustering Grow Surveys Concepts Web Channel Attributes Customer Data Trending Outcomes Information Extraction Operational Systems Data Collection Retain Expert UI Copyright 2003 -4, SPSS Inc. Prediction Business UI Text Actions NLP Categorization Fraud Business User 10
NLP Relies on the Building Blocks of Language • • Morphology Syntax Semantics Objective is to go from syntactic phrase – Using a tool like Text Mining is a great idea for any organization that is interested in maintaining information on competitive intelligence. • To semantic concept: – Competitive Intelligence Copyright 2003 -4, SPSS Inc. 11
Morphology • Understanding words Noun – Stems – Affixes • Prefix • Suffix – Inflectional elements § Reduces complexity of analysis § Reduces complexity of representation § Supports text mining Copyright 2003 -4, SPSS Inc. Prefix Noun Stem Suffix in - dispute - able 12
Syntax • The Bank of Canada will curb inflation with higher interest rates Sentence Noun phrase Adjective The Verb phrase Aux Verb Noun will curb inflation Prepositional phrase Noun Bank of Canada with Adjective Copyright 2003 -4, SPSS Inc. higher Noun phrase Noun Interest rates 13
Semantics • The meaning of it all • Approaches to meaning – Semantic networks – Deductive logic – Rule-based systems • Useful for classification of documents Copyright 2003 -4, SPSS Inc. 14
Problems with NLP • Limitations of Natural Language Processing – Correctly identifying the role of noun phrases – Representing abstract concepts – Classifying synonyms – Representing the number of concepts • Limitations of technology – Language specific designs are required – Classification speed – Classifying hybrid words and sentences Copyright 2003 -4, SPSS Inc. 15
IR Approach • Statistics applied to syntax yields pretty good results for: – Information Filtering – Text Categorization – Document/Term Clustering – Text Summarization
Generality of Basic Techniques t 1 t 2 … tn d 1 w 12… w 1 n d 2 w 21 w 22… w 2 n …… … dm wm 1 wm 2… wmn Term similarity CLUSTERING Doc similarity Term Weighting Tokenized text Stemming & Stop words Raw text tt t t tt Sentence selection SUMMARIZATION META-DATA/ ANNOTATION tt t t tt d d dd dd d d Vector centroid d CATEGORIZATION 17
Stemming • General: – http: //en. wikipedia. org/wiki/Stemming – http: //www. comp. lancs. ac. uk/computing/research/stemming/general/ – http: //snowball. tartarus. org/texts/introduction. html *READ* • Julie B. Lovins (1968) – http: //www. comp. lancs. ac. uk/computing/research/stemming/general/lovins. htm – http: //snowball. tartarus. org/algorithms/lovins/stemmer. html • Martin Porter (1979) – http: //www. comp. lancs. ac. uk/computing/research/stemming/general/porter. htm • Snowball (~2000) – Framework for writing stemming algorithms – Language and compiler for stemming algorithms – http: //snowball. tartarus. org
Information Filtering • Stable & long term interest, dynamic info source • System must make a delivery decision immediately as a document “arrives” • Two Methods: Content-based vs. Collaborative my interest: … Filtering System 26
Examples of Information Filtering • • • News filtering Email filtering Recommending Systems Literature alert And many others 27
Sample Applications • Information Filtering • Text Categorization ÞDocument/Term Clustering • Text Summarization 28
The Clustering Problem • Discover “natural structure” • Group similar objects together • Object can be document, term, passages 29
Similarity-induced Structure 30
Examples of Doc/Term Clustering • • • Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks In general, very useful for text mining 31
Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering ÞText Summarization 32
“Retrieval-based” Summarization • Observation: term vector summary? • Basic approach – Rank “sentences”, and select top N as a summary • Methods for ranking sentences – Based on term weights – Based on position of sentences – Based on the similarity of sentence and document vector – NOTE: Similarity can be measured by inner product of vectors of term frequencies 33
Examples of Summarization • News summary • Summarize retrieval results – Single doc summary – Multi-doc summary • Summarize a cluster of documents (automatic label creation for clusters) 34
Sample Applications • Information Filtering ÞText Categorization • Document/Term Clustering • Text Summarization 35
Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Categorization System Business Education Sports Business … … Science Education 36
Examples of Text Categorization • • News article classification Meta-data annotation Automatic Email sorting Web page classification 38
References • http: //paginas. fe. up. pt/~ec/files_0405/slides/07 %20 Text. Mining. pdf • http: //disi. unitn. it/~bernardi/Courses/CL/Slides/i r. pdf • Multinomimal Distribution – http: //onlinestatbook. com/2/probability/multinomial. html – http: //onlinestatbook. com/2/probability/binomial. ht ml
WEKA Tutorials • https: //moodle. umons. ac. be/pluginfile. php/4 3703/mod_resource/content/2/Weka. Tutorial. pdf • http: //www. unal. edu. co/diracad/einternacional/We ka. pdf
- Mining complex data types
- Web text mining
- Mining multimedia databases
- Bedroom silver carpet
- Example of a chemical change
- Making connections images
- Unstructured data to structured data conversion
- Text analytics and text mining
- Text analytics and text mining
- Difference between strip mining and open pit mining
- What is kdd process in data mining
- Mining fraud
- Olap data warehouse
- Introduction to data warehousing and data mining
- Strip mining vs open pit mining
- Mineral resources and mining chapter 13
- Unstructured text
- Unstructured and structured data
- Data reduction in data mining
- What is missing data in data mining
- Data reduction in data mining
- Data reduction in data mining
- Data reduction in data mining
- Shell cube in data mining
- Data reduction in data mining
- Data warehouse dan data mining
- Data mining dan data warehouse
- Mining complex types of data
- Noisy data in data mining
- Data warehouse architecture in data mining
- Data preparation for data mining
- Data compression in data mining
- Data warehouse dan data mining
- Complex data types in data mining
- Unstructured data growth
- Unstructured data growth rate
- Counterdisciplinary
- Azure unstructured data
- Sql server unstructured data
- Dealing with unstructured data
- Dealing with unstructured data
- Dealing with unstructured data