Central University of Las Villas Cuba Artificial intelligence

Central University of Las Villas, Cuba Artificial intelligence Lab Computer Science Department Text Mining Prof. Leticia Arco García leticiaarco@gmail. com

Background Central University “Marta Abreu” of Las Villas 2

Motivation: unstructured data “We are drowning in information, but starving for knowledge” John Naisbett Advances in Knowledge Discovery and Data Mining. AAAI Press and MIT Press, Menlo Park and Cambridge, MA, USA 1996 audio image video petabytes Text zetabyte data 2. 5 quintillon document 3

Contents § Origin and definitions § Techniques § Natural language processing § Textual representation approaches § Tools and resources § Applications in Business Informatics § Our results § Challenges: Sem. Eval 2017? 4

Origin § The challenge of exploiting the large unstructured proportion of enterprise information 1 Luhn, 1958 § Manual text mining approaches 2 – Schulman, et. al. , 1989 § Knowledge Discovery in Texts 3 (KDT) Feldman and Dagan, 1995 1 Luhn, . H. P. (1958) A business intelligence system. IBM Journal. October, pp. 314 -319. 2 Schulman, P. , Castellon, C. , Seligman, M. (1989) Assessing explanatory style: the content analysis of verbatim explanations and the attributional style questionnaire. Behav. Res. Ther. Vol. 27, No. 5, pp. 505512. 3 Feldman R. , Dagan I. (1995): Knowledge Discovery in Textual Databases (KDT), in Proceedings of the First International Conference on Knowledge Discovery and Data Mining KDD-95, pp. 112 -117. 5

Text Mining definitions (1/2) § Text mining can be broadly defined as knowledge intensive process in which a user interacts with a document collection over time by using a suite of analysis tools § Text mining is the study and practice of extracting information from text using the principles of computational linguistics § Text mining as exploratory data analysis is a method of (building and) using software systems to support researchers in deriving new and relevant information (knowledge) from large text collections 6

Text Mining definitions (2/2) § Text mining is the establishing of previously unknown and unsuspected relations of features in a (textual) data base § Text mining is a knowledge creation tool, because it offers powerful possibilities for creating knowledge and relevance out of the massive amounts of unstructured information available on the Internet and corporate intranets Text Mining is defined as automatic discovery of hidden patterns, traits, or unknown information and knowledge from textual data 7

Text Mining vs Data Mining Structured Data Retrieval Data Mining Unstructured Data (Text) Information Retrieval Text Mining Search (goal-oriented) Discover (opportunistic) 8

Multidisciplinary field § Natural language processing § Computational linguistics § Machine learning § Visualization § Database systems § Data mining § Statistics 9

Techniques § Information retrieval § Textual analysis § Text clustering § Generation of term association § Topic detection § Text categorization and classification § Text summarization 10

Techniques § Information retrieval § Textual analysis 1. Indexing process 2. Information retrieval model 3. Queries § Text clustering § Generation of term association § Topic detection § Text categorization and classification § Text summarization Crawlers: Nutch, Scrapy, … 11

Techniques • Dictionaries § Information retrieval • Taggers • Ontologies § Textual analysis • Parsers § Text clustering § Generation of term association § Topic detection § Text categorization and classification § Text summarization 12

Techniques § Information retrieval • Different clustering classifications • Distances and similarities § Textual analysis § Text clustering • Clustering validity measures • Cluster labeling § Generation of term association § Topic detection § Text categorization and classification § Text summarization 13

Techniques § Information retrieval • Mining Frequent Patterns § Textual analysis • Discovering associations § Text clustering • Discovering correlations § Generation of term association • Generating association rules from frequent itemsets § Topic detection § Text categorization and classification • Generating cause-effect rules • Extracting decision rules § Text summarization 14

Techniques § Information retrieval § Textual analysis § Text clustering § Generation of term association § Topic detection • Supervised models • Unsupervised models § Text categorization and classification § Text summarization 15

Techniques § Information retrieval § Textual analysis § Text clustering § Generation of term association § Topic detection • Support vector machines § Text categorization and classification • Decision tres § Text summarization • Neural networks • Naïve Bayes • Deep learning 16

Techniques § Information retrieval § Textual analysis § Text clustering § Generation of term association • Single-document • Multi-document § Topic detection • Extracts § Text categorization and classification • Abstracts § Text summarization • Domain specific • Domain independent 17

Natural language processing levels § Phonology: Sound of words § Morphology: Nature of words § Lexical: An interpretation of an individual word § Syntactic: Grammatical structure of the sentence § Semantic: Look at the whole sentence to discover the meaning § Discouse: Connections between sentences will be made § Anaphora resolution § Text structure recognition § Pragmatic: Try to reveal an extra meaning of the text 18

Different linguistic approaches • Graphemic level: analysis on a sub word level, commonly concerning letters Operate solely on plain statistical facts about text. • Cannot completely capture the meaning of documents. 2. Lexical level: analysis concerning individual words • There is only a weak relationship between term occurrences and document content. 3. Syntactic level: analysis concerning the structure of sentences 1. 4. 5. Semantic level: analysis related to the meaning of words and phrases Pragmatic level: analysis related to meaning regarding language dependent and language independent, e. g. application specific, context Try to capture more semantic content by exploiting an increasing amount of contextual information: • Structure of sentences • Paragraphs • Documents 19

Natural language processing tasks § Syntax § Semantics § Discourse § Speech 20

Syntax § Morphological segmentation Separateinflectional words intoforms individual Reduce and morphemesderivationally and identify the classforms of the sometimes related morphemes of a word to a common base form I will see you tomorrow at 5 p. m. Saturday night we will go to the restaurant. § Normalization (lemmatization, stemming, …) § Part of speech tagging I will comeback at 5 p. m. Saturday. § Parsing § Sentence breaking (sentence boundary disambiguation) § Word segmentation Separate a chunk of continuous text into separate words. 21

Semantics § Named entity recognition (NER) § Natural language generation § Natural language understanding Determine which items in the text map to proper names (e. g. Convert information from person, location, organization) computer databases or semantic intents into readable human language (e. g. from BPMN to natural description) Given a human-language Given two text fragments, question, determine answer determine its if one being true entails other ofthe text, identify the Recognizing textual entailment Given a chunk of text, relationshipsseparate among named entities (e. g. it into segments antecedentseach and of consequents in a to Relationship extraction which is devoted decision rule) a topic, and identify the Sentiment analysis topic of the segment (e. g. textual segments which Topic segmentation and recognition contribute to a decision) Select the meaning which Word sense disambiguation makes the most sense in context (e. g. run) § Question answering § § § 22

Discourse § Automatic summarization § Co reference resolution (anaphora resolution) § Discourse analysis “For a value of more than 5, 000, Identify discourse structure seniorthe management approvalofis connected the nature of the requiredtext, (A 8). i. e. If this is granted, theinvoice discourse mayrelationships be finally approved”. between sentences (e. g. useful for identifying the BPMN flow) 23

Textual representation models § Based on vector space model § Vector Space Model (VSM) § Latent Semantic Analysis (LSA) § Based on graphs § Node: Textual units § Edges: Relations between textual units § Probabilistic models § Probabilistic Latent Semantic Analysis (PLSA) § Latent Dirichlet Allocation (LDA) § Word 2 vec (Word embeddings) § Input: a large corpus of text § Output: a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. 24

Graph based models 25

Vector Space Model Term 1 Term 2 … Document 1 w 12 w 1 m Document 2 w 21 w 22 w 2 m … … … Documentn wn 1 wn 2 … Termm … wnm 26

Text Representation • • Stop word elimination Term Frequency Thresholding and Zipf’s law Document Frequency Thresholding Entropy Mutual information Stemming Using thesaurus Latent Semantic Analysis T 1 … Doc 2 … Docn • • • Recognize different file formats Avoid capitalization constraints and punctuation marks Delete non-alphanumeric characters … Doc 1 … • • • T 2 … … fij Tm … … Local component Global component Normalization component 27

Tools § Apache Lucene § NLTK § Tika § Semantic. Vectors § Stanford Core. NLP § S Space § Apache Open. NLP § Ling. Pipe § Frame. Net § Weka § UIMA § Text. Miner § GATE § R § SOLR § Rapid. Miner § Portia § Knime 28

Resources § Word. Net (Rita. Word. Net y JWNL) § Euro. Word. Net § SUMO (Suggested Upper Merged Ontology) § Word. Net Domain § Word. Net Affect § Tree. Tagger § Senti. Word. Net § General Inquirer 29

Word. Net: synsets COMPUTER § synset#1 {computer, computing machine, computing device, data processor, electronic computer, information processing system} (a machine for performing calculations automatically) § synset#2 {calculator, reckoner, figurer, estimator, computer} (an expert at calculation (or at operating calculating machines)) 30

Word. Net: Relation between terms and synsets EAT § [Index. Word: [Lemma: eat] [POS: verb]]: take in solid food; “She was eating a banana”; “What did you eat for dinner last night? ” § [Index. Word: [Lemma: eat] [POS: verb]]: eat a meal; take a meal; “We did not eat until 10 P. M. because there were so many phone calls”; “I didn’t eat yet, so I gladly accept your invitation” § [Index. Word: [Lemma: eat] [POS: verb]]: take in food; used of animals only; “This dog doesn’t eat certain kinds of meat”; “What do whales eat? ” § [Index. Word: [Lemma: eat] [POS: verb]]: worry or cause anxiety in a persistent way; “What’s eating you? ” § [Index. Word: [Lemma: eat] [POS: verb]]: use up (resources or materials); “this car consumes a lot of gas”; “We exhausted our savings”; “They run through 20 bottles of wine a week” § [Index. Word: [Lemma: eat] [POS: verb]]: cause to deteriorate due to the action of water, air, or an acid; “The acid corroded the metal”; “The steady dripping of water rusted the metal stopper in the sink” 31

Word. Net: Relation between synsets § All POS § Synonymy § Antonymy § Only sustantives § Hypernymy § Hyponymy § Meronymy § Only verbs § Troponymy § Entailment 32

Word. Net: synsets related to COMPUTER synset#1 Hyponymy shows the relationship between a generic term (hypernym) and a specific instance of it (hyponym) § Hypernymy § {machine} (any mechanical or electrical device that transmits or modifies energy to perform or assist in the performance of human tasks) § Hyponymy § {analog computer, analogue computer} (a computer that represents information by variable quantities (e. g. , positions or voltages)) § Meronymy (part of, or a member of something) § {busbar, bus} (an electrical conductor that makes a common connection between several circuits; "the busbar in this computer can transmit data either way between any two components of the system") 33

Word. Net Domain 35

Tree. Tagger The Tree. Tagger is easy to use. Palabra The Tree. Tagger is POS DT NP VBZ Lemma the Tree. Tagger be easy to use. JJ TO VB SENT easy to use. 36

Data Transforming into Business Intelligence Supply chain management Data outside Text mining Web mining Data mining Real-time analysis Information retrieval Data within enterprise Information extraction Information transformation Input Search engines Data cleaning Predictions and forecast Data Warehouse Business Intelligence tools Output 37

Scenarios where text mining can build BPM capabilities How can an organization improve system functionality and user experience? What organizational values are incorporated in an organization? How can an organization innovate its process? Customer review Documents What organizational values support innovation in a business area? Process descriptions Cloud services Are we still operating according to the right strategy? Process-related content Communication logs How can new information systems help to support users’ needs? What are emerging topics in process and in business area that are relevant to an organization? 38

A Text Mining approach for integrating business process models and governing documents 39

Facilitating business process discovery using email analysis § Hypothesis: Data sources that represent the communications facilitate the identification of the business processes § Idea: Identify email message threads § Outcome: process fragment enactment models that can help process engineers § § Validate their findings about the business processes Understand better the vague and unclear parts of the processes 40

Predictive process monitoring framework that combines text mining with sequence classification techniques Structured data often comes in conjunction with unstructured (textual) data such as emails or comments. Call {revenue : 34555; debt sum : 500} {Please send a warning. 1234567: “Gave extension of 5 days and issued a warning about sending it to encashment. An encashment warning letter sent on the 06/10, 11: 10 deadline. ”} 41

Automated generation of business process models from natural language input 42

Semantics based event log aggregation for process mining and analytics Event log pre processing techniques are needed that leverage semantic information for better alignment with the purpose of semi automatically building, extending, and applying process ontologies. 43

Other applications § Generate natural language text from business process models § Support process model validation based on Text Generation § Use unstructured data from interviews and questionnaires for improving business process § Detect inconsistencies between process models and textual descriptions § Detect non uniformly specified process element names on the same process decomposition level § Detect naming convention violations, ambiguity and incomplete elements in process models 44

General schema for clustering, labeling and evaluating textual corpora 45

Document clustering based on Differential Betweenness • Consider the structure and relationships between data • Represent objects and their relationships in a graph • Exploit the topology T 1 T 2 … Tm Doc 1 … Doc 1 Doc 2 … Docn … … fij … … … Doc 2 . . . Docn Sim(i, j) Docn Betweenness is an indicator of: • who the most influential people in the network are • who control the flow of information between most others 46

Clustering based on Differential Betweenness Obtain the similarity graph Calculate the weighted differential betweenness matrix Estimate the edges to be eliminated Determine the cluster kernels by means of the extraction of the connected components Classify remaining nodes 47

GARLucene: Sistema para la Gestión de Artículos científicos Recuperados usando Lucene 48

Which are more similar? Document 1 Document 2 Document 3 … … … <Abstract> Term 1 Term 2 <Abstract> <Keywords> Term 1 Term 2 </Keywords> <Introduction> Term 2 Term 1 </Introduction> … … … <References> Term 2 Term 1 </References> 49

Methodology for clustering considering content and structure Desktop application Luc. XML Web system Scientific Solr 50

Schema for topic segmentation and detection Textual corpora Represent textual units Identify textual units vectors, graphs, probabilistic distribution textual units Pre-process tokens Segmen t Represent segments vectors, graphs, probabilistic distribution Cluster segments segment clusters (topics) Framework Opinion. Topic. Detection Desktop application Opinion. TD Label segment clusters Topics and corresponding labels 51

RST disambiguation New unsupervised semantic disambiguation algorithm based on clustering and Rough Set Theory 1. Eliminate stop words 2. Lemmatize terms 3. Find sense of each term in Word. Net 4. Cluster senses of terms 5. Calculate lower and upper approximation of each cluster of terms 6. Calculate Rough F measure of each cluster 7. Identify the best cluster considering number of terms, Rough F measure value and Word. Net index 8. Calculate rough membership measures for assigning senses to clusters 52

Sem. Eval 2017 § Semantic comparison for words and texts § Task 1: Semantic Textual Similarity § Task 2: Multi lingual and Cross lingual Semantic Word Similarity § Task 3: Community Question Answering § Parsing semantic structures § Task 9: Abstract Meaning Representation Parsing and Generation § Task 10: Extracting Keyphrases and Relations from Scientific Publications § Task 11: End User Development using Natural Language 53

Central University of Las Villas, Cuba Artificial intelligence Lab Computer Science Department Thanks! Questions, ideas, suggestions, comments, … Text Mining Prof. Leticia Arco García leticiaarco@gmail. com