Introduction to Natural Language Processing and Text Mining


























- Slides: 26
Introduction to Natural Language Processing and Text Mining and The basic building blocks Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur 1
What is speech and language processing? Computational Linguistics deals with the modeling of natural language from a computational perspective. Natural Language Processing Process information contained in natural language text / speech Getting computers to perform useful tasks involving human languages – Enabling human-machine communication – Improving human-human communication – Doing stuff with language objects Can machines understand human language? What does one mean by ‘understand’? Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful. 2
Natural Language Processing What is it? We’re going to study what goes into getting computers to perform useful and interesting tasks involving human languages. We will be secondarily concerned with the insights that such computational work gives us into human processing of language. 3
Importance of studying NLP A hallmark of human intelligence. Text is the largest repository of human knowledge and is growing quickly. emails, news articles, web pages, scientific articles, insurance claims, customer complaint letters, transcripts of phone calls, technical documents, government documents, patent portfolios, court decisions, contracts, …… Are we reading any faster than before? How do we keep up? 4
Goals of NLP Scientific Goal Identify the computational machinery needed for an agent to exhibit various forms of linguistic behaviour Engineering Goal Design, implement, and test systems that process natural languages for practical applications 5
Computer Speech and Language Processing Goals can be very ambitious True text understanding Good quality translation Or goals can be practical Web search engines Question Answering Machine Translation services on the Web Speech synthesis Voice recognition Conversational Agents Summarization Natural language technology not yet perfected But still good enough for several useful applications 6
Text Mining Text mining deriving high quality information from text. Text mining usually involves – the process of structuring the input text – deriving patterns within the structured data – evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering concept/entity extraction generation of taxonomies sentiment analysis document summarization entity relation modeling 7
Big Applications These kinds of applications require a tremendous amount of knowledge of language. Consider the following interaction with HAL the computer from 2001: A Space Odyssey HAL Dave: Open the pod bay doors, Hal. HAL: I’m sorry Dave, I’m afraid I can’t do that. 8
What’s needed? Speech recognition and synthesis Knowledge of the English words involved What they mean How they combine (bay, vs. pod bay) How groups of words clump What the clumps mean Dialog It is polite to respond, even if you’re planning to kill someone. It is polite to pretend to want to be cooperative (I’m afraid, I can’t…) 9
Real Example What is the Fed’s current position on interest rates? What or who is the “Fed”? What does it mean for it to to have a position? How does “current” modify that? 10
Caveat NLP has an AI aspect to it. We’re often dealing with ill-defined problems We don’t often come up with perfect solutions/algorithms We can’t let either of those facts get in our way 11
Preparation Basic algorithm and data structure analysis Ability to program Some exposure to logic Exposure to basic concepts in probability Interest in Language 12
Commercial World Lot’s of exciting stuff going on… Some samples… Machine translation Question answering Buzz analysis 13
Google/Arabic 14
Web Q/A 15
Summarization Current web-based Q/A is limited to returning simple fact-like (factoid) answers (names, dates, places, etc). Multi-document summarization can be used to address more complex kinds of questions. Circa 2002: What’s going on with the Hubble? 16
News. Blaster Example The U. S. orbiter Columbia has touched down at the Kennedy Space Center after an 11 -day mission to upgrade the Hubble observatory. The astronauts on Columbia gave the space telescope new solar wings, a better central power unit and the most advanced optical camera. The astronauts added an experimental refrigeration system that will revive a disabled infrared camera. ''Unbelievable that we got everything we set out to do accomplished, '' shuttle commander Scott Altman said. Hubble is scheduled for one more servicing mission in 2004. 17
Weblog Analytics Textmining weblogs, discussion forums, user groups, and other forms of user generated media. Product marketing information Political opinion tracking Social network analysis Buzz analysis (what’s hot, what topics are people talking about right now). 18
Google/Arabic Translation 19
Forms of Natural Language The input/output of a NLP system can be: written text: newspaper articles, letters, manuals, prose, … Speech: read speech (radio, TV, dictations), conversational speech, commands, … To process written text, we need: lexical, syntactic, Semantic knowledge about the language discourse information, real world knowledge To process spoken language, we need additionally speech recognition speech synthesis 20
Components of NLP Natural Language Understanding Mapping the given input in the natural language into a useful representation. Different level of analysis required: morphological analysis, syntactic analysis, semantic analysis, discourse analysis, … h a h is r? e rd Natural Language Generation hic W language from some internal representation. Producing output in the natural Different level of synthesis required: – deep planning (what to say), – syntactic generation 21
Natural language understanding Uncovering the mappings between the linear sequence of words (or phonemes) and the meaning that it encodes. Representing this meaning in a useful (usually symbolic) representation. By definition - heavily dependent on the target task Words and structures mean different things in different contexts The required target representation is different for different tasks. Why is NLU hard? The mapping between words, their linguistic structure and the meaning that they encode is extremely complex and difficult to model and decompose. Natural language is very ambiguous The goal of understanding is itself task dependent and very complex. 22
Why NL Understanding is hard? Natural language is extremely rich in form and structure, and very ambiguous. How to represent meaning, Which structures map to which meaning structures. Ambiguity: ne input can mean many different things Lexical (word level) ambiguity -- different meanings of words Syntactic ambiguity -- different ways to parse the sentence Interpreting partial information -- how to interpret pronouns Contextual information -- context of the sentence may affect the meaning of that sentence. Many input can mean the same thing. Interaction among components of the input. Noisy input (e. g. speech) 23
Linguistics Levels of Analysis Phonology: sounds / letters / pronunciation. concerns how words are related to the sounds that realize them. Morphology: the structure of words and the laws concerning the formation of new words from pieces (morphs) Syntax: how these sequences are structured, eg, structures of sentences and the ways individual words are connected within them Semantics: concerns what words mean and how these meaning combine in sentences to form sentence meaning. The study of context-independent meaning. 24
Linguistics Levels of Analysis Pragmatics: concerns how sentences are used in different situations and how use affects the interpretation of the sentence Discourse: concerns how the immediately preceding sentences affect the interpretation of the next sentence. For example, interpreting pronouns and interpreting the temporal aspects of the information. World Knowledge – includes general knowledge about the world. What each language user must know about the other’s beliefs and goals. 25
Knowledge needed Speech recognition and synthesis Dictionaries (how words are pronounced) Phonetics (how to recognize/produce each sound of the language) Natural language understanding Knowledge of the natural language words involved – What they mean – How they combine Knowledge of syntactic structure Dialog and pragmatic knowledge 26