Natural Language Processing for IR Cheng Xiang Zhai

  • Slides: 97
Download presentation
龙星计划课程: 信息检索 Natural Language Processing for IR Cheng. Xiang Zhai (翟成祥) Department of Computer

龙星计划课程: 信息检索 Natural Language Processing for IR Cheng. Xiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http: //www-faculty. cs. uiuc. edu/~czhai, [email protected] uiuc. edu 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 1

Elements of Text Info Management Technologies Retrieval Applications Visualization Summarization Filtering Information Access Search

Elements of Text Info Management Technologies Retrieval Applications Visualization Summarization Filtering Information Access Search Mining Applications Mining Information Organization Categorization Extraction Knowledge Acquisition Clustering Natural Language Content Analysis Text 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 2

Overview of NLP for IR • Most work has focused on improving the indexing

Overview of NLP for IR • Most work has focused on improving the indexing unit – Syntactic phrases – Word sense disambiguation – Anaphora resolution • Dependency parsing • Sentiment retrieval 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 3

Improving the Indexing Unit he following slides are taken/adapted from Jimmy Lin’s presentation http:

Improving the Indexing Unit he following slides are taken/adapted from Jimmy Lin’s presentation http: //umiacs. umd. edu/~resnik/ling 647_sp 2006/slides/jimmy_ir_lecture. ppt 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 4

The Central Problem in IR Information Seeker Concepts Query Terms Authors Concepts Document Terms

The Central Problem in IR Information Seeker Concepts Query Terms Authors Concepts Document Terms Do these represent the same concepts? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 5

Why is IR hard? • IR is hard because natural language is so rich

Why is IR hard? • IR is hard because natural language is so rich (among other reasons) • What are the issues? – Tokenization – Morphological Variation – Synonymy – Polysemy – Paraphrase – Ambiguity – Anaphora 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 6

Possible Solutions • Vary the unit of indexing – Strings and segments – Tokens

Possible Solutions • Vary the unit of indexing – Strings and segments – Tokens and words – Phrases and entities – Senses and concepts • Manipulate queries and results – Term expansion – Post-processing of results 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 7

Tokenization • What’s a word? – First try: words are separated by spaces The

Tokenization • What’s a word? – First try: words are separated by spaces The cat on the mat. the, cat, on, the, mat – What about clitics? I’m not saying that I don’t want John’s input on this. • What about languages without spaces? 天主教教宗若望保祿二世因感冒再度住進醫院。 天主教 教宗 若望保祿二世 因 感冒 再度 住進 醫院。 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 8

Word-Level Issues • Morphological variation = different forms of the same concept – Inflectional

Word-Level Issues • Morphological variation = different forms of the same concept – Inflectional morphology: same part of speech break, broken; sing, sang, sung; etc. – Derivational morphology: different parts of speech destroy, destruction; invent, invention, reinvention; etc. • Synonymy = different words, same meaning {dog, canine, doggy, puppy, etc. } concept of dog • Polysemy = same word, different meanings (e. g. , root) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 9

Paraphrase • Language provides different ways of saying the same thing Who killed Abraham

Paraphrase • Language provides different ways of saying the same thing Who killed Abraham Lincoln? (1) John Wilkes Booth killed Abraham Lincoln. (2) John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life. When did Wilt Chamberlain score 100 points? (1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks. (2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank Mc. Guire didn’t expect it to last long, saying, “He’ll get 100 points someday. ” Mc. Guire’s prediction came true just a few months later in a game against the New York Knicks on March 2. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 10

Ambiguity • What exactly do you mean? I saw the man on the hill

Ambiguity • What exactly do you mean? I saw the man on the hill with the telescope? Who has the telescope? Time flies like an arrow. Say what? Visiting relatives can be annoying. Who’s visiting? • Why don’t we have problems (most of the time)? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 11

Ambiguity in Action • Different documents with the same keywords may have different meanings…

Ambiguity in Action • Different documents with the same keywords may have different meanings… What do frogs eat? keywords: frogs, eat What is the largest volcano in the Solar System? keywords: largest, volcano, solar, system (1) Mars boasts many extreme frogs eat mainly insects (1) Adult geographic features; for example, and other small animals, including earthworms, minnows, and spiders. Olympus Mons, is the largest volcano in the solar system. (2) The Galileo probe's mission to eat many kinds of (2) Alligators Jupiter, the largest planet in the small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds. Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons. (3) Even the largest volcanoes found bats catch fish with their (3) Some on Earth are puny in comparison to claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats. 2008 © Cheng. Xiang Zhai others found around our own cosmic backyard, the Solar System. Dragon Star Lecture at Beijing University, June 21 -30, 2008 12

Anaphora • Language provides different ways of referring to the same entity Who killed

Anaphora • Language provides different ways of referring to the same entity Who killed Abraham Lincoln? (1) John Wilkes Booth killed Abraham Lincoln. (2) John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life. When did Wilt Chamberlain score 100 points? (1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks. (2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank Mc. Guire didn’t expect it to last long, saying, “He’ll get 100 points someday. ” Mc. Guire’s prediction came true just a few months later in a game against the New York Knicks on March 2. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 13

 • Terminology More Anaphora – Anaphor = an expression that refers to another

• Terminology More Anaphora – Anaphor = an expression that refers to another – Anaphora = the phenomenon • Other different types of referring expressions: Fujitsu and NEC said they were still investigating, and that knowledge of more such bids could emerge. . . Other major Japanese computer companies contacted yesterday said they have never made such bids. The hotel recently went through a $200 million restoration… original artworks include an impressive collection of Greek statues in the lobby. • Anaphora resolution can be hard! The city council denied the demonstrators a permit because… they feared violence. they advocated violence. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 14

What can we do? • Here are the some of the problems: – Tokenization

What can we do? • Here are the some of the problems: – Tokenization – Morphological variation, synonymy, polysemy – Paraphrase, ambiguity – Anaphora • General approaches: – Vary the unit of indexing – Manipulate queries and results 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 15

What do we index? • In information retrieval, we are after the concepts represented

What do we index? • In information retrieval, we are after the concepts represented in the documents • … but we can only index strings • So what’s the best unit of indexing? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 16

The Tokenization Problem • In many languages, words are not separated by spaces… •

The Tokenization Problem • In many languages, words are not separated by spaces… • Tokenization = separating a string into “words” • Simple greedy approach: – Start with a list of every possible term (e. g. , from a dictionary) – Look for the longest word in the unsegmented string – Take longest matching term as the next word and repeat 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 17

Probabilistic Segmentation • For an input word: c 1 c 2 c 3 …

Probabilistic Segmentation • For an input word: c 1 c 2 c 3 … cn • Try all possible partitions: c 1 c 2 c 3 c 4 … cn … • Choose the highest probability partition – E. g. , compute P(c 1 c 2 c 3) using a language model • Challenges: search, probability estimation 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 18

Indexing N-Grams • Consider a Chinese document: c 1 c 2 c 3 …

Indexing N-Grams • Consider a Chinese document: c 1 c 2 c 3 … cn • Don’t segment (you could be wrong!) • Instead, treat every character bigram as a term c 1 c 2 c 3 c 4 c 5 … cn-1 cn • Break up queries the same way • Works at least as well as trying to segment correctly! 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 19

Morphological Variation • Handling morphology: related concepts have different forms – Inflectional morphology: same

Morphological Variation • Handling morphology: related concepts have different forms – Inflectional morphology: same part of speech dogs = dog + PLURAL broke = break + PAST – Derivational morphology: different parts of speech destruction = destroy + ion researcher = research + er • Different morphological processes: – Prefixing – Suffixing – Infixing – Reduplication 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 20

Stemming • Dealing with morphological variation: index stems instead of words – Stem: a

Stemming • Dealing with morphological variation: index stems instead of words – Stem: a word equivalence class that preserves the central concept • How much to stem? – organization organize organ? – resubmission resubmit/submission submit? – reconstructionism? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 21

Does Stemming Work? • Generally, yes! (in English) – Helps more for longer queries

Does Stemming Work? • Generally, yes! (in English) – Helps more for longer queries – Lots of work done in this area Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1): 7 -15. Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993. David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1): 70 -84. And others… 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 22

Stemming in Other Languages • Arabic makes frequent use of infixes the root ktb

Stemming in Other Languages • Arabic makes frequent use of infixes the root ktb maktab (office), kitaab (book), kutub (books), kataba (he wrote), naktubu (we write), etc. • What’s the most effective stemming strategy in Arabic? Open research question… 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 23

Words = wrong indexing unit! • Synonymy = different words, same meaning {dog, canine,

Words = wrong indexing unit! • Synonymy = different words, same meaning {dog, canine, doggy, puppy, etc. } concept of dog • Polysemy = same word, different meanings Bank: financial institution or side of a river? Crane: bird or construction equipment? • It’d be nice if we could index concepts! – Word sense: a coherent cluster in semantic space – Indexing word senses achieves the effect of conceptual indexing 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 24

Indexing Word Senses • How does indexing word senses solve the synonym/polysemy problem? {dog,

Indexing Word Senses • How does indexing word senses solve the synonym/polysemy problem? {dog, canine, doggy, puppy, etc. } concept 112986 I deposited my check in the bank. I saw the sailboat from the bank concept 76529 bank concept 53107 • Okay, so where do we get the word senses? – Word. Net – Automatically find “clusters” of words that describe the same concepts – Other methods also have been tried… 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 25

Word Sense Disambiguation • Given a word in context, automatically determine the sense (concept)

Word Sense Disambiguation • Given a word in context, automatically determine the sense (concept) – This is the Word Sense Disambiguation (WSD) problem • Context is the key: – For each ambiguous word, note the surrounding words bank {river, sailboat, water, etc. } side of a river – Learn a classifier from a collection of examples bank {check, money, account, etc. } financial institution – Use the classifier to determine the senses of words in the documents 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 26

 • Nope! Does it work? Ellen M. Voorhees. (1993) Using Word. Net to

• Nope! Does it work? Ellen M. Voorhees. (1993) Using Word. Net to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993. Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994 And others… • Examples of limited success…. Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4 th Annual Symposium on Document Analysis and Information Retrieval. Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using Word. Net Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 27

Why Disambiguation Hurts • Bag-of-words techniques already disambiguate – Context for each term is

Why Disambiguation Hurts • Bag-of-words techniques already disambiguate – Context for each term is established in the query • WSD is hard! – Many words are highly polysemous, e. g. , interest – Granularity of senses is often domain/application specific • WSD tries to improve precision – But incorrect sense assignments would hurt recall – Slight gains in precision do not offset large drops in recall 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 28

An Alternate Approach • Indexing word senses “freezes” concepts at index time • What

An Alternate Approach • Indexing word senses “freezes” concepts at index time • What if we expanded query terms at query time instead? dog AND cat ( dog OR canine ) AND ( cat OR feline ) • Two approaches – Manual thesaurus, e. g. , Word. Net, UMLS, etc. – Automatically-derived thesaurus, e. g. , co-occurrence statistics 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 29

Does it work? • Yes… if done “carefully” • User should be involved in

Does it work? • Yes… if done “carefully” • User should be involved in the process – Otherwise, poor choice of terms can hurt performance 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 30

Handling Anaphora • Anaphora resolution: finding what the anaphor refers to (i. e. ,

Handling Anaphora • Anaphora resolution: finding what the anaphor refers to (i. e. , the antecedent) John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life. He = John Wilkes Booth • Most common example: pronominal anaphora resolution – Simplest method works pretty well: find previous noun phrase matching in gender and number 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 31

Expanding Anaphors • When indexing, replace anaphors with their antecedents • Does it work?

Expanding Anaphors • When indexing, replace anaphors with their antecedents • Does it work? – Somewhat – … but can be computationally expensive – … helps more if you want to retrieve sub-document segments 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 32

Beyond Word-Level Indexing • Words are the wrong unit to index… • Many multi-word

Beyond Word-Level Indexing • Words are the wrong unit to index… • Many multi-word combinations identify entities – Persons: George W. Bush, Dr. Jones – Organizations: Red Cross, United Way – Corporations: Hewlett Packard, Kraft Foods – Locations: Easter Island, New York City • Entities often have finer-grained structures Professor Stephen W. Hawking title Cambridge, Massachusetts first name middle initial last name 2008 © Cheng. Xiang Zhai city state Dragon Star Lecture at Beijing University, June 21 -30, 2008 33

Indexing Named Entities • Why would we want to index named entities? • Index

Indexing Named Entities • Why would we want to index named entities? • Index named entities as special tokens In reality, at the time of Edison’s 1879 patent, the light bulb PERSON DATE had been in existence for some five decades …. • And treat special tokens like query terms Who patented the light bulb? patent light bulb PERSON When was the light bulb patented? patent light bulb DATE • Works pretty well for question answering, but not evaluated for regular topic retrieval (? ) John Prager, Eric Brown, and Anni Coden. (2000) Question-Answering by Predictive Annotation. Proceedings 2000. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 of SIGIR 34

Indexing Phrases • Motivation: “bank terminology” vs “terminology bank” • Two types of phrases

Indexing Phrases • Motivation: “bank terminology” vs “terminology bank” • Two types of phrases – Those that make sense, e. g. , “school bus”, “hot dog” – Those that don’t, e. g. , bigrams in Chinese • Treat multi-word tokens as index terms • Three sources of evidence: – Dictionary lookup – Linguistic analysis – Statistical analysis (e. g. , co-occurrence) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 35

Known Phrases • Compile a term list that includes phrases – Technical terminology can

Known Phrases • Compile a term list that includes phrases – Technical terminology can be very helpful • Index any phrase that occurs in the list • Most effective in a limited domain – Otherwise hard to capture most useful phrases 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 36

Syntactic Phrases • Parsing = automatically assign structure to a sentence Sentence Prepositional Phrase

Syntactic Phrases • Parsing = automatically assign structure to a sentence Sentence Prepositional Phrase Noun Phrase Det Adj Noun phrase Adj Noun Verb Prep Det Adj Noun The quick brown fox jumped over the lazy black dog • “Walk” the tree and extract phrases – Index all noun phrases – Index subjects and verbs – Index verbs and objects – etc. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 37

Syntactic Variations • What does linguistic analysis buy? – Coordinations lung and breast cancer

Syntactic Variations • What does linguistic analysis buy? – Coordinations lung and breast cancer lung cancer, breast cancer – Substitutions inflammatory sinonasal disease inflammatory disease, sinonasal disease – Permutations addition of calcium addition 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 38

Statistical Analysis • Automatically discover phrases based on cooccurrence probabilities P(“kick the bucket”) =

Statistical Analysis • Automatically discover phrases based on cooccurrence probabilities P(“kick the bucket”) = P(“kick”) P(“the”) P(“bucket”) ? • If terms are not independent, they may form a phrase • Use this method to automatically learn a phrase dictionary 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 39

Does Phrasal Indexing Work? • Yes… • But the gains are so small they’re

Does Phrasal Indexing Work? • Yes… • But the gains are so small they’re not worth the cost • Primary drawback: too slow! 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 40

Sample Phrase-Indexing Results[Zhai 97] Usingle type of phrases to supplement single words helps But

Sample Phrase-Indexing Results[Zhai 97] Usingle type of phrases to supplement single words helps But using all phrases to supplement single words doesn’t help as muc 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 41

What about ambiguity? • Different documents with the same keywords may have different meanings…

What about ambiguity? • Different documents with the same keywords may have different meanings… What do frogs eat? keywords: frogs, eat What is the largest volcano in the Solar System? keywords: largest, volcano, solar, system (1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders. boasts many extreme (1) Mars geographic features; for example, (2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds. Galileo probe's mission to (2) The Jupiter, the largest planet in the (3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats. the largest volcanoes found (3) Even on Earth are puny in comparison to 2008 © Cheng. Xiang Zhai Olympus Mons, is the largest volcano in the solar system. Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons. others found around our own cosmic backyard, the Solar System. Dragon Star Lecture at Beijing University, June 21 -30, 2008 42

Indexing Relations • Instead of terms, index syntactic relations between entities in the text

Indexing Relations • Instead of terms, index syntactic relations between entities in the text Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders. < frogs subject-of eat > < insects object-of eat > < animals object-of eat > < adult modifies frogs > < small modifies animals > Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds. < alligators subject-of eat > < kinds object-of animals > < small modifies animals > From the relations, it is clear who’s eating whom! 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 43

Are syntactic relations enough? • Consider this example: John broke the window. The window

Are syntactic relations enough? • Consider this example: John broke the window. The window broke. < John subject-of break > < window subject-of break> “John” and “window” are both subjects… But John is the person doing the breaking (or “agent”), and the window is the thing being broken (or “theme”) • Syntax sometimes isn’t enough… we need semantics (or meaning)! • Semantics, for example, allows us to relate the following two fragments: The barbarians destroyed the city… The destruction of the city by the barbarians… event: destroy agent: barbarians theme: city 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 44

Semantic Roles • Semantic roles are invariant with respect to syntactic expression Mary loaded

Semantic Roles • Semantic roles are invariant with respect to syntactic expression Mary loaded the truck with hay. Hay was loaded onto the truck by Mary. event: load agent: Mary material: hay destination: truck • The idea: – Identify semantic roles – Index “frame structures” with filled slots – Retrieve answers based on semantic-level matching 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 45

Does it work? • No, not really… • Why not? – Syntactic and semantic

Does it work? • No, not really… • Why not? – Syntactic and semantic analysis is difficult: errors offset whatever gain is gotten – As with WSD, these techniques are precisionenhancers… recall usually takes a dive – It’s slow! 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 46

Alternative Approach • Sophisticated linguistic analysis is slow! – Unnecessary processing can be avoided

Alternative Approach • Sophisticated linguistic analysis is slow! – Unnecessary processing can be avoided by query time analysis • Two-stage retrieval – Use standard document retrieval techniques to fetch a candidate set of documents – Use passage retrieval techniques to choose a few promising passages (e. g. , paragraphs) – Apply sophisticated linguistic techniques to pinpoint the answer • Passage retrieval – Find “good” passages within documents – Key Idea: locate areas where lots of query terms appear close together 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 47

Summary • IR is hard because language is rich and complex (among other reasons)

Summary • IR is hard because language is rich and complex (among other reasons) • Two general approaches to the problem – Attempt to find the best unit of indexing – Try to fix things at query time • It is hard to predict a priori what techniques work – Questions must be answered experimentally • Words are really the wrong thing to index – But there isn’t really a better alternative… 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 48

What You Should Know • Lots of effort have been made to improve indexing

What You Should Know • Lots of effort have been made to improve indexing units • Most results, however, are non-promising 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 49

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao,

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval, SIGIR 2004 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 50

Reference • Structure and performance of a dependency language model. Ciprian, David Engle and

Reference • Structure and performance of a dependency language model. Ciprian, David Engle and et al. Eurospeech 1997. • Parsing English with a Link Grammar. Daniel D. K. Sleator and Davy Temperley. Technical Report CMU -CS-91 -196 1991. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 51

Why we use independence assumption? • The independence assumption is one of the assumptions

Why we use independence assumption? • The independence assumption is one of the assumptions widely adopted in probabilistic retrieval theory. • Why? – Make retrieval models easier. – Make retrieval operation tractable. • The shortage of independence assumption – Independence assumption does not hold in textual data. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 52

 • Latest ideas of dependence assumption Bigram – Some language modeling approach try

• Latest ideas of dependence assumption Bigram – Some language modeling approach try to incorporate word frequency by using bigram. – Shortage: • Some of word dependencies not only exist between adjacent words but also exist at more distant. • Some of adjacent words are not exactly connected. – Bigam language model showed only marginally better effectiveness than the unigram model. • Bi-term – Bi-term language model is similar to the bigram model except the constraint of order in terms is relaxed. – “information retrieval” and “retrieval of information” will be assigned the same probability of generating the query. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 53

Structure and performance of a dependency language model 2008 © Cheng. Xiang Zhai Dragon

Structure and performance of a dependency language model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 54

Introduction • This paper present a maximal entropy language model that incorporates both syntax

Introduction • This paper present a maximal entropy language model that incorporates both syntax and semantics via a dependency grammar. • Dependency grammar: express the relations between words by a directed graph which can incorporate the predictive power of words that lie outside of bigram or trigram range. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 55

Introduction • Why we use Ngram – Assume if we want to record we

Introduction • Why we use Ngram – Assume if we want to record we need to store independent parameters • The drawback of Ngram – Ngram blindly discards relevant words that lie N or more positions in the past. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 56

Structure of the model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing

Structure of the model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 57

Structure of the model • Develop an expression for the joint probability , K

Structure of the model • Develop an expression for the joint probability , K is the linkages in the sentence. • Then we get • Assume that the sum is dominated by a single term, then 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 58

A dependency language model of IR • A query we want to rank –

A dependency language model of IR • A query we want to rank – Previous work: • Assume independence between query terms : – New work: • Assume that term dependencies in a query form a linkage 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 59

A dependency language model of IR • • Assume that the sum Ls is

A dependency language model of IR • • Assume that the sum Ls is dominated by a single term over all the possible Assume that each term is dependent on exactly one related query term generated previous. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 60

A dependency language model of IR 2008 © Cheng. Xiang Zhai Dragon Star Lecture

A dependency language model of IR 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 61

A dependency language model of IR • Assume – The generation of a single

A dependency language model of IR • Assume – The generation of a single term is independent of L • By this assumption, we would have arrived at the same result by starting from any term. L can be represented as an undirected graph. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 62

A dependency language model of IR 取log 2008 © Cheng. Xiang Zhai Dragon Star

A dependency language model of IR 取log 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 63

Parameter Estimation • Estimating – Assume that the linkages are independent. – Then count

Parameter Estimation • Estimating – Assume that the linkages are independent. – Then count the relative frequency of link l between and given that they appear in the same sentence. Have a link in a sentence in training data A score The link frequency of query i and query j 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 64

Parameter Estimation assumption Assumption: 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing

Parameter Estimation assumption Assumption: 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 65

Parameter Estimation • Estimating – The document language model is smoothed with a Dirichlet

Parameter Estimation • Estimating – The document language model is smoothed with a Dirichlet prior Constant discount Dirichilet distribution 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 66

Parameter Estimation • Estimating 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing

Parameter Estimation • Estimating 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 67

Experimental Setting • Stemmed and stop words were removed. • Queries are TREC topics

Experimental Setting • Stemmed and stop words were removed. • Queries are TREC topics 202 to 250 on TREC disk 2 and 3. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 68

The flow of the experimental document Training data For weight computation query Find the

The flow of the experimental document Training data For weight computation query Find the linkage of query Count the frequency Get Find the max L by maxl. P(l|Q) Get P(L|D) Count the frequency Get combine Ranking document Count the frequency Get 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 69

Result-BM & UG • BM: binary independent retrieval • UG: unigram language model approach

Result-BM & UG • BM: binary independent retrieval • UG: unigram language model approach • UG achieves the performance similar to, or worse than, that of BM. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 70

Result- DM • DM: dependency model • The improve of DM over UG is

Result- DM • DM: dependency model • The improve of DM over UG is statistically significant. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 71

Result- BG • BG: bigram language model • BG is slightly worse than DM

Result- BG • BG: bigram language model • BG is slightly worse than DM in five out of six TREC collections but substantially outperforms UG in all collection. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 72

Result- BT 1 & BT 2 • BT: bi-term language model 2008 © Cheng.

Result- BT 1 & BT 2 • BT: bi-term language model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 73

Conclusion • This paper introduce the linkage of a query as a hidden variable.

Conclusion • This paper introduce the linkage of a query as a hidden variable. • Generate each term in turn depending on other related terms according to the linkage. – This approach cover several language model approaches as special cases. • The experimental of this paper outperforms substantially over unigram, bigram and classical probabilistic retrieval model. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 74

Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 wzhang@cs.

Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 [email protected] uic. edu [email protected] uic. edu [email protected] binghamton. edu 1 Department of Computer Science, University of Illinois at Chicago 2 Department of Computer Science, Binghamton University CIKM 2007 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 75 75

Outline • Overview of the opinion retrieval • Topic retrieval • Opinion identification •

Outline • Overview of the opinion retrieval • Topic retrieval • Opinion identification • Ranking documents by opinion similarity • Experimental results Dragon Star Lecture at Beijing University, June 21 -30, 2008 CIKM 2007 2008 © Cheng. Xiang Zhai 76 76

Overview of the Opinion Retrieval • Opinion retrieval • Given a query, find documents

Overview of the Opinion Retrieval • Opinion retrieval • Given a query, find documents that have subjective opinions about the query A query “book” Relevant: “This is a very good book. ” Irrelevant: “This book has 123 pages. ” 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 77

Overview of the Opinion Retrieval • Introduced at TREC 2006 Blog Track • 14

Overview of the Opinion Retrieval • Introduced at TREC 2006 Blog Track • 14 groups, 57 submitted runs in TREC 2006 • 20 groups, 104 runs in TREC 2007 (on going) • Key problems • Opinion features • Query-related opinions • Rank the retrieved documents 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 78

Our Algorithm Document set Query Retrieved documents Opinionative documents Query-related opinionative documents 2008 ©

Our Algorithm Document set Query Retrieved documents Opinionative documents Query-related opinionative documents 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 79

Topic Retrieval • Retrieve query-relevant documents • No opinion involved • Features • Phrase

Topic Retrieval • Retrieve query-relevant documents • No opinion involved • Features • Phrase recognition • Query expansion • Two document-query similarities 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 80

Topic Retrieval – Phrase Recognition • Semantic relationship among the words • For phrase

Topic Retrieval – Phrase Recognition • Semantic relationship among the words • For phrase similarity calculation purpose • 4 types • • Proper noun: “University of Lisbon” Dictionary phrase: “computer science” Simple phrase: “white car” Complex phrase: “small white car” 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 81

Topic Retrieval – Query Expansion • Find the synonyms • “wto” “world trade organization”

Topic Retrieval – Query Expansion • Find the synonyms • “wto” “world trade organization” • Same importance • Add additional terms • “wto” negotiate, agreements, Tariffs, 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 82

Topic Retrieval - Similarity • Sim(Query, Doc) = <Sim_P, Sim_T> • Phrase similarity •

Topic Retrieval - Similarity • Sim(Query, Doc) = <Sim_P, Sim_T> • Phrase similarity • Having or not having a phrase • Sim_P = sum ( idf(P_i) ) • Term similarity • Sum of the Okapi scores of all the query terms • Document ranking • D 1 is ranked higher than D 2, if (Sim_P 1>Sim_P 2) OR (P 1=P 2 AND T 1>T 2) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 83

Opinion Identification Subjective training data Objective training data Feature Selection retrieved documents SVM classifier

Opinion Identification Subjective training data Objective training data Feature Selection retrieved documents SVM classifier From topic retrieval 2008 © Cheng. Xiang Zhai opinionativ e documents To opinion ranking Dragon Star Lecture at Beijing University, June 21 -30, 2008 84

Opinion Identification – Training Data • Subjective training data • Review web sites •

Opinion Identification – Training Data • Subjective training data • Review web sites • Documents having opinionative phrases • Objective training data • Dictionary entries • Documents not having opinionative phrases 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 85

Opinion Identification – Feature Selection • The words expressing opinions • Pearson’s Chi-square test

Opinion Identification – Feature Selection • The words expressing opinions • Pearson’s Chi-square test • Test of the independence between subjectivity label and words via contingency table • Count the number of sentences • Unigrams and bigrams 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 86

Opinion Identification – Classifier • A support vector machine (SVM) classifier Subjective sentences Objective

Opinion Identification – Classifier • A support vector machine (SVM) classifier Subjective sentences Objective sentences Feature vector representation Training SVM classifier 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 87

Opinion Identification – Classifier • Apply the SVM classifier Document Sentence 1 SVM classifier

Opinion Identification – Classifier • Apply the SVM classifier Document Sentence 1 SVM classifier Sentence 2 Label 1: objective Label 2: subjective … … Sentence n Label n: objective 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 88

Opinion Similarity - Query-Related Opinions • Find the query-related opinions query opinionative sentence document

Opinion Similarity - Query-Related Opinions • Find the query-related opinions query opinionative sentence document text window document 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 89

Opinion Similarity – Similarity 1 • Assumption 1 • Higher topic relevance • Higher

Opinion Similarity – Similarity 1 • Assumption 1 • Higher topic relevance • Higher rank • OSim_ir = Sim(Query, Doc) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 90

Opinion Similarity – Similarity 2 • Assumption 2 • More query-related opinions • Higher

Opinion Similarity – Similarity 2 • Assumption 2 • More query-related opinions • Higher rank • OSim_stcc: total number of sentences • OSim_stcs: total score of sentences 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 91

Opinion Similarity – Similarity 3 • A linear combination of 1 and 2 •

Opinion Similarity – Similarity 3 • A linear combination of 1 and 2 • a * Osim_ir + (1 -a) * OSim_stcc • b * Osim_ir + (1 -b) * OSim_stcs 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 92

Opinion Similarity – Experimental Results • TREC 2006 Blog Track data • 50 queries,

Opinion Similarity – Experimental Results • TREC 2006 Blog Track data • 50 queries, 3. 2 million Blog documens • UIC at TREC 2006 Blog Track • Title-only queries: scored the first • 28% - 32% higher than best TREC 2006 scores • Good things learned • More training data • Combined similarity function 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 93

Conclusions • Designed and implemented an opinion retrieval system. IR + text classification for

Conclusions • Designed and implemented an opinion retrieval system. IR + text classification for opinion retrieval • The best known retrieval effectiveness on TREC 2006 blog data • Extend to polarity classification: positive/negative/mixed • Plan to improve feature selection 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 94

What You Should Know • Language models offer more opportunities for studying weighting of

What You Should Know • Language models offer more opportunities for studying weighting of phase-indexing (including weights on). • Opinion finding is a hot area now. A major challenge is how to combine sentiment score with a topic score. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 95

Future Research Directions in NLP for IR • It’s time to revisit many old

Future Research Directions in NLP for IR • It’s time to revisit many old issues: – We have better NLP techniques now – We have better probabilistic retrieval models now – Possible topics: phrase indexing, word sense disambiguation, TF-effect of NLP, … • Language models seem to offer better opportunities for adopting NLP results (e. g, dependency LM) • Sentiment retrieval • NLP for helping difficult topics 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 96

Roadmap • This lecture: NLP for IR (understanding documents) • Next lecture: Topic models

Roadmap • This lecture: NLP for IR (understanding documents) • Next lecture: Topic models for text mining 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 97