CSA 3180 Natural Language Processing Information Extraction 2

  • Slides: 59
Download presentation
CSA 3180: Natural Language Processing Information Extraction 2 • Named Entities • Question Answering

CSA 3180: Natural Language Processing Information Extraction 2 • Named Entities • Question Answering • Anaphora Resolution • Co-Reference December 2005 CSA 3180: Information Extraction II 1

Introduction • Slides partially based on talk by Lucian Vlad Lita • Sheffield GATE

Introduction • Slides partially based on talk by Lucian Vlad Lita • Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks • Anaphora resolution slides based on Dan Cristea slides, with additional input from Gabriela-Eugenia Dima, Oana Postolache and Georgiana Puşcaşu December 2005 CSA 3180: Information Extraction II 2

References • Fastus System Documentation • Robert Gaizauskas “IE Perspective on Text Mining” •

References • Fastus System Documentation • Robert Gaizauskas “IE Perspective on Text Mining” • Daniel Bikel’s “Nymble: A High Performance Learning Name Finder” • Helena Ahonen-Myka’s notes on FSTs • Javelin system documentation • MUC 7 Overview & Results December 2005 CSA 3180: Information Extraction II 3

Named Entities • Person Name: Colin Powell, Frodo • Location Name: Middle East, Aiur

Named Entities • Person Name: Colin Powell, Frodo • Location Name: Middle East, Aiur • Organization: UN, DARPA • Domain Specific vs. Open Domain December 2005 CSA 3180: Information Extraction II 4

Anaphora Resolution unprocessed text annotation tool AR golden standard AR engine AR annotated text

Anaphora Resolution unprocessed text annotation tool AR golden standard AR engine AR annotated text fine-tuning comparison & evaluation December 2005 CSA 3180: Information Extraction II 5

Anaphora Resolution • Text: – Nature of discourse – Anaphoric phenomena • Anaphora Resolution

Anaphora Resolution • Text: – Nature of discourse – Anaphoric phenomena • Anaphora Resolution Engines: – Models – General AR Frameworks – Knowledge Sources December 2005 CSA 3180: Information Extraction II 6

Anaphora Resolution Anaphora represents the relation between a “proform” (called an “anaphor”) and another

Anaphora Resolution Anaphora represents the relation between a “proform” (called an “anaphor”) and another term (called an "antecedent"), when the interpretation of the anaphor is in a certain way determined by the interpretation of the antecedent. Barbara Lust, Introduction to Studies in the Acquisition of Anaphora, D. Reidel, 1986 December 2005 CSA 3180: Information Extraction II 7

Anaphora Example It was a bright cold day in April, and the clocks were

Anaphora Example It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. Orwell, 1984 antecedent December 2005 anaphor antecedent CSA 3180: Information Extraction II anaphor 8

Anaphora • pronouns (personal, demonstrative, . . . ) – full pronouns – clitics

Anaphora • pronouns (personal, demonstrative, . . . ) – full pronouns – clitics (RO: dă-mi-l, IT: dammelo) • nouns – definite – indefinite • adjectives, numerals (generally associated with an ellipsis) • In this the play is expressionist 1 in its approach to theme. • But it is also so 1 in its use of unfamiliar devices. . . December 2005 CSA 3180: Information Extraction II 9

Referential Expressions • mark the noun phrases • for each NP ask a question

Referential Expressions • mark the noun phrases • for each NP ask a question about it • keep as REs those NPs that can be naturally referenced in the question The policeman got in the car in a hurry in order to catch the run-away thief. December 2005 CSA 3180: Information Extraction II 10

Referential Expressions a. John was going down the street looking for Bill‘s house. b.

Referential Expressions a. John was going down the street looking for Bill‘s house. b. He found it at the first corner. December 2005 CSA 3180: Information Extraction II 11

Referential Expressions a. John was going down the street looking for Bill‘s house. b.

Referential Expressions a. John was going down the street looking for Bill‘s house. b. He met him at the first corner. December 2005 CSA 3180: Information Extraction II 12

Referential Expressions The empty anaphor Gianni diede una mela a Michele. Piu tardi, gli

Referential Expressions The empty anaphor Gianni diede una mela a Michele. Piu tardi, gli diede un’arancia. [Not&Zancanara, 1996] John gave an apple to Michelle. Later on, gave her an orange. December 2005 CSA 3180: Information Extraction II 13

Textual Ellipsis The functional (bridge) anaphora The state of the accumulator is indicated to

Textual Ellipsis The functional (bridge) anaphora The state of the accumulator is indicated to the user. 30 minutes before the complete uncharge, the computer signals for 5 seconds. [Strube&Hahn, 1996] December 2005 CSA 3180: Information Extraction II 14

Events, States, Descriptions He left without eating 1. Because of this 1 , he

Events, States, Descriptions He left without eating 1. Because of this 1 , he was starving in the evening. But, he adds, Priesley is more interested in Johnson living than in Johnson dead 1. In this 1 the play is expressionist in its approach to theme. [Halliday & Hassan, 1976] December 2005 CSA 3180: Information Extraction II 15

Definite/Indefinite NPs Once upon a time, there was a king and a queen. And

Definite/Indefinite NPs Once upon a time, there was a king and a queen. And the king one day went hunting. Apollo took out his bow. . . Take the elevator to the 4 th floor. December 2005 CSA 3180: Information Extraction II 16

Anaphora Resolution • State of the art in Anaphora Resolution: – Identity: 65 -80%

Anaphora Resolution • State of the art in Anaphora Resolution: – Identity: 65 -80% – Other: much less… December 2005 CSA 3180: Information Extraction II 17

What is so difficult? Nothing – everything is so simple! John 1 has just

What is so difficult? Nothing – everything is so simple! John 1 has just arrived. He 1 seems tired. The girl 1 leaves the trash on the table and wants to go away. The boy 2 tries to hold her 1 by the arm 3 1; she 1 escapes and runs; he 2 calls her 1 back. Caragiale, At the Mansion December 2005 CSA 3180: Information Extraction II 18

What is so difficult? Nothing indeed, but imagine letting the machine go wrong. .

What is so difficult? Nothing indeed, but imagine letting the machine go wrong. . . There‘s a pile of inflammable trash next to your car. You‘ll have to get rid of it. If the baby does not thrive on the raw milk, boil it. [Hobbs, 1997] December 2005 CSA 3180: Information Extraction II 19

What is so difficult? Semantic restrictions Jeff 1 helped Dick 2 wash the car.

What is so difficult? Semantic restrictions Jeff 1 helped Dick 2 wash the car. He 1 washed the windows as Dick 2 waxed the car. He 1 soaped a pane. Jeff 1 helped Dick 2 wash the car. He 1 washed the windows as Dick 2 waxed the car. He 2 buffed the hood. [Walker, Joshi & Prince, 1997] December 2005 CSA 3180: Information Extraction II 20

What is so difficult? Semantic corelates An elephant 1 hit the car with the

What is so difficult? Semantic corelates An elephant 1 hit the car with the trunk. The animal 1 had to be taken away not to produce other damages. * An animal 1 hit the car with the trunk. The elephant 1 had to be taken away not to produce other damages. December 2005 CSA 3180: Information Extraction II 21

What is so difficult? Long distance recovery (pronominalization) 1. 2. 3. 4. 5. His

What is so difficult? Long distance recovery (pronominalization) 1. 2. 3. 4. 5. His re-entry into Hollywood came with the movie “Brainstorm”, but its completion and release has been delayed by the death of co-star Natalie Wood. He plays Hugh Hefner of Playboy magazine in Bob Fosse’s “Star 80. ” It’s about Dorothy Stratton, the Playboy Playmate who was killed by her husband. He also stars in the movie “Class. ” Los Angeles Times, July 18, 1983, cited in [Fox, 1986] December 2005 CSA 3180: Information Extraction II 22

What is so difficult? Gender mismatches Mr. Chairman. . . , what is her

What is so difficult? Gender mismatches Mr. Chairman. . . , what is her position upon this issue? (political correctness!!) Number mismatches The government discussed. . . They. . . December 2005 CSA 3180: Information Extraction II 23

What is so difficult? Distributed antecedents John 1 invited Mary 2 to the cinema.

What is so difficult? Distributed antecedents John 1 invited Mary 2 to the cinema. After the movie ended they 3={1, 2} went to a restaurant. December 2005 CSA 3180: Information Extraction II 24

What is so difficult? Empty/non-empty anaphors John gave an apple to Michelle. Later on,

What is so difficult? Empty/non-empty anaphors John gave an apple to Michelle. Later on, gave her an orange. John gave an apple to Michelle. Later on, he gave her an orange. John gave an apple to Michelle. Later on, this one asks him for an orange. December 2005 CSA 3180: Information Extraction II 25

Semantics are Essential Police. . . They Teacher. . . She/He A car. .

Semantics are Essential Police. . . They Teacher. . . She/He A car. . . The automobile A Mercedes. . . The car A lamp. . . The bulb December 2005 CSA 3180: Information Extraction II 26

Semantics are not all • Pronouns - poor semantic features he she it they

Semantics are not all • Pronouns - poor semantic features he she it they [+animate, +male, +singular] [+animate, +female, +singular] [+inanimate, +singular] [+plural] • Gender in Romance languages Ro. maşină = ea (feminine) Ro. automobil = el (masculine) • Anaphora resolution by concord rules Un camion a heurté une voiture. Celle-ci a été complètement détruite. Gender match! Gender mismatch ! (A truck hit a car. It was completely destroyed. ) December 2005 CSA 3180: Information Extraction II 27

Anaphora Resolution [Charniak, 1972] It order to do AR, one has to be able

Anaphora Resolution [Charniak, 1972] It order to do AR, one has to be able to do everything else. Once everything else is done AR comes for free. December 2005 CSA 3180: Information Extraction II 28

Anaphora Resolution Most current anaphora resolution systems implement a pipeline architecture with three modules:

Anaphora Resolution Most current anaphora resolution systems implement a pipeline architecture with three modules: Referential expressions • Collect: determines the List of Potential Antecedents (LPAs). a 1, a 2, a 3, … an Collect • Filter: eliminates from the LPA the referees that are incompatible with the referential expression under scrutiny. • Preference: determines the most likely antecedent on the basis of an ordering policy. December 2005 a 1, a 2, a 3, … an Filter Preference CSA 3180: Information Extraction II 29

Anaphora Resolution Models • [Hobbs, 1976] (pronominal anaphora) Naïve algorithm: - implies a surface

Anaphora Resolution Models • [Hobbs, 1976] (pronominal anaphora) Naïve algorithm: - implies a surface parse tree - navigation on the syntactic tree of the anaphor‘s sentence and the preceding ones in the order of recency, each tree in a left-to-right, breadth-first manner A semantic approach: - implies a semantic representation of the sentences (logical expression) - a collection of semantic operations (inferences) - type of pronoun is important December 2005 CSA 3180: Information Extraction II 30

Anaphora Resolution Models • [Lappin & Leass, 1994] (pronominal anaphora) - syntactic structures an

Anaphora Resolution Models • [Lappin & Leass, 1994] (pronominal anaphora) - syntactic structures an intrasentensial syntactic filtering morphological filter (person, number, gender) detection of pleonastic pronouns salience parameters (grammatical role, parallelism of grammatical roles, frequency of mention, proximity, sentence recency) December 2005 CSA 3180: Information Extraction II 31

Anaphora Resolution Models • [Sidner, 1981], [Grosz&Sidner, 1986] - focus/attentional based - give more

Anaphora Resolution Models • [Sidner, 1981], [Grosz&Sidner, 1986] - focus/attentional based - give more salience to those semantic entities that are in focus - define where to look for an antecedent in the semantic structure of the preceding text (a stack in G&S‘s model) December 2005 CSA 3180: Information Extraction II 32

AR Models: Centering • [Grosz, Joshi, Weinstein, 1983, 1995] • [Brennan, Friedman and Pollard,

AR Models: Centering • [Grosz, Joshi, Weinstein, 1983, 1995] • [Brennan, Friedman and Pollard, 1987] • Cf(u) = <e 1, e 2, . . . ek> - an ordered list • Cb(u) = ei • Cp(u) = e 1 Cb(u) = Cb(u-1) Cb(u) = Cp(u) Cb(u) Cp(u) CONTINUING SMOOTH SHIFT RETAINING ABRUPT SHIFT • CON > RET > SSH > ASH December 2005 CSA 3180: Information Extraction II 33

AR Models: Centering a. I haven’t seen Jeff for several days. Cf = (I=[I],

AR Models: Centering a. I haven’t seen Jeff for several days. Cf = (I=[I], [Jeff]) Cb = [I] b. Carl thinks he’s studying for his exams. Cf = ([Carl], he=[Jeff], [Jeff´s exams]) Cb = [Jeff] c. I think he? went to the Cape with Linda. [Grosz, Joshi & Weinstein, 1983] December 2005 CSA 3180: Information Extraction II 34

AR Models: Centering b. Carl thinks he’s studying for his exams. Cf = ([Carl],

AR Models: Centering b. Carl thinks he’s studying for his exams. Cf = ([Carl], he=[Jeff], [Jeff´s exams]) Cb = [Jeff] Jeff c. I think he? went to the Cape with Linda. Cf = (I=[I], he=[Jeff], [the Cape], [Linda]) Cb = [Jeff] Cf = (I=[I], he=[Carl], [the Cape], [Linda]) RETAINING Cb = [Carl] ABRUPT SHIFT December 2005 CSA 3180: Information Extraction II 35

Anaphora Resolution Models • [Mitkov, 1998] - knowledge-poor approach POS tagger, noun phrase rules

Anaphora Resolution Models • [Mitkov, 1998] - knowledge-poor approach POS tagger, noun phrase rules 2 previous sentences definiteness, giveness, lexical reiteration, section heading preference, distance, terms of the field, etc. December 2005 CSA 3180: Information Extraction II 36

General Framework Build a framework capable of easily accommodating any of the existing AR

General Framework Build a framework capable of easily accommodating any of the existing AR models, fine-tune them, practice with them to enhance performance (learning), eventually obtaining a better model December 2005 CSA 3180: Information Extraction II 37

General Framework text AR-engine AR-model 1 AR-model 2 AR-model 3 December 2005 CSA 3180:

General Framework text AR-engine AR-model 1 AR-model 2 AR-model 3 December 2005 CSA 3180: Information Extraction II 38

Co-References • Halliday and Hassan: a semantic relation, not a textual one Co-referential anaphoric

Co-References • Halliday and Hassan: a semantic relation, not a textual one Co-referential anaphoric relation The text layer a a evokes centera The semantic layer December 2005 b b evokes centera CSA 3180: Information Extraction II 39

Time and Discourse • Discourse has a dynamic nature Time axes real time 1

Time and Discourse • Discourse has a dynamic nature Time axes real time 1 2 discourse time 1 2 story time 2 800 December 2005 1 920 1000 1030 CSA 3180: Information Extraction II 40

Resolution Moment Police officer David Cheshire went to Dillard's home. Putting his ear next

Resolution Moment Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also. [Tanaka, 1999] Cheshire December 2005 Dillard his Dillard CSA 3180: Information Extraction II Cheshire 41

Resolution Delay • Sanford and Garrod (1989) – initiation point – completion point •

Resolution Delay • Sanford and Garrod (1989) – initiation point – completion point • Information is kept in a temporary location of memory December 2005 CSA 3180: Information Extraction II 42

Cataphora – What is there? • The element referred to is anticipated by the

Cataphora – What is there? • The element referred to is anticipated by the referring element • Theories – scepticism – syntactic reality From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum… Oscar Wilde, The Picture of Dorian Gray December 2005 CSA 3180: Information Extraction II 43

No right reference needed in discourse processing • Introduction of an empty discourse entity

No right reference needed in discourse processing • Introduction of an empty discourse entity • Addition of new features as discourse unfolds • Pronoun anticipation in Romanian I taught Gabriel to read. = Ro. L-am învatat pe Gabriel sa citeasca. December 2005 CSA 3180: Information Extraction II 44

Unique directionality in interpretation John he he gender = masc number = sg sem

Unique directionality in interpretation John he he gender = masc number = sg sem = person name = John gender = masc number = sg sem = person ? name = John anaphora cataphora December 2005 CSA 3180: Information Extraction II John 45

Automatic Interpretation • necessity for an intermediate level a The text layer b RE

Automatic Interpretation • necessity for an intermediate level a The text layer b RE a projects fsa The restriction layer fsa evokes centera The semantic layer December 2005 centera CSA 3180: Information Extraction II 46

Three Layer Approach to AR 1. John sold his bicycle 2. although Bill would

Three Layer Approach to AR 1. John sold his bicycle 2. although Bill would have wanted it. his bicycle The text layer ……………………… projects it projects no = sg The restrictions layer …… ………………… sem=bicycle sem=¬human det = yes evokes no = sg The semantic layer ………… sem=bicycle det = yes December 2005 CSA 3180: Information Extraction II 47

Delayed Interpretation Police officer David Cheshire went to Dillard's home. Putting his ear next

Delayed Interpretation Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also. t 0 The text layer The restriction layer Cheshire fs. Cheshire t 1 t 2 Dillard his fs. Dillard candidates={ , } t 3 Dillard fs. Dillard The semantic layer Cheshire December 2005 Dillard CSA 3180: Information Extraction II 48

Delayed Interpretation From the corner of the divan of Persian saddle-bags on which he

Delayed Interpretation From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum… t 0 The text layer he time Lord Henry Wotton t 2 t 1 his projection gender=masc number=sing sem= person evoking completes name= Lord Henry Wotton The restriction layer The semantic layer December 2005 gender=masc gender = masc number=sing number = sing sem= person sem = person name= Lord Henry Wotton ? CSA 3180: Information Extraction II evoking initiates 49

The case of Cataphora 1. Although Bill would have wanted it, 2. John sold

The case of Cataphora 1. Although Bill would have wanted it, 2. John sold his bicycle to somebody else. it The text layer ……………………… projects his bicycle projects no = sg sem=bicycle det = yes no = sg The restrictions layer …… ………………… sem=¬human evokes The semantic layer ………… December 2005 evokes no = sg sem=bicycle sem=¬human det = yes CSA 3180: Information Extraction II 50

AR Models • • a set of primary attributes a set of knowledge sources

AR Models • • a set of primary attributes a set of knowledge sources a set of evocation heuristics or rules a set of rules that configure the domain of referential accessibility December 2005 CSA 3180: Information Extraction II 51

AR Models REa REb REc REd REx The text layer ………………… knowledge sources The

AR Models REa REb REc REd REx The text layer ………………… knowledge sources The projection layer ………………. … DE DE m j The semantic layer …. ………… attrx primary attributes DE 1 heuristics/rules domain of referential accessibility December 2005 CSA 3180: Information Extraction II 52

Set of Primary Attributes a. morphological - number - lexical gender - person December

Set of Primary Attributes a. morphological - number - lexical gender - person December 2005 CSA 3180: Information Extraction II 53

Set of Primary Attributes b. syntactical -full syntactic description of REs as constituents of

Set of Primary Attributes b. syntactical -full syntactic description of REs as constituents of a syntactic tree [Lappin and Leass, 1994] CT based approaches [Grosz, Joshi and Weinstein, 1995], [Brennan, Friedman and Pollard, 1987], syntactic domain based approaches [Chomsky, 1981], [Reinhart, 1981], [Gordon and Hendricks, 1998], [Kennedy and Boguraev, 1996] -quality of being adjunct, embedded or complement of a preposition [Kennedy and Boguraev, 1996] -inclusion or not in an existential construction [Kennedy and Boguraev, 1996] -syntactic patterns in which the RE is involved syntactic parallelism [Kennedy and Boguraev, 1996], [Mitkov, 1997] December 2005 CSA 3180: Information Extraction II 54

Set of Primary Attributes c. semantic -position of the head of the RE in

Set of Primary Attributes c. semantic -position of the head of the RE in a conceptual hierarchy (animacy, sex (or natural gender), concreteness) Word. Net based models [Poesio, Vieira and Teufel, 1997] -inclusion in a synonymy class -semantic roles, out of which selectional restrictions, inferential links, pragmatic limitations, semantic parallelism and object preference can be verified December 2005 CSA 3180: Information Extraction II 55

Set of Primary Attributes d. positional -offset of the first token of the RE

Set of Primary Attributes d. positional -offset of the first token of the RE in the text [Kennedy and Boguraev, 1996] -inclusion in an utterance, sentence or clause, considered as a discourse unit [Hobbs, 1987], Azzam, Humphreys and Gaizauskas, 1998], [Cristea et al. , 2000] December 2005 CSA 3180: Information Extraction II 56

Set of Primary Attributes e. surface realisation (type) the domain of this feature contains:

Set of Primary Attributes e. surface realisation (type) the domain of this feature contains: zero-pronoun, clitic pronoun, full pronoun, reflexive pronoun, possessive pronoun, demonstrative pronoun, reciprocal pronoun, expletive “it”, bare noun (undetermined NP), indefinite determined NP, proper noun (name) [Gordon and Hendricks, 1998], [Cristea et. al, 2000] December 2005 CSA 3180: Information Extraction II 57

Set of Primary Attributes f. other inclusion or not of the RE in a

Set of Primary Attributes f. other inclusion or not of the RE in a specific lexical field (“domain concept”) [Mitkov, 1997] - frequency of the term in the text [Mitkov, 1997] - occurrence of the term in a heading [Mitkov, 1997] December 2005 CSA 3180: Information Extraction II 58

Knowledge Sources • Type of process: incremental • A knowledge source: a (virtual) processor

Knowledge Sources • Type of process: incremental • A knowledge source: a (virtual) processor able to fetch values to attributes on the restriction layer • Minimum set: POS-tagger + shallow parser December 2005 CSA 3180: Information Extraction II 59