CSA 3180 Natural Language Processing Information Extraction 1

  • Slides: 58
Download presentation
CSA 3180: Natural Language Processing Information Extraction 1 – Introduction • Information Extraction •

CSA 3180: Natural Language Processing Information Extraction 1 – Introduction • Information Extraction • Named Entities • IE Systems • MUC • Finite State Machines • Pattern Recognition December 2005 CSA 3180: Information Extraction I 1

Introduction • Slides based on Lectures by Marti Hearst (2004) • GATE Information Extraction

Introduction • Slides based on Lectures by Marti Hearst (2004) • GATE Information Extraction • http: //www. gate. ac. uk/ie/ • Sheffield Web Intelligence Technologies • http: //nlp. shef. ac. uk/wig/ December 2005 CSA 3180: Information Extraction I 2

Classification at different granularities • Text Categorization: – Classify an entire document • Information

Classification at different granularities • Text Categorization: – Classify an entire document • Information Extraction (IE): – Identify and classify small units within documents • Named Entity Extraction (NE): – A subset of IE – Identify and classify proper names • People, locations, organizations December 2005 CSA 3180: Information Extraction I 3

Martin Baker, a person Genomics job Employers job posting form December 2005 CSA 3180:

Martin Baker, a person Genomics job Employers job posting form December 2005 CSA 3180: Information Extraction I 4

Aggregator Websites December 2005 CSA 3180: Information Extraction I 5

Aggregator Websites December 2005 CSA 3180: Information Extraction I 5

foodscience. com-Job 2 Job. Title: Ice Cream Guru Employer: foodscience. com Job. Category: Travel/Hospitality

foodscience. com-Job 2 Job. Title: Ice Cream Guru Employer: foodscience. com Job. Category: Travel/Hospitality Job. Function: Food Services Job. Location: Upper Midwest Contact Phone: 800 -488 -2611 Date. Extracted: January 8, 2001 Source: www. foodscience. com/jobs_midwest. html Other. Company. Jobs: foodscience. com-Job 1 December 2005 CSA 3180: Information Extraction I 6

Aggregator Websites • Read in many web pages from different sites • Extract information

Aggregator Websites • Read in many web pages from different sites • Extract information into a database • Screen Scraping • Can then return data matching particular queries • Data mining can extract meaningful insight that might not have been obvious December 2005 CSA 3180: Information Extraction I 7

December 2005 CSA 3180: Information Extraction I 8

December 2005 CSA 3180: Information Extraction I 8

Data Mining December 2005 CSA 3180: Information Extraction I 9

Data Mining December 2005 CSA 3180: Information Extraction I 9

IE from Research Papers December 2005 CSA 3180: Information Extraction I 10

IE from Research Papers December 2005 CSA 3180: Information Extraction I 10

IE from Commercial Websites December 2005 CSA 3180: Information Extraction I 11

IE from Commercial Websites December 2005 CSA 3180: Information Extraction I 11

What is Information Extraction? As a task: Filling slots in a database from sub-segments

What is Information Extraction? As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a superimportant shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… December 2005 CSA 3180: Information Extraction I 12

What is Information Extraction? As a task: Filling slots in a database from sub-segments

What is Information Extraction? As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… December 2005 CSA 3180: Information Extraction I 13

What is Information Extraction? As a family of techniques: Information Extraction = segmentation +

What is Information Extraction? As a family of techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… December 2005 Microsoft Corporation CEO Bill Gates Microsoft aka “named entity Gates extraction” Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation CSA 3180: Information Extraction I 14

What is Information Extraction? A family of techniques: Information Extraction = segmentation + classification

What is Information Extraction? A family of techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super -important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… December 2005 Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation CSA 3180: Information Extraction I 15

What is Information Extraction? A family of techniques: Information Extraction = segmentation + classification

What is Information Extraction? A family of techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… December 2005 Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation CSA 3180: Information Extraction I 16

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Database Train extraction models Data mine Label training data December 2005 Query, Search CSA 3180: Information Extraction I 17

IE in Context: Formatting Text paragraphs without formatting Grammatical sentences and some formatting &

IE in Context: Formatting Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of Body. Media. Astro holds a Ph. D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M. S. in symbolic and heuristic computation and B. S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. December 2005 CSA 3180: Information Extraction I 18

IE in Context: Formatting Non-grammatical snippets, rich formatting & links December 2005 Tables CSA

IE in Context: Formatting Non-grammatical snippets, rich formatting & links December 2005 Tables CSA 3180: Information Extraction I 19

IE in Context: Coverage Web site specific Genre specific Formatting Layout Amazon. com Book

IE in Context: Coverage Web site specific Genre specific Formatting Layout Amazon. com Book Pages December 2005 Resumes CSA 3180: Information Extraction I 20

IE in Context: Coverage Wide, non-specific Language University Names December 2005 CSA 3180: Information

IE in Context: Coverage Wide, non-specific Language University Names December 2005 CSA 3180: Information Extraction I 21

IE in Context: Complexity Regular set Closed set U. S. states U. S. phone

IE in Context: Complexity Regular set Closed set U. S. states U. S. phone numbers He was born in Alabama… Phone: (413) 545 -1323 The big Wyoming sky… The CALD main office can be reached at 412 -268 -1299 Ambiguous patterns, needing context and many sources of evidence Complex pattern U. S. postal addresses University of Arkansas P. O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4 th Floor Cincinnati, Ohio 45210 December 2005 Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at Whiz. Bang Labs. CSA 3180: Information Extraction I 22

IE in Context: Single Field/Record Jack Welch will retire as CEO of General Electric

IE in Context: Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut “Named entity” extraction December 2005 CSA 3180: Information Extraction I 23

State of the Art • Named entity recognition from newswire text – Person, Location,

State of the Art • Named entity recognition from newswire text – Person, Location, Organization, … – F 1 in high 80’s or low- to mid-90’s • Binary relation extraction – Contained-in (Location 1, Location 2) Member-of (Person 1, Organization 1) – F 1 in 60’s or 70’s or 80’s • Web site structure recognition – Extremely accurate performance obtainable – Human effort (~10 min? ) required on each site December 2005 CSA 3180: Information Extraction I 24

IE Generations • Hand-Built Systems – Knowledge Engineering [1980 s– ] – Rules written

IE Generations • Hand-Built Systems – Knowledge Engineering [1980 s– ] – Rules written by hand – Require experts who understand both the systems and the domain – Iterative guess-test-tweak-repeat cycle • Automatic, Trainable Rule-Extraction Systems [1990 s– ] – Rules discovered automatically using predefined templates, using automated rule learners – Require huge, labeled corpora (effort is just moved!) • Statistical Models [1997 – ] – Use machine learning to learn which features indicate boundaries and types of entities. – Learning usually supervised; may be partially unsupervised December 2005 CSA 3180: Information Extraction I 25

IE Techniques Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama

IE Techniques Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. V V P Classifier VP NP END BEGIN END December 2005 Mo st PP which class? BEGIN NP pa rs NNP ly NNP lik e Most likely state sequence? BEGIN VP S CSA 3180: Information Extraction I 26

Trainable IE Systems Pros • Annotating text is simpler & faster than writing rules.

Trainable IE Systems Pros • Annotating text is simpler & faster than writing rules. • Domain independent • Domain experts don’t need to be linguists or programers. • Learning algorithms ensure full coverage of examples. December 2005 Cons • Hand-crafted systems perform better, especially at hard tasks. (but this is changing) • Training data might be expensive to acquire • May need huge amount of training data • Hand-writing rules isn’t that hard!! CSA 3180: Information Extraction I 27

MUC: Genesis of IE • DARPA funded significant efforts in IE in the early

MUC: Genesis of IE • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: – Terrorist events – Industrial joint ventures – Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’ 90’s) December 2005 CSA 3180: Information Extraction I 28

MUC • Named entity • Person, Organization, Location • Co-reference • Clinton President Bill

MUC • Named entity • Person, Organization, Location • Co-reference • Clinton President Bill Clinton • Template element • Perpetrator, Target • Template relation • Incident • Multilingual December 2005 CSA 3180: Information Extraction I 29

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production of 20, 000 iron and “metal wood” clubs a month December 2005 CSA 3180: Information Extraction I 30

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint

MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production of 20, 000 iron and “metal wood” clubs a month December 2005 CSA 3180: Information Extraction I 31

MUC Templates • Relationship • tie-up • Entities: • Bridgestone Sports Co, a local

MUC Templates • Relationship • tie-up • Entities: • Bridgestone Sports Co, a local concern, a Japanese trading house • Joint venture company • Bridgestone Sports Taiwan Co • Activity • ACTIVITY 1 • Amount • NT$2, 000 December 2005 CSA 3180: Information Extraction I 32

MUC Templates • ATIVITY 1 – Activity • Production – Company • Bridgestone Sports

MUC Templates • ATIVITY 1 – Activity • Production – Company • Bridgestone Sports Taiwan Co – Product • Iron and “metal wood” clubs – Start Date • January 1990 December 2005 CSA 3180: Information Extraction I 33

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 December 2005 CSA 3180: Information Extraction I Example from Fastus (1993) 34

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 December 2005 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 CSA 3180: Information Extraction I 35

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co. , capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20, 000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co. ” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co. ” Activity: ACTIVITY-1 Amount: NT$20000 December 2005 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co. ” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 CSA 3180: Information Extraction I 36

Evaluating IE Accuracy • Always evaluate performance on independent, manuallyannotated test data not used

Evaluating IE Accuracy • Always evaluate performance on independent, manuallyannotated test data not used during system development. • Measure for each test document: – Total number of correct extractions in the solution template: N – Total number of slot/value pairs extracted by the system: E – Number of extracted slot/value pairs that are correct (i. e. in the solution template): C • Compute average value of metrics adapted from IR: – Recall = C/N – Precision = C/E – F-Measure = Harmonic mean of recall and precision December 2005 CSA 3180: Information Extraction I 37

Results for 1997 NE – named entity recognition CO – coreference resolution TE –

Results for 1997 NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production December 2005 CSA 3180: Information Extraction I 38

Finite State Transducers for IE • Basic method for extracting relevant information • IE

Finite State Transducers for IE • Basic method for extracting relevant information • IE systems generally use a collection of specialized FSTs • Company Name detection • Person Name detection • Relationship detection December 2005 CSA 3180: Information Extraction I 39

Equivalent Representations Regular expressions Finite automata Each can describe the others Regular languages Theorem:

Equivalent Representations Regular expressions Finite automata Each can describe the others Regular languages Theorem: For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa. December 2005 CSA 3180: Information Extraction I 40

Finite State Automata Graphs • A state • The start state • An accepting

Finite State Automata Graphs • A state • The start state • An accepting state • A transition December 2005 a CSA 3180: Information Extraction I 41

Finite Automata • A FA is similar to a compiler in that: – A

Finite Automata • A FA is similar to a compiler in that: – A compiler recognizes legal programs in some (source) language. – A finite-state machine recognizes legal strings in some language. • Example: Programming Language Identifiers – sequences of one or more letters or digits, starting with a letter: letter | digit letter S December 2005 A CSA 3180: Information Extraction I 42

Finite Automata • Transition s 1 a s 2 • Is read In state

Finite Automata • Transition s 1 a s 2 • Is read In state s 1 on input “a” go to state s 2 • If end of input – If in accepting state => accept – Otherwise => reject • If no transition possible (got stuck) => reject • FSA = Finite State Automata December 2005 CSA 3180: Information Extraction I 43

Finite State Automata Language • The language defined by a FSA is the set

Finite State Automata Language • The language defined by a FSA is the set of strings accepted by the FSA. – in the language of the FSM shown below: • x, tmp 2, Xy. Zzy, position 27. – not in the language of the FSM shown below: • 123, a? , 13 apples. letter | digit letter S December 2005 A CSA 3180: Information Extraction I 44

Example: Integer Literals • FSA that accepts integer literals with an optional + or

Example: Integer Literals • FSA that accepts integer literals with an optional + or - sign: • Note the two different edges from S to A • (+|-)? [0 -9]+ digit B digit + S A - December 2005 CSA 3180: Information Extraction I 45

Finite State Automata Example • FSA that accepts three letter English words that begin

Finite State Automata Example • FSA that accepts three letter English words that begin with p and end with d or t. • Here I use the convenient notation of making the state name match the input that has to be on the edge leading to that state. a t p i o d u December 2005 CSA 3180: Information Extraction I 46

Formal Definition • A finite automaton is a 5 -tuple ( , Q, ,

Formal Definition • A finite automaton is a 5 -tuple ( , Q, , q, F) where: – An input alphabet – A set of states Q – A start state q – A set of accepting states F Q – is the state transition function: Q x Q (i. e. , encodes transitions state input state) December 2005 CSA 3180: Information Extraction I 47

FSA Implementation A table-driven approach: • table: – one row for each state in

FSA Implementation A table-driven approach: • table: – one row for each state in the machine, and – one column for each possible character. • Table[j][k] – which state to go to from state j on character k, – an empty entry corresponds to the machine getting stuck. • Note: when you use the re package in python, it converts your regex’s into efficient FSMs December 2005 CSA 3180: Information Extraction I 48

Terminology • FSA: Finite State Automaton • FSM: Finite State Machine – FSA, FSM

Terminology • FSA: Finite State Automaton • FSM: Finite State Machine – FSA, FSM used interchangibly • FST: Finite State Transducer – The same as FSA, except each transition produces output December 2005 CSA 3180: Information Extraction I 49

FSTs for IE • FSTs are often compiled from regular expressions • Probabilistic (weighted)

FSTs for IE • FSTs are often compiled from regular expressions • Probabilistic (weighted) FSTs • FSTs mean different things to different IE approaches: – Based on lexical items (words) – Based on statistical language models – Based on deep syntactic/semantic analysis • Several FSTs or a more complex FST can be used to find one type of information (e. g. company names) December 2005 CSA 3180: Information Extraction I 50

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Text Analyzer: Frodo –

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Text Analyzer: Frodo – Proper Name Baggins – Proper Name works – Verb for – Prep Hobbit – Unknown. Cap Factory – Noun. Cap Inc – Comp. Abbr December 2005 CSA 3180: Information Extraction I 51

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. A regular expression for

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. A regular expression for finding company names: “some capitalized words, maybe a comma, then a company abbreviation indicator” Company. Name December 2005 = (Proper. Name | Some. Cap)+ Comma? Comp. Abbr CSA 3180: Information Extraction I 52

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Proper. Name works for

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Proper. Name works for Company. Name. December 2005 CSA 3180: Information Extraction I 53

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FSA

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FSA CAB word 1 (CAP | PN) 2 comma 3 CAB 4 word (CAP| PN) CAP = Some. Cap, CAB = Comp. Abbr, PN = Proper. Name, = empty string December 2005 CSA 3180: Information Extraction I 54

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FST

FSTs for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FST CAB CN word 1 (CAP | PN) 2 comma 3 (CAP| PN) CAB CN 4 word CAP = Some. Cap, CAB = Comp. Abbr, PN = Proper. Name, = empty string, CN = Company. Name December 2005 CSA 3180: Information Extraction I 55

FSMs for Pattern Recognition • Use regex’s at a higher level of description for

FSMs for Pattern Recognition • Use regex’s at a higher level of description for pattern recognition • Determining which person holds what office in what organization – [person] , [office] of [org] • Vuk Draskovic, leader of the Serbian Renewal Movement – [org] (named, appointed, etc. ) [person] P [office] • NATO appointed Wesley Clark as Commander in Chief • Determining where an organization is located – [org] in [loc] • NATO headquarters in Brussels – [org] [loc] (division, branch, headquarters, etc. ) • KFOR Kosovo headquarters December 2005 CSA 3180: Information Extraction I 56

FASTUS • Successful early IE system (1993) • Built on Finite State Automata (FSA)

FASTUS • Successful early IE system (1993) • Built on Finite State Automata (FSA) transductions December 2005 CSA 3180: Information Extraction I 57

set up new Taiwan dollars 1. Complex Words: Recognition of multi-words and proper names

set up new Taiwan dollars 1. Complex Words: Recognition of multi-words and proper names a Japanese trading house 2. Basic Phrases: Simple noun groups, verb groups and particles had set up production of 20, 000 iron and metal wood clubs 3. Complex phrases: [company] [set up] [Joint-Venture] with [company] Patterns for events of interest to the application Basic templates are to be built. December 2005 Complex noun groups and verb groups 4. Domain Events: 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. CSA 3180: Information Extraction I 58