Information Extraction Kuanghua Chen khchenccms ntu edu tw

  • Slides: 28
Download presentation
Information Extraction Kuang-hua Chen khchen@ccms. ntu. edu. tw Language & Information Processing System Lab.

Information Extraction Kuang-hua Chen khchen@ccms. ntu. edu. tw Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University

Outline • • • Introduction Information extraction Metadata Text processing techniques Message understanding conference

Outline • • • Introduction Information extraction Metadata Text processing techniques Message understanding conference Future researches Language & Information Processing System, LIS, NTU 1998/10/22 2

Information Services • • • Keyword searching Information retrieval (Document retrieval) Information filtering Information

Information Services • • • Keyword searching Information retrieval (Document retrieval) Information filtering Information extraction Information summarization Information understanding Language & Information Processing System, LIS, NTU 1998/10/22 3

Information Extraction? • A task draws out some information from documents based on predefined

Information Extraction? • A task draws out some information from documents based on predefined templates. • A predefined template is a collection of attribute -value pairs. • The templates play the roles of metadata formats but with different faces. Language & Information Processing System, LIS, NTU 1998/10/22 4

Specificity of an IE Task • Due to the specificity of task, extracting what

Specificity of an IE Task • Due to the specificity of task, extracting what kind of information is domain-dependent. • For example – MUC-5 : the target documents are news articles about joint ventures and microelectronics – MUC-6 : the target documents of are news articles about management changes Language & Information Processing System, LIS, NTU 1998/10/22 5

Templates • User-defined templates – Dynamically customized based on user’s information need – Researches

Templates • User-defined templates – Dynamically customized based on user’s information need – Researches of information extraction • Authority-controlled templates – Statically specified by some authorities – Researches of metadata research Language & Information Processing System, LIS, NTU 1998/10/22 6

Metadata • Metadata is data about data • Metadata is used to describe other

Metadata • Metadata is data about data • Metadata is used to describe other information based on some rules or policies • Examples – Person: ID card, driver’s license – Book: MARC Language & Information Processing System, LIS, NTU 1998/10/22 7

Examples of Metadata • GILS – Government Information Locator Service • FGDC – Federal

Examples of Metadata • GILS – Government Information Locator Service • FGDC – Federal Geographic Data Committee Standard • CIMI – Consortium for the Computer Interchange of Museum Information Language & Information Processing System, LIS, NTU 1998/10/22 8

Functions of Metadata • • • Location Discovery Documentation Evaluation Selection Language & Information

Functions of Metadata • • • Location Discovery Documentation Evaluation Selection Language & Information Processing System, LIS, NTU 1998/10/22 9

What Information? • • • Person Event Time Place Object Relationship Language & Information

What Information? • • • Person Event Time Place Object Relationship Language & Information Processing System, LIS, NTU 1998/10/22 10

MARC • In order to make the readers or users convenient to find the

MARC • In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo-American Cataloging Rules, 2 nd edition (AACR 2). • Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example. Language & Information Processing System, LIS, NTU 1998/10/22 11

001 005 008 010 020 040 050 00 082 00 091 095 . .

001 005 008 010 020 040 050 00 082 00 091 095 . . . 83021957 //r 91 19911024125216. 4 831004 s 1984 nyua b 00110 eng cam a 83021957 //r 91 0918212758 (pbk. ) : |c$24. 95 DLC|c. DLC|d. DLC Z 678. 9|b. D 68 1984 025/. 04|219 Z/678. 9/D 68/1984///1410222 AL/1415924 CL/1453410 CL/1733896 CF TUL|b. AL|b. CL|b. CF TUL|d. Z 678. 9|e. D 68|y 1984|t 095|b. AL|c 1410222 . . . 099 TUL|d|e|y|f|t 091|b|c|x|z 100 10 Dowlin, Kenneth E 245 14 The electronic library : |bthe promise and the process / |c. Kenneth E. Dowlin 260 0 New York, N. Y. : |b. Neal-Schuman Publishers, |cc 1984 300 xi, 199 p. : |bill. ; |c 23 cm 440 0 Applications in information management and technology series 504 Includes bibliographical references and index 650 0 Libraries|x. Automation 650 0 Information technology 910 8'93 D#139 MCL Language & Information Processing System, LIS, NTU 1998/10/22 12

Dublin Core • A simple metadata format • For the networked information • Contain

Dublin Core • A simple metadata format • For the networked information • Contain 15 elements Language & Information Processing System, LIS, NTU 1998/10/22 13

Elements of Dublin Core Language & Information Processing System, LIS, NTU 1998/10/22 14

Elements of Dublin Core Language & Information Processing System, LIS, NTU 1998/10/22 14

Automaticity • It is needed to develop some automatic or semiautomatic procedures to “catalog”

Automaticity • It is needed to develop some automatic or semiautomatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts. • Researches of information extraction cast light on the resolution to these problems. Language & Information Processing System, LIS, NTU 1998/10/22 15

Complexity and Automaticity of Metadata Format complexity automaticity Language & Information Processing System, LIS,

Complexity and Automaticity of Metadata Format complexity automaticity Language & Information Processing System, LIS, NTU 1998/10/22 16

Components of IE Systems • • • Tokenization module Stemming module Word segmentation module

Components of IE Systems • • • Tokenization module Stemming module Word segmentation module Lexical analysis module Syntactic analysis module Domain knowledge module Language & Information Processing System, LIS, NTU 1998/10/22 17

Techniques for Text Processing • Researches of natural language processing (NLP) have developed many

Techniques for Text Processing • Researches of natural language processing (NLP) have developed many high-performance analysis systems. • The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. – The difficulty of this part is to distinguish whether periods are full-stop or part of abbreviations. Language & Information Processing System, LIS, NTU 1998/10/22 18

Techniques for Text Processing (continued) • The Stemming module is also good enough. –

Techniques for Text Processing (continued) • The Stemming module is also good enough. – Porter algorithm [Porter, 1980] – Two-level morphology [Koskenniemi, 1983]. • Lexical analysis module, the most improved part of researches of NLP in recent years. – Probabilistic tagger [Church, 1988] – Rule-based tagger [Brill, 1992] – Hybrid tagger [Voutilainen, 1993] – Finite-state tagger [Kempe, 1997] Language & Information Processing System, LIS, NTU 1998/10/22 19

Word Segmentation • Chinese word segmentation –將黃大目的確實行動作了解釋 (改寫自張俊盛教授舉的例子) –將 黃大目 的 確實 行動 作

Word Segmentation • Chinese word segmentation –將黃大目的確實行動作了解釋 (改寫自張俊盛教授舉的例子) –將 黃大目 的 確實 行動 作 了 解釋 • Segmentation approach – CKIP, SINICA – BDC – NLP, NTHU – NLPL, NTU • Take proper nouns into consideration Language & Information Processing System, LIS, NTU 1998/10/22 20

Syntactic Analysis • The most challenging work • From the viewpoint of NLP, the

Syntactic Analysis • The most challenging work • From the viewpoint of NLP, the correct and complete parse tree is very important • For applications like IR and IE, time is the most critical factor • Leverage time and correctness factors is important • Partial parsing Language & Information Processing System, LIS, NTU 1998/10/22 21

Partial Parsing • Fidditch [Hindle, 1983] • Chunker – Rule-based chunker [Abney, 1991] –

Partial Parsing • Fidditch [Hindle, 1983] • Chunker – Rule-based chunker [Abney, 1991] – Probabilistic chunker [Chen and Chen, 1993] • Transformational-based parser [Brill, 1993] • Probabilistic binary parser [Chen, 1998] • Finite-state parser Language & Information Processing System, LIS, NTU 1998/10/22 22

Message Understanding Conference • A gathering of researchers in natural language processing • Conference

Message Understanding Conference • A gathering of researchers in natural language processing • Conference participants must develop NLP systems that perform a variety of information extraction tasks • Each system's performance is evaluated by comparing its output with the output of human linguists Language & Information Processing System, LIS, NTU 1998/10/22 23

MUC Tasks • MUC-1 (1987) and MUC-2 (1989) – naval operations • MUC-3 (1991)

MUC Tasks • MUC-1 (1987) and MUC-2 (1989) – naval operations • MUC-3 (1991) and MUC-4 (1992) – terrorist activity • MUC-5 (1993) – joint ventures and microelectronics • MUC-6 (1995) – management changes Language & Information Processing System, LIS, NTU 1998/10/22 24

MUC-6 Tasks • Named Entity (NE) requires only that the system under evaluation identify

MUC-6 Tasks • Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others. – person names – company names – organization names – location – dates, times, currency • Coreference (CO) requires connecting all references to "identical" entities. • Template Element (TE) requires grouping entity attributes together into entity "objects. " Language & Information Processing System, LIS, NTU 1998/10/22 25

Results of MUC-6 Language & Information Processing System, LIS, NTU 1998/10/22 26

Results of MUC-6 Language & Information Processing System, LIS, NTU 1998/10/22 26

MUC-7 Tasks (1998) • • Name Entity (NE) Coreference (CO) Template Element (TE) Template

MUC-7 Tasks (1998) • • Name Entity (NE) Coreference (CO) Template Element (TE) Template Relationship (TR) requires identifying relationships between template elements. • Scenario Template (ST) requires identifying instances of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects. " Language & Information Processing System, LIS, NTU 1998/10/22 27

Future Researches • Dynamic templates gradually shift to static metadata through user study •

Future Researches • Dynamic templates gradually shift to static metadata through user study • High-performance, fast parsing algorithm • Discourse analysis • Summarization as information extraction • Multimedia, intermedia consideration • Multimodal, intermodal consideration Language & Information Processing System, LIS, NTU 1998/10/22 28