GATE a General Architecture for Text Engineering http

  • Slides: 27
Download presentation
GATE, a General Architecture for Text Engineering http: //gate. ac. uk/ http: //nlp. shef.

GATE, a General Architecture for Text Engineering http: //gate. ac. uk/ http: //nlp. shef. ac. uk/ Hamish Cunningham Department of Computer Science, University of Sheffield ENST, Paris, 20/1/2003 Natural Language Engineering in Sheffield: • One of the largest Human Language Technology groups in the EU • 50 staff in Language and Speech Processing; 25 in Information Retrieval, including 6 professors • A focus on scientific method in AI (participate in all the leading quantitative evaluation programmes in the US) • A focus on engineering high-quality open-source software for applications and demonstrators (27)

 GATE, a General Architecture for Text Engineering GATE is…. • An architecture A

GATE, a General Architecture for Text Engineering GATE is…. • An architecture A macro-level organisational picture for LE software systems. • A framework For programmers, GATE is an object-oriented class library that implements the architecture. • A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e. g. Information Extraction. • Free software (LGPL). Mature robust software (in development since 1995). Download at http: //gate. ac. uk/download Comes with… • Some free components. . . and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. 2(27)

Applications; languages GATE has been used for a variety of applications, including: • MUMIS:

Applications; languages GATE has been used for a variety of applications, including: • MUMIS: automatic creation of semantic indexes for multimedia programme material • MUSE: a multi-genre IE system • EMILLE: a 70 million word corpus of Indic languages • Metadata for Medline (at Merck) • Creation of metadata for Semantic Web Services; documentation using NLG • HSE: summarisation of health and safety information from company reports • Old. Bailey. IE: NE recognition on 17 th century Old Bailey Court reports. • AKT: language technology in knowledge management • AMITIES: call centre automation • Digital libraries / e-philology for ancient languages researchers • Various Medical Informatics and database technology projects • IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian next year) 3(27)

Some users… At time of writing a representative fraction of GATE users includes: •

Some users… At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK; • Merck Kg. Aa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US; • Sirma AI Ltd. , Bulgaria; • Resco AB, Sweden/Finland/Germany; • Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts • Master Foods NV: extraction of commodities events from news • the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities; • the Perseus Digital Library project, Tufts University, US. 4(27)

 Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not

Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e. g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable Component-based development • An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 5(27)

 GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons, …… Documents /

GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons, …… Documents / corpora: • GATE documents loaded from local files or the web. . . • Diverse document formats: text, html, XML, email, RTF, SGML. Processing Resourcres Algorithmic components knows as PRs – beans with execute methods. • All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing). • 20 -30 freebies with GATE • e. g. Named entity recognition; Word. Net; Protégé; Ontology; Onto. Gazetteer; DAML+OIL export; Information Retrieval based on Lucene 6(27)

7(27) Visual Resources

7(27) Visual Resources

Displaying Coreference Information 8(27)

Displaying Coreference Information 8(27)

Displaying Syntactic Information 9(27)

Displaying Syntactic Information 9(27)

Lexicon Support – Word. Net example 10(27)

Lexicon Support – Word. Net example 10(27)

A Language Analysis Example … ANNIE … Named entity Coreference HTML docs XML docs

A Language Analysis Example … ANNIE … Named entity Coreference HTML docs XML docs GATE Format Handlers RTF docs Document content Document metadata POS tagger … Document format data Named entity Linguistic data … … Event extraction Custom application 1 Relational Database Oracle/ 11(27) Postgres. QL File storage

Building IE Components in GATE (1) The ANNIE system – a reusable and easily

Building IE Components in GATE (1) The ANNIE system – a reusable and easily extendable set of components 12(27)

Building IE Components in GATE (2) JAPE: a Java Annotation Patterns Engine • Light,

Building IE Components in GATE (2) JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components Rule: Company 1 Priority: 25 ( ( {Token. orthography == upper. Initial} )+ {Lookup. kind == company. Designator} ): company. Match --> : company. Match. Named. Entity = { kind = company, rule = “Company 1” } 13(27)

Performance Evaluation • At document level – annotation diff • At corpus level –

Performance Evaluation • At document level – annotation diff • At corpus level – corpus benchmark tool – tracking system’s performance over time 14(27)

Regression Testing – Corpus Benchmark Tool 15(27)

Regression Testing – Corpus Benchmark Tool 15(27)

The Semantic Web and GATE is being used for development of (semi-)automatic methods for:

The Semantic Web and GATE is being used for development of (semi-)automatic methods for: • linking web pages to Ontologies using Information Extraction; • learning and evolving Ontologies via IE and lexical semantic network traversal. 16(27)

Populating Ontologies with IE 17(27)

Populating Ontologies with IE 17(27)

Protégé and Ontology Management 18(27)

Protégé and Ontology Management 18(27)

Information Retrieval Support Based on the Lucene IR engine 19(27)

Information Retrieval Support Based on the Lucene IR engine 19(27)

 Editing Multilingual Data GATE Unicode Kit (GUK) Java provides no special support for

Editing Multilingual Data GATE Unicode Kit (GUK) Java provides no special support for text input (this may change) • Support for defining additional Input Methods (IMs) • currently 30 IMs for 17 languages • Pluggable in other applications 20(27)

Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced

Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities: 21(27)

Dialogue Systems • GATE is being used in the Amities project for automating call

Dialogue Systems • GATE is being used in the Amities project for automating call centres • Creation of dialogue processing server components to run in the Galaxy Communicator architecture • Easy adaptation of the portable IE components to work on noisy ASR output • Robustness and speed of GATE components vital for realtime dialogue systems 22(27)

The MUMIS project • Multimedia Indexing and Searching Environment • Composite index of a

The MUMIS project • Multimedia Indexing and Searching Environment • Composite index of a multimedia programme from multiple sources in different languages • ASR, video processing, information extraction (Dutch, English, German), merging, user interface • University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA • Yorick Wilks, Hamish Cunningham, Horacio Saggion, Kalina Bontcheva, Diana Maynard, Oana Hamza, Cristian Ursu 23(27)

The Whole Picture Ontology & Lexicon DE IE Formal Text Formal NL Formal Text

The Whole Picture Ontology & Lexicon DE IE Formal Text Formal NL Formal Text Formal EN Formal Text Sources IE IE Formal Text Formal Anno. Text tations Merging Final Annotations Video & Audio Signal Forma l Forma ll Forma l. Text Forma ll. Text Forma l. Text Speech l l. Text Signals Text ASR Query Formal Formal Text Text Formal Trans Text criptions User Interface Results 24(27) Multimedia Data Base

User Interface 25(27)

User Interface 25(27)

Play 26(27)

Play 26(27)

 Conclusion GATE: an infrastructure that lowers the overhead of creating & embedding robust

Conclusion GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components Further information: http: //gate. ac. uk/ • Online demos, tutorials and documentation • Software downloads • Talks and papers 27(27)