SPro UT Shallow Processing with Unification and Typed

Shallow Text Processing TEXT DOCUMENTS Building ontologies Concept indices, more accurate queries Domain-specific patterns

Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates

Motivation for SPro. UT One System for Multilingual and Domain Adaptive Shallow Text Processing

SPro. UT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich Krieger Jakub

SPro. UT Architecture LINGUISTIC PROCESSING RESOURCES INPUT DATA LEXICAL RESOURCES JTFS STREAM OF TEXT

Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state

Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set

Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations

XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional

XTDL Formalism Couple of standard regular operators: concatenation disjunction Kleene plus m-n span repetition

XTDL Formalism loc-pp : > morph & [POS Prep & #preposition, INFL [CASE #1,

XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS) 2. LHS Pattern instance

XTDL Interpreter Matched input sequence “im sonnigen Rom” (in sunny Rome) Jakub Piskorski Warszawa,

XTDL Interpreter Rule with an instantiated pattern on the LHS Jakub Piskorski Warszawa, 10.

XTDL formalism Unified result Jakub Piskorski Warszawa, 10. 01. 2003

Linguistic Processing Resources Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form

System Description Language Construction of a concrete system instance via definition of a regular

System Description Language (M 1 M 2)(input) M 1. clear. State(); M 1. set.

Future Work Optimization of grammar interpretation Various search strategies Additional linguistic processing resources Real

Slides: 20

Download presentation

SPro. UT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI Gmb. H Jakub Piskorski Warszawa, 10. 01. 2003

Shallow Text Processing TEXT DOCUMENTS Building ontologies Concept indices, more accurate queries Domain-specific patterns Tokens Clause structure MULTI-AGENTS Shallow Text Processing Components Term association extraction Text Mining Word Stems Template generation E-COMMERCE Automatic Database Construction EXECUTIVE INFORMATION SYSTEMS Information Extraction Q/A Systems Phrases Semi-structured data Jakub Piskorski Document Indexing/Retrieval Named Entities WORKFLOW MANAGEMENT Fine-grained concept matching Text Classification DATA WAREHOUSING Warszawa, 10. 01. 2003

Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragments FASTUS – uses CPSL (Common Pattern Specification Language) GATE – uses JAPE (Java Annotation Patterns Engine) Jakub Piskorski Warszawa, 10. 01. 2003

Motivation for SPro. UT One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressiveness Modularity Flexible integration of different processing modules Portability Industrial standards Jakub Piskorski Warszawa, 10. 01. 2003

SPro. UT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich Krieger Jakub Piskorski, Ulrich Schäfer, Feiyu. Xu Jakub Piskorski Warszawa, 10. 01. 2003

SPro. UT Architecture LINGUISTIC PROCESSING RESOURCES INPUT DATA LEXICAL RESOURCES JTFS STREAM OF TEXT ITEMS …. [. . ] …. XTDL GRAMMAR REGULAR COMPILER EXTENDED OPTIMIZED FINITE-STATE NETWORK XTDL INTERPRETER STRUCTURED OUTPUT DATA FINITE-STATE MACHINE TOOLKIT GRAMMAR DEVELOPMENT ENVIRONMENT Jakub Piskorski ONLINE PROCESSING Warszawa, 10. 01. 2003

Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e. g. , incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools Jakub Piskorski Warszawa, 10. 01. 2003

Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation Regular expressions over TFSs (SPro. UT) with restrictions Jakub Piskorski Warszawa, 10. 01. 2003

Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers Jakub Piskorski Warszawa, 10. 01. 2003

XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities *top* morph : = sign & [POS atom, STEM atom, INFL infl] atom *avm* tense present sign token *rule* infl morph index-avm lang de Jakub Piskorski en tokentype separator url Warszawa, 10. 01. 2003

XTDL Formalism Couple of standard regular operators: concatenation disjunction Kleene plus m-n span repetition | + {m, n} optionality ? Kleene star * n-fold repetition {n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det, . . . ] ([POS Adj, . . . , RELN %LIST])* [POS Noun, . . . ] -> [. . . , RELN %LIST] Jakub Piskorski Warszawa, 10. 01. 2003

XTDL Formalism loc-pp : > morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location]. Jakub Piskorski Warszawa, 10. 01. 2003

XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS) 2. LHS Pattern instance creation 3. Unfication of the rule instance and matched input Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture) Jakub Piskorski Warszawa, 10. 01. 2003

XTDL Interpreter Matched input sequence “im sonnigen Rom” (in sunny Rome) Jakub Piskorski Warszawa, 10. 01. 2003

XTDL Interpreter Rule with an instantiated pattern on the LHS Jakub Piskorski Warszawa, 10. 01. 2003

XTDL formalism Unified result Jakub Piskorski Warszawa, 10. 01. 2003

Linguistic Processing Resources Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from ‘compactified’ MMORPH: English 200, 000 German 830, 000 French 225, 000 Spanish 570, 000 Italian 330, 000 entries entries + Shallow Compound Recognition Asian Languages: Chinese – Shanxi Japanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology Jakub Piskorski Warszawa, 10. 01. 2003

System Description Language Construction of a concrete system instance via definition of a regular expression of module specifications All lingusitic modules must implement a specific JAVA interface Automatic compilation of system description into a single JAVA class Jakub Piskorski Warszawa, 10. 01. 2003

System Description Language (M 1 M 2)(input) M 1. clear. State(); M 1. set. Input(input); M 1. set. Output(M 1. compute. Output(M 1. get. Input())); M 2. clear. State(); M 2. set. Input(mediate. Seq(M 1, M 2)); M 2. set. Output(M 2. compute. Output(M 2. get. Input())); return M 2. get. Output(); (M*)(input) M. clear. State(); M. set. Input(input); M. set. Output(mediate. Fix(M)); return M. get. Output(); Jakub Piskorski Warszawa, 10. 01. 2003

Future Work Optimization of grammar interpretation Various search strategies Additional linguistic processing resources Real data testing: large grammars and real-world texts Jakub Piskorski Warszawa, 10. 01. 2003