SPro UT Shallow Processing with Unification and Typed
- Slides: 20
SPro. UT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI Gmb. H Jakub Piskorski Warszawa, 10. 01. 2003
Shallow Text Processing TEXT DOCUMENTS Building ontologies Concept indices, more accurate queries Domain-specific patterns Tokens Clause structure MULTI-AGENTS Shallow Text Processing Components Term association extraction Text Mining Word Stems Template generation E-COMMERCE Automatic Database Construction EXECUTIVE INFORMATION SYSTEMS Information Extraction Q/A Systems Phrases Semi-structured data Jakub Piskorski Document Indexing/Retrieval Named Entities WORKFLOW MANAGEMENT Fine-grained concept matching Text Classification DATA WAREHOUSING Warszawa, 10. 01. 2003
Finite-State based approaches SPPC - pure finite-state based STP, small number of basic predicates SMES – predciates inspect arbitrary properties of the input tokens/fragments FASTUS – uses CPSL (Common Pattern Specification Language) GATE – uses JAPE (Java Annotation Patterns Engine) Jakub Piskorski Warszawa, 10. 01. 2003
Motivation for SPro. UT One System for Multilingual and Domain Adaptive Shallow Text Processing Trade-off between efficiency and expressiveness Modularity Flexible integration of different processing modules Portability Industrial standards Jakub Piskorski Warszawa, 10. 01. 2003
SPro. UT is a joint work by: Markus Becker, Witold Drożdżyński, Ulrich Krieger Jakub Piskorski, Ulrich Schäfer, Feiyu. Xu Jakub Piskorski Warszawa, 10. 01. 2003
SPro. UT Architecture LINGUISTIC PROCESSING RESOURCES INPUT DATA LEXICAL RESOURCES JTFS STREAM OF TEXT ITEMS …. [. . ] …. XTDL GRAMMAR REGULAR COMPILER EXTENDED OPTIMIZED FINITE-STATE NETWORK XTDL INTERPRETER STRUCTURED OUTPUT DATA FINITE-STATE MACHINE TOOLKIT GRAMMAR DEVELOPMENT ENVIRONMENT Jakub Piskorski ONLINE PROCESSING Warszawa, 10. 01. 2003
Core Components – FSM Toolkit Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices Finite-state Machine model: FSA, WFSA, FST, WFST Arbitrary real-valued semirings Some new crucial STP-relevant operations (e. g. , incremental construction of minimal deterministic FSAs) Functionality similar to AT&T tools Jakub Piskorski Warszawa, 10. 01. 2003
Core Components – Regular Compiler Definition and configuration via XML Unicode compatible Extendible set of circa 20 operations Scanner definitions vs. general regular expressions Biasing optimization process Various ways of handling ambiguities Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation Regular expressions over TFSs (SPro. UT) with restrictions Jakub Piskorski Warszawa, 10. 01. 2003
Core Components – Typed Feature Structure Package JAVA implementation of TFSs Efficient unification operations Dynamic extension of the type hierarchy Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers Jakub Piskorski Warszawa, 10. 01. 2003
XTDL Formalism Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application XTDL grammar rules – production part on LHS, and output description on RHS TDL used for establishment of a type hierarchy of linguistic entities *top* morph : = sign & [POS atom, STEM atom, INFL infl] atom *avm* tense present sign token *rule* infl morph index-avm lang de Jakub Piskorski en tokentype separator url Warszawa, 10. 01. 2003
XTDL Formalism Couple of standard regular operators: concatenation disjunction Kleene plus m-n span repetition | + {m, n} optionality ? Kleene star * n-fold repetition {n} Unidirectional coreference under Kleene star (and restricted iteration) [POS Det, . . . ] ([POS Adj, . . . , RELN %LIST])* [POS Noun, . . . ] -> [. . . , RELN %LIST] Jakub Piskorski Warszawa, 10. 01. 2003
XTDL Formalism loc-pp : > morph & [POS Prep & #preposition, INFL [CASE #1, NUMBER #2, GENDER #3]] morph & [POS Determiner, INFL [CASE #1, NUMBER #2, GENDER #3]] ? morph & [POS Adjective, INFL [CASE #1, NUMBER #2, GENDER #3]] * gazetteer & [TYPE general-location, SURFACE #location] -> [CAT location-pp, PREP #preposition LOCATION #location]. Jakub Piskorski Warszawa, 10. 01. 2003
XTDL Interpreter 1. Matching of regular patterns using unifiability (LHS) 2. LHS Pattern instance creation 3. Unfication of the rule instance and matched input Longest match strategy Ambiguities allowed Interpreter generates TFSs as output (cascaded architecture) Jakub Piskorski Warszawa, 10. 01. 2003
XTDL Interpreter Matched input sequence “im sonnigen Rom” (in sunny Rome) Jakub Piskorski Warszawa, 10. 01. 2003
XTDL Interpreter Rule with an instantiated pattern on the LHS Jakub Piskorski Warszawa, 10. 01. 2003
XTDL formalism Unified result Jakub Piskorski Warszawa, 10. 01. 2003
Linguistic Processing Resources Tokenization with fine-grained token classification Gazetteer (static named-entity lexica) Morphology Full-form lexica obtained from ‘compactified’ MMORPH: English 200, 000 German 830, 000 French 225, 000 Spanish 570, 000 Italian 330, 000 entries entries + Shallow Compound Recognition Asian Languages: Chinese – Shanxi Japanese – Chasen Other: Czech – HMM-based Part-of-Speech Tagging + Morphology Jakub Piskorski Warszawa, 10. 01. 2003
System Description Language Construction of a concrete system instance via definition of a regular expression of module specifications All lingusitic modules must implement a specific JAVA interface Automatic compilation of system description into a single JAVA class Jakub Piskorski Warszawa, 10. 01. 2003
System Description Language (M 1 M 2)(input) M 1. clear. State(); M 1. set. Input(input); M 1. set. Output(M 1. compute. Output(M 1. get. Input())); M 2. clear. State(); M 2. set. Input(mediate. Seq(M 1, M 2)); M 2. set. Output(M 2. compute. Output(M 2. get. Input())); return M 2. get. Output(); (M*)(input) M. clear. State(); M. set. Input(input); M. set. Output(mediate. Fix(M)); return M. get. Output(); Jakub Piskorski Warszawa, 10. 01. 2003
Future Work Optimization of grammar interpretation Various search strategies Additional linguistic processing resources Real data testing: large grammars and real-world texts Jakub Piskorski Warszawa, 10. 01. 2003
- Types of programming languages
- Difference between strongly and weakly typed languages
- Shallow processing definition
- Example of shallow processing
- Levels of processing
- Lop model of memory
- German and italian unification compare and contrast
- Chapter 23 lesson 3 nationalism unification and reform
- National unification and the national state
- Typed essay format
- Typed functional programming
- Typed template haskell
- Gwam typing meaning
- 3/4 of a page typed
- Interpreted language vs compiled language
- Sources of wholesome water
- Shallow and deep foundation
- Iceberg model of culture
- Shallow comparative and superlative
- Narrow and shallow assortment example
- Shallow comparative and superlative