FineGrained Geographical Relation Extraction from Wikipedia Andre Blessing
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS) 1/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Overview • motivation • why are fine-grained relations important? • self-annotation • automatic annotation using structured data • use this annotation for training classifier • extraction framework • evaluation and conclusion 2/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Geographical data provider • Geo. Names • gazetteer • names, type, coordinates • 8 million entries • 2. 6 million populated places • community-based • Creative Commons Attribution 3. 0 License • Free to share 3/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Geo. Names 4/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Geo. Names – hierarchical types Name German name + sample Description ADM 1 Bundesland (Rheinland. Pfalz) State in the United States, a primary administrative division of a country ADM 2 Regierungs. Bezirk a subdivision of a first-order administrative division ADM 3 Landkreis County, a subdivision of a (Bad Kreuznach) second-order administrative division ADM 4 Gemeinde (Gebroth) Municipality, a subdivision of a third-order administrative division PPL (populated place) Stadt-, Ortsteil (Stuttgart Bad Cannstatt) Suburb, a subdivision of a fourth-order administrative division 5/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Geo. Names – missing hierarchical relations 6/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Task Definition • relation definition • R 1 -2 • ADM 3 -ADM 4 • Landkreis (county)- Gemeinde (municipality) • R 0 -1 • ADM 4 -PPL • Gemeinde (municipality) and Ortsteil (suburb) • task • classify all possible binary relations of named entities in one sentence 7/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Example - binary relations between all NEs • Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). • Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). • binary relations between NEs • (Gebroth, Bad Kreuznach) element of R 1_2 • (Gebroth, Rheinland-Pfalz) • (Gebroth, Deutschland) • (Bad Kreuznach, Rheinland-Pfalz) • (Bad Kreuznach, Germany) • (Rheinland-Pfalz, Deutschland) 8/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Requirements for extraction system • fast to develop • requested relation types can change • avoid expensive manual annotation • fine-grained relation types • e. g. simple part-of relation is not sufficient • trained system need no structured data • several input sources (Wikipedia, blogs, twitter, news) • German data 9/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Wikipedia as resource • structured data • templates (e. g. infoboxes), links, categories, tables, lists • unstructured data • written text • high quality • many users • Wiki. Bots • structured data can be used to annotate unstructured data → self-annotation 10/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Self-Annotation - example Gebroth R 1_2(Gebroth, Bad Kreuznach) Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). unstructured data Landkreis Bad Kreuznach (county) 11/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Self-annotation - challenges • infoboxes are not always complete/correct/coherent filled • matching with unstructured data • pattern matching not sufficient • orthographic variances • morphology • multi-word expressions • matching need some manual adjustment • only one relation per article 12/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Extraction framework • UIMA (Unstructured Information Management Architecture) • pipeline architecture • easy exchange of components • fast development • extended components • Collection. Reader for Wikipedia • linguistic annotation • supervised classifier 13/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Extraction pipeline German Wikipedia JWPL structured data FSPar. Engine Max. Ent. Classifier FSPar. Annotator Clear. TK unstructured text Collection Reader Self. Annotation UIMA Pipeline text 14/20 Consumer Collection Reader Geo. Names IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Linguistic processing FSPar engine (Schiehlen 2003) tokenizer Po. S-tagger (bases on Tree. Tagger) chunker partial dependency parser 15/20 Token Po. S Lemma Geborth NE Gebroth ist VAFIN sein. A eine ART ein Ortsgemeinde NN Orts#@gemeinde im APPART in IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Supervised classification • extended Clear. TK-Annotator • feature sets • • F 0: NE distance (baseline) F 1: Window-based (pos, lemma, size=2) F 2: chunks (parent chunks of NEs) F 3: dependency parse (paths between NEs) • Max. Ent. Classifier 16/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Evaluation • 9000 articles about German municipalities and suburbs • 5300 articles for training • 1800 articles for development • 1800 articles for final evaluation • R 1_2 relation is also available from the Federal Statistical Office of Germany • Used for evaluate self-annotation • 99. 9 % ( 1 error in 1304 sentences) 17/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Results Linguistic effort description F 0 None Distance + NE position F 1 Po. S-Tagging Window-based (size=2, Po. S, lemma) F 2 Chunk-parse Parent chunk F 3 Dependency-parse Dependency paths between NEs Classifier Features Precision Recall FP FN 1 F 0 79. 0% 55. 7% 279 833 2 F 0+F 1 92. 4% 89. 3% 138 202 3 F 0+F 2 90. 2% 89. 5% 182 198 4 F 0+F 3 97. 7% 97. 4% 43 48 5 F 0. . F 3 98. 8% 97. 8% 23 41 18/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Conclusion • text is important resource for context-aware systems • self-annotation • automatic annotation using structured data • Wikipedia is a valuable resource • structured and unstructured data • containing fine-grained relations • UIMA based implementation • fine-grained geographical relation extraction is possible 19/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze Questions: ? ! www. nexus. uni-stuttgart. de 20/20 IMS Universität Stuttgart
- Slides: 20