CLARINPL Research Userdriven Language Technology Infrastructure Maciej Piasecki

  • Slides: 37
Download presentation
CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G

CLARIN-PL – Research User-driven Language Technology Infrastructure Maciej Piasecki Wrocław University of Technology G 4. 19 Research Group maciej. piasecki@pwr. wroc. pl 2014 -11 -27

Basic Notions Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Language Technology (LT) §

Basic Notions Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Language Technology (LT) § § language resources and tools robust in terms of quality and coverage multipurpose component based § Language Technology Infrastructure § a software framework (architecture or platform) § for combining language tools with language resources into processing chains (or pipelines) § the defined processing chains are next applied to language data sources § interoperability, also with the external systems

LT in Humanities and Social Sciences: Barriers Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

LT in Humanities and Social Sciences: Barriers Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Physical – language tools and resources are not accessible in Internet § Informational – descriptions are not available or there is no means for searching § Technological – lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users’ computers § Related to knowledge – the use of LT requires programming skills or knowledge from the area of natural language engineering § Legal – licences for language resources and tools (LRTs) limit their applications

CLARIN Support for Humanities & Social Sciences Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN Support for Humanities & Social Sciences Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § CLARIN is ERIC type consortium of § 11 countries (Austria, Bulgaria, Czech Republic, Denmark, Estonia, Germany, Lithuania, The Netherlands, Poland, Portugal, Sweden) and The Dutch Language Union § 1 observer: Norway § Focus area: § Supporting research in Humanities and Social Sciences § Users: researchers, Ph. D students, students and scientific institutions § CLARIN Mission § To significantly lower the barriers for the use of Language Technology in Humanities & Social Sciences (H&SS) § To facilitate or enable research methods based on automated analysis of text and speech resources

CLARIN Offer Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Integration of different LT

CLARIN Offer Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Integration of different LT components into one interoperable system § Common, flexible meta-data standard (CMDI) § Central searching for resources (Virtual Language Observatory) § One sign on and one login into the distributed infrastructure § Decreased Physical and Informational Barriers § Common standards: promoting, co-ordinating, harmonising § Web Services for Language Tools and Resources § Decreased Technological Barrier § Installation-free, access via Web Applications § Decreased Knowledge Barrier § Common licences and promotion of the open access § Decreased Legal Barrier

CLARIN: Portal Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN: Portal Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN: Virtual Language Observatory Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN: Virtual Language Observatory Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN: Federated Content Search – Searching Corpora Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN: Federated Content Search – Searching Corpora Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

LTI Development Paradigms Humanistyka Cyfrowa Warszawa 2014 -11 -27 § Bottom-up § a collected

LTI Development Paradigms Humanistyka Cyfrowa Warszawa 2014 -11 -27 § Bottom-up § a collected offer approach § based on linking together the already existing Language Resources and Tools § focused on accessibility, technical interoperability and processing chains § Top-down § following on user-centred design paradigm § research applications for H&SS are a starting point § Bi-directional § linking of Language Resources and Tools § combined with the development of research applications CLARIN-PL

Bi-directional LTI Development Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Idea § development

Bi-directional LTI Development Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Idea § development of the necessary elements § a distributed network infrastructure § basic LT processing chain § combined with user-centred approach to the development of research applications § Top-down part § close co-operation with key users from the H&SS domain § a metaphor of the Agile-like light weight software designing method with emphasis to prototyping § amendments to the shape of the technical basis: LRTs, standards, § inspirations, identification of the further user needs, next iterations

CLARIN-PL: the Consortium Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Polish scientific consortium

CLARIN-PL: the Consortium Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Polish scientific consortium § Wrocław University of Technology, G 4. 19 Research Group § Institute of Computer Science, Polish Academy of Science § Polish-Japanese Institute of Information Technology, Chair of Multimedia § University of Łódź, PELCRA group at Chair of English Language and Applied Linguistics § Institute of Slavic Studies, Polish Academy of Science § Wrocław University § Goal: implementation of the Polish part of the CLARIN ERIC LTI § Follows the bi-directional approach to LTI development

CLARIN-PL: Mission Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Starting point § Several

CLARIN-PL: Mission Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Starting point § Several publicly available language resources and tools for Polish, § But still many were lacking § Deeper technological barrier: restricted applications § CLARIN-PL Pillars: § CLARIN-PL Language Technology Centre www. clarin-pl. eu § the Polish node of the CLARIN distributed infrastructure § Complete set of the basic Language Resources & Tools for Polish § Research applications for H&SS § first set for key users and selected H&SS sub-domains.

CLARIN-PL Language Technology Centre Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Location in

CLARIN-PL Language Technology Centre Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Location in Wrocław University of Technology § based on modified D-Space system from Lindat (Czech CLARIN) § One sign-on, one login (a member of the Pioneer. id Federation) § Advanced repository system for language resources § § § Persistent Identifiers for resources and tools Rich CMDI meta-data – CLARIN wide visibility in the central search Interface for Federated Content Search depositing service for researchers from H&SS application for the Data Seal of Approval § Adherence to all CLARIN specifications about standards and protocols § Web Services for LRTs: § the basic processing chain of Polish § Prototype system for flexible composition of the natural language processing chains § support for developers SOAP & REST interfaces § Web Applications for LRTs § Knowledge Sharing: expertise and support for the users

CLARIN-PL: Language Resources 1. 2. 3. 4. 5. 6. Polish Morphological Dictionary Polish Speech

CLARIN-PL: Language Resources 1. 2. 3. 4. 5. 6. Polish Morphological Dictionary Polish Speech Corpora Annotated Polish Corpora Bilingual Corpora Polish Historical Corpus Semantic lexicon § Wordnet for Polish § formal description of lexical meanings 7. Dictionary of Multiword Expressions 8. Bilingual semantic lexicon 9. Lexicon of Proper Names 10. Syntactic-semantic Valency Dictionary 11. Robust syntactic-semantic grammar Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Language Resources 1. 2. 3. 4. 5. 6. Polish Morphological Dictionary Polish Speech

CLARIN-PL: Language Resources 1. 2. 3. 4. 5. 6. Polish Morphological Dictionary Polish Speech Corpora Annotated Polish Corpora Bilingual Corpora Polish Historical Corpus Semantic lexicon § pl. Word. Net 3. 0 § formal description of lexical meanings 7. Dictionary of Multiword Expressions 8. Bilingual semantic lexicon 9. Lexicon of Proper Names 10. Syntactic-semantic Valency Dictionary: 11. Robust syntactic-semantic grammar Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Language Resources Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Starting point –

CLARIN-PL: Language Resources Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Starting point – a set of large resources § a huge National Corpus of Polish (1 billion tokens) § pl. Word. Net 2. 1 – a very large wordnet for Polish § Korpus Politechniki Wrocławskiej – an open Polish corpus with rich annotation § Expanded resources § pl. Word. Net 3. 0 – a huge semantic lexicon of Polish § a comprehensive description of the Polish lexico-semantic system (~200 000 lemmas, ~280 000 senses) § fully mapped to English Princeton Word. Net § described formally by mapping to an ontology § Dictionary of multiword expressions described syntactically § NELexicon 2. 0 – a huge lexicon of Polish Proper Names (2. 5 mln)

CLARIN-PL: Language Resources for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Expanded

CLARIN-PL: Language Resources for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Expanded resources § Conversational corpus (following PELCRA and NKJP) § A large semantic valency lexicon for Polish predicative lexical units § Newly built resources § Transcribed training-testing Polish speech corpus § Bi-lingual corpora: § Polish-English, Polish-Bulgarian-Russian, Polish-Lithuanian § Polish historical corpus (for the years 1945 -1954) § Corpora annotated for: meta-data, anaphora, time expressions, spatial expressions, semantic relations and situations

pl. Word. Net 2. 2 in CLARIN-PL http: //plwordnet. pwr. edu. pl Humanistyka Cyfrowa

pl. Word. Net 2. 2 in CLARIN-PL http: //plwordnet. pwr. edu. pl Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

pl. Word. Net 2. 2 in CLARIN-PL http: //plwordnet. pwr. edu. pl Humanistyka Cyfrowa

pl. Word. Net 2. 2 in CLARIN-PL http: //plwordnet. pwr. edu. pl Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Systems

CLARIN-PL: Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Systems for searching corpora, especially Polish corpora § Spokes for conversational and bilingual corpora § Poliqarp 2. 0 for richly annotated § Historical corpora [New] § Text mining (information extraction) § Recognition and classification of Proper Names § Recognition of anaphoric links § Recognition and classification of time expressions and spatial expressions [New] § Situation recognition [New] § Extraction of multiword expressions (collocations) § A generic set of morpho-syntactic tools for Polish that can be adapted to a domain specified by the user [New]

CLARIN-PL: Language Tools for Polish § § Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Language Tools for Polish § § Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL Word Sense Disambiguation based on pl. Word. Net Shallow semantic parser [New] Deep syntactic-semantic parser [New] Tools for the extraction of the semantic-pragmatic information from documents and collections of documents, e. g. § keywords [New], § semantic relations between text fragments § and text summaries

Basic Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL 1. Segmentation

Basic Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL 1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and constituent 14. Deep semantic parser

Basic Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL 1. Segmentation

Basic Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL 1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and constituent 14. Deep semantic parser

Basic Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL 1. Segmentation

Basic Language Tools for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL 1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and constituent 14. Deep semantic parser

CLARIN-PL: Processing Chain for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Processing Chain for Polish Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Recognition and classification of Proper Names Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

CLARIN-PL: Recognition and classification of Proper Names Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

Bi-directional - Top-down Part: First Applications Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL §

Bi-directional - Top-down Part: First Applications Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Approaching users § already active, interested, working on large textual and speech resources, … § covering a maximal variety of research areas, e. g. linguistics, literary studies, psychology, political studies and sociology § matching the available language tools for Polish § the first set of several prototype application illustrating possibilities and facilitating identification of the needs § First applications § § § Spokes – searching corpora of conversational data A system for collecting Polish text corpora from the Web A open textometric and stylometric system focused on Polish Semantic text classification for sociology Literary Map

Spokes (University of Łódź) http: //spokes. clarin-pl. eu Humanistyka Cyfrowa Warszawa 2014 -11 -27

Spokes (University of Łódź) http: //spokes. clarin-pl. eu Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

System for Collecting Polish Text Corpora from the Web Humanistyka Cyfrowa Warszawa 2014 -11

System for Collecting Polish Text Corpora from the Web Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Requests from the users revealed gaps in the available technology § existing corpus building systems were too sensitive to text encoding errors found in the web § not designed for informal corpora like blogs § A system for collecting Polish text corpora from the Web had to be constructed: § based on tools from the Masaryk University in Brno § to detect texts including larger number of errors (by morphological analysis) § supports semi-automated extraction of texts from blogs, posts on forums, etc. § integrated with tools for processing

Open Textometric and Stylometric System Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § System

Open Textometric and Stylometric System Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § System designed for characteristic features of Polish § like rich inflection, weakly constrained word order § Based on several existing components including Stylo (Eder & Rybicki) § Enabling the use of features defined on any level of the linguistic structure: § from the level of word forms § up to the level of the semantic-pragmatic structures. § Available as Web Application and a Web Service § Stylometric techniques appear to be applicable in many tasks of H&SS § sociology (characteristic features that are for different subgroups), political studies (similarity and differences between political parties), literary studies …

Semantic Text Classification for Sociology Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Users:

Semantic Text Classification for Sociology Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Users: Collegium Civitas, Warsaw § Goal § Support for large scale analysis of the source materials § Automatically annotate documents and text fragments with pre -defined semantic categories § Definition of categories by examples § Automated semantic grouping of documents and text fragments § Support for § § Corpus building Manual annotation of the learning sub-corpus Automated annotation process Statistical analysis of the results

Ge. TClas. S – Generalised Text Classification for Sociology Humanistyka Cyfrowa Warszawa 2014 -11

Ge. TClas. S – Generalised Text Classification for Sociology Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

Literary Map Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Users: Digital Humanities Centre

Literary Map Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Users: Digital Humanities Centre of The Institute of Literary Research (Polish Academy of Sciences) § Goal § Support for using maps in the literary criticism § Tool for the identification of all geographical names in the literary text (or a corpus) and mapping them onto a geographical map § Tasks 1. Identification and semantic classification of the referring language expressions 2. Disambiguation of the referents 3. Mapping the referents onto a map (geo-location) 4. Recognition of the semantic relations and statistical analysis

Literary Map Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

Literary Map Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL

Conclusions Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Application of LT to the

Conclusions Humanistyka Cyfrowa Warszawa 2014 -11 -27 CLARIN-PL § Application of LT to the research in Humanities & Social Sciences seem to be much more challenging than in commercial systems! § LT for Polish achieved a stage in which valuable support can be provided for research applications § Bi-directional approach combines § development of the basic, universal set of language tools and resources § with inspirations from the research applications

CLARIN-PL Thank you very much for your attention! www. clarin-pl. eu Supported by the

CLARIN-PL Thank you very much for your attention! www. clarin-pl. eu Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]

Bi-directional: bottom-up part PALC 2014 Łódź 2014 -11 -22 CLARIN-PL § LRTs and LRT

Bi-directional: bottom-up part PALC 2014 Łódź 2014 -11 -22 CLARIN-PL § LRTs and LRT chains can be useful … § if the required tools and resources exist, § and, they are robust! § What is the minimal set of LRTs? § What kind of LRTs can be called robust? § automated applications in H&SS seem to require high quality of language tools and mostly large coverage of resource § BLARK – The Basic Language Resource Kit § “the minimal set of language resources that is necessary to do any precompetitive research and education at all” (Krauwer, 2003) and also basic processing chains § possible reference point to compare LRTs for different languages