Corpus Technology at the IDS Corpora and Corpus
- Slides: 66
Corpus Technology at the IDS Corpora and Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 1
Affiliation o Institut für Deutsche Sprache, Mannheim (Eichinger) n departments o o o Grammatik Pragmatik Lexik (Haß-Zumkehr) n Bratislava, 27/01/03 Arbeitsgruppe für Korpustechnologie (Belica) Cyril Belica 2
Arbeitsgruppe für Korpustechnologie o staff n n n o 2. 5 IT experts 1. 5 linguists 2 undergraduates research framework n n corpus based lexicography strictly no NLP Bratislava, 27/01/03 Cyril Belica 3
Credos o o more data is better data (R. Mercer) minimal assumption (J. Sinclair) no unsupervised language interpretation by the machine language independency Bratislava, 27/01/03 Cyril Belica 4
Credos o o o race for the ultimate highest authority: expert competence vs. corpus evidence race for the ultimate least authority: prescriptive user vs. language-aware tools corpus representativeness = ( corpus size + diversity of external evidence + corpus documentation) × corpus composition tools Bratislava, 27/01/03 Cyril Belica 5
Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 6
Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 7
Project: Corpus Technology: Work Package Corpus Acquisition o o o o German only since mid-sixties three copyright status levels composition (see Credos) annotations text modell access users Bratislava, 27/01/03 Cyril Belica 8
Project: Corpus Technology: Work Package: Corpus Acquisition Goals o o o o find analyze get copyright buy convert document archive Bratislava, 27/01/03 Cyril Belica 9
Project: Corpus Technology: Work Package: Corpus Acquisition Corpus Size o the world’s largest collection of German corpora Bratislava, 27/01/03 Cyril Belica 10
Project: Corpus Technology: Work Package: Corpus Acquisition Text Model o TEI/CES based with proprietary extensions n n n o structure bibliography primary text structure n three level hierarchy o o o corpus document text Bratislava, 27/01/03 Cyril Belica 11
Project: Corpus Technology: Work Package: Corpus Acquisition Text Model o bibliography n n n o exhaustive documentation pagination & page numbering text dating vs. language dating primary text model n n original, AFAP (no typo correction etc. ) minimal assumption exceptions: tokenizer, sentence boundaries layout if linguistically relevant Bratislava, 27/01/03 Cyril Belica 12
Project: Corpus Technology: Work Package: Corpus Acquisition Corpus Availability o o via the COSMAS toolbox free online access n o o http: //corpora. ids-mannheim. de/cosmas sign-up required no commercial use by default Bratislava, 27/01/03 Cyril Belica 13
Project: Corpus Technology: Work Package: Corpus Acquisition Users o linguistics n n o IDS worldwide psychology, neurology, cognition science, medicine, information technology, language teaching, military and intelligence, media, etc. Bratislava, 27/01/03 Cyril Belica 14
Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 15
Project COSMAS o o o o life cycle tokenizer indexing methods query language virtual corpora C II extensions performance issues further features Bratislava, 27/01/03 Cyril Belica 16
Project: COSMAS Life Cycle design implementation beta release LAN production release WAN production release web production release 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 design implementation beta release LAN release Bratislava, 27/01/03 Cyril Belica 17
Project: COSMAS Tokenizer o o o ambiguos tunable by user options word frequencies change according to option settings Bratislava, 27/01/03 Cyril Belica 18
Project: COSMAS Indexing Methods o o multiple adaptive text annotations Bratislava, 27/01/03 Cyril Belica 19
Project: COSMAS Query Language o o o C-II case, lemma, wildcards, reference, word/lemma lists, list/query import, logical operators, inclusive/exclusive proximity, min/max spans, word/sentence/paragraph/text metrics, annotations etc. Bratislava, 27/01/03 Cyril Belica 20
Project: COSMAS Virtual Corpora o corpus composition is a use-related issue rather than an acquisition-related issue Bratislava, 27/01/03 Cyril Belica 21
Project: COSMAS Virtual Corpora o corpus composition tool n n n by explicit external evidence reference by explicit internal evidence reference by target external evidence distribution Bratislava, 27/01/03 Cyril Belica 22
Project: COSMAS Virtual Corpora o types & properties n n n random predefined user-defined loadable overlapping evolving Bratislava, 27/01/03 Cyril Belica 23
Project: COSMAS C II Extensions o o o o full SGML support concurrent annotation levels handling of discontiguous text spans multiple corpora active in parallel support for contrastive studies sound, multimedia hub to an external SGML/XML-viewer (in general: multimedia player) Bratislava, 27/01/03 Cyril Belica 24
Project: COSMAS: C II Extensions GUI C II help Bratislava, 27/01/03 Cyril Belica 25
Project: COSMAS Performance Issues o search speed n n n result cache / bootstrapping proximity logic / bucketing computing frequencies on-the-fly due to o virtual corpora ambiguous tokenizer superset corpora Bratislava, 27/01/03 Cyril Belica 26
Project: COSMAS Performance Issues o o fast filtering/sorting of kwics parallelism during full text access n n software-side / threading hardware-side / jukeboxes Bratislava, 27/01/03 Cyril Belica 27
Project: COSMAS Management o o o user management corpus access rights Bratislava, 27/01/03 Cyril Belica 28
Project: COSMAS Further Features o o o lemmatizer as access provider hit set randomizing diachronic plots restricted full text export API for batch processing Web connectivity Bratislava, 27/01/03 Cyril Belica 29
Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 30
Project: Wissen über Wörter Wi. W: Goals and Target Audience o long-term lexicographical initiative o lexical issues of standard German n documentation n explanation for general public n lexicologic description for experts o 250. 000 to 300. 000 headwords o hypertext-based information system Bratislava, 27/01/03 Cyril Belica 31
Project: Wissen über Wörter Components Lexicographic Authoring Linguistic Framework Bratislava, 27/01/03 Software Support Cyril Belica 32
Project: Wissen über Wörter The structure of an Entry 3 lemma types word element word MWU co-occurrance information polysemic information sense # 1 orthography and pronounciation meaning and usage grammar language reflection diachronic and topical Bratislava, 27/01/03 sense # n orthography and pronounciation meaning and usage grammar language reflection diachronic and topical Cyril Belica 33
Project: Wissen über Wörter Traditional Corpus-Based Approach o testing hypotheses, e. g. concerning the n n n o existence of a corpus evidence corpus frequency of a phenomenon dating of the first corpus evidence genre, topic, style, variety, etc. typical for a phenomenon etc. documenting competence-based claims Bratislava, 27/01/03 Cyril Belica 34
Project: Wissen über Wörter Wi. W Corpus-Based Approach look up let discover structures inspect try an interpretation Wi. W traditionally Discover structures? How? Bratislava, 27/01/03 Cyril Belica 35
Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 36
Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 37
Project: Corpus Technology: Work Package Corpus Analysis Methods o o o o fast full text retrieval algorithms collocation analysis and clustering lemmatizer annotation wizzard collocation database collocation explorer topic spotting Bratislava, 27/01/03 Cyril Belica 38
Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o o o focal point: detecting and structuring lexical cohesion discovering hierarchical structures within a set of search hits based on a collocational patterning of the search terms independent observed and expected sets: virtual corpus again Bratislava, 27/01/03 Cyril Belica 39
Suchanfrage Bratislava, 27/01/03 Cyril Belica 40
Ergebnisübersicht Bratislava, 27/01/03 Cyril Belica 41
Normaler Kwic Bratislava, 27/01/03 Cyril Belica 42
Analyseparameter Bratislava, 27/01/03 Cyril Belica 43
Signifikanzliste Bratislava, 27/01/03 Cyril Belica 44
Signifikanzliste, fein Kollokator: Plätze typische Stellung im Kontext: vom 1. Wort links bis zum 1. Wort rechts Kollokator: gesetzt typische Stellung im Kontext: 2. Wort rechts Bratislava, 27/01/03 Cyril Belica 45
Signifikanzliste, „Stücken“ Bratislava, 27/01/03 Cyril Belica 46
Kwic, „Stücken“ Bratislava, 27/01/03 Cyril Belica 47
Project: Wissen über Wörter Structuring the Corpus Evidence o o o the CA is in fact just another kind of sorting the corpus evidence, other than by date or bibliograhy, etc. but: rather than sorting by some text external criterion the CA provides for ordering by text internal patterning benefits n improving mass data introspection by o o o focusing user's view on typical uses weighting the degree of typicality collecting and masking off singularities Bratislava, 27/01/03 Cyril Belica 48
Project: Wissen über Wörter CA: Lexicographer’s View interpretation of collocational patterns by further generalizations with respect to o o o word class POS sense topic domain MWU etc. Bratislava, 27/01/03 Cyril Belica 49
Project: Wissen über Wörter Collocational Patterns of &frei word class (examples) n noun Eintritt, Verkauf, Platz, Universität, Meinungsäußerung, Wahlen, Fahrt, Beruf, Journalist n verb (mostly participle) gesetzt, erfunden, übersetzt, gewählt, geworden, bewegen, wählen, herumlaufen, leben, entscheiden, bestimmen Bratislava, 27/01/03 Cyril Belica 50
Project: Wissen über Wörter Collocational Patterns of &frei sense (examples) n „free of charge“ Eintritt, Zugang, Zutritt n „not occupied“ Platz, Zimmer, Stelle, Betten, Parkplatz n „not limited“ Verkauf, Zugang, Fahrt, Training, Wahlen, Arztwahl, Aussicht, Personenverkehr n „creative“ Technik, Training, Improvisation, entfalten, Stil Bratislava, 27/01/03 Cyril Belica 51
Project: Wissen über Wörter Collocational Patterns of &frei topic and domain (examples) n „human rights“ Meinungsäußerung, Wahlen, Bürger, Presse, Entfaltung, Religionsausübung, Wort, Berichterstattung n „economy“ Marktwirtschaft, Wettbewerb, Warenverkehr, Aktionäre, Wirtschaftsverband, Handel, Kapitalverkehr, Strommarkt, Unternehmertum n „occupation“ Journalist, Autor, Schriftsteller, Publizist Bratislava, 27/01/03 Cyril Belica 52
Project: Wissen über Wörter Multi-Word Units based on &frei Fuß Lauf Himmel Meinungsäußerung Fahrt Bürger Träger Wildbahn Weg Hand Fall Motto Erfunden Bratislava, 27/01/03 [auf freien Fuß setzen] [freien Lauf lassen] [unter freiem Himmel] [Recht auf freie Meinungsäußerung] [freie Fahrt für freie Bürger] [Freie Träger] [in freier Wildbahn] [Weg frei machen] [freie Hand lassen] [im freien Fall] [frei nach dem Motto: „…“] [frei erfunden] Cyril Belica 53
Project: Wissen über Wörter Multi-Word Units based on &frei Bewegen Wirtschaft Frank Bahn Geleit Radikale Logis Betten Manege Kräfte Kopf Minute Bratislava, 27/01/03 [frei bewegen] [freie Wirtschaft] [frank und frei] [Bahn frei! Freie Bahn] [freies Geleit] [freie Radikale] [freie Kost und Logis] [freie Betten] [Manege frei!] [freies Spiel der Kräfte] [frei im Kopf sein] [jede freie Minute] etc. Cyril Belica 54
Project: Wissen über Wörter Collocators of MWUs verb collocators of freies Geleit zusichern, gewähren, zusagen, fordern, bekommen, verlangen, erhalten, garantieren, inhaftieren, sichern, zurückkehren, anbieten, stellen, versprechen, stationieren, etc. noun collocators of freien Lauf lassen Phantasie, Kreativität, Gefühl, Unmut, Emotion, Frust, Wut, Zerstörungswut, Aggression, Tränen, Gedanken, Freude, Enttäuschung, Bewegungsdrang, Zorn, Begeisterung, Spieltrieb, Assoziation, Temperament, Empörung, Hass, etc. Bratislava, 27/01/03 Cyril Belica 55
Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o tunable span n n o o o auto focus optionally within sentence boundaries significance and grain parameters optionally lemmatized ambiguous clustering ignoring function words option several statistics (log-likelihood ratio etc. ) Bratislava, 27/01/03 Cyril Belica 56
Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o lexicological challenge n n n map lexical cohesion into linguistic categories poor adequacy of traditional description role of lexical cohesion in lexicographical relevance Bratislava, 27/01/03 Cyril Belica 57
Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering Cyril Belica: „Statistische Kollokationsanalyse und Clustering“. Korpusanalysemodul unter http: //corpora. ids-mannheim. de/cosmas © IDS Mannheim 1995 – 2002 n n online available since 1995 plugged in COSMAS I and COSMAS II on-the-fly analysis of any (dynamic) COSMAS corpus detecting not only binary word relations but rather complex phrasal and/or contextual patterns, MWUs and their hierarchies Bratislava, 27/01/03 Cyril Belica 58
Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 59
Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 60
Project: Corpus Technology: Work Package Corpus Analysis Methods o o o o fast full text retrieval algorithms collocation analysis and clustering lemmatizer annotation wizzard collocation database collocation explorer topic spotting Bratislava, 27/01/03 Cyril Belica 61
Project: Corpus Technology: Work Package: Corpus Analysis Methods Lemmatizer o o o corpus based analytical rather than generative ambiguous German inflectional, compositional & derivational morphology fast Bratislava, 27/01/03 Cyril Belica 62
Project: Corpus Technology: Work Package: Corpus Analysis Methods Annotation. Wizzard o menu driven composition of feature structures Bratislava, 27/01/03 Cyril Belica 63
Project: Corpus Technology: Work Package: Corpus Analysis Methods CCDB - Collocation Database o o default corpus baseline word list six parameter sets batch run Bratislava, 27/01/03 Cyril Belica 64
Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Explorer o o o lexical cohesion visualized hyperbolic projection lexicographical interpretation & editing interface to COSMAS & CCDB issue: lexicology Bratislava, 27/01/03 Cyril Belica 65
Project: Corpus Technology: Work Package: Corpus Analysis Methods Topic Spotting o goal n document clustering for o o virtual corpus composition frequency distribution analysis for n Bratislava, 27/01/03 sense disambiguation Cyril Belica 66
- What is corpus
- Lutalphase
- Ovarian stroma
- Hind brain
- Corpora quadrigemina function
- 4th ventricle sheep brain
- Vision couleur cheval
- Mcenery corpus linguistics "free download" or "read online"
- Midbrain
- Corpora quadrigemina pronunciation
- Nodular hyperplasia
- Corpora aranacea
- What is corpus
- Corpora
- Old opie occasionally tries
- Corpus technology
- Mapa ids jmk
- Ids project
- Ids background
- Lettre suivie ids
- Ohsu health ids
- Ids definition
- What is ids
- Ids východ
- Intranet ids
- Goals of ids
- Distributed ids
- Ids sensor placement
- Bro ids
- Ids craig
- Ids pt research ruleset
- Linux ids
- Ids background
- Nids
- L
- Dochters ids postma
- Elasticsearch on raspberry pi
- Top ids ips vendors
- Pollock's lavender mist is an example of _____.
- Paramount ids
- Examples of ids
- Tripwire ids
- Ids beams
- Mpep ids timing
- Ids pharmacy
- Srx idp configuration
- Product planning
- Background ids
- Ids
- Firewalls and intrusion detection systems
- Ids
- Ids
- Bernhard auchmann
- Www.gawiconline.org
- Blackice ids
- Ids opensource
- Hát kết hợp bộ gõ cơ thể
- Ng-html
- Bổ thể
- Tỉ lệ cơ thể trẻ em
- Chó sói
- Glasgow thang điểm
- Hát lên người ơi alleluia
- Môn thể thao bắt đầu bằng chữ f
- Thế nào là hệ số cao nhất
- Các châu lục và đại dương trên thế giới
- Công thức tính độ biến thiên đông lượng