Corpus Technology at the IDS Corpora and Corpus

  • Slides: 66
Download presentation
Corpus Technology at the IDS Corpora and Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica

Corpus Technology at the IDS Corpora and Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 1

Affiliation o Institut für Deutsche Sprache, Mannheim (Eichinger) n departments o o o Grammatik

Affiliation o Institut für Deutsche Sprache, Mannheim (Eichinger) n departments o o o Grammatik Pragmatik Lexik (Haß-Zumkehr) n Bratislava, 27/01/03 Arbeitsgruppe für Korpustechnologie (Belica) Cyril Belica 2

Arbeitsgruppe für Korpustechnologie o staff n n n o 2. 5 IT experts 1.

Arbeitsgruppe für Korpustechnologie o staff n n n o 2. 5 IT experts 1. 5 linguists 2 undergraduates research framework n n corpus based lexicography strictly no NLP Bratislava, 27/01/03 Cyril Belica 3

Credos o o more data is better data (R. Mercer) minimal assumption (J. Sinclair)

Credos o o more data is better data (R. Mercer) minimal assumption (J. Sinclair) no unsupervised language interpretation by the machine language independency Bratislava, 27/01/03 Cyril Belica 4

Credos o o o race for the ultimate highest authority: expert competence vs. corpus

Credos o o o race for the ultimate highest authority: expert competence vs. corpus evidence race for the ultimate least authority: prescriptive user vs. language-aware tools corpus representativeness = ( corpus size + diversity of external evidence + corpus documentation) × corpus composition tools Bratislava, 27/01/03 Cyril Belica 5

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava,

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 6

Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava,

Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 7

Project: Corpus Technology: Work Package Corpus Acquisition o o o o German only since

Project: Corpus Technology: Work Package Corpus Acquisition o o o o German only since mid-sixties three copyright status levels composition (see Credos) annotations text modell access users Bratislava, 27/01/03 Cyril Belica 8

Project: Corpus Technology: Work Package: Corpus Acquisition Goals o o o o find analyze

Project: Corpus Technology: Work Package: Corpus Acquisition Goals o o o o find analyze get copyright buy convert document archive Bratislava, 27/01/03 Cyril Belica 9

Project: Corpus Technology: Work Package: Corpus Acquisition Corpus Size o the world’s largest collection

Project: Corpus Technology: Work Package: Corpus Acquisition Corpus Size o the world’s largest collection of German corpora Bratislava, 27/01/03 Cyril Belica 10

Project: Corpus Technology: Work Package: Corpus Acquisition Text Model o TEI/CES based with proprietary

Project: Corpus Technology: Work Package: Corpus Acquisition Text Model o TEI/CES based with proprietary extensions n n n o structure bibliography primary text structure n three level hierarchy o o o corpus document text Bratislava, 27/01/03 Cyril Belica 11

Project: Corpus Technology: Work Package: Corpus Acquisition Text Model o bibliography n n n

Project: Corpus Technology: Work Package: Corpus Acquisition Text Model o bibliography n n n o exhaustive documentation pagination & page numbering text dating vs. language dating primary text model n n original, AFAP (no typo correction etc. ) minimal assumption exceptions: tokenizer, sentence boundaries layout if linguistically relevant Bratislava, 27/01/03 Cyril Belica 12

Project: Corpus Technology: Work Package: Corpus Acquisition Corpus Availability o o via the COSMAS

Project: Corpus Technology: Work Package: Corpus Acquisition Corpus Availability o o via the COSMAS toolbox free online access n o o http: //corpora. ids-mannheim. de/cosmas sign-up required no commercial use by default Bratislava, 27/01/03 Cyril Belica 13

Project: Corpus Technology: Work Package: Corpus Acquisition Users o linguistics n n o IDS

Project: Corpus Technology: Work Package: Corpus Acquisition Users o linguistics n n o IDS worldwide psychology, neurology, cognition science, medicine, information technology, language teaching, military and intelligence, media, etc. Bratislava, 27/01/03 Cyril Belica 14

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava,

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 15

Project COSMAS o o o o life cycle tokenizer indexing methods query language virtual

Project COSMAS o o o o life cycle tokenizer indexing methods query language virtual corpora C II extensions performance issues further features Bratislava, 27/01/03 Cyril Belica 16

Project: COSMAS Life Cycle design implementation beta release LAN production release WAN production release

Project: COSMAS Life Cycle design implementation beta release LAN production release WAN production release web production release 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 design implementation beta release LAN release Bratislava, 27/01/03 Cyril Belica 17

Project: COSMAS Tokenizer o o o ambiguos tunable by user options word frequencies change

Project: COSMAS Tokenizer o o o ambiguos tunable by user options word frequencies change according to option settings Bratislava, 27/01/03 Cyril Belica 18

Project: COSMAS Indexing Methods o o multiple adaptive text annotations Bratislava, 27/01/03 Cyril Belica

Project: COSMAS Indexing Methods o o multiple adaptive text annotations Bratislava, 27/01/03 Cyril Belica 19

Project: COSMAS Query Language o o o C-II case, lemma, wildcards, reference, word/lemma lists,

Project: COSMAS Query Language o o o C-II case, lemma, wildcards, reference, word/lemma lists, list/query import, logical operators, inclusive/exclusive proximity, min/max spans, word/sentence/paragraph/text metrics, annotations etc. Bratislava, 27/01/03 Cyril Belica 20

Project: COSMAS Virtual Corpora o corpus composition is a use-related issue rather than an

Project: COSMAS Virtual Corpora o corpus composition is a use-related issue rather than an acquisition-related issue Bratislava, 27/01/03 Cyril Belica 21

Project: COSMAS Virtual Corpora o corpus composition tool n n n by explicit external

Project: COSMAS Virtual Corpora o corpus composition tool n n n by explicit external evidence reference by explicit internal evidence reference by target external evidence distribution Bratislava, 27/01/03 Cyril Belica 22

Project: COSMAS Virtual Corpora o types & properties n n n random predefined user-defined

Project: COSMAS Virtual Corpora o types & properties n n n random predefined user-defined loadable overlapping evolving Bratislava, 27/01/03 Cyril Belica 23

Project: COSMAS C II Extensions o o o o full SGML support concurrent annotation

Project: COSMAS C II Extensions o o o o full SGML support concurrent annotation levels handling of discontiguous text spans multiple corpora active in parallel support for contrastive studies sound, multimedia hub to an external SGML/XML-viewer (in general: multimedia player) Bratislava, 27/01/03 Cyril Belica 24

Project: COSMAS: C II Extensions GUI C II help Bratislava, 27/01/03 Cyril Belica 25

Project: COSMAS: C II Extensions GUI C II help Bratislava, 27/01/03 Cyril Belica 25

Project: COSMAS Performance Issues o search speed n n n result cache / bootstrapping

Project: COSMAS Performance Issues o search speed n n n result cache / bootstrapping proximity logic / bucketing computing frequencies on-the-fly due to o virtual corpora ambiguous tokenizer superset corpora Bratislava, 27/01/03 Cyril Belica 26

Project: COSMAS Performance Issues o o fast filtering/sorting of kwics parallelism during full text

Project: COSMAS Performance Issues o o fast filtering/sorting of kwics parallelism during full text access n n software-side / threading hardware-side / jukeboxes Bratislava, 27/01/03 Cyril Belica 27

Project: COSMAS Management o o o user management corpus access rights Bratislava, 27/01/03 Cyril

Project: COSMAS Management o o o user management corpus access rights Bratislava, 27/01/03 Cyril Belica 28

Project: COSMAS Further Features o o o lemmatizer as access provider hit set randomizing

Project: COSMAS Further Features o o o lemmatizer as access provider hit set randomizing diachronic plots restricted full text export API for batch processing Web connectivity Bratislava, 27/01/03 Cyril Belica 29

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava,

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 30

Project: Wissen über Wörter Wi. W: Goals and Target Audience o long-term lexicographical initiative

Project: Wissen über Wörter Wi. W: Goals and Target Audience o long-term lexicographical initiative o lexical issues of standard German n documentation n explanation for general public n lexicologic description for experts o 250. 000 to 300. 000 headwords o hypertext-based information system Bratislava, 27/01/03 Cyril Belica 31

Project: Wissen über Wörter Components Lexicographic Authoring Linguistic Framework Bratislava, 27/01/03 Software Support Cyril

Project: Wissen über Wörter Components Lexicographic Authoring Linguistic Framework Bratislava, 27/01/03 Software Support Cyril Belica 32

Project: Wissen über Wörter The structure of an Entry 3 lemma types word element

Project: Wissen über Wörter The structure of an Entry 3 lemma types word element word MWU co-occurrance information polysemic information sense # 1 orthography and pronounciation meaning and usage grammar language reflection diachronic and topical Bratislava, 27/01/03 sense # n orthography and pronounciation meaning and usage grammar language reflection diachronic and topical Cyril Belica 33

Project: Wissen über Wörter Traditional Corpus-Based Approach o testing hypotheses, e. g. concerning the

Project: Wissen über Wörter Traditional Corpus-Based Approach o testing hypotheses, e. g. concerning the n n n o existence of a corpus evidence corpus frequency of a phenomenon dating of the first corpus evidence genre, topic, style, variety, etc. typical for a phenomenon etc. documenting competence-based claims Bratislava, 27/01/03 Cyril Belica 34

Project: Wissen über Wörter Wi. W Corpus-Based Approach look up let discover structures inspect

Project: Wissen über Wörter Wi. W Corpus-Based Approach look up let discover structures inspect try an interpretation Wi. W traditionally Discover structures? How? Bratislava, 27/01/03 Cyril Belica 35

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava,

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 36

Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava,

Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 37

Project: Corpus Technology: Work Package Corpus Analysis Methods o o o o fast full

Project: Corpus Technology: Work Package Corpus Analysis Methods o o o o fast full text retrieval algorithms collocation analysis and clustering lemmatizer annotation wizzard collocation database collocation explorer topic spotting Bratislava, 27/01/03 Cyril Belica 38

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o o

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o o o focal point: detecting and structuring lexical cohesion discovering hierarchical structures within a set of search hits based on a collocational patterning of the search terms independent observed and expected sets: virtual corpus again Bratislava, 27/01/03 Cyril Belica 39

Suchanfrage Bratislava, 27/01/03 Cyril Belica 40

Suchanfrage Bratislava, 27/01/03 Cyril Belica 40

Ergebnisübersicht Bratislava, 27/01/03 Cyril Belica 41

Ergebnisübersicht Bratislava, 27/01/03 Cyril Belica 41

Normaler Kwic Bratislava, 27/01/03 Cyril Belica 42

Normaler Kwic Bratislava, 27/01/03 Cyril Belica 42

Analyseparameter Bratislava, 27/01/03 Cyril Belica 43

Analyseparameter Bratislava, 27/01/03 Cyril Belica 43

Signifikanzliste Bratislava, 27/01/03 Cyril Belica 44

Signifikanzliste Bratislava, 27/01/03 Cyril Belica 44

Signifikanzliste, fein Kollokator: Plätze typische Stellung im Kontext: vom 1. Wort links bis zum

Signifikanzliste, fein Kollokator: Plätze typische Stellung im Kontext: vom 1. Wort links bis zum 1. Wort rechts Kollokator: gesetzt typische Stellung im Kontext: 2. Wort rechts Bratislava, 27/01/03 Cyril Belica 45

Signifikanzliste, „Stücken“ Bratislava, 27/01/03 Cyril Belica 46

Signifikanzliste, „Stücken“ Bratislava, 27/01/03 Cyril Belica 46

Kwic, „Stücken“ Bratislava, 27/01/03 Cyril Belica 47

Kwic, „Stücken“ Bratislava, 27/01/03 Cyril Belica 47

Project: Wissen über Wörter Structuring the Corpus Evidence o o o the CA is

Project: Wissen über Wörter Structuring the Corpus Evidence o o o the CA is in fact just another kind of sorting the corpus evidence, other than by date or bibliograhy, etc. but: rather than sorting by some text external criterion the CA provides for ordering by text internal patterning benefits n improving mass data introspection by o o o focusing user's view on typical uses weighting the degree of typicality collecting and masking off singularities Bratislava, 27/01/03 Cyril Belica 48

Project: Wissen über Wörter CA: Lexicographer’s View interpretation of collocational patterns by further generalizations

Project: Wissen über Wörter CA: Lexicographer’s View interpretation of collocational patterns by further generalizations with respect to o o o word class POS sense topic domain MWU etc. Bratislava, 27/01/03 Cyril Belica 49

Project: Wissen über Wörter Collocational Patterns of &frei word class (examples) n noun Eintritt,

Project: Wissen über Wörter Collocational Patterns of &frei word class (examples) n noun Eintritt, Verkauf, Platz, Universität, Meinungsäußerung, Wahlen, Fahrt, Beruf, Journalist n verb (mostly participle) gesetzt, erfunden, übersetzt, gewählt, geworden, bewegen, wählen, herumlaufen, leben, entscheiden, bestimmen Bratislava, 27/01/03 Cyril Belica 50

Project: Wissen über Wörter Collocational Patterns of &frei sense (examples) n „free of charge“

Project: Wissen über Wörter Collocational Patterns of &frei sense (examples) n „free of charge“ Eintritt, Zugang, Zutritt n „not occupied“ Platz, Zimmer, Stelle, Betten, Parkplatz n „not limited“ Verkauf, Zugang, Fahrt, Training, Wahlen, Arztwahl, Aussicht, Personenverkehr n „creative“ Technik, Training, Improvisation, entfalten, Stil Bratislava, 27/01/03 Cyril Belica 51

Project: Wissen über Wörter Collocational Patterns of &frei topic and domain (examples) n „human

Project: Wissen über Wörter Collocational Patterns of &frei topic and domain (examples) n „human rights“ Meinungsäußerung, Wahlen, Bürger, Presse, Entfaltung, Religionsausübung, Wort, Berichterstattung n „economy“ Marktwirtschaft, Wettbewerb, Warenverkehr, Aktionäre, Wirtschaftsverband, Handel, Kapitalverkehr, Strommarkt, Unternehmertum n „occupation“ Journalist, Autor, Schriftsteller, Publizist Bratislava, 27/01/03 Cyril Belica 52

Project: Wissen über Wörter Multi-Word Units based on &frei Fuß Lauf Himmel Meinungsäußerung Fahrt

Project: Wissen über Wörter Multi-Word Units based on &frei Fuß Lauf Himmel Meinungsäußerung Fahrt Bürger Träger Wildbahn Weg Hand Fall Motto Erfunden Bratislava, 27/01/03 [auf freien Fuß setzen] [freien Lauf lassen] [unter freiem Himmel] [Recht auf freie Meinungsäußerung] [freie Fahrt für freie Bürger] [Freie Träger] [in freier Wildbahn] [Weg frei machen] [freie Hand lassen] [im freien Fall] [frei nach dem Motto: „…“] [frei erfunden] Cyril Belica 53

Project: Wissen über Wörter Multi-Word Units based on &frei Bewegen Wirtschaft Frank Bahn Geleit

Project: Wissen über Wörter Multi-Word Units based on &frei Bewegen Wirtschaft Frank Bahn Geleit Radikale Logis Betten Manege Kräfte Kopf Minute Bratislava, 27/01/03 [frei bewegen] [freie Wirtschaft] [frank und frei] [Bahn frei! Freie Bahn] [freies Geleit] [freie Radikale] [freie Kost und Logis] [freie Betten] [Manege frei!] [freies Spiel der Kräfte] [frei im Kopf sein] [jede freie Minute] etc. Cyril Belica 54

Project: Wissen über Wörter Collocators of MWUs verb collocators of freies Geleit zusichern, gewähren,

Project: Wissen über Wörter Collocators of MWUs verb collocators of freies Geleit zusichern, gewähren, zusagen, fordern, bekommen, verlangen, erhalten, garantieren, inhaftieren, sichern, zurückkehren, anbieten, stellen, versprechen, stationieren, etc. noun collocators of freien Lauf lassen Phantasie, Kreativität, Gefühl, Unmut, Emotion, Frust, Wut, Zerstörungswut, Aggression, Tränen, Gedanken, Freude, Enttäuschung, Bewegungsdrang, Zorn, Begeisterung, Spieltrieb, Assoziation, Temperament, Empörung, Hass, etc. Bratislava, 27/01/03 Cyril Belica 55

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o tunable

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o tunable span n n o o o auto focus optionally within sentence boundaries significance and grain parameters optionally lemmatized ambiguous clustering ignoring function words option several statistics (log-likelihood ratio etc. ) Bratislava, 27/01/03 Cyril Belica 56

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o lexicological

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering o lexicological challenge n n n map lexical cohesion into linguistic categories poor adequacy of traditional description role of lexical cohesion in lexicographical relevance Bratislava, 27/01/03 Cyril Belica 57

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering Cyril Belica:

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Analysis and Clustering Cyril Belica: „Statistische Kollokationsanalyse und Clustering“. Korpusanalysemodul unter http: //corpora. ids-mannheim. de/cosmas © IDS Mannheim 1995 – 2002 n n online available since 1995 plugged in COSMAS I and COSMAS II on-the-fly analysis of any (dynamic) COSMAS corpus detecting not only binary word relations but rather complex phrasal and/or contextual patterns, MWUs and their hierarchies Bratislava, 27/01/03 Cyril Belica 58

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava,

Projects o o Corpus Technology COSMAS Wissen über Wörter DEREKO – Deutsches Referenzkorpus Bratislava, 27/01/03 Cyril Belica 59

Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava,

Projects Corpus Technology o Work Packages n n Corpus Acquisition Corpus Analysis Methods Bratislava, 27/01/03 Cyril Belica 60

Project: Corpus Technology: Work Package Corpus Analysis Methods o o o o fast full

Project: Corpus Technology: Work Package Corpus Analysis Methods o o o o fast full text retrieval algorithms collocation analysis and clustering lemmatizer annotation wizzard collocation database collocation explorer topic spotting Bratislava, 27/01/03 Cyril Belica 61

Project: Corpus Technology: Work Package: Corpus Analysis Methods Lemmatizer o o o corpus based

Project: Corpus Technology: Work Package: Corpus Analysis Methods Lemmatizer o o o corpus based analytical rather than generative ambiguous German inflectional, compositional & derivational morphology fast Bratislava, 27/01/03 Cyril Belica 62

Project: Corpus Technology: Work Package: Corpus Analysis Methods Annotation. Wizzard o menu driven composition

Project: Corpus Technology: Work Package: Corpus Analysis Methods Annotation. Wizzard o menu driven composition of feature structures Bratislava, 27/01/03 Cyril Belica 63

Project: Corpus Technology: Work Package: Corpus Analysis Methods CCDB - Collocation Database o o

Project: Corpus Technology: Work Package: Corpus Analysis Methods CCDB - Collocation Database o o default corpus baseline word list six parameter sets batch run Bratislava, 27/01/03 Cyril Belica 64

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Explorer o o o lexical

Project: Corpus Technology: Work Package: Corpus Analysis Methods Collocation Explorer o o o lexical cohesion visualized hyperbolic projection lexicographical interpretation & editing interface to COSMAS & CCDB issue: lexicology Bratislava, 27/01/03 Cyril Belica 65

Project: Corpus Technology: Work Package: Corpus Analysis Methods Topic Spotting o goal n document

Project: Corpus Technology: Work Package: Corpus Analysis Methods Topic Spotting o goal n document clustering for o o virtual corpus composition frequency distribution analysis for n Bratislava, 27/01/03 sense disambiguation Cyril Belica 66