Research methods in corpus linguistics Xiaofei Lu Overview





















- Slides: 21
Research methods in corpus linguistics Xiaofei Lu
Overview ¡ ¡ ¡ ¡ ¡ What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics 2
What is a corpus? ¡ Leech (1992): l ¡ Francis (1982): l ¡ an unexciting phenomenon, a helluva lot of text, stored on a computer a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis Sinclair (1991): l a collection of naturally-occurring language text, chosen to characterise a state or a variety of language 3
Types of corpora ¡ General-purpose monolingual corpora l ¡ Specialized corpora l ¡ l The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer Corpora and varieties l ¡ International Corpus of Learner English Parallel & comparable corpora l ¡ Lancaster Corpus of Academic Written English Learner corpora l ¡ The British National Corpus International Corpus of English Synchronic and diachronic corpora 4
Corpus design ¡ ¡ ¡ ¡ Purpose Comparability Type Content: mode, interaction, domain, medium Structure: proportions Size Sampling? Design of the BNC 5
Where to obtain corpora ¡ ¡ Linguistic data consortium Bookmarks for corpus-based linguists Ask on the corpora list Compile your own corpora l l l Design your corpus Getting permission File format, metadata, and data markup Text capture ¡ Scanning, typing, electronic files, web crawlers, e. g. , Web. SPHINX ¡ Transcription tools, e. g. , Transcriber A Guide to Good Practice 6
Corpus annotation Why annotate ¡ Levels of corpus annotation ¡ Difficulties for corpus annotation ¡ Tools for corpus annotation ¡ 7
Why annotate ¡ For linguistic research l ¡ Allow more effective corpus searches For natural language processing l l Spelling and grammar checking Text summarization Machine translation Question answering 8
Levels of corpus annotation ¡ ¡ ¡ ¡ ¡ Sentence segmentation Word segmentation/tokenization Part-of-speech (POS) tagging Chunking/shallow parsing Syntactic parsing Semantic annotation Pragmatic annotation Parallel corpora: sentence alignment Learner corpora: error annotation 9
Difficulties for corpus annotation ¡ Ambiguity l l ¡ I saw a pig with binoculars. Problems for tagging, parsing, & WSD Unknown words l l l Identification POS tagging Semantic annotation 10
Tools for corpus annotation ¡ ¡ ¡ Bookmarks for corpus-based linguists Corpora and Corpus Annotation Tools on the WWW POS tagger demonstration l l l ¡ Sentence segmentation POS tagging Extracting NPs of the form DT NN NN Dexter: Tools for analyzing language data 11
Corpus analysis Levels of corpus analysis ¡ Tools for corpus analysis ¡ Interpreting corpus data ¡ 12
Levels of corpus analysis Word frequency lists ¡ Concordances ¡ l l ¡ Collocation (lexical patterning) Colligation (syntactic patterning) Keyword lists 13
Tools for corpus analysis Bookmarks for corpus-based linguists ¡ Recommendations: ¡ l l l Word. Smith Tools (not free) Ant. Conc (free) Text. Stat (free) Unix tools ¡ Write your own scripts ¡ 14
Exercise (part 1) Download and install Ant. Conc ¡ Download some text for processing ¡ l ¡ Project Gutenberg Generate a word frequency list for your mini-corpus 15
Interpreting corpus data ¡ Are frequency differences statistically significant? l l l w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test (doesn’t work well for small numbers) Fisher’s Exact Test (doesn’t work for a cross table larger than 2× 2) 16
Exercise (part 2) Compare your word frequency list with that of BNC ¡ Anything interesting? ¡ Run the chi-square test and Fisher’s Exact test on some interesting words ¡ 17
Interpreting corpus data (cont. ) ¡ Collocational analysis: How strongly are x and y associated l l ¡ Mutual information ¡ Measures difference between observed and expected frequencies of (X, Y) ¡ Higher MI, stronger association ¡ Doesn’t work well for low frequencies T-test ¡ Measures confidence with which to claim strong association between X and Y ¡ Higher t-score, higher association Online calculations 18
Exercise (part 3) Generate a concordance for a target word ¡ Find a word that co-occurs frequently with the target word ¡ Test if the word is strongly associated with the target word ¡ 19
Note on research project design ¡ ¡ ¡ Purpose of project Corpus compilation and annotation Corpus analysis l l ¡ Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance Caution on generalizability 20
Future courses on corpus linguistics ¡ Spring 2007 l l ¡ APLING 597 E: Introduction to Corpus Linguistics Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis Spring 2008 l l APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious research projects 21