What is a corpus what is corpus linguistics



















- Slides: 19
What is a corpus, what is corpus linguistics?
What is a corpus? • A book? • An article? • An archive?
Definition CORPUS: (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse. (2) In linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analyzed by means of tagging (the addition of identifying and classifying tags to words and other formations) and the use of concordancing programs. (Mc. Arthur, Tom. (ed. ) 1992. The Oxford Companion to the English. Oxford & New York: Oxford University Press. )
Definition corpus, plural corpora; A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to Determine how the usage of a particular sound, word, or syntactic construction varies. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts. ( Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford: Blackwell. )
Definition A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language. (Crystal, David. 1991. A Dictionary of Linguistics and Phonetics. Oxford: Blackwell. ) • A collection of naturally occurring language text, chosen to characterize a state or variety of a language. (John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. )
Corpus is not any kind of text… • a sample/collection which is representative with regards to the research hypothesis a defined size and content Electronically stored as it is easier to obtain information on frequencies, grammatical patterns, collocations by means of computer than manually costs of new analysis are lower in compare to manual counting freely available (so the research results can be contrasted, compared and repeated)
What is corpus linguistics? Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. • Corpus linguistics is a methodology to obtain and analyze the language data either quantitatively or qualitatively. • It can be applied in almost any area of language studies. • An object of a study is authentic, naturally occurring language use. • Corpus linguistics is not a separate branch of linguistics (like e. g. sociolinguistics) or a theory of language.
IS CORPUS LINGUISTICS A BRANCH OF LINGUISTICS? The answer to this question is both yes and no. Corpus linguistics is not a branch of linguistics in the same sense as syntax, semantics, sociolinguistics and so on. All of these disciplines concentrate on describing/explaining some aspect of language use. Corpus linguistics in contrast is a methodology rather than an aspect of language requiring explanation or description. A corpus-based approach can be taken to many aspects of linguistic enquiry. Syntax, semantics and pragmatics are just three examples of areas of linguistic enquiry that have used a corpus-based approach (see Chapter 4). Corpus linguistics is a methodology that may be used in almost any area of linguistics, but it does not truly delimit an area of linguistics itself.
Why shall I use corpora? • Objective verification of results • Corpora show people really use the language. They do not provide imaginary, idealised examples • Quantitative data shows what occurs frequently and what occurs rarely in the language • Thank to IT-technology we can conduct fast, complex studies, process more material than by hand.
Why shall I use corpora and corpus linguistics? • What kind of questions they may answer? • What kind of questions they may not answer?
What kind of question can CL answer? • How much, how many, how often, what…? • How many words does one need to participate in an everyday conversation? • What are the most characteristic words for discourse on asylum seekers? • In which idiomatic expressions does the word „kot“ and „pies“ appear together?
What kind of question CL cannot answer? • Why…? • CL cannot explain the reasons of a language use? It cannot provide a negative evidence – it is not enough that something does not appear in a corpus. (Or can it?
Where is CL popular nowadays? • Speech analysis – speech synthesize • Lexicography – how many senses a word has • Grammar/syntax – grammatical patterns • Semantics – semantic networks • Pragmatics – difference between a student’s and professor’s e-mail • Sociolinguistics – political discourse • Stylistics – author identification • Language acquisition – what are the most common mistakes of students • Historical linguistics – how the use of prepositions changed over a century • Dialectology - what kind of vocabulary differences are there • Psycholinguistics – how frequent are different types of speech error in everyday language • Language engineering – automatic POS tagging
Chomsky's criticism to Corpus Linguistics Chomsky's criticism was based on his refusal to the corpus as a source of evidence in linguistic enquiry. He changed the object of linguistic enquiry from abstract descriptions of language to theories which reflected a psychological reality, cognitively plausible models of language. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance. (Chomsky 1988) Competence is best described as our tacit, internalised knowledge of a language. Performance, on the other hand, is external evidence of language competence and its usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form. Chomsky argued that it was competence rather than performance that the linguist was trying to model.
Corpus Linguistics Approaches There are two approaches in corpus linguistics: corpus-driven approach and corpusbased approach. According to Biber (2009: 12 -17): 1. Corpus-based research assumes the validity of linguistic forms and structures derived from linguistic theory. The primary goal of research is to analyse the systematic patterns of variation and use for those pre-defined linguistic features. 2. Corpus-driven research is more inductive, constructs themselves emerge from analysis of a corpus. so that the linguistic
Corpus Linguistics Approaches (cont’d) According to Tognini-Bonelli (2001: 84 -5): 1. Corpus-based studies typically use corpus data in order to explore a theory or hypothesis, aiming to validate it, refute it or refine it. The definition of corpus linguistics as a method underpins this approach. 2. Corpus-driven linguistics rejects the characterisation of corpus linguistics as a method and claims instead that the corpus itself should be the sole source of our hypotheses about language. It is thus claimed that the corpus itself embodies a theory of language.
Corpus Design 1. Authenticity 2. Sampling 3. Size 4. Balance and representation 5. Legal issue
Data Collection Regimes Monitor Corpora the size of corpus develops over time and consists of various materials (e. g. COCA, Bo. E) Balanced Corpora tries to represent a particular type of language in a certain range of time (e. g. LOB, BCCWJ) Opportunities Corpora Due to the technical limitation, sometimes the collection of corpus data should consider the use of the existing data or data that can be accessed easily.
THANK YOU