Corpus Linguistics Lecture 1 Albert Gatt Contact details

Corpus Linguistics Lecture 1 Albert Gatt

Contact details o My email: albert. gatt@um. edu. mt o Drop me a line with queries etc, and to arrange meetings.

Course web page o Course web page: http: //staff. um. edu. mt/albert. gatt/home/teachin g/corpus. Ling. html n Details of tutorials, lectures etc will always be on the web page. o Readings for the lecture o Downloadable lecture notes (available after the lecture)

Suggested text o T. Mc. Enery and A. Wilson. (2001). Corpus Linguistics. Edinburgh University Press o NB: Over the course of these lectures, other readings will also be proposed and made available, usually online.

Lectures and assessment o Structure of lectures: n all lectures will take place in the lab n usually, about half the lecture (1 hr) will be devoted to practical work o Course assessment: assignment n Final essay (ca. 1500 -2000 words) n Essay topics will involve research on corpora!

Questions… ?

What is corpus linguistics? o A new theory of language? n No. In principle, any theory of language is compatible with corpus-based research. o A separate branch of linguistics (in addition to syntax, semantics…)? n No. Most aspects of language can be studied using a corpus (in principle). o A methodology to study language in all its aspects? n Yes! The most important principle is that aspects of language are studied empirically by analysing natural data using a corpus. n A corpus is an electronic, machine-readable collection of texts that represent “real life” language use.

Goals of this lecture o To define the terms: n corpus linguistics n corpus o To give an overview of the history of corpus linguistics o To contrast the corpus-based approach to other methodologies used in the study of language

An initial example o Suppose you’re a linguist interested in the syntax of verb phrases. n Some verbs are transitive, some intransitive o I ate the meat pie (transitive) o I swam (intransitive) o What about: n quiver n quake Most traditional grammars characterise these as intransitive o Are these really intransitive?

One possible methodology… o The standard method relies on the linguist’s intuition: n I never use quiver/quake with a direct object. n I am a native speaker of this language. n All native speakers have a common mental grammar or competence (Chomsky). n Therefore, my mental grammar is the same as everyone else’s. n Therefore, my intuition accurately reflects English speakers’ competence. n Therefore, quiver/quake are intransitive. o NB: The above is a gross simplification! E. g. linguists often rely on judgements elicited from other native speakers.

Another possible methodology… o This one relies on data: n I may never use quiver/quake with a direct object, but… n …other people might n Therefore, I’ll get my hands on a large sample of written and/or spoken English and check.

Quiver/quake: the corpus linguist’s answer o A study by Atkins and Levin (1995) found that quiver and quake do occur in transitive constructions: n the insect quivered its wings n it quaked his bowels (with fear) o Used a corpus of 50 million words to find examples of the verbs. o With sufficient data, you can find examples that your own intuition won’t give you…

Example II: lexical semantics o Quasi-synonymous lexical items exhibit subtle differences in context. n strong n powerful o A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning.

Example II continued o Some differences between strong and powerful (source: British National Corpus): n strong n powerful wind, feeling, accent, flavour tool, weapon, punch, engine o The differences are subtle, but examining their collocates helps.

Some preliminary definitions o The second approach is typical of the corpus-based methodology: n Corpus: A large, machine-readable collection of texts. o Often, in addition to the texts themselves, a corpus is annotated with relevant linguistic information. n Corpus-based methodology: An approach to Natural Language analysis that relies on generalisations made from data.

Example (British National Corpus) o British National Corpus (BNC): n 100 million words of English o 90% written, 10% spoken n Designed to be representative and balanced. n Texts from different genres (literature, news, academic writing…) n Annotated: Every single word is accompanied by part-of-speech information.

Example (continued) o A sentence in the BNC: n Explosives found on Hampstead Heath. o o o o <s> <w NN 2>Explosives <w VVD>found <w PRP>on <w NP 0>Hampstead <w NP 0>Heath <PUN>.

Example (continued) new sentence o <s> plural noun o <w NN 2>Explosives past tense verb o <w VVD>found preposition o <w PRP>on proper noun o <w NP 0>Hampstead proper noun o <w NP 0>Heath punctuation o <PUN>. Explosives found on Hampstead Heath

Important to note o This is not “raw” text. n Annotation means we can search for particular patterns. n E. g. for the quiver/quake study: “find all occurrences of quiver which are verbs, followed by a determiner and a noun” o The collection is very large n Only in very large collections are we likely to find rare occurrences. o Corpus search is done by computer. You can’t trawl through 100 million words manually!

The practical objections… o But we’re linguists not computer scientists! Do I have to write programs? n No, there are literally dozens of available tools to search in a corpus. o Are all corpora good for all purposes? n No. Some are “general-purpose”, like the BNC. Others are designed to address specific issues.

The theoretical objections… o What guarantee do we have that the texts in our corpus are “good data”, quality texts, written by people we can trust? o How do I know that what I find isn’t just a small, exceptional case. E. g. quiver in a transitive construction could be really a one-off! o Just because there a few examples of something, doesn’t mean that all native speakers use a certain construction! o Do we throw intuition out of the window?

Part 2 A brief history of corpus linguistics

Language and the cognitive revolution o Before the 1950’s, the linguist’s task was: n to collect data about a language; n to make generalisations from the data (e. g. “In Maltese, the verb always agrees in number and gender with the subject NP”) n The basic idea: language is “out there”, the sum total of things people say and write. o After the 1950’s: n the so-called “cognitive revolution” n language treated as a mental phenomenon n no longer about collecting data, but explaining what mental capabilities speakers have

The 19 th & early 20 th Century o Many early studies relied on corpora. o Language acquisition research was based on collections of child data. o Anthropologists collected samples of unknown languages. o Comparative linguists used large samples from different languages. o A lot of work done on frequencies: n frequency of words… n frequency of grammatical patterns… n frequency of different spellings… o All of this was interrupted around 1955.

Chomsky and the cognitive turn o Chomsky (1957) was primarily responsible for the new, cognitive view of language. o He distinguished (1965): n Descriptive adequacy: describing language, making generalisations such as “X occurs more often than Y” n Explanatory adequacy: explaining why some things are found in a language, but not others, by appealing to speakers’ competence, their mental grammar o He made several criticisms of corpus-based approaches.

Criticisms of corpora (I) o Competence vs. performance: n To explain language, we need to focus on competence of an idealised speaker-hearer. o Competence = internalised, tacit knowledge of language n Performance – the language we speak/write – is not a good mirror of our knowledge o it depends on situations o it can be degraded o it can be influenced by other cognitive factors beyond linguistic knowledge

Criticisms of corpora (II) o Early work using corpora assumed that: n the number of sentences of a language is finite (so we can get to know everything about language if the sample is large enough) o But actually, it is impossible to count the number of sentences in a language. n Syntactic rules make the possibilities literally infinite: the man in the house (NP -> NP + PP) the man in the house on the beach (PP -> PREP + NP) the man in the house on the beach by the lake … o So what use is a corpus? We’re never going to have an infinite corpus.

Criticisms of corpora (III) o A corpus is always skewed, i. e. biased in favour of certain things. n Certain obvious things are simply never said. E. g. We probably won’t find a dog is a dog in our corpus. o A corpus is always partial: We will only find things in a corpus if they are frequent enough. n A corpus is necessarily only a sample. n Rare things are likely to be omitted from a sample.

Criticisms of corpora (IV) o Why use a corpus if we already know things by introspection? o How can a corpus tell us what is ungrammatical? n Corpora won’t contain “disallowed” structures, because these are by definition not part of the language. n So a corpus contains exclusively positive evidence: you only get the “allowed” things n But if X is not in the corpus, this doesn’t mean it’s not allowed. n It might just be rare, and your corpus isn’t big enough. (Skewness)

Refutations o Corpora can be better than introspectvie evidence because: n They are public; other people can verify and replicate your results (the essence of scientific method). n Some kinds of data are simply not available to introspection. E. g. people aren’t good at estimating the frequency of words or structures. n Skewness can itself be informative: If X occurs more frequently than Y in a corpus, that in itself is an interesting fact.

Refutations (II) o By the way, nobody’s saying “throw introspection out the window”… n There is no reason not to combine the corpusbased and the introspection-based method. o Many other objections can be overcome by using large enough corpora. n Pre-1950, most corpus work was done manually, so it was error prone. n Machine-readable corpora means we have a great new tool to analyse language very efficiently!

Corpora in the late 20 th Century o Corpus linguistics enjoyed a revival with the advent of the digital personal computer. n Kucera and Francis: the Brown Corpus, one of the first n Svartvik: the London-Lund Corpus, which built on Brown o These were rapidly followed by others… Today, corpora are firmly back on the linguistic landscape.

Summary o Introduced the notion of corpus and corpus-based research o Gave a quick overview of the history of this methodology o Looked at some possible objections to corpus-based methods, and some possible counter-arguments

Next lecture o We look more closely at some important properties of a corpus: n n Machine-readability Balance Representativeness …