LIN 3098 Corpus Linguistics Lecture 5 Albert Gatt

LIN 3098 – Corpus Linguistics Lecture 5 Albert Gatt

In this lecture… o Corpora and the Lexicon n uses of corpora in lexicography o Counting words n lemmatisation and other issues n types versus tokens n word frequency distributions in corpora

Part 1 Corpora and lexicography

Why corpora are useful o Lexicographic work has long relied on contextual cues to identify meanings. n e. g. Samuel Johnson used examples from literature to exemplify uses of a word. o Corpora make this procedure much easier n not only to provide examples but: n to actually identify meanings of a word given its context n definitions of word meanings should therefore be more precise, if based on large amounts of data

Specific applications o Grammatical alternations of words n E. g. Verb diathesis alternations: o Atkins and Levin (1995) found that verbs such as quiver and quake have both intransitive and transitive uses. (see Lecture 1) n E. g. uses of prepositions such as on, with… o Regional variations in word use n relying on corpora which include gender/region/dialect/date information

Specific applications - II o Identification of occurrences of a specific homograph, e. g. house (Verb) n examination of the contexts in which it occurs n relies on POS tagging o Keeping track of changes in a language through a monitor corpus o Identifying how common a word is, through frequency counts. n many dictionaries include such information now n this shall be our starting point

Part 2 Counting words in corpora: types versus tokens

Running example o Throughout this lecture, reference is made to data from a corpus of Maltese texts: n ca. 51, 000 words n all from Maltese-language newspapers n various topics and article types

How to count words: types versus tokens o token = any word in the corpus n (also counting words that occur more than once) o type = all the individual, different words in the corpus n (grouping occurrences of a word together as representatives of a single type) o Example: n I spoke to the chap who spoke to the child n 10 tokens n 7 types (I, spoke, to, the, chap, who, child)

More on types and tokens o The number of tokens in the corpus is an estimate of overall corpus size n Maltese corpus: 51, 000 tokens o The number of types is an estimate of vocabulary size n gives an idea of the lexical richness of the corpus n Maltese corpus: 8193 types

Type/token ratio o A (rough!) way of measuring the amount of variation in the vocabulary in the corpus. o Roughly, can be interpreted as the “rate at which new types are introduced, as a function of number of tokens”

Difficult decisions - I o Do we distinguish upper- and lowercase words? n is New in New York the same as new in new car? n but what of New in New cars are expensive? (sentence-initial caps) n in practise, it’s not straightforward to distinguish the two accurately, but can be done

Difficult decisions - II o What about morphological variants? n man – men one type or two? n go – went one type or two? o If we map all morphological (inflectional) variants to a single type, our counts will be cleaner (lemmatisation). n depends on availability of automated methods to do this o Maltese also presents problems with variants of the definite article (ir-, is-, ix- etc) n ir-raġel (DEF-man): one token or two?

Difficult decisions - III o Do numbers count? n e. g. is 1, 500 a word? n may artificially inflate frequency counts n one approach is to treat all numbers as tokens of a single type “NUMBER” or “###” o Punctuation n can compromise frequency counts n computer will treat “woman!” as different from “woman” n needs to be stripped n problematic for languages that rely on non-alphabetic symbols: Maltese ‘l (“to”) vs l- (“the”)

Part 2 Representing word frequencies

Raw frequency lists (data from Maltese) o A simple list, pairing each word with its frequency word aħħar (“last”) jkun (“be. IMPERF. 3 SG”) ukoll (“also”) bħala (“as”) dak (“that. SGM”) tat- (“of. DEF”) frequency 97 96 93 91 86 86

Frequency ranks o Word counts can get very big. n most frequent word in the Maltese corpus occurs 2195 times (and the corpus is small) o Raw frequency lists can be hard to process. o Useful to represent words in terms of rank: n count the words n sort by frequency (most frequent first) n assign a rank to the words: o rank 1 = most frequent o rank 2 = next most frequent o …

Rank-frequency list example (data from Maltese) rank Frequency 1 2195 2 2080 3 1277 4 1264 Rank of type, according to frequency Number of times the type occurs

Frequency spectrum (data from Maltese) frequency types o A representation that shows, for each frequency 1 4382 value, the number of different types 2 1253 that occur with that frequency. 3 661 4 356

Normalised frequency counts o A raw frequency for a word isn’t necessarily informative. n E. g. difficult to compare the frequency of the word in corpora of different sizes. o We often take a “normalised” count. n typical to divide the frequency by some constant, such as 10, 000 or 1, 000 n this gives “frequency of word per million” rather than a raw count.

Type/token ratio revisited o (no. of types)/(no. of tokens) o Another way of estimating “vocabulary richness” of a corpus, instead of just looking at vocabulary size. o E. g. if a corpus consists of 1000 words, and there are 400 types, then the TTR is 40%

Type/token ratio o Ratio varies enormously depending on corpus size! o If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%. o With 4 million words, it’s more likely to be in the region of 2%. o Reasons: n vocab size grows with corpus size but n large corpora will contain a lot of tokens that occur many times

Standardised type/token ratio o One way to account for TTR variations due to corpus size is to compute an average TTR for chunks of a constant size. Example: n compute the TTR for every 1000 words of running text n then, take an average over all the 1000 word chunks o This is the approach used, for example, in Word. Smith.

Part 3 Frequency distributions, or “few giants, many midgets”

Non-linguistic case study o Suppose we are interested in measuring people’s height. n population = adult, male/female, European n sample: N people from the relevant population n measure height of each person in the sample o Results: n person 1: 1. 6 m n person 2: 1. 5 m n …

Measures of central tendency o Given the height of individuals in our sample, we can calculate some summary statistics: n mean (“average”): sum of all heights in sample, divided by N n mode: most frequent value n Median: the middle value o What are your expectations?

The data (example) height 1 135 2 159 3 160 4 160 5 180 o Mean: 158. 8 cm n This is the expected value in the long run. n If our sample is good, we would expect that most people would have a height at or around the mean. o Mode: 160 cm o Median: 160

Plotting height/frequency Observations: 1. Extreme values are less frequent. 2. Most people fall on the mean 3. Mode is approximately same as mean 4. Bell-shaped curve (“normal” distribution)

Plotting height/frequency • • • This shape characterises the Normal Distribution. A “bell curve” Quite typical for a lot of data sampled from humans (but not all data)

What about language? o Typical observations about word frequencies in corpora: 1. there a few words with extremely high frequency 2. there are many more words with extremely low frequency 3. the mean is not a good indicator: most words will have an actual value that is very far above or below the mean

A closer look at the Maltese data o Out of 51, 000 tokens: n 8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5) n ca. 15% of our corpus size is made up of only 5 different words! o Out of 8193 types: n 4382 are hapax legomena, occurring only once (bottom ranks) n 1253 occur only twice n … o In this data, the mean won’t tell us very much. n it hides huge variations!

Ranks and frequencies (Maltese) 1. 2195 2. 2080 3. 1277 … 2298. 1 2299. 1 … Among top ranks, frequency drops very dramatically Among bottom ranks, frequency drops very gradually

General observations o In corpora: n there always a few very highfrequency words, and many lowfrequency words n among the top ranks, frequency differences are big n among bottom ranks, frequency differences are very small

So what are the high-frequency words? o Top 5 ranked words in the Maltese data: n li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of”), tal- (“of the”) o Bottom ranked words: n n n żona (“zone”) f = 1 yankee f = 1 żwieten (“Zejtun residents”) f = 1 xortih (“luck. POSS-3 SGM”) f = 1 widnejhom (“ear. POSS-3 PL”) f = 1

Zipf’s law o George K. Zipf (1902 – 1950) established a mathematical model for describing frequency data: Frequency decreases with rank. More precisely, frequency is inversely proportional to rank. o We can plot this in a chart: n Y-axis = frequency n X-axis = rank n each dot on the chart represents the lexical item (type) at a given rank

How Zipf’s law pans out (Maltese data) A few high frequency, low-rank words Hundreds of low-frequency, high-rank words

Zipf’s law cross-linguistically o Empirical work has shown that the Zipfian distribution is observable: n independent of the language n irrespective of corpus size (for reasonably large corpora) o The bigger your corpus: n the bigger your vocabulary size (no. types) n the more words of frequency 1 (hapax legomena) o Why?

Some reasons o If words were completely random, every word would be equally likely. n Our plot would be completely flat: all words at all ranks have same frequency. o Language is absolutely non-random: n occurrence of words governed by: o syntax o author/speaker intentions o. . . o Some words are the basic “skeleton” for our sentences. They are the most frequent.

Implications o Traditional measures of central tendency (mean etc) not very useful. o No two corpora can be directly compared if they are of different size: n vocab size increases with corpus size n most of the vocab made up of hapax legomena n most of the corpus size (no. tokens) made up of a few, very frequent types, typically function words.

Summary o We’ve introduced some of the uses of corpora for lexicography. o Focused today on word frequencies, especially Zipf’s law n looked at some of the implications o Next up: n collocations and why they’re useful

References o Baroni, M. (2007). Distributions in text. In A. Lüdeling and M. Kytö (eds. ), Corpus linguistics: An international handbook. Berlin: Mouton de Gruyter.