Corpora and Statistical Methods Part 2 Albert Gatt

Corpora and Statistical Methods – Part 2 Albert Gatt

Preliminaries: Hypothesis testing and the binomial distribution

Permutations � Suppose we have the 5 words {the, dog, ate, a, bone} � How many permutations (possible orderings) are there of these words? �the dog ate a bone �dog the ate a bone �… � E. g. there are 5! = 120 ways of permuting 5 words.

Binomial coefficient �Slight variation: �How many different choices of three words are there out of these 5? �This is known as an “n choose k” problem, in our case: “ 5 choose 3” �For our problem, this gives us 10 ways of choosing three items out of 5

Bernoulli trials � A Bernoulli (or binomial) trial is like a coin flip. Features: There are two possible outcomes (not necessarily with the same likelihood), e. g. success/failure or 1/0. 2. If the situation is repeated, then the likelihoods of the two outcomes are stable. 1.

Sampling with/out replacement �Suppose we’re interested in the probability of pulling out a function word from a corpus of 100 words. �we pull out words one by one without putting them back �Is this a Bernoulli trial? �we have a notion of success/failure: w is either a function word (“success”) or not (“failure”) �but our chances aren’t the same across trials: they diminish since we sample without replacement

Cutting corners �If the sample (e. g. the corpus) is large enough, then we can assume a Bernoulli situation even if we sample without replacement. �Suppose our corpus has 52 million words �Success = pulling out a function word �Suppose there are 13 million function words �First trial: p(success) =. 25 �Second trial: p(success) = 12, 999/51, 999 =. 249 �On very large samples, the chances remain relatively stable even without replacement.

Binomial probabilities - I �Let π represent the probability of success on a Bernoulli trial (e. g. our simple word game on a large corpus). �Then, p(failure) = 1 - π �Problem: What are the chances of achieving success 3 times out of 5 trials? �Assumption: each trial is independent of every other. �(Is this assumption reasonable? )

Binomial probabilities - II �How many ways are there of getting success three times out of 5? �Several: SSSFF, SFSFS, SFSSF, … �To estimate the number of possible ways of getting k outcomes from n possibilities, we use the binomial coefficient:

Binomial probabilities - III �“ 5 choose 3” gives 10. �Given independence, each of these sequences is equally likely. �What’s the probability of a sequence? �it’s an AND problem (multiplication rule) �P(SSSFF) = πππ(1 - π)(1 – π) = π3(1 - π)2 �P(SFSFS) = π(1 - π) π = π3(1 - π)2 �(they all come out the same)

Binomial probabilities - IV �The binomial distribution states that: �given n Bernoulli trials, with probability π of success on each trial, the probability of getting exactly k successes is: probability of each success probability of k successes out of n Number of different ways of getting k successes

Expected value and variance �Expected value: Expected value of X over n trials Variance of X over n trials �where π is our probability of success

Using the t-test for collocation discovery

The logic of hypothesis testing �The typical scenario in hypothesis testing compares two hypotheses: 1. The research hypothesis 2. A null hypothesis � The idea is to set up our experiment (study, etc) in such a way that: � If we show the null hypothesis to be false then � we can affirm our research hypothesis with a certain degree of confidence

H 0 for collocation studies �There is no real association between w 1 and w 2, i. e. occurrence of <w 1, w 2> is no more likely than chance. �More formally: �H 0: P(w 1 & w 2) = P(w 1)P(w 2) �i. e. P(w 1) and P(w 2) are independent

Some more on hypothesis testing �Our research hypothesis (H 1): �<w 1, w 2> are strong collocates �P(w 1 & w 2) > P(w 1)P(w 2) �A null hypothesis H 0 �P(w 1 & w 2) = P(w 1)P(w 2) �How do we know whether our results are sufficient to affirm H 1? �I. e. how big is our risk of wrongly falsifying H 0?

The notion of significance �We generally fix a “level of confidence” in advance. �In many disciplines, we’re happy with being 95% confident that the result we obtain is correct. �So we have a 5% chance of error. �Therefore, we state our results at p = 0. 05 �“The probability of wrongly rejecting H 0 is 5% (0. 05)”

Tests for significance � Many of the tests we use involve: having a prior notion of what the mean/variance of a population is, according to H 0 2. computing the mean/variance on our sample of the population 3. checking whether the sample mean/variance is different from the sample predicted by H 0, at 95% confidence. 1.

The t-test: strategy �obtain mean (x) and variance (s 2) for a sample �H 0: sample is drawn from a population with mean μ and variance σ2 �estimate the t value: this compares the sample mean/variance to the expected (population) mean/variance under H 0 �check if any difference found is significant enough to reject H 0

Computing t � calculate difference between sample mean and expected population mean � scale the difference by the variance � Assumption: population is normally distributed. � If t is big enough, we reject H 0. The magnitude of t given our sample size N is simply looked up in a table. � Tables tell us what the level of significance is (p-value, or likelihood of making a Type 1 error, wrongly rejecting H 0).

Example: new companies �We think of our corpus as a series of bigrams, and each sample we take is an indicator variable (Bernoulli trial): �value = 1 if a bigram is new companies �value = 0 otherwise �Compute P(new) and P(companies) using standard MLE. � H 0: P(new companies) = P(new)P(companies)

Example continued �We have computed the likelihood of our bigram of interest under H 0. �Since this is a Bernoulli Trial, this is also our expected mean. �We then compute the actual sample probability of <w 1, w 2> (new companies). �Compute t and check significance

Uses of the t-test �Often used to rank candidate collocations, rather than compute significance. �Stop word lists must be used, else all bigrams will be significant. �e. g. M&S report 824 out of 831 bigrams that pass the significance test. �Reason: �language is just not random �regularities mean that if the corpus is large enough, all bigrams will occur together regularly and often enough to be significant. �Kilgarriff (2005): Any null hypothesis will be rejected on a large enough corpus.

Extending the t-test to compare samples �Variation on the original problem: �what co-occurrence relations are best to distinguish between two words, w 1 and w 1’ that are nearsynonyms? �e. g. strong vs. powerful �Strategy: �find all bigrams <w 1, w 2> and <w 1, w 2’> �e. g. strong tea, strong support �check, for each w 1, if it occurs significantly more often with w 2, versus w 2’. �NB. This is a two-sample t-test

Two-sample t-test: details � H 0: For any w 1, the probabilities of <w 1, w 2> and <w 1, w 2’> is the same. �i. e. μ (expected difference) = 0 � Strategy: �extract sample of <w 1, w 2> and <w 1, w 2’> �assume they are independent �compute mean and SD for each sample �compute t �check for significance: is the magnitude of the difference large enough? � Formula:

Simplifying under binomial assumptions �On large samples, variance in the binomial distribution approaches the mean. I. e. : �(similarly for the other sample mean) �Therefore:

Concrete example: strong vs. powerful (M&S, p. 167); NY Times Words occurring significantly more often with powerful than strong Words occurring significantly more often with strong than powerful

Criticisms of the t-test �Assumes that the probabilities are normally distributed. This is probably not the case in linguistic data, where probabilities tend to be very large or very small. �Alternative: chi-squared test (Χ 2) �compare differences between expected and observed frequencies (e. g. of bigrams)

The chi-square test

Example �Imagine we’re interested in whether poor performance is a good collocation. �H 0: frequency of poor performance is no different from the expected frequency if each word occurs independently. �Find frequencies of bigrams containing poor, performance and poor performance. �compare actual to expected frequencies �check if the value is high enough to reject H 0

Example continued OBSERVED FREQUENCIES f(w 1= poor) F(w 1 =/= poor) f(w 2=performance) 15 1, 230 (poor performance) (bad performance) F(w 2 =/= performance) 3, 580 (poor people) 12, 000 (all other bigrams) Expected frequencies need to be computed for each cell: E. g. expected value for cell (1, 1) poor performance:

Computing the value � The chi-squared value is the sum of differences of observed and expected frequencies, scaled by expected frequencies. � Value is once again looked up in a table to check if degree of confidence (p-value) is acceptable. �If so, we conclude that the dependency between w 1 and w 2 is significant.

More applications of this statistic � Kilgarriff and Rose 1998 use chi-square as a measure of corpus similarity �draw up an n (row)*2 (column) table �columns correspond to corpora �rows correspond to individual types �compare the difference in counts between corpora �H 0: corpora are drawn from the same underlying linguistic population (e. g. register or variety) �corpora will be highly similar if the ratio of counts for each word is roughly constant. �This uses lexical variation to compute corpussimilarity.

Limitations of t-test and chi-square �Not easily interpretable �a large chi-square or t value suggests a large difference �but makes more sense as a comparative measure, rather than in absolute terms �t-test is problematic because of the normality assumption �chi-square doesn’t work very well for small frequencies (by convention, we don’t calculate it if the expected value for any of the cells is less than 5) �but n-grams will often be infrequent!

Likelihood ratios for collocation discovery

Rationale � A likelihood ratio is the ratio of two probabilities �indicates how much more likely one hypothesis is compared to another � Notation: � c 1 = C(w 1) � c 2 = C(w 2) � c 12 = C(<w 1, w 2>) � Hypotheses: � H 0: P(w 2|w 1) = p = P(w 2|¬w 1) � H 1: � P(w 2|w 1) = p 1 � P(w 2|¬w 1) = p 2 � p 1 =/= p 2

Computing the likelihood ratio H 0 P(w 2|w 1) P(w 2|¬w 1) Prob. that c 12 bigrams out of c 1 are <w 1, w 2> Prob. that c 2 - c 12 out of N- c 1 bigrams are <¬w 1, w 2>) H 1

Computing the likelihood ratio �The likelihood (odds) that a hypothesis H is correct is L(H).

Computing the Likelihood ratio �We usually compute the log of the ratio: �Usually expressed as: because, for v. large samples, this is roughly equivalent to a Χ 2 value

Interpreting the ratio �Suppose that the likelihood ratio for some bigram <w 1, w 2> is x. This says: �If we make the hypothesis that w 2 is somehow dependent on w 1, then we expect it to occur x times more than its actual base rate of occurrence would predict. �This ratio is also better for sparse data. �we can use the estimate as an approximate chi-square value even when expected frequencies are small.

Concrete example: bigrams involving powerful (M&S, p. 174) Source: NY Times corpus (N=14. 3 m) Note: sparse data can still have a high log likelihood value! Interpreting -2 log l as chi-squared allows us to reject H 0, even for small samples (e. g. powerful cudgels)

Relative frequency ratios �An extension of the same logic of a likelihood ratio �used to compare collocations across corpora �Let <w 1, w 2> be our bigram of interest. �Let C 1 and C 2 be two corpora: �p 1 = P(<w 1, w 2>) in C 1 �p 2 = P(<w 1, w 2>) in C 2. �r= p 1/p 2 gives an indication of the relative likelihood of <w 1, w 2> in C 1 and C 2.

Example application �Manning and Schutze (p. 176) compare: �C 1: NY Times texts from 1990 �C 2: NY Times texts from 1989 �Bigram <East, Berliners> occurs 44 times in C 2, but only 2 times in C 1, so r = 0. 03 �The big difference is due to 1989 papers dealing more with the fall of the Berlin Wall.

Summary �We’ve now considered two forms of hypothesis testing: �t-test �chi-square �Also, log-likelihood ratios as measures of relative probability under different hypotheses. �Next, we begin to look at the problem of lexical acquisition.

References � M. Lapata, S. Mc. Donald & F. Keller (1999). Determinants of Adjective-Noun plausibility. Proceedings of the 9 th Conference of the European Chapter of the Association for Computational Linguistics, EACL-99 � A. Kilgarriff (2005). Language is never, ever random. Corpus Linguistics and Linguistic Theory 1(2): 263 � Church, K. and Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics 16(1).