Information Theory Metrics Giancarlo Schrementi Expected Value Example

Expected Value • Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3. 5 The equation gives

A Template Equation • Expected Value forms a template for many of the equations

Entropy as an EV • Notice it has the same three parts • The

Mutual Information • Our three friends show up again • The value this time

MI’s Value • The bottom tells us the probability of the two events occurring

Mutual Information as an EV • Mutual Information can be looked at as the

Kullback-Liebler Divergence • Once again it looks like an EV equation. • P and

What does KL tell us? • The value part tells us what the informational

Other KL Observations • As it is written above it is not-symmetric and does

Comparing MI and KL • The two equations are noticeably similar. • In fact

Sample Application: Chunk Parser • Chunk parsers attempt to divide a sentence into its

MI and KL in this Task • Here we are only concerned with the

Results of the Comparison • First, note that these equations can result in negative

One Final Variant • These two equations are variants of mutual information designed to

References • S. Kullback and R. A. Leibler. On information and sufficiency. Annals of

Slides: 16

Download presentation

Information Theory Metrics Giancarlo Schrementi

Expected Value • Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3. 5 The equation gives you the value that you can expect an event to take.

A Template Equation • Expected Value forms a template for many of the equations in Information Theory • Notice it has three parts, a summation over all possible events, a probability of an event and the value of that event. • It can be thought of as a weighted average

Entropy as an EV • Notice it has the same three parts • The value here is the log base 2 of the probability which can be seen as the amount of information that an event transmits. • Since it’s a logarithm that means less likely occurrences are more informative than highly likely occurrences. • So Entropy can be thought of as the expected amount of information that we will receive.

Mutual Information • Our three friends show up again • The value this time is the joint probability divided by the product of the two marginal probabilities. • This equation is definitely an expected value equation but what does that value tell us?

MI’s Value • The bottom tells us the probability of the two events occurring if they are independent. • The top is the actual probability of the two events occurring which is going to be different if they are dependent in some way. • Dividing probabilities tells us how much more likely one probability is than the other. • Thus the value tells us how much more likely the joint event is than it would be if the two were independent, giving us a measure of dependence. • The lg scales the value to be in the bits of information one event tells us about the next event.

Mutual Information as an EV • Mutual Information can be looked at as the expected number of bits that one event tells us about another event in the distribution. This can be thought of as how many fewer bits are need to encode the next event because of your knowledge of the prior event. • An MI of zero means that every event is independent of every other event. • It is also symmetric across X, Y and is always non-negative.

Kullback-Liebler Divergence • Once again it looks like an EV equation. • P and Q are two distributions and P(x) is the probability of event x in P. • The division in the value part tells us how much more/less likely an event is in distribution P than it would be Q. • The lg scales this to be in bits, but what does that tell us?

What does KL tell us? • The value part tells us what the informational content difference is between event x in distribution P and event x in distribution Q. • So KL tells us the expected difference in informational content of events in the two distributions. • This can be thought of as the difference in number of bits needed to encode event X in the two distributions.

Other KL Observations • As it is written above it is not-symmetric and does not obey the triangle inequality but it is non-negative. • If p and q are the same then the equation results in zero. • Kullback and Leibler actually define it as the sum of the equation above plus its counterpart where p & q are switched. It then becomes symmetric.

Comparing MI and KL • The two equations are noticeably similar. • In fact you can express MI as the KL divergence between p(i, j) and p(i)p(j). • This tells us that MI is computing the divergence between the true distribution and the distribution where each event is completely independent. Essentially computing KL with respect to a reference point. • KL divergence gives us a measure of how different two distributions are, MI gives us a measure of different a distribution is from what it would be if its events were independent.

Sample Application: Chunk Parser • Chunk parsers attempt to divide a sentence into its logical/grammatical units. In this case recursively, by splitting the sentence into two chunks and then splitting those two chunks into two chunks and so on. • MI and KL are both measures that can tell us how statistically related two words are to each other. • If two words are highly related to each othere is not likely to be a chunk boundary between them and likewise if they are largely unrelated then between them would be a good location to suppose a chunk boundary.

MI and KL in this Task • Here we are only concerned with the values at a point not the expected value over all points so the summations and the weight are missing giving us a ‘point-wise’ calculation. • These calculations are across a bigram <xy> in a text, where x and y are two words. • p(<xy>) is the probability of a bigram occurring • Note in the KL equation what P and Q are. Q is p(y) which can be thought of as the prior and P is p(y|x) which can be thought of as the posterior. This gives you the information difference that knowledge of x brings to the probability of the right word y.

Results of the Comparison • First, note that these equations can result in negative numbers if the top probability is lower than the bottom one. Also, MI is no longer symmetrical because p(<xy>) is different from p(<yx>). • This asymmetry is useful for language analysis because most languages are direction dependent and so we want our metrics to be sensitive to that. • They both provide an accurate dependence measure, but the KL measure provides more exaggerated values the further it gets from zero. This is because p(y|x) can get much higher than p(<xy>) will ever be. • This makes KL more useful for this context in that relationships are more striking and thus easier to identify from noise.

One Final Variant • These two equations are variants of mutual information designed to tell you how much a word tells you about the words that could occur that to its left or to its right. • Notice that its an expected value over all the bigrams in which the word occurs in the right or the left and weighted by the conditional probability of that bigram given the word in question. • This can be used to give you an estimate of the handedness of a word or how much the word restricts what can occur to the left or the right of it.

References • S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22(1): 79 -86, March 1951. • Damir Ćavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, Giancarlo Schrementi. "On Statistical Parameter Setting. " Proceedings of the First Workshop on Psycho-computational Models of Human Language Acquisition (COLING-2004). Geneva, Switzerland. August 28 -29, 2004. • Damir Ćavar, Paul Rodrigues, Giancarlo Schrementi. "Syntactic Parsing Using Mutual Information and Relative Entropy. " Proceedings of the Midwest Computational Linguistics Colloquium (MCLC). Bloomington, IN, USA. July 2004. • http: //en. wikipedia. org/wiki/Mutual_information • http: //en. wikipedia. org/wiki/Kullback-Leibler_divergence