Grammatical Noriegas interaction in corpora and treebanks ICAME

Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27 -31 May 2009 Sean Wallis Survey of English Usage University College London s. wallis@ucl. ac. uk

Outline • The probability of Noriega • What can a parsed corpus tell us? • Individual choices • Repeating choices • Potential sources of interaction • Case interaction • LITEs • What use is interaction evidence?

The probability of Noriega (Church 2000) • Ken Church looked at word frequency in corpus data – Method • Find probability of word occurring overall, pr(w) • Divide each text into two halves: T 1, T 2 Q What is the probability of the word in T 2 if it has already been found in T 1, pr(w in T 2 | w in T 1) ? – Result • ‘Content words’ like Noriega leap in probability if seen before pr(w in T 2 | w in T 1) >> pr(w in T 2) • Pronouns, determiners, etc. no change T 1 T 2

What can a parsed corpus tell us? • Parsed corpora contain (lots of) trees – Use Fuzzy Tree Fragment queries to get data – An FTF – A matching case in a tree – Using ICECUP

What can a parsed corpus tell us? • Three kinds of evidence may be obtained from a parsed corpus Frequency evidence of a particular known rule, structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of the relationship between rules, structures and events • Evidence is necessarily framed within a particular grammatical scheme – So… (an obvious question) how might we evaluate this grammar?

Individual choices (Nelson, Wallis & Aarts 2002) • What factors affect a lexical / grammatical choice? – experiment: does IV DV? • Independent Variable (IV) = sociolinguistic or grammatical • Dependent Variable (DV) = grammatical alternation – carry out a 2 test – e. g. does the type of preceding NP head affect the choice between relative and non-finite postmodification? }{ – a significant but small interaction – for more complex experiments repeat with multiple variables (ICECUP IV) DV rel. nonfin. Total N 6, 790 6, 193 12, 983 PRON 771 446 1, 217 Total 7, 561 6, 639 14, 200 people who live in Hawaii vs. those living in Hawaii IV

Repeating choices (Wallis, submitted) • Construction often involves repetition – e. g. repeated decisions to add an attributive AJP to specify a NP head: the tall white ship

Repeating choices (Wallis, submitted) • Construction often involves repetition – e. g. repeated decisions to add an attributive AJP to specify a NP head: the tall white ship the ship + the tall white ship

Repeating choices (Wallis, submitted) • Construction often involves repetition – e. g. repeated decisions to add an attributive AJP to specify a NP head: the tall white ship the ship + the tall white ship • Sequential probability analysis – calculate probability of adding each AJP

Repeating choices (Wallis, submitted) • Construction often involves repetition – e. g. repeated decisions to add an attributive AJP to specify a NP head: the tall white ship • Sequential probability analysis – calculate probability of adding each AJP – probability falls • second < first • third < second • fourth < second – choices interact – a feedback loop probability

Repeating choices - more examples Adjectives before a noun • similar to AJPs before a noun NP head AVPs before a verb • no interaction NP postmodification, embedded vs. multiple embedded multiple • both interact probability • the probability of postmodification of the same head falls faster than that for embedding

Potential sources of interaction • shared context – topic or ‘content words’ (Noriega) • idiomatic conventions – semantic ordering of attributive adjectives (tall white ship) • logical semantic constraints – exclusion of incompatible adjectives (? tall short ship) • communicative constraints – brevity on repetition (just say ship next time) • psycholinguistic processing constraints – attention and memory of speakers

Case interaction (new research) • Individual choice experiments – measure interaction between variables – statistics assume that cases are independent • we know AJPs in an NP interact – what if we study AJPs? cases • Cases from same text may also interact variables

Case interaction (new research) • Cases should be independent – what can we do? ignore problem discount ‘obvious’ duplicate cases randomly subsample take only one case per text score each case by the degree to which it interacts with others from the same text • We need a model of case interaction

Case interaction (new research) • An a posteriori model of case interaction classify grammatical relationships between A and B

Case interaction (new research) • An a posteriori model of case interaction classify grammatical relationships between A and B measure interaction strength dp(A, B) between A and B in each relationship

Case interaction (new research) • An a posteriori model of case interaction classify grammatical relationships between A and B measure interaction strength dp(A, B) between A and B in each relationship compute marginal probability for each case A from dependent probabilities dp(A, B), dp(A, C). . .

Classify grammatical relationships • Order – word order, dominance (parent-child vs. child-parent), etc. • Topology – basic relationship: word, sibling, dominance etc. • Grammar – subclassify topology by grammar – e. g. distinguishing co-ordination from other clauses • Distance – steps along an axis and how steps are measured – e. g. whether to include all intermediate elements

Measure interaction strength • Previous experiments involved single events – Bayesian probability differences (‘swing’) • Noreiega ‘content words’: pr(a | b) – pr(a) • Repeating choices: pr(a 2 | a 1) – pr(a 1 | a 0) • Interaction between two groups of (alternate) events – Difference in probabilities of choice

Measure interaction strength • Previous experiments involved single events – Bayesian probability differences (‘swing’) • Noreiega ‘content words’: pr(a | b) – pr(a) • Repeating choices: pr(a 2 | a 1) – pr(a 1 | a 0) • Interaction between two groups of (alternate) events – Difference in probabilities of choice – Bayesian dependence dp. B • sum relative probability difference – Cramér’s fc • based on chi-square ( 2) • not affected by direction

Compute marginal probability • Find the probability that A is dependent on other cases – Suppose two other cases B and C exist with dependent probabilities dp(A, B), dp(A, C) and B and C also interact with fc(B, C)

Compute marginal probability • Find the probability that A is dependent on other cases – Suppose two other cases B and C exist with dependent probabilities dp(A, B), dp(A, C) and B and C also interact with fc(B, C) – if fc(B, C) = 1 then dp(A) = maximum dp – if fc(B, C) = 0 then dp(A) = area – interpolate for other values of fc dependent independent

Compute marginal probability • Find the probability that A is dependent on other cases – Suppose two other cases B and C exist with dependent probabilities dp(A, B), dp(A, C) and B and C also interact with fc(B, C) – if fc(B, C) = 1 then dp(A) = maximum dp – if fc(B, C) = 0 then dp(A) = area – interpolate for other values of fc dependent • Then compute marginal probability – ip(A) = 1 – dp(A) + {dp(A) / 2+fc(B, C)} • Extend to more than three cases! independent

LITEs (new research) • Case interaction models – classify grammatical relationships – measure interaction strength between two choices • A legitimate experimental method?

LITEs (new research) • Case interaction models – classify grammatical relationships – measure interaction strength between two choices • A legitimate experimental method? – cf. transmission experiments in physics emitter medium receiver

LITEs (new research) • Case interaction models – classify grammatical relationships – measure interaction strength between two choices • A legitimate experimental method? – cf. transmission experiments in physics receiver medium emitter medium receiver emitter • Linguistic interaction transmission experiments?

LITEs (new research) • A LITE investigates the interaction between two choices in a defined relationship – emitter/receiver – medium – up+down distance d via a clause C • non-finite vs. relative clauses • co-ordinated clauses; other clauses {non-finite, relative}

LITEs (new research) • A LITE investigates the interaction between two choices in a defined relationship – emitter/receiver – medium – up+down distance d via a clause C • non-finite vs. relative clauses • co-ordinated clauses; other clauses – Plot fc over d • skip intermediate co-ordination nodes – Result • co-ordination exhibits >1. 5 x interaction for this choice

What use is interaction evidence? • New methods for evaluating interaction along grammatical axes – General purpose, robust, structural – Based on grammar in corpus – Classifying grammatical relationships allows us to experiment with the corpus grammar • Methods have philosophical implications – Grammar structure framing linguistic choices – Linguistics as an evaluable observational science • Signature (trace) of language production decisions – A unification of theoretical and corpus linguistics?

What use is interaction evidence? • Corpus linguistics – Optimising existing grammar • e. g. co-ordination, compound nouns • Theoretical linguistics – Comparing different grammars, same language – Comparing different languages or periods • Psycholinguistics – Search for evidence of language production constraints in spontaneous speech corpora • speech and language therapy • language acquisition and development

More information • Useful links – Survey of English Usage • www. ucl. ac. uk/english-usage – Fuzzy Tree Fragments • www. ucl. ac. uk/english-usage/resources/ftfs – Individual choice experiments with FTFs • www. ucl. ac. uk/english-usage/resources/ftfs/experiment. htm – To obtain ICE-GB (or DCPSE) • www. ucl. ac. uk/english-usage/resources/sales. htm • References Church 2000. Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p 2. Proceedings of Coling-2000. 180 -186. Nelson, G. , Wallis, S. A. & Aarts, B. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins. Wallis, S. A. {submitted}. Capturing linguistic interaction in a grammar: a method for empirically evaluating the grammar of a parsed corpus. Language. Available from www. ucl. ac. uk/english-usage/staff/sean/resources/analysing-grammatical-interaction. pdf