Workshop Corpus 1 What might a corpus of

  • Slides: 38
Download presentation
Workshop: Corpus (1) What might a corpus of spoken data tell us about language?

Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English Usage University College London s. wallis@ucl. ac. uk

Outline • What can a corpus tell us? • The 3 A cycle •

Outline • What can a corpus tell us? • The 3 A cycle • What can a parsed corpus tell us? • ICE-GB and DCPSE • Diachronic changes – Modal shall/will over time • Intra-structural priming – NP premodification • The value of interaction evidence

What can a corpus tell us? • Three kinds of evidence may be obtained

What can a corpus tell us? • Three kinds of evidence may be obtained from a corpus Frequency (distribution) evidence of a particular known linguistic event Coverage (discovery) evidence of new events Interaction evidence of the relationship between events • But if these ‘events’ are lexical, this evidence can only really tell us about lexis – So corpus linguistics has always involved annotation

The 3 A cycle • Plain text corpora – evidence of lexical phenomena Text

The 3 A cycle • Plain text corpora – evidence of lexical phenomena Text

The 3 A cycle • Plain text corpora – evidence of lexical phenomena •

The 3 A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals Corpus Annotation Text

The 3 A cycle • Plain text corpora – evidence of lexical phenomena •

The 3 A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals • Corpus research = the ‘ 3 A’ cycle – Annotation Corpus Annotation Text

The 3 A cycle • Plain text corpora – evidence of lexical phenomena •

The 3 A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals • Corpus research = the ‘ 3 A’ cycle – Annotation Abstraction data transformation (“operationalisation”) Dataset Abstraction Corpus Annotation Text

The 3 A cycle Hypotheses • Plain text corpora – evidence of lexical phenomena

The 3 A cycle Hypotheses • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals Analysis Dataset Abstraction Corpus Annotation • Corpus research = the ‘ 3 A’ cycle – Annotation Abstraction Analysis data transformation (“operationalisation”) Text

Annotation Abstraction • Abstraction – selects data from annotated corpus – maps it to

Annotation Abstraction • Abstraction – selects data from annotated corpus – maps it to a regular dataset for statistical analysis – bi-directional (“concretisation”) • allows us to interpret statistically significant results

Annotation Abstraction • Abstraction – selects data from annotated corpus – maps it to

Annotation Abstraction • Abstraction – selects data from annotated corpus – maps it to a regular dataset for statistical analysis – bi-directional (“concretisation”) • allows us to interpret statistically significant results • Even ‘lexical’ questions need annotation: – 1 st person declarative modal verb shall/will abstraction relies on annotation

What can a parsed corpus tell us? • Three kinds of evidence may be

What can a parsed corpus tell us? • Three kinds of evidence may be obtained from a parsed corpus Frequency evidence of a particular known rule, structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of the relationship between rules, structures and events • BUT evidence is necessarily framed within a particular grammatical scheme – So… (an obvious question) how might we evaluate this grammar?

What can a parsed corpus tell us? • Parsed corpora contain (lots of) trees

What can a parsed corpus tell us? • Parsed corpora contain (lots of) trees – Use Fuzzy Tree Fragment queries to get data – An FTF

What can a parsed corpus tell us? • Parsed corpora contain (lots of) trees

What can a parsed corpus tell us? • Parsed corpora contain (lots of) trees – Use Fuzzy Tree Fragment queries to get data – An FTF – A matching case in a tree – Using ICECUP (Nelson et al, 2002)

What can a parsed corpus tell us? • Trees as handle on data –

What can a parsed corpus tell us? • Trees as handle on data – make useful distinctions – retrieve cases reliably – not necessary to “agree” to framework used • provided distinctions are meaningful

What can a parsed corpus tell us? • Trees as handle on data –

What can a parsed corpus tell us? • Trees as handle on data – make useful distinctions – retrieve cases reliably – not necessary to “agree” to framework used • provided distinctions are meaningful • Trees as trace of language production process – interaction between decisions leave a probabilistic effect on overall performance • not simple to distinguish between source – depends on the framework • but may also validate it

Why spoken corpora? • Speech predates writing – historically – child development – literacy

Why spoken corpora? • Speech predates writing – historically – child development – literacy growth and spread – internal speech during writing

Why spoken corpora? • Speech predates writing – historically – child development – literacy

Why spoken corpora? • Speech predates writing – historically – child development – literacy growth and spread – internal speech during writing • Scale – professional authors recommend 1, 000 words/day – 1 hour of speech 8, 000 words (DCPSE)

Why spoken corpora? • Speech predates writing – historically – child development – literacy

Why spoken corpora? • Speech predates writing – historically – child development – literacy growth and spread – internal speech during writing • Scale – professional authors recommend 1, 000 words/day – 1 hour of speech 8, 000 words (DCPSE) • Spontaneity – production process lost: many written sources edited

Why spoken corpora? • Speech predates writing – historically – child development – literacy

Why spoken corpora? • Speech predates writing – historically – child development – literacy growth and spread – internal speech during writing • Scale – professional authors recommend 1, 000 words/day – 1 hour of speech 8, 000 words (DCPSE) • Spontaneity – production process lost: many written sources edited • Dialogue – interaction between speakers

ICE-GB and DCPSE • British Component of the International Corpus of English (1990 -92)

ICE-GB and DCPSE • British Component of the International Corpus of English (1990 -92) – – 1 million words (nominal) 60% spoken, 40% written speech component is orthographically transcribed fully parsed • marked up, POS-tagged, parsed, hand-corrected • Diachronic Corpus of Present-day Spoken English – 800, 000 words (nominal) – orthographically transcribed and fully parsed – created from subsamples of LLC and ICE-GB • Matching numbers of texts in text categories • Not sampled over equal duration – LLC (1958 -1977) – ICE-GB (1990 -1992)

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1.

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1. 0 p(shall | {shall, will}) 0. 8 0. 6 0. 4 0. 2 0. 0 1955 1960 1965 1970 1975 1980 1985 1990 1995 • Small amounts of data / year

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1.

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1. 0 p(shall | {shall, will}) • Confidence intervals identify the degree of certainty in our results 0. 8 0. 6 0. 4 0. 2 0. 0 1955 • Small amounts of data / year 1960 1965 1970 1975 1980 1985 1990 1995

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1.

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1. 0 p(shall | {shall, will}) • Confidence intervals identify the degree of certainty in our results • Highly skewed p in some cases 0. 8 0. 6 0. 4 0. 2 0. 0 1955 • Small amounts of data / year – p = 0 or 1 (circled) 1960 1965 1970 1975 1980 1985 1990 1995

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1.

Modal shall vs. will over time • Plotting modal shall/will over time (DCPSE) 1. 0 p(shall | {shall, will}) • Confidence intervals identify the degree of certainty in our results 0. 8 0. 6 0. 4 0. 2 0. 0 1955 • Small amounts of data / year 1960 1965 1970 1975 1980 1985 1990 1995 • We can now estimate an approximate downwards curve (Aarts et al. , 2013)

Intra-structural priming • Priming effects within a structure – Study repeating an additive step

Intra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e. g. an NP with a noun head N

Intra-structural priming • Priming effects within a structure – Study repeating an additive step

Intra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e. g. an NP with a noun head – a single additive step applied to this structure • e. g. add an attributive AJP before the head N AJP

Intra-structural priming • Priming effects within a structure – Study repeating an additive step

Intra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e. g. an NP with a noun head – a single additive step applied to this structure N AJP • e. g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? N ship

Intra-structural priming • Priming effects within a structure – Study repeating an additive step

Intra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e. g. an NP with a noun head – a single additive step applied to this structure N AJP • e. g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? AJP N tall ship

Intra-structural priming • Priming effects within a structure – Study repeating an additive step

Intra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e. g. an NP with a noun head – a single additive step applied to this structure N AJP • e. g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? AJP N tall very green ship

Intra-structural priming • Priming effects within a structure – Study repeating an additive step

Intra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e. g. an NP with a noun head – a single additive step applied to this structure N AJP • e. g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? AJP AJP N tall very green old ship

NP premodification • Sequential probability analysis – calculate probability of adding each AJP –

NP premodification • Sequential probability analysis – calculate probability of adding each AJP – error bars: Wilson intervals – probability falls probability • second < first 0. 20 • third < second – decisions interact – Every AJP added makes it harder to add another 0. 15 0. 10 0. 05 0. 00 0 1 2 3 4 5

NP premodification: explanations? • Feedback loop: for each successive AJP, it is more difficult

NP premodification: explanations? • Feedback loop: for each successive AJP, it is more difficult to add a further AJP • Possible explanations include: logical and semantic constraints • tend to say the tall green ship • do not tend to say tall short ship or green tall ship communicative economy • once speaker said tall green ship, tends to only say ship memory/processing constraints • unlikely: this is a small structure, as are AJPs

NP premod’n: speech vs. writing • Spoken vs. written subcorpora – Same overall pattern

NP premod’n: speech vs. writing • Spoken vs. written subcorpora – Same overall pattern – Spoken data tends to have fewer attributive AJPs • Support for communicative economy or memory/processing hypotheses? – Significance tests • Paired 2 x 1 Wilson tests (Wallis 2011) • first and second observed spoken probabilities are significantly smaller than written probability 0. 25 0. 20 written 0. 15 spoken 0. 10 0. 05 0. 00 0 1 2 3 4 5

Potential sources of interaction • shared context – topic or ‘content words’ (Noriega) •

Potential sources of interaction • shared context – topic or ‘content words’ (Noriega) • idiomatic conventions – semantic ordering of attributive adjectives (tall green ship) • logical-semantic constraints – exclusion of incompatible adjectives (? tall short ship) • communicative constraints – brevity on repetition (just say ship next time) • psycholinguistic processing constraints – attention and memory of speakers

What use is interaction evidence? • Corpus linguistics – Optimising existing grammar • e.

What use is interaction evidence? • Corpus linguistics – Optimising existing grammar • e. g. co-ordination, compound nouns • Theoretical linguistics – Comparing different grammars, same language – Comparing different languages or periods • Psycholinguistics – Search for evidence of language production constraints in spontaneous speech corpora • speech and language therapy • language acquisition and development

What can a parsed corpus tell us? • Trees as handle on data –

What can a parsed corpus tell us? • Trees as handle on data – make useful distinctions – retrieve cases reliably – not necessary to “agree” to framework used • provided distinctions are meaningful • Trees as trace of language production process – interaction between decisions leave a probabilistic effect on overall performance • not simple to distinguish between source – results enabled by the framework • but may also validate it

The importance of annotation • Key element of a ‘ 3 A cycle’ –

The importance of annotation • Key element of a ‘ 3 A cycle’ – Annotation Abstraction Analysis • Richer annotation – more effective abstraction – deeper research questions? • Multiple layers of annotation – new research questions – studying interaction between layers • Algorithmic vs. human annotation

More information • References Aarts, B. Close, J. and Wallis, S. A. (2013) Choices

More information • References Aarts, B. Close, J. and Wallis, S. A. (2013) Choices over time: methodological issues in current change. In Aarts, Close, Leech and Wallis (eds)The Verb Phrase in English. Cambridge University Press. Nelson, G. , Wallis, S. A. and Aarts, B. (2002) Exploring Natural Language. Amsterdam: John Benjamins. Wallis, S. A. (2011) Comparing χ2 tests for separability. London: Survey of English Usage. • Useful links – Survey of English Usage • www. ucl. ac. uk/english-usage – Fuzzy Tree Fragments • www. ucl. ac. uk/english-usage/resources/ftfs – Statistics and methodology research blog • http: //corplingstats. wordpress. com