English Corpus Linguistics Introducing the Diachronic Corpus of

  • Slides: 21
Download presentation
English Corpus Linguistics Introducing the Diachronic Corpus of Present -Day Spoken English (DCPSE) Sean

English Corpus Linguistics Introducing the Diachronic Corpus of Present -Day Spoken English (DCPSE) Sean Wallis UCL

Barber (1964): changes in English grammar a. b. c. d. e. f. g. h.

Barber (1964): changes in English grammar a. b. c. d. e. f. g. h. A tendency to regularize irregular morphology (e. g. dreamt- dreamed); A revival of the “mandative” subjunctive, probably inspired by formal US usage (we demand that she take part in the meeting); Elimination of shall as a future marker in the first person; Development of new, auxiliary-like uses of certain lexical verbs (e. g. get, want – cf. , e. g. , The way you look, you wanna / want to see a doctor soon); Extension of the progressive to new constructions, e. g. modal, present perfect and past perfect passive progressive (the road would not be being built/ has not been being built/ had not been being built before the general elections); Increase in the number and types of multi-word verbs (phrasal verbs, have/take/give a ride, etc. ); Placement of frequency adverbs before auxiliary verbs (even if no emphasis is intended – I never have said so); Do-support for have (have you any money? and no, I haven’t any money - do you have/ have you got any money? and no, I don’t have any money/ haven’t got any money)…

The Diachronic Corpus of Present-day Spoken English (DCPSE) – Orthographically transcribed spoken Br. E

The Diachronic Corpus of Present-day Spoken English (DCPSE) – Orthographically transcribed spoken Br. E – Fully parsed • every ‘sentence’ has a tree diagram • searchable with ICECUP and FTFs – 400, 000+ words each from • London-Lund Corpus (aka The ‘Survey Corpus’) • ICE-GB – Balanced by text category – Not evenly distributed by year • LLC: samples from 1958 -1977 • ICE-GB: 1990 -1992

Tree diagrams A tree diagram for the sentence We’re getting there.

Tree diagrams A tree diagram for the sentence We’re getting there.

Barber on shall and will • [T]he distinctions formerly made between shall and will

Barber on shall and will • [T]he distinctions formerly made between shall and will are being lost, and will is coming increasingly to be used instead of shall. One reason for this is that in speech we very often say neither [will] nor [shall], but just [’ll]: I’ll see you to-morrow, we’ll meet you at the station, John’ll get it for you. We cannot use this weak form in all positions (not at the end of a phrase, for example), but we use it very often; and, whatever its historical origin may have been (probably from will), we now use it indiscriminately as a weak form for either shall or will; and very often the speaker could not tell you which he had intended. There is thus often a doubt in a speaker’s mind whether will or shall is the appropriate form; and, in this doubt, it is will that is spreading at the expense of shall, presumably because will is used more frequently than shall anyway, and so is likely to be the winner in a levelling process. So people nowadays commonly say or write I will be there, we will all die one day, and so on, when they intend to express simple futurity and not volition. (Barber 1964: 134)

Denison on shall and will • During the latter part of our period [1776

Denison on shall and will • During the latter part of our period [1776 -present day]. . . in the first person shall has increasingly been replaced by will even where there is no element of volition in the meaning. (Denison 1998: 167)

The use of shall and will in written British and American English from the

The use of shall and will in written British and American English from the 1960 s and 1990 s Br. E will shall Am. E will shall LOB FLOB 2, 798 2, 723 355 200 LL 1. 2 44. 3 diff % -2. 7% -43. 7% Brown Frown 2, 702 2, 402 267 150 LL 17. 3 33. 1 diff % -11. 1% -43. 8% From: Mair and Leech (2006: 327) • Figures are normalised per million word frequencies • Log likelihood LL is performed against number of words

Mair and Leech’s data • Simply counts tagged lexical tokens – Will = auxiliary

Mair and Leech’s data • Simply counts tagged lexical tokens – Will = auxiliary verb, includes ’ll – Shall = auxiliary verb – Includes negative forms • Does not distinguish by grammatical position or context – Does not ask whether the choice is available, e. g. limit to first person use – Does not consider subclasses separately • Negative cases: will not/won’t vs. shall not/shan’t? • Do interrogative cases behave differently? • Is written data only • Can we do better than this?

An FTF for first person declarative shall • This FTF is limited to first

An FTF for first person declarative shall • This FTF is limited to first person cases – The FTF requires that the NP is realised by the pronoun I or we. • Interrogative cases have a different structure • We can subtract negative (shall not) cases to exclude them.

Shall vs. will • Does the proportion of cases of shall out of {shall,

Shall vs. will • Does the proportion of cases of shall out of {shall, will} change over time? shall 2(shall) 2(will)Summary will Total 110 78 188 1. 32 1. 45 d% = -30. 24% 20. 84% ICE-GB 40 58 98 2. 53 2. 79 = 0. 17 TOTAL 150 136 286 3. 85 4. 24 2 = 8. 09 LLC • ² for first person subject; shall vs will d% = percentage difference (30% fall in shall between LLC and ICE-GB) = an estimate of the size of the overall effect (a bit like d%) 2 = 2 x 2 chi-square test: is this change statistically significant? 2(shall) = 2 x 1 goodness of fit test: does shall behave differently to average?

Shall vs. will/’ll • Does the proportion of cases of shall out of {shall,

Shall vs. will/’ll • Does the proportion of cases of shall out of {shall, will, ’ll} change over time? shall Total 2(shall) 2(will) LLC will 2(’ll) ’ll 104 69 371 544 9. 98 0. 13 2. 33 ICE-GB 36 52 365 453 11. 98 0. 16 2. 80 TOTAL 140 121 736 997 21. 96 0. 30 5. 13 • ² for first person subject; shall vs will vs. ’ll 2(shall) = 2 x 1 goodness of fit test: does shall behave differently to average?

Focusing on choice • We focused on the choice of shall vs. will –

Focusing on choice • We focused on the choice of shall vs. will – Mair and Leech simply said that total cases of shall fell – But this might have happened for other reasons • For example there may have been more opportunities to use shall in the LLC data • Examining choice is a more precise way of conducting experiments than counting frequencies – It allows us to consider what variables (time, genre, other choices) affect the probability of shall being chosen • Probability is a simple fraction from 0 to 1. – p(shall) = F(shall) + F(will)+… F(shall)

Probability of shall vs. will over time

Probability of shall vs. will over time

Probability of shall vs. will/’ll over time

Probability of shall vs. will/’ll over time

Confidence intervals • Probability p(shall): 0 = no cases are of type shall 1

Confidence intervals • Probability p(shall): 0 = no cases are of type shall 1 = all cases are of type shall • Our sample is a tiny subset of possible sentences from the same period – So we cannot say a particular observation is certain – Instead we try to estimate our confidence in an observation using error bars or confidence intervals • The more data we have supporting an observation p, the smaller the confidence interval around it • We set a confidence level, typically of 95% – we are 95% sure that the true value is within the interval

Modal meaning • Remember Barber and Denison. Not all cases of shall or will

Modal meaning • Remember Barber and Denison. Not all cases of shall or will mean the same thing – Root (futurity): • • I’ve got some at home so I shall take it home. [DI-A 18 #30] I will answer you in a minute. [DI-B 30 #293] – Epistemic (volition): • • • So I shall have roughly from the twenty-ninth of June to the eighth of July on which I can spend the whole of that time on those two papers. [DL-B 01 #62] It’s certainly my long term hope that I will have some kind of companion. . . [DI-B 53 #0257] We should examine these choices separately – Unfortunately this means classifying cases manually

Modal meaning: statistics Root shall • • % Epistemic % Unclear % LLC 33

Modal meaning: statistics Root shall • • % Epistemic % Unclear % LLC 33 30. 84 72 67. 29 2 107 ICE-GB 22 59. 46 14 37. 84 sig 2. 70 37 will LLC 44 55. 70 28 35. 44 7 79 ICE-GB 37 66. 07 14 25. 00 5 Root shall / will is stable: results are not significant 56 Total sig Epistemic shall 136 / will falls (d%128 = -30% 27%) 15 279 Total 1. 87 1 8. 86 8. 93 – The fall in shall is not explained by the sharp fall in Epistemic modals overall - from 100 (72+28) to 28 (14+14) – This is evidence that the shift in use in C 20 is concentrated within Epistemic meanings, from shall to will. – Barber and Denison: earlier shift was in Root (future) meaning.

Modal meaning: statistics Root shall will • Shall LLC 107 ICE-GB 2. 70 LLC

Modal meaning: statistics Root shall will • Shall LLC 107 ICE-GB 2. 70 LLC 79 ICE-GB is 56 losing % Epistemic % Unclear % 2 Total 33 30. 84 72 67. 29 1. 87 22 59. 46 37 44 55. 70 14 37. 84 28 35. 44 7 8. 86 37 66. 07 14 25. 00 5 8. 93 sig 1 its particular Epistemic meaning as a result Total 136 sig – In the LLC data two thirds (67%)128 of shall uses were Epistemic. 15 279 – This fell to 37% (just over one third) in ICE-GB.

Conclusions • DCPSE is – orthographically transcribed spoken English • mostly spontaneous – fully

Conclusions • DCPSE is – orthographically transcribed spoken English • mostly spontaneous – fully parsed and checked by linguists, uses phrase structure grammar based on Quirk et al. – searchable with ICECUP and FTFs • Even lexical studies benefit from parsing – allows us to focus on when a choice occurs • You can use DCPSE to carry out many different experiments on real English – we looked at change over (recent) time – we might also look at how decisions interact

Conclusions • Designing a Corpus Linguistic experiment means thinking carefully about your hypothesis and

Conclusions • Designing a Corpus Linguistic experiment means thinking carefully about your hypothesis and then attempting to test it against the corpus – We examined the shift from shall to will – We limited it to first person, declarative, positive cases – Changing baselines (including ’ll) may lead to different conclusions • Many corpus studies only consider word baselines (or pmw) • But it is often better to consider proportions of types of clause or phrase, or list specific alternative choices – Alternation (choice) studies aim to hold meaning constant so the speaker/writer is free to choose between both cases: • We focused further by subdividing data by modal meaning

Suggested further reading • On shall vs. will and the progressive: – Aarts, B.

Suggested further reading • On shall vs. will and the progressive: – Aarts, B. Close, J. and Wallis S. A. (forthcoming) Choices over time: methodological issues in investigating current change. In: B. Aarts et al. The changing Verb Phrase, Cambridge: CUP. • www. ucl. ac. uk/english-usage/projects/verb-phrase/book/aartsclosewallis. pdf – Barber, C. (1964) Linguistic Change in Present-Day English. Edinburgh and London: Oliver and Boyd. – Denison, D. (1998) Syntax. In: S. Romaine (ed. ). The Cambridge History of the English Language. IV: 1776 -1997. Cambridge: Cambridge University Press. 92 -329. – Mair, C. and Leech, G. (2006) Current changes in English syntax. In: B. Aarts and A. Mc. Mahon (ed. ) The Handbook of English Linguistics. Malden MA: Blackwell Publishers. 318 -342. • On statistical tests, confidence intervals and other methods: – Wallis, S. A. (2010) z-squared: the origin and use of 2. Survey of English Usage, UCL. • www. ucl. ac. uk/english-usage/statspapers/z-squared. pdf