Building a corpus to investigate the presentation of
Building a corpus to investigate the presentation of speech, thought and writing in Spoken British English Dan Mc. Intyre, John Heywood, Tony Mc. Enery, Elena Semino and Mick Short Department of Linguistics and Modern English Language Lancaster University, UK 6 th April 2003 PALC 2003
Aims of the project n To investigate the forms and functions of speech, thought and writing presentation in spoken data. n To compare the presentation of ST&WP in a corpus of spoken data with the findings from an equivalent corpus of written texts. n To further test the model of speech and thought presentation outlined in Leech and Short (1981).
What is speech, thought and writing presentation? n Prototypically, the presentation in a posterior discourse of what was said, thought or written in a (supposed) anterior discourse. Speaker’s words Direct speech [DS] ‘Shut up, you silly old fool, ’ [RS] she said. Indirect speech [RS] She told him [IS] that he should shut up. Reporter’s words Representation of a speech act [RSA] She commanded him.
Selecting the corpus data n 120 transcripts - approximately 260, 000 words. n Texts taken from the British National Corpus (BNC) and Centre for North West Regional Studies (CNWRS) oral history archives at Lancaster University. n CNWRS interview tapes digitised to be time-aligned with text by Softsound Ltd, Cambridge, UK. n BNC sound files identified where possible.
The ST&WP categories Main categories Speech Thought Writing FDS FDT FDW ‘Shut up you silly old fool!’ RS/DS RT/DT RW/DW ‘Shut up you silly old fool!’, she said. FIS FIT FIW He should shut up, the silly old fool! RS/IS RT/IT RW/IW She said that he should shut up. RSA RTA RWA She commanded him. RV RI RN She shouted at him. RU We used to call them idiots in those days. A She looked at him.
ST&WP category features Category features P to. Pic # problematic tag 1+ repetitions e embedded g grammatical negative a marked absence of ST&WP h hypothetical i inferred q quotation r iterative v interrogative p imperative u unfinished 1+ level of embedding
ST&WP category features Category features P to. Pic # problematic tag 1+ repetitions e embedded g grammatical negative a marked absence of ST&WP h hypothetical i inferred q quotation r iterative v interrogative p imperative u unfinished 1+ level of embedding She said he wasn’t a silly old fool. If you behave foolishly, people will say that you’re a silly old fool. ‘Stop being a silly old fool!’ she said.
Annotating the corpus for ST&WP n n We use the element <sptag> and mark the ST&WP category within the attribute cat. Tags designed for concordancing using Wordsmith Tools. 15 fields to mark ST&WP categories. x used as a placeholder for empty positions. element <sptag attribute cat attribute value = “xxxxxxxx”> fields 1 - 15
Annotating the corpus for ST&WP n n We use the element <sptag> and mark the ST&WP category within the attribute cat. Tags designed for concordancing using Wordsmith Tools. 15 fields to mark ST&WP categories. x used as a placeholder for empty positions. element <sptag attribute cat attribute value = “FIW”> fields 1 - 15
Annotating the corpus for ST&WP n n We use the element <sptag> and mark the ST&WP category within the attribute cat. Tags designed for concordancing using Wordsmith Tools. 15 fields to mark ST&WP categories. x used as a placeholder for empty positions. element <sptag attribute cat attribute value = “x. DSxxxh”> fields 1 - 15
Annotating the corpus for ST&WP n n We use the element <sptag> and mark the ST&WP category within the attribute cat. Tags designed for concordancing using Wordsmith Tools. 15 fields to mark ST&WP categories. x used as a placeholder for empty positions. element <sptag attribute cat attribute value = “x. RSAxxghxxxxp”> fields 1 - 15
Annotating the corpus for ST&WP n n We use the element <sptag> and mark the ST&WP category within the attribute cat. Tags designed for concordancing using Wordsmith Tools. 15 fields to mark ST&WP categories. x used as a placeholder for empty positions. <sptag cat=“x. DS”> = <sptag one=“x” two=“D” three=“S”>
A sample extract from a markedup file <sptag cat="A">Then they went to Hereford and there were cat="A"> Quakers there and </sptag><sptag cat="x. RIxxxxxxi">he had "> a hard time of it</sptag><sptag cat="x. RIxxxxxxi">they "> didn't like Catholics</sptag><sptag cat="A">and I can cat="A"> remember <note desc="S implied">they sent me</note> I was implied"> a manageress in the laundry here and <note desc="S implied">they implied"> sent me to Kendal</note> when we opened a laundry at Kendal and I was staying at a lodging in Kendal and the man was th they were Quakers and </sptag><sptag cat="x. RSxx 2">I cat="x. RSxx 2"> said to the young lady, I said</sptag><sptag cat="x. DS"> Would you mind if you made my dinner on Friday it doesn't matter if it's only bread and butter, but no meat, because we don't eat meat on a Friday or no bacon just bread anything plain it doesn't matter what it is but no meat</sptag><sptag cat="x. RS">and the old man says</sptag><sptag cat="x. RS"> cat="x. DS">I'm sorry for thee</sptag><sptag cat="x. RT">and I thought "> cat="x. RT"> </sptag><sptag cat="x. DT">oh he was a cat="x. DT"> </sptag><sptag cat="A">but cat="A"> Quaker. Anyway</sptag><sptag cat="x. RS">she says</sptag><sptag cat="x. RS"> cat="x. DS">shut up you silly old fool</sptag> ">
Preliminary results: comparative numbers and percentages of speech tags in the Spoken and Written Corpora, in relation to total number of discourse presentation tags Written Corpus Spoken Corpus FDS DS Total 927 (10. 79%) 2047 (23. 83%) 2974 (34. 62%) 199 (2. 26%) 1975 (22. 48%) 2174 (24. 75%) FIS 157 (1. 82%) 83 (0. 94%) IS 1114 (12. 97%) 590 (6. 71%) 1398 (16. 27%) 1349 (15. 35%) 391 (4. 55%) 858 (9. 76%) (N)RSA N/RV Spoken Corpus Total tags = 34, 927 A = 21, 467 RU = 255 RS = 2, 774 Ambiguities = 1, 149 ST&WP tags = 8, 783 Written Corpus Total tags = 16, 533 N = 3, 601 Ambiguities = 885 ST&WP tags = 8, 588
Preliminary results: comparative numbers and percentages of thought tags in the Spoken and Written Corpora, in relation to total number of discourse presentation tags Written Corpus Spoken Corpus FDT DT Total 69 (0. 80%) 38 (0. 44%) 107 (1. 24%) 8 (0. 09%) 162 (1. 84%) 170 (1. 93%) FIT 275 (3. 20%) 11 (0. 12%) IT 201 (2. 34%) 855 (9. 73%) 114 (1. 32%) 363 (4. 13%) 1355 (15. 77%) 1530 (17. 42%) (N)RTA N/RI Spoken Corpus Total tags = 34, 927 A = 21, 467 RU = 255 RT = 1, 109 Ambiguities = 488 ST&WP tags = 8, 783 Written Corpus Total tags = 16, 533 N = 3, 601 Ambiguities = 885 ST&WP tags = 8, 588
Preliminary results: comparative numbers and percentages of writing tags in the Spoken and Written Corpora, in relation to total number of discourse presentation tags Written Corpus Spoken Corpus FDW DW Total 32 (0. 37%) 109 (1. 26%) 141 (1. 64%) 140 (1. 59%) 71 (0. 80%) 211 (2. 40%) FIW 32 (0. 37%) 23 (0. 26%) IW 74 (0. 86%) 37 (0. 42%) 215 (2. 50%) 295 (3. 35%) 41 (0. 47%) 234 (2. 66%) (N)RWA NW/RN Spoken Corpus Total tags = 34, 927 A = 21, 467 RU = 255 RW = 145 Ambiguities = 295 ST&WP tags = 8, 783 Written Corpus Total tags = 16, 533 N = 3, 601 Ambiguities = 885 ST&WP tags = 8, 588
Where next? n Further refinement of ST&WP annotation. n ST&WP and prosodic discontinuities (e. g. voice quality. ) n Combination of quantitative and qualitative analyses. n Comparison of findings from the two corpora.
- Slides: 17