Unbalanced Multigenre Corpora Svetlana Strinyuk NRU HSE Criteria
Unbalanced Multigenre Corpora Svetlana Strinyuk, NRU HSE
Criteria for Text Selection Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination. (Sinclair 2005) The external vs. internal criteria (Biber (1993: 243) situational vs. linguistic perspectives • External criteria are defined situationally irrespective of the distribution of linguistic features • Internal criteria are defined linguistically, taking into account the distribution of such features
Criteria for Text Selection A combination of external vs. internal criteria was implemented External factors: Extralinguistic factors determining communicative (academic discourse), field of study, purpose of writing. situation Internal Factors: Monolingual corpora, a set of linguistic features searched for in both corpora is the same.
Corpora Size • corpora should be balanced in terms of size • in case of contrastive analysis of two different size corpora normalized frequencies are used • to convert each frequency into a value per million words, or per thousand words. This is called normalizing the frequency scores. • Frequency per million words = ( frequency ÷ text no. words ) x 1, 000
Small Corpora: Representativeness A corpus can be said to be representative if the findings from that corpus are generalisable to language or a particular aspect of language as a whole. Obviously, it is not possible to collect an entire language to test the representativeness of a corpus. ‘saturation’ (also known as ‘closure’, see Mc. Enery et al 2006: 1516). Saturation (at the lexical level) can be tested for by taking a corpus and dividing it into equal sections in terms of number of words. If another section of the same size is now added, the number of new items in the new section should be approximately the same as in the other sections.
Corpora Size: Small Research Corpora a small corpus is convenient • investigating high frequency items, not to be overwhelmed by too much data from a big corpus. • some corpus software set limits on the number of concordance output lines; when they get to these limits they simply stop searching the corpus
Two Corpora Used in Research • Learner Corpora Papers of NRU HSE fourth year students (Educational Prog. : Politology, History, Law, Business Informatics, Software Engineering, Management, Economics) • Competent Writers Corpora articles published in English in journals most prominent in the field: Political Science, History, Law, Business Informatics, Software Engineering, Management, Economics Used as model writing proved by the Editor, Journal and Publishing House reputation.
References • Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. • Kennedy, G. D. 1998. An Introduction to Corpus Linguistics. Harlow: Longman. • Mc. Enery, T. and Wilson, A. 2001. Corpus Linguistics. (2 nd Ed. ) Edinburgh: Edinburgh University Press. • Mc. Enery, T. , Xiao, R. and Tono, Y. 2006. Corpus-based language studies : an advanced resource book. London: Routledge. • Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press • Sinclair, John Mc. H. 2001. Preface. In M. • Ghadessy, A. Henry and R. L. Rose- berry (eds. ). Small corpus studies and ELT , vii–xv. Amsterdam: John Benjamins. • Sinclair, John Mc. H. 2005 a. Corpus and text: Basic principles. In M. Wynne (ed. ). Developing linguistic corpora: A guide to good practice , 1– 21. Oxford: Arts and Humanities Data Service. • Sinclair, John Mc. H. 2005 b. Appendix: How to build a corpus. In M. Wynne (ed. ). Developing linguistic corpora: A guide to good practice , 98– 103. Oxford: Arts and Humanities Data Service
References • Shaalan K. , Hassanien A. E. , Tolba F. (eds) Intelligent Natural Language Processing: Trends and Applications • Meyer, C. (2002) English Corpus Linguistics Cambridge: Cambridge University Press • Biber, D. , Conrad, S. & Reppen, R. (1998) Corpus Linguistics • Cambridge: Cambridge University Press
- Slides: 9