Searching corpora Review Word lists Form grammatical words
- Slides: 25
Searching corpora
Review • Word lists – Form: grammatical words are most frequent – More words occur as the rank increases. For example, there are going to be more words (types) occurring 4 times in a corpus than words occurring 5 times.
Uses of wordlists • Determine frequency bands for English – Most frequent 1000 words – Academic “band” etc. • Such frequency bands can be used in course planning and to judge the difficulty of a text
Uses of wordlists • Determine the core vocabulary for a particular topic, such as academic English or business English • Produce a wordlist for a business English corpus and use a stoplist (remove grammatical words) • Do a corpus comparison
Corpus comparison table
How to search: Using concordance software • Corpora are large and computer-aided analysis is necessary • Concordance software is the most common type of text analysis software • Concordance software – produces frequency lists from a corpus – allows searches for words or phrases
Searching a corpus
Info easily obtained from corpora • word lists and word frequency – (What are the most common words in a corpus? ) • word combinations (collocations) such as high risk, low maintenance, deep passion • grammatical constructions (How common is the passive? How often does the passive occur with a by-phrase? He was seen by the doctor. ) • What kinds of texts/genres are associated with particular collocations/constructions?
Basic procedure of corpus analysis • very simple: – searching for words and phrases – sorting the results – obtaining frequency information – performing the analysis (Sinclair reading)
Results - sorted
Results - surrounding words
Collocates and collocations? • The collocates of a word are the frequent cooccurring words • If office is the node word, then post, head and take are examples of collocates • head office is a collocation
Why are collocations a big deal? • There are several answers – knowing a language involves knowing lots and lots of collocations – it is unclear how to deal with collocations within grammar – collocates provide clues to the meaning or connotations of a word, e. g. , husbands and wives
husband wife
Corpus data: collocates of high
Corpus data • evidence of extensive multi-word units • various names: collocations, fixed expressions, chunks, pre-fabricated units (prefabs), lexical bundles
Corpus data • thing: sort of thing, kind of thing, the thing to do, the thing is • change: change in attitude, change of heart, change in policy, change over time, time for a change, pace of change, rate of change, subject to change
Issues in interpreting concordance lines (Hunston) • Nature of search term – word, lemma, phrase • Some unwanted concordance lines may be deleted • Sorting (in different ways) brings out patterns in the data • Look first at the words surrounding the search term • In some cases, a larger co-text must be examined
Issues in interpreting concordance lines (Hunston) • Typically, a large number of lines are retrieved. One technique is to look closely at a few concordance lines and try to draw some generalisations. Then look at another set of concordance lines to see if your generalisation holds • What is typical? What is central (prototypical)?
Techniques in interpreting concordance lines • Examine the frequent collocates – what are the larger (formal patterns) • Examine or apply part-of-speech categories. Do patterns emerge if POS data is taken into account • Check for semantic categories: colour terms, hedges, …
Corpus view of language • Researchers who work with corpora tend to come to similar views about the nature of language • More attention is paid to the words (lexis and phraseology) • Led to lexical approaches to language teaching (Willis, Lewis, Mc. Carthy)
Words and grammar • Language is traditionally divided into the lexicon (words) and grammar (a set of rules) • But where, for example, does the phrase “sort of thing” or “the thing to do” fit within this view of language?
View of language • corpus view – language/grammar is a vast network of lexical/grammatical relations – collocations (high standards, high on drugs) – verb – object co-selection (lose – job, find employment, made – redundant) – constructions - passive (or BE + adjective/participle)
Representing a corpus-based grammar • Quite difficult • Use schemas to represent words, collocations and constructions – [post office] – [N N] – [the thing to V] – [SUBJ Vmanner POSS way PATH]
Schemas • Related to schema theory in reading • Schemas have a form and a meaning and they are linked to form a network. • [change of heart] -- meaning • Abstract schema [N of N] -- abstract meaning
- Spelling city words their way
- Ow sound
- Qri word lists
- Grammatical words
- Grammer words
- Reticular activating system
- Corpora quadrigemina function
- Olfactory nerves
- Grain de suie cheval
- Types of corpus linguistics
- What is corpus
- Midbrain
- Corpora quadrigemina pronunciation
- Corpora amylacea
- Corpora aranacea
- What is corpus
- Help
- Cranial nerves mnemonic
- Semicolon vs colon
- Lesson 3: lists practice
- Swst spelling test
- K for words
- Edinburgh resource lists
- Concatinates
- Lists of tuples python
- Cons in lisp