Searching corpora Review Word lists Form grammatical words

  • Slides: 25
Download presentation
Searching corpora

Searching corpora

Review • Word lists – Form: grammatical words are most frequent – More words

Review • Word lists – Form: grammatical words are most frequent – More words occur as the rank increases. For example, there are going to be more words (types) occurring 4 times in a corpus than words occurring 5 times.

Uses of wordlists • Determine frequency bands for English – Most frequent 1000 words

Uses of wordlists • Determine frequency bands for English – Most frequent 1000 words – Academic “band” etc. • Such frequency bands can be used in course planning and to judge the difficulty of a text

Uses of wordlists • Determine the core vocabulary for a particular topic, such as

Uses of wordlists • Determine the core vocabulary for a particular topic, such as academic English or business English • Produce a wordlist for a business English corpus and use a stoplist (remove grammatical words) • Do a corpus comparison

Corpus comparison table

Corpus comparison table

How to search: Using concordance software • Corpora are large and computer-aided analysis is

How to search: Using concordance software • Corpora are large and computer-aided analysis is necessary • Concordance software is the most common type of text analysis software • Concordance software – produces frequency lists from a corpus – allows searches for words or phrases

Searching a corpus

Searching a corpus

Info easily obtained from corpora • word lists and word frequency – (What are

Info easily obtained from corpora • word lists and word frequency – (What are the most common words in a corpus? ) • word combinations (collocations) such as high risk, low maintenance, deep passion • grammatical constructions (How common is the passive? How often does the passive occur with a by-phrase? He was seen by the doctor. ) • What kinds of texts/genres are associated with particular collocations/constructions?

Basic procedure of corpus analysis • very simple: – searching for words and phrases

Basic procedure of corpus analysis • very simple: – searching for words and phrases – sorting the results – obtaining frequency information – performing the analysis (Sinclair reading)

Results - sorted

Results - sorted

Results - surrounding words

Results - surrounding words

Collocates and collocations? • The collocates of a word are the frequent cooccurring words

Collocates and collocations? • The collocates of a word are the frequent cooccurring words • If office is the node word, then post, head and take are examples of collocates • head office is a collocation

Why are collocations a big deal? • There are several answers – knowing a

Why are collocations a big deal? • There are several answers – knowing a language involves knowing lots and lots of collocations – it is unclear how to deal with collocations within grammar – collocates provide clues to the meaning or connotations of a word, e. g. , husbands and wives

husband wife

husband wife

Corpus data: collocates of high

Corpus data: collocates of high

Corpus data • evidence of extensive multi-word units • various names: collocations, fixed expressions,

Corpus data • evidence of extensive multi-word units • various names: collocations, fixed expressions, chunks, pre-fabricated units (prefabs), lexical bundles

Corpus data • thing: sort of thing, kind of thing, the thing to do,

Corpus data • thing: sort of thing, kind of thing, the thing to do, the thing is • change: change in attitude, change of heart, change in policy, change over time, time for a change, pace of change, rate of change, subject to change

Issues in interpreting concordance lines (Hunston) • Nature of search term – word, lemma,

Issues in interpreting concordance lines (Hunston) • Nature of search term – word, lemma, phrase • Some unwanted concordance lines may be deleted • Sorting (in different ways) brings out patterns in the data • Look first at the words surrounding the search term • In some cases, a larger co-text must be examined

Issues in interpreting concordance lines (Hunston) • Typically, a large number of lines are

Issues in interpreting concordance lines (Hunston) • Typically, a large number of lines are retrieved. One technique is to look closely at a few concordance lines and try to draw some generalisations. Then look at another set of concordance lines to see if your generalisation holds • What is typical? What is central (prototypical)?

Techniques in interpreting concordance lines • Examine the frequent collocates – what are the

Techniques in interpreting concordance lines • Examine the frequent collocates – what are the larger (formal patterns) • Examine or apply part-of-speech categories. Do patterns emerge if POS data is taken into account • Check for semantic categories: colour terms, hedges, …

Corpus view of language • Researchers who work with corpora tend to come to

Corpus view of language • Researchers who work with corpora tend to come to similar views about the nature of language • More attention is paid to the words (lexis and phraseology) • Led to lexical approaches to language teaching (Willis, Lewis, Mc. Carthy)

Words and grammar • Language is traditionally divided into the lexicon (words) and grammar

Words and grammar • Language is traditionally divided into the lexicon (words) and grammar (a set of rules) • But where, for example, does the phrase “sort of thing” or “the thing to do” fit within this view of language?

View of language • corpus view – language/grammar is a vast network of lexical/grammatical

View of language • corpus view – language/grammar is a vast network of lexical/grammatical relations – collocations (high standards, high on drugs) – verb – object co-selection (lose – job, find employment, made – redundant) – constructions - passive (or BE + adjective/participle)

Representing a corpus-based grammar • Quite difficult • Use schemas to represent words, collocations

Representing a corpus-based grammar • Quite difficult • Use schemas to represent words, collocations and constructions – [post office] – [N N] – [the thing to V] – [SUBJ Vmanner POSS way PATH]

Schemas • Related to schema theory in reading • Schemas have a form and

Schemas • Related to schema theory in reading • Schemas have a form and a meaning and they are linked to form a network. • [change of heart] -- meaning • Abstract schema [N of N] -- abstract meaning