Using Corpora and how to build them Adam

Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

• Corpora show us the facts of the language Madrid 2010 Kilgarriff: Web Corpora

What is a corpus? • a corpus is a collection of texts – when viewed as an object of language research 3

Which texts? • Written • Spoken Madrid 2010 Kilgarriff: Web Corpora

Written • Books – Fiction – Non-fiction – Textbooks • • • Newspapers Letters, unpublished Web pages Academic journals Student essays … Madrid 2010 Kilgarriff: Web Corpora

Spoken Must be transcribed, for text corpora • Conversation – Who? Region, class, age-group, situation… • • • Lectures TV and Radio Film transcripts Meetings, seminars … Madrid 2010 Kilgarriff: Web Corpora

Which texts? • Different purposes, different text types • Making dictionaries: – Cover the whole language – Some of everything Madrid 2010 Kilgarriff: Web Corpora

How much? • Most words are rare • Zipf’s Law • To get enough data for most words, we need very big corpora Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law Word (pos) r f the (det) 1 6187267 to (prep) 10 917579 as (adv) 100 91583 playing (vb) 1000 9738 paint (vb) 2000 4539 amateur (adj) 10, 000 741 Madrid 2010 Kilgarriff: Web Corpora rxf 6187267 9175790 9158300 9738000 9078000 7410000

Zipf’s Law • the: 6% • 100 most frequent: 45% • 7500 most frequent: • all others: Madrid 2010 rare Kilgarriff: Web Corpora 90%

Zipf’s Law Madrid 2010 Kilgarriff: Web Corpora

Leading English Corpora: Size of Corpora (in words) 109 108 107 106 1960 s 1970 s Brown/LOB Madrid 2010 1980 s 1990 s 2000 s COBUILD BNC OEC Kilgarriff: Web Corpora

Good news • The web Madrid 2010 Kilgarriff: Web Corpora

You can’t help noticing • Replaceable or replacable? – http: //googlefight. com 14

• Very very large – 2006 estimates for duplicate free, linguistic, Googleindexed web • German: 44 billion words • Italian: 25 billion words • English: 1, 000 billion -10, 000 billion words • • • Most languages Most language types Up-to-date Free Instant access 15

What is a corpus? • a corpus is a collection of texts – when viewed as an object of language research 16

Is the web a corpus? Yes 17

but it’s not representative 18

sublanguage • Language = core + sublanguages • Options for corpus construction – none – some – all • None – impoverished view of language • Some: BNC – cake recipes and gastro-uterine disease – not car repair manuals or astronomy or … • All: until recently, not viable 19

Representativeness • The web is not representative • but nor is anything else • Text type variation – under-researched, lacking in theory • Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001 • Text type is an issue across NLP – Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there 20

What is out there? • What text types are there on the web? – some are new: chatroom – proportions • is it overwhelmed by porn? How much? • Hard question 21

Comparing frequency lists • Web 1 T – Present from google – All 1 -, 2 -, 3 -, 4, 5 -grams with f>40 in one trillion (1012) words of English • that’s 1, 000, 000 • Compare with BNC – 100 words with highest Web 1 T: BNC ratio – 100 words with lowest ratio 22

Web-high (155 terms) • 61 web and computing – config browser spyware url www forum • 38 porn • 22 US English (incl Spanish influence –los) • 18 business/products common on web – poker viagra lingerie ringtone dvd casino rental collectible tiffany – NB: BNC is old • 4 legal – trademarks pursuant accordance herein 23

Web-low • Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him 24

Observations • Pronouns and past tense verbs – Fiction • Masc vs fem • Yesterday – Probably daily newspapers • Constancy of ratios: – He/himself – She/herself 25

• The web – – a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language • we are well placed – a lot of people will be interested • Let’s – – study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure 26

Using Search Engines No setup costs Start querying today Methods • Hit counts • ‘snippets’ – Metasearch engines, Web. Corp • Find pages and download 27

Googleology • Google hit counts for language modelling – Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista • Very interesting work • Great interest in query syntax 28

The Trouble with Google • not enough instances – max 1000 • not enough queries – max 1000 per day with API • not enough context – 10 -word snippet around search term • sort order – search term in titles and headings • untrustworthy hit counts • limited search options • linguistically dumb, eg not lemmatised • aime/aimer/aimes/aimons/aimez/aiment … 29

• Appeal – Zero-cost entry, just start googling • Reality – High-quality work: high-cost methodology 30

Also: • No replicability • Methods, stats not published • At mercy of commercial corporation 31

Also: • • • No replicability Methods, stats not published At mercy of commercial corporation Googleology is bad science So… 32

Basic steps • Gather pages – Google hits – Select and gather whole sites – General crawl • • Filter De-duplicate Linguistic processing Load into corpus tool 33

Oxford English Corpus • Whole domains chosen and harvested – control over text type • 2 billion words (Mar 08) 34

Oxford English Corpus 35

De. Wa. C, It. Wa. C, UKWa. C • 1. 5 B words each • Marco Baroni, Adriano Ferraresi • Seeds: – mid-frequency words from ‘core vocab’ lists and corpora • Google on seed words, then crawl 36

Filtering • Non-text (sound, image etc) files • Boilerplate (within file) – Copyright notices, navigation bars – “high markup” heuristic • Not “text in sentences” – Look for function words – Lists? ? Sports results? ? Crossword puzzles? ? • Spam, pornography – Tough • De-duplication (also tough) 37

Small, specialised corpora • Terminologists • Translators needing target-language domain-specific vocab • Specialist dictionaries – Don’t exist – Expensive/inaccessible – Out of date 38

Boot. Cat (Bootstrapping Corpora and Terms) • Put in seed terms • Google/Yahoo search • Retrieve Google/Yahoo hits – Remove duplicates, boilerplate • Small instant corpora • Baroni and Bernardini, LREC 2004 • Web version – Web. Boot. Ca. T – At Sketch Engine site 39

Task • Choose area of specialist interest – English or Spanish • Select at least 5 seed terms – Specialist: good • Build corpus – At least 100, 000 words – Iterate if necessary • Find at least six words/phrases/meanings you did not know before • Write up Madrid 2010 Kilgarriff: Web Corpora