Linguistic evidence within and across languages word frequency

  • Slides: 39
Download presentation
Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff

Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography Master. Class Universities of Leeds and Sussex 1

Linguistic evidence within and across languages, word frequency lists and language learning Or Word

Linguistic evidence within and across languages, word frequency lists and language learning Or Word lists are useful, but are they (could they be) scientific? 2

KELLY EU lifelong learning project Goal: wordcards 9 languages, 36 pairs Word in one

KELLY EU lifelong learning project Goal: wordcards 9 languages, 36 pairs Word in one lg on one side, other on other Language learning Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden Partners (incl Leeds) in 6 countries (Leeds does Arabic Chinese Russian) 3

Method Prepare monolingual lists Translate Each into 8 target languages Professional translation services Integrate,

Method Prepare monolingual lists Translate Each into 8 target languages Professional translation services Integrate, finalise Produce cards Goal for each set 9000 pairs at 6 levels 4

(Monolingual) Word Lists Define a syllabus Which words get used in Learning-to-read books (NS

(Monolingual) Word Lists Define a syllabus Which words get used in Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing NS: educational psychologists NNS: proficiency levels 5

Should be corpus-based Most aren't Corpora are quite new Easy to do better People

Should be corpus-based Most aren't Corpora are quite new Easy to do better People will use them Maybe also Governments 6

How Take your corpus Count Voila 7

How Take your corpus Count Voila 7

Complications What is a word Words and lemmas Grammatical classes Numbers, names. . .

Complications What is a word Words and lemmas Grammatical classes Numbers, names. . . Multiwords Homonymy All are slightly different issues for each lg 8

What is a word; delimiters Found between spaces English Compounding, separable verbs Arabic, Italian

What is a word; delimiters Found between spaces English Compounding, separable verbs Arabic, Italian co-operate, widely-held, farmer's, can't Norwegian, Swedish Not for Chinese: segmentation Clitics, al, . . . 9

Words and lemmas Word form (in text) invading Lemma (dictionary headword) Invade forms invades

Words and lemmas Word form (in text) invading Lemma (dictionary headword) Invade forms invades invaded invading Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara 10

Grammatical classes brush (verb) and brush (noun) Same item or different? Proposal: lempos Recommendation:

Grammatical classes brush (verb) and brush (noun) Same item or different? Proposal: lempos Recommendation: different With trepidation Chinese: weak sense of noun, verb Required (short) list of word classes for each lg Same for all unless good reason 11

Marginal cases Numbers Closed sets Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions

Marginal cases Numbers Closed sets Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions Days of week, months Countries twelve, seventeenth, fifties easter, christmas, islam, republican Consistency before freq: policies needed 12

Multiwords According to Linguistically a word but Multiword frequency list: top item of the

Multiwords According to Linguistically a word but Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords Base list: Recommendation: no multiwords But see below 13

Homonymy bank (river) and bank (money) Word sense disambiguation We can't do (with decent

Homonymy bank (river) and bank (money) Word sense disambiguation We can't do (with decent accuracy) We can't give freqs for senses Lists of words not meanings Sometimes disconcerting See also below 14

Corpora A fairly arbitrary sample of a lg To limit arbitrariness of wdlist Make

Corpora A fairly arbitrary sample of a lg To limit arbitrariness of wdlist Make it big and diverse WACKY corpora From web Can do for any language Web language: less formal not mainly 'reporting' or fiction, cf news, BNC Good for lg learners 15

Comparing corpora Corpora: new We are all beginners Best way to get sense of

Comparing corpora Corpora: new We are all beginners Best way to get sense of a corpus Compare with another Keywords of each vs. other Case studies Sketch Engine functions 16

Comparing frequency lists • Web 1 T – Present from google – All 1

Comparing frequency lists • Web 1 T – Present from google – All 1 , 2 , 3 , 4, 5 grams with f>40 in one trillion (1012) words of English • that’s 1, 000, 000 • Compare with BNC – Take top 50, 000 items of each – 105 Web 1 T words not in BNC top 50 k – 50 words with highest Web 1 T: BNC ratio – 50 words with lowest ratio 17

Web-high (155 terms) • 61 web and computing – config browser spyware url www

Web-high (155 terms) • 61 web and computing – config browser spyware url www forum • 38 porn • 22 US English (incl Spanish influence –los) • 18 business/products common on web – poker viagra lingerie ringtone dvd casino rental collectible tiffany – NB: BNC is old • 4 legal – trademarks pursuant accordance herein 18

Web-low • Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday

Web-low • Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him 19

Observations • Pronouns and past tense verbs – Fiction • Masc vs fem •

Observations • Pronouns and past tense verbs – Fiction • Masc vs fem • Yesterday – Probably daily newspapers • Constancy of ratios: – He/himself – She/herself 20

Corpus Factory Many languages General corpus, 100 m+ words Fast High quality Comparable across

Corpus Factory Many languages General corpus, 100 m+ words Fast High quality Comparable across languages

Gather Seed words Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to

Gather Seed words Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come Extract text from Wikipedia 2 Text Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.

Web Corpus Statistics

Web Corpus Statistics

Evaluation For each of the languages, two corpora available: Web and Wiki Dutch: also

Evaluation For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’ Informational > typical written Interactional > typical spoken

Evaluation 1 st, 2 nd person pronouns strong indicators of interactional language. English: I

Evaluation 1 st, 2 nd person pronouns strong indicators of interactional language. English: I me my mine yours we us our For each languages Ratio: web: wiki

Results

Results

Stages Sort out corpora, tagging Automatically generate M 1 lists names, numbers, countries. .

Stages Sort out corpora, tagging Automatically generate M 1 lists names, numbers, countries. . . keywords vis-a-vis other corpora Review, prepare M 2 lists Translate 28

review - how? points system 2 points for each of 6 levels 12 points

review - how? points system 2 points for each of 6 levels 12 points for most freq words deduct points for words in overrepresented areas add in words from other corpora 29

Translation database On the web All translations entered into it Queries like All Swedish

Translation database On the web All translations entered into it Queries like All Swedish words used as translations more than six times All 1: 1: 1: 1. . . 'simple cases' 30

Translations Usually, of texts Words in context Kelly: no context Usual principles don't apply

Translations Usually, of texts Words in context Kelly: no context Usual principles don't apply Instructions to translators 31

Using the database Find words not in M 2 lists, that need adding Multiwords

Using the database Find words not in M 2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So: add it to English list Homonyms: could be similar 32

Monolingual master lists (M 3) Based on a WAC corpus Input from other same-lg

Monolingual master lists (M 3) Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs Useful words which might not be hi-freq added words/multiwords must be above a lower freq threshold Target 9000 Important contribution 33

Numbers Target: 9000 per list M 2 lists Estimate: 5000 -6000 needed We add

Numbers Target: 9000 per list M 2 lists Estimate: 5000 -6000 needed We add 3000 -4000 multiwords and other 'back-translations' 34

From M 3 lists to T 2 lists 35

From M 3 lists to T 2 lists 35

Current status M 1 lists prepared Lists checked, compared with other lists Corpus-based and

Current status M 1 lists prepared Lists checked, compared with other lists Corpus-based and other M 2 lists prepared Translation underway 36

Big problems Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello Worse

Big problems Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between Competence for communicating The corpora at our disposal 37

Word lists are useful, but . . . are they scientific? A tiny bit,

Word lists are useful, but . . . are they scientific? A tiny bit, occasionally . . . could they be scientific? Yes article of faith By the end of KELLY, we'll have a clearer idea how 38

 http: //forbetterenglish. com 39

http: //forbetterenglish. com 39