Ngram Tokenization for Indian Language Text Retrieval Paul

Talk Outline l l Introduction Monolingual Experiments from CLEF 2000 -2007 Ø Ø Ø

Morphological Processes l Inflection Ø l Conjugation Ø l write, written, writing; swim, swam,

Why Do We Normalize Text? l It seems desirable to group related words together

Rule-Based Stemming: Snowball l l l Applicable to alphabetic languages An approximation to lemmatization

N-gram Tokenization l l l Represent text as overlapping substrings Fixed length of n

Single N-gram Stemming l Traditional (rule-based) stemming attempts to remove the morphologically variable portion

Statistical Segmentation l Morfessor Algorithm Ø Ø Ø l l Given a dictionary list,

Character Skipgrams l l Character n-grams: robust matching technique Skipgrams: super robust matching Some

Generating Indexing Terms Word Snowba Morfessor ll 5 -grams authored author+ed _auth, autho, uthor,

JHU/APL HAIRCUT System l The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT)

CLEF Ad Hoc Test Sets (2000 – 2007) #docs size 00 Bulgarian (BG) 69

Tokenization Alternatives l Stemming Effective in Romance languages Ø Not always available Ø l

Monolingual Tokenization words stems morf 4 -stem 4 -grams 5 -grams BG Bulgarian 0.

IR & Language Family 5 -gram Gains Tied to morphological complexity Ø Small improvements

Why are N-grams Effective? l (1) Spelling N-grams localize single letter spelling errors Ø

(3) Because of Morphological Variation? l N-grams might gain their power by controlling for

Source of N-gram Power 100% HU Relative Change in MAP 75% 50% FI CS

Corpus-Based Translation l Given aligned parallel texts and a particular term to translate Find

N-gram Translations l l Character n-grams can be statistically translated, just like words N-grams

Parallel Sources Corpus Size Genre CLEF Languages Bible 785 k Religious CZ, DE, EN,

Effectiveness & Corpus Size 0. 45 0. 40 0. 35 0. 30 0. 25

Effectiveness by size (2) 0. 45 0. 40 0. 35 0. 30 0. 25

FIRE Index Characteristics BN EN HI MR l l Bengali English Hindi Marathi #docs

Tokenization for FIRE 2008 BN EN HI MR l l Bengali English Hindi Marathi

Relative Gains w/ Relevance Feedback BN EN HI MR l l l Bengali English

In Conclusion l Compared several forms of representing text In European languages n-grams obtain

Slides: 27

Download presentation

N-gram Tokenization for Indian Language Text Retrieval Paul Mc. Namee paul. mcnamee@jhu. edu 13 December 2008

Talk Outline l l Introduction Monolingual Experiments from CLEF 2000 -2007 Ø Ø Ø l l Words Stemmed words (Snowball) Character n-grams (n=4, 5) N-gram stems Automatically segmented words (Morfessor algorithm) Skipgrams (n-grams with skips) Why are n-grams effective? Bilingual Experiments (CLEF) FIRE Results Summary 13 December 2008

Morphological Processes l Inflection Ø l Conjugation Ø l write, written, writing; swim, swam, swum Derivation Ø l box, boxes (plural); actor (male), actress (female) sleep, sleepy; play (verb), player (noun), playful (adjective) Word Formation Compounding: news + paper = newspaper; air + port = airport Ø Clipping: professor -> prof; facsimile-> fax Ø Acronyms: GOI = Government of India Ø 13 December 2008

Why Do We Normalize Text? l It seems desirable to group related words together for query/document processing l Why? To make lexicographers happy? Ø To improve system performance? Ø l If performance is the goal, then it ought not to matter whether the indexing terms look like morphemes, or not 13 December 2008

Rule-Based Stemming: Snowball l l l Applicable to alphabetic languages An approximation to lemmatization Identify a root morpheme by chopping off prefixes and suffixes Used for Dutch, English, Finnish, French, German, Italian, Spanish, and Swedish Snowball rulesets also exist for Hungarian and Portuguese No Indian language support Most stemmers are rule-based -ing => e juggling => juggl -es => e juggles => juggl -le => -l juggle => juggl The Snowball project provides high quality, rulebased stemmers for many European languages http: //snowball. tartarus. org/ 13 December 2008

N-gram Tokenization l l l Represent text as overlapping substrings Fixed length of n of 4 or 5 is effective in alphabetic languages For text of length m, there are m-n+1 n-grams _ s w i m m e r s s w i m m e r m m e r s _ Advantages: simple, address morphology, surrogate for short phrases, robust against spelling & diacritical errors, language-independence Disadvantages: conflation (e. g. , simmer, slimmer, glimmer, immerse), n-grams incur both speed and disk usage penalties 13 December 2008

Single N-gram Stemming l Traditional (rule-based) stemming attempts to remove the morphologically variable portion of words Ø Negative effects from over- and under-conflation Hungarian Bulgarian _hun (20547) _bul (10222) hung (4329) bulg (963) unga (1773) ulga (1955) ngar (1194) lgar (1480) gari (2477) aria (11036) rian (18485) ian_ (49777) Short n-grams covering affixes occur frequently - those around the morpheme tend to occur less often. This motivates the following approach: (1) For each word choose the least frequently occurring character 4 gram (using a 4 -gram index) (2) Benefits of n-grams with runtime efficiency of stemming Continues work in Mayfield and Mc. Namee, ‘Single N-gram Stemming’, SIGIR 2003 13 December 2008

Statistical Segmentation l Morfessor Algorithm Ø Ø Ø l l Given a dictionary list, learns to split words into segments A form of statistical stemming based on Minimum Description Length (MDL) > 70% of world languages have concatenative morphology Creutz & Lagus, ACL-2002 http: //www. cis. hut. fi/projects/morpho Examples Ø Ø Ø affect+ion+ate author+ized juggle+r+s sea+gull+s 2007 Morphology Challenge Successful on an IR task Ø Multiple segments per word are generated Ø See Mc. Namee, Nicholas, & Mayfield, ‘Don’t Have a Stemmer? Be un+concern+ed’, SIGIR 2008 13 December 2008

Character Skipgrams l l Character n-grams: robust matching technique Skipgrams: super robust matching Some letters are omitted (essentially a wildcard match) Ø sw*m matches swim / swam / swum Ø f**t matches foot / feet Ø l Skip bi-grams for fuzzy matching Pirkola et al. (2002): learning cross-lingual translation mappings in related languages Ø Mustafa (2004): monolingual Arabic retrieval Ø l Example: 4, 2 skipgrams for Hopkins 4 letters, 2 skips Ø hkin, hpkn, hoin, hokn, hopn Ø oins, okis, opns, opis, opks Ø Note: more skipgrams than plain n-grams Ø l l Slight gains in Czech, Hungarian, Persian Application to OCR’d docs? 13 December 2008

Generating Indexing Terms Word Snowba Morfessor ll 5 -grams authored author+ed _auth, autho, uthor, thore, hored, ored_ authorized author+ized _auth, autho, uthor, thori, horiz, orize, rized, ized_ authorship authorshi p author+ship _auth, autho, uthor, thors, horsh, orshi, rship, ship_ reauthorizati reauthor on re+author+izat _reau, reaut, eauth, autho, uthor, thori, ion horiz, oriza, rizat, izati, zatio, ation, tion_ afoot a+foot _afoo, afoot, foot_ footballs football+s _foot, footb, ootba, otbal, tbaall, balls, alls_ footloose footloos foot+loose _foot, footl, ootlo, otloo, tloos, loose, oose_ footprint foot+print _foot, footp, ootpr, otpri, tprin, print, rint_ feet _feet, feet_ 13 December 2008

JHU/APL HAIRCUT System l The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT) ØUses l l l state-of-the-art statistical language model Ponte & Croft, ‘A Language Modeling Approach to Information Retrieval, ’ SIGIR-98 Miller, Leek, and Schwartz, ‘A Hidden Markov Model Information Retrieval System’, SIGIR-99. Typically set λ to 0. 5 ØLanguage-neutral ØSupports large dictionaries ØUsed at TREC (10 x), CLEF (9 x), NTCIR(2 x) 13 December 2008

CLEF Ad Hoc Test Sets (2000 – 2007) #docs size 00 Bulgarian (BG) 69 k 213 MB Czech (CS) 82 k 178 MB Dutch (NL) 190 k 540 MB English (EN) 170 k 580 MB Finnish (FI) 55 k 137 MB French (FR) 178 k 470 MB 34 German (DE) 295 k 660 MB 37 Hungarian (HU) 50 k 105 MB Italian (IT) 157 k 363 MB Portuguese (PT) 107 k 340 MB Russian (RU) 17 k 68 MB Spanish (ES) 453 k 1086 MB Swedish (SV) 143 k 352 MB 33 01 02 03 04 50 50 56 47 42 54 42 30 45 45 49 50 52 49 49 50 56 05 06 07 49 50 50 156 50 47 49 50 367 333 192 48 50 148 181 46 49 49 51 28 50 120 50 34 49 34 50 50 146 62 50 57 156 49 53 102 13 December 2008

Tokenization Alternatives l Stemming Effective in Romance languages Ø Not always available Ø l N-grams Language-neutral Ø Large gains in complex languages Ø l Other techniques Ø Statistical stemming beats words - Segmentation - Single n-gram stems Ø No run-time penalty 13 December 2008

Monolingual Tokenization words stems morf 4 -stem 4 -grams 5 -grams BG Bulgarian 0. 2164 0. 2703 0. 2822 0. 3105 0. 2820 CS Czech 0. 2270 0. 3215 0. 2567 0. 3294 0. 3223 DE German 0. 3303 0. 3695 0. 3994 0. 3464 0. 4098 0. 4201 EN English 0. 4060 0. 4373 0. 4018 0. 4176 0. 3990 0. 4152 ES Spanish 0. 4396 0. 4846 0. 4451 0. 4485 0. 4597 0. 4609 FI Finnish 0. 3406 0. 4296 0. 4018 0. 3995 0. 4989 0. 5078 FR French 0. 3638 0. 4019 0. 3680 0. 3882 0. 3844 0. 3930 HU Hungarian 0. 1976 0. 2921 0. 2836 0. 3746 0. 3624 IT Italian 0. 3749 0. 4178 0. 3474 0. 3741 0. 3738 0. 3997 NL Dutch 0. 3813 0. 4003 0. 4053 0. 3836 0. 4219 0. 4243 PT Portuguese 0. 3162 0. 3287 0. 3418 0. 3358 0. 3524 RU Russian 0. 2671 0. 3307 0. 2875 0. 3406 0. 3330 SV Swedish 0. 3387 0. 3738 0. 3638 0. 4236 0. 4271 0. 3605 0. 3518 0. 3894 0. 3923 11. 6% 8. 9% 20. 5% 21. 4% 0. 4146 0. 3928 0. 3902 0. 4214 0. 4310 11. 5% 5. 6% 4. 9% 13. 3% 13 December 2008 15. 9% Average 0. 3756 0. 3230 % change Aveage-8 % change 0. 3719

IR & Language Family 5 -gram Gains Tied to morphological complexity Ø Small improvements in Romance family Ø 90% Relative Improvement l 70% 60% 50% BG 40% 30% FI CS SV 20% DE RU 10% 0% 4. 00 4. 50 5. 00 5. 50 6. 00 6. 50 Mean Token Length 7. 00 7. 50 Estimating Complexity Ø Mean word length - Spearman rho = 0. 77 Ø Information-theoretic approach - Spearman rho = 0. 67 - Kettunen et al. , Juola 90% Relative Improvement l HU 80% 70% 60% 50% 40% FI CS 30% SV 20% NL DE 10% 0% 1. 03 1. 08 1. 13 Kettunen Estimate 1. 18 13 December 2008

Why are N-grams Effective? l (1) Spelling N-grams localize single letter spelling errors Ø In news about 1 in 2000 words is misspelled Ø l (2) Phrasal Clues Word spanning n-grams hint at phrases Ø Only slight differences observed Ø 13 December 2008

(3) Because of Morphological Variation? l N-grams might gain their power by controlling for morphological variation Ø l Juola (1998) and Kettunen (2006) did experiments ‘removing’ morphology from language Ø l N-grams focused on root morphemes tend to match across inflected forms Such as replacing each surface form with a 6 -digit number I compared words and 5 -grams under normal and permuted letter conditions golfer: legfro Ø golfed: dofegl Ø golfing: ligfron Ø 13 December 2008

Source of N-gram Power 100% HU Relative Change in MAP 75% 50% FI CS BG 25% DE SV RU NL ES CS 0% HU -25% RU FI BG DE SV 5 -grams - ordinary words l l PT NL PT ES FR FR IT IT EN EN 5 -grams - permuted words Idea: remove morphology from a language Letter order of words was randomly permuted golfer -> legfro, team-> eamt Ø golfing, golfer, golfed no longer share a morpheme Ø l 4 conditions: {words, 5 -grams} x {normal, shuffled} 13 December 2008

Corpus-Based Translation l Given aligned parallel texts and a particular term to translate Find set of documents (sentences) in the source language containing the term Ø Examine corresponding foreign documents Ø Extract ‘good’ candidate(s) Ø Goodness can be based on term similarity measures (Dice, MI, IBM Model 1, etc. ) Ø The price of oil increased yesterday. The economy reacted sharply … El precio del petróleo aumentó ayer. La economía reaccionó agudamente … The Rosetta Stone was discovered in 1799 by Napoleonic forces in Egypt. British physicist Thomas Young determined that cartouches were names of royalty. In 1821 Jean François Champollion began deciphering hieroglyphics using parallel data in Demotic and Greek 13 December 2008

N-gram Translations l l Character n-grams can be statistically translated, just like words N-grams (such as n=4, 5) are smaller than words Ø May capture affixes and morphological roots - ‘work’ (from working) maps to ‘abaj’ (as in trabajaba) - ‘yrup’ (from syrup) maps to ‘rabe’ (as in jarabe) Ø Suitable with Proper Nouns - ‘therl’ (from Netherlands) to ‘ses b’ (as in Países Bajos) German Italian word milch latte stem milch latt 4 -grams milc ilch latt 5 -grams _milch ilch_ _latte French Dutch word lait melk stem lait melk 4 -grams lait melk 5 -grams _lait_ _melk_ 13 December 2008

Parallel Sources Corpus Size Genre CLEF Languages Bible 785 k Religious CZ, DE, EN, ES, FI, FR, IT, NL, PT, RU, SV JRC/Acquis 32 M EU Law BG, CZ, DE, EN, ES, FI, FR, HU, IT, NL, PT, RU, SV Europarl 33 M Parlimentary Debate DE, EN, ES, FI, FR, IT, NL, PT, SV OJEU 84 M Governmental Affairs DE, EN, ES, FI, FR, IT, NL PT, SV Bible: Therefore was the name of it called Babel; because Jehovah did there confound the language of all the earth: and from thence did Jehovah scatter them abroad upon the face of all the earth. Acquis: (24) In order to contribute to the conservation of octopus and in particular to protect the juveniles, it is necessary to establish, in 2006, a minimum size of octopus from the maritime waters under the sovereignty or jurisdiction of third countries and situated in the CECAF region pending the adoption of a regulation amending Regulation (EC) No 850/98. Europarl: Mr President, the tsunami tragedy should be no less significant to the world’s leaders and to Europe than 11 September. OJEU: 11. Trafficking in women for sexual exploitation. A 4 -0372/97. Resolution on the Communi- cation from the Commission to the Council and the European Parliament on trafficking in women for the purpose of sexual exploitation (COM(96)0567 - C 4 -0638/96). The European Parliament, 13 December 2008

Effectiveness & Corpus Size 0. 45 0. 40 0. 35 0. 30 0. 25 0. 20 0. 15 0. 10 0. 05 0% 1% words DE 40% 5 -grams 2% DE 5% 10% 20% 60% 80% 100% 0. 05 0. 45 0. 40 0. 35 0. 30 0. 25 0. 20 0. 15 0. 10 0. 05 0% 1% words FI 5 -grams 2% FI 5% 10% 20% 40% 60% 80% 100% 0% 1% words ES 40% 5 -grams 2% ES 5% 10% 20% 60% 80% 100% 0% 1% words FR 40% 5 -grams 2% FR 5% 10% 20% 60% 80% 100% 0. 05 English queries translated using Europarl Corpus sub-sampled from 1 to 100%. 13 December 2008

Effectiveness by size (2) 0. 45 0. 40 0. 35 0. 30 0. 25 0. 20 0. 15 0. 10 0. 05 0% 1% words IT 5 -grams 2% IT 5% 10% 20% 40% 60% 80% 100% 0. 05 0. 45 0. 40 0. 35 0. 30 0. 25 0. 20 0. 15 0. 10 0. 05 0% 1% words PT 40% 5 -grams 2% PT 5% 10% 20% 60% 80% 100% 0% 1% words NL 40% 5 -grams 2% NL 5% 10% 20% 60% 80% 100% 0% 1% words SV 40% 5 -grams 2% SV 5% 10% 20% 60% 80% 100% 0. 05 13 December 2008

FIRE Index Characteristics BN EN HI MR l l Bengali English Hindi Marathi #docs # uniq words # uniq 5 -grams text size (gzip) 123, 040 125, 516 95, 213 99, 359 34, 985 247, 592 19, 403 47, 940 1, 321, 876 839, 103 741, 915 1, 580, 775 151 MB 122 MB 110 MB 104 MB Vocabulary size in ILs seems abnormally small Possibly a bug in my pre-processing or tokenization, perhaps related to Unicode (e. g. , continuation or modification characters) Neuchâtel BN HI MR Bengali Hindi Marathi #docs # uniq words 123, 047 95, 215 99, 357 249, 215 127, 658 511, 550 13 December 2008

Tokenization for FIRE 2008 BN EN HI MR l l Bengali English Hindi Marathi Average words 4 -grams 5 -grams sk 41 Top @ FIRE 0. 1231 0. 5495 0. 0672 0. 1735 0. 2283 0. 3280 0. 5241 0. 2820 0. 3740 0. 3834 0. 3582 0. 5415 0. 3487 0. 3675 0. 4040 0. 3352 0. 5264 0. 2746 0. 3478 0. 3710 0. 4719 0. 5572 0. 3487 0. 4483 0. 4565 Difficult to interpret results with anomalous vocabulary Need Failure Analysis Performance using words in ILs seems quite depressed Hindi 5 -gram run had good relative performance l Difference vs. 4 -grams much larger than typically seen 13 December 2008

Relative Gains w/ Relevance Feedback BN EN HI MR l l l Bengali English Hindi Marathi words 4 -grams 5 -grams sk 41 +29. 4% +22. 0% -4. 0% +19. 5% +36. 3% +21. 1% +18. 6% +23. 9% +46. 4% +21. 4% +20. 3% +27. 4% +41. 3% +19. 6% +4. 0% +21. 0% Query expansion using top 10 documents 50 terms (words), 150 terms (4/5 -grams), 400 terms (sk 41) Fairly effective: 20 -40% gains 13 December 2008

In Conclusion l Compared several forms of representing text In European languages n-grams obtain 20% gain over words Ø Rule-based stemming good in Romance languages Ø Morfessor segments, n-gram stems better than words, not as good as Snowball stemmer Ø l N-grams gains Greatest in morphologically richer languages Ø Lost when morphology ‘removed’ from language Ø l FIRE N-grams and RF also effective in ILs Ø Must resolve vocabulary issue Ø Difficulty finding parallel text, but would like to investigate bilingual retrieval Ø 13 December 2008