Vocabulary size and term distribution tokenization text normalization
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2
Overview l Getting started: – – l Collection vocabulary – – – l l tokenization, stemming, compounds end of sentence Terms, tokens, types Vocabulary size Term distribution Stop words Vector representation of text and term weighting
Tokenization l l Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your | ears Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing Type the class of all tokens containing the same character sequence Term type that is included in the system dictionary (normalized)
l The cat slept peacefully in the living room. It’s a very old cat.
l Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, phone numbers, dates San Francisco, Los Angeles
l Issues of tokenization are language specific – l Requires the language to be known Language identification based on classifiers that use short character subsequences as features is highly effective – Most languages have distinctive signature patterns
Very important for information retrieval l Splitting tokens on spaces can cause bad retrieval results – l German: compound nouns – – l Search for York University, returns pages containing new york university Retrieval systems for German greatly benefit fron the use of compound-splitter module Checks if a word can be subdivided into words that appear in the vocabulary East Asian Languages (Chinese, Japanese, Korean, Thai) – Text is written without any spaces between words
Stop words l Very common words that have no discriminatory power
Building a stop word list l Sort terms by collection frequency and take the most frequent – l Why do we need stop lists – – l In a collection about insurance practices, “insurance” would be a stop word Smaller indices for information retrieval Better approximation of importance for summarization etc Use problematic in phrasal searches
l Trend in IR systems over time – – l l Large stop lists (200 -300 terms) Very small stop lists (7 -12 terms) No stop list whatsoever The 30 most common words account for 30% of the tokens in written text Good compression techniques for indices Term weighting leads to very common words having little impact for document represenation
Normalization l Token normalization – – Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens U. S. A vs USA Anti-discriminatory vs antidiscriminatory Car vs automobile?
Normalization sensitive to query Query term Windows windows Window Terms that should match Windows, windows, window, windows
Capitalization/case folding l Good for – – l Bad for – – l l Allow instances of Automobile at the beginning of a sentence to match with a query of automobile Helps a search engine when most users type ferrari when they are interested in a Ferrari car Proper names vs common nouns General Motors, Associated Press, Black Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning In IR, lowercasing is most practical because of the way users issue their queries
Other languages l 60% of webpages are in english – – l Less than one third of Internet users speak English Less than 10% of the world’s population primarily speak English Only about one third of blog posts are in English
Stemming and lemmatization l l Organize, organizes, organizing Democracy, democratic, democratization Am, are, is be Car, cars, car’s, cars’ ==? car
l Stemming – Crude heuristic process that chops off the ends of the words l l Democratic democa Lemmatization – Use of vocabulary and morphological analysis, returns the base form of a word (lemma) l l Democratic democracy Sang sing
Porter stemmer l Most common algorithm for stemming English – – 5 phases of word reduction SSES SS l – IES I l – – ponies poni SS S l – caresses caress cats cat EMENT l l replacement replac cement
Vocabulary size l Dictionaries – l 600, 000+ words But they do not include names of people, locations, products etc
Heap’s law: estimating the number of terms M vocabulary size (number of terms) T number of tokens 30 < k < 100 b = 0. 5 Linear relation between vocabulary size and number of tokens in log-log space
Zipf’s law: modeling the distribution of terms l The collection frequency of the ith most common term is proportional to 1/i l If the most frequent term occurs cf 1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc
Problems with the normalization l A change in the stop word list can dramatically alter term weightings l A document may contain an outlier term
- Slides: 26