Modern Information Retrieval Chapter 7 Text Operations Ricardo

Document Preprocessing n n n Lexical analysis of the text Elimination of stopwords Stemming

Lexical Analysis of the Text n Word separators Ø Ø Ø space digits hyphens

Elimination of Stopwords n A list of stopwords Ø Ø words that are too

Stemming n Example Ø Ø n connect, connected, connecting, connections effectiveness --> effective -->

Index Terms Selection n Motivation Ø Ø n A sentence is usually composed of

Thesauri n n Peter Roget, 1988 Example cowardly adj. Ignobly lacking in courage: cowardly

The Purpose of a Thesaurus n n n To provide a standard vocabulary for

Thesaurus Term Relationships n n n BT: broader NT: narrower RT: non-hierarchical, but related

Term Selection Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Automatic Indexing n Indexing: Ø n assign identifiers (index terms) to text documents. Identifiers:

Two Issues n Issue 1: indexing exhaustivity Ø Ø n exhaustive: assign a large

Parameters of retrieval effectiveness n Recall n Precision n Goal high recall and high

b Nonrelevant Items c a Relevant Items d Retrieved Part

A Joint Measure n F-score Ø Ø is a parameter that encode the importance

Choices of Recall and Precision n Both recall and precision vary from 0 to

Term-Frequency Consideration n n Function words Ø for example, "and", "or", "of", "but", …

A Frequency-Based Indexing Method n n n Eliminate common function words from the document

Inverse Document Frequency n Inverse Document Frequency (IDF) for term Tj where dfj (document

TFx. IDF n Weight wij of a term Tj in a document di n

Term-discrimination Value n Useful index terms Ø n Distinguish the documents of a collection

A Virtual Document Space Original State After Assignment of good discriminator After Assignment of

Good Term Assignment n When a term is assigned to the documents of a

Poor Term Assignment n A high frequency term is assigned that does not discriminate

Term Discrimination Value n Definition where n dvj = Q - Qj Q and

Variations of Term-Discrimination Value with Document Frequency N Low frequency Medium frequency High frequency

TFij x dvj n wij = tfij x dvj n compared with Ø Ø

Document Centroid n Issue: efficiency problem N(N-1) pairwise similarities n Document centroid C =

Probabilistic Term Weighting n n Goal Explicit distinctions between occurrences of terms in relevant

Probabilistic Term Weighting n Pr(rel), Pr(nonrel): document’s a priori probabilities of relevance and nonrelevance

Assumptions n Terms occur independently in documents

For a specific document D n Given a document D=(d 1, d 2, …,

Issue n How to compute pj and qj ? pj = r j /

Estimation of Term-Relevance n The occurrence probability of a term in the nonrelevant documents

Comparison When N is sufficiently large, N-dfj N, = idfj

Estimation of Term-Relevance n Estimate the number of relevant documents rj in the collection

Summary n Inverse document frequency, idfj Ø n Term discrimination value, dvj Ø n

Slides: 40

Download presentation

Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto

Document Preprocessing n n n Lexical analysis of the text Elimination of stopwords Stemming Selection of index terms Construction of term categorization structures

Lexical Analysis of the Text n Word separators Ø Ø Ø space digits hyphens punctuation marks the case of the letters

Elimination of Stopwords n A list of stopwords Ø Ø words that are too frequent among the documents articles, prepositions, conjunctions, etc. n Can reduce the size of the indexing structure considerably n Problem Ø Search for “to be or not to be”?

Stemming n Example Ø Ø n connect, connected, connecting, connections effectiveness --> effective --> effect picnicking --> picnic king --> k Removing strategies Ø Ø affix removal: intuitive, simple table lookup successor variety n-gram

Index Terms Selection n Motivation Ø Ø n A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. Most of the semantics is carried by the noun words. Identification of noun groups Ø A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

Thesauri n n Peter Roget, 1988 Example cowardly adj. Ignobly lacking in courage: cowardly turncoats Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang). n A controlled vocabulary for the indexing and searching

The Purpose of a Thesaurus n n n To provide a standard vocabulary for indexing and searching To assist users with locating terms for proper query formulation To provide classified hierarchies that allow the broadening and narrowing of the current query request

Thesaurus Term Relationships n n n BT: broader NT: narrower RT: non-hierarchical, but related

Term Selection Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Automatic Indexing n Indexing: Ø n assign identifiers (index terms) to text documents. Identifiers: Ø Ø single-term vs. term phrase controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, … Ø objective vs. nonobjective text identifiers cataloging rules define, e. g. , author names, publisher names, dates of publications, …

Two Issues n Issue 1: indexing exhaustivity Ø Ø n exhaustive: assign a large number of terms nonexhaustive Issue 2: term specificity Ø broad terms (generic) cannot distinguish relevant from nonrelevant documents Ø narrow terms (specific) retrieve relatively fewer documents, but most of them are relevant

Parameters of retrieval effectiveness n Recall n Precision n Goal high recall and high precision

b Nonrelevant Items c a Relevant Items d Retrieved Part

A Joint Measure n F-score Ø Ø is a parameter that encode the importance of recall and procedure. =1: equal weight <1: precision is more important >1: recall is more important

Choices of Recall and Precision n Both recall and precision vary from 0 to 1. Particular choices of indexing and search policies have produced variations in performance ranging from 0. 8 precision and 0. 2 recall to 0. 1 precision and 0. 8 recall. In many circumstance, both the recall and the precision varying between 0. 5 and 0. 6 are more satisfactory for the average users.

Term-Frequency Consideration n n Function words Ø for example, "and", "or", "of", "but", … Ø the frequencies of these words are high in all texts Content words Ø words that actually relate to document content Ø varying frequencies in the different texts of a collect Ø indicate term importance for content

A Frequency-Based Indexing Method n n n Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di. Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.

Inverse Document Frequency n Inverse Document Frequency (IDF) for term Tj where dfj (document frequency of term Tj) is the number of documents in which Tj occurs. Ø fulfil both the recall and the precision Ø occur frequently in individual documents but rarely in the remainder of the collection

TFx. IDF n Weight wij of a term Tj in a document di n Eliminating common function words Computing the value of wij for each term Tj in each document Di Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors n n

Term-discrimination Value n Useful index terms Ø n Distinguish the documents of a collection from each other Document Space Ø Ø Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together When a high-frequency term without discrimination is assigned, it will increase the document space density

A Virtual Document Space Original State After Assignment of good discriminator After Assignment of poor discriminator

Good Term Assignment n When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection. n This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.

Poor Term Assignment n A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar. n This is reflected in an increase in document space density.

Term Discrimination Value n Definition where n dvj = Q - Qj Q and Qj are space densities before and after the assignments of term Tj. dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

Variations of Term-Discrimination Value with Document Frequency N Low frequency Medium frequency High frequency dvj=0 dvj>0 dvj<0

TFij x dvj n wij = tfij x dvj n compared with Ø Ø : decrease steadily with increasing document frequency dvj: increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

Document Centroid n Issue: efficiency problem N(N-1) pairwise similarities n Document centroid C = (c 1, c 2, c 3, . . . , ct) n where wij is the j-th term in document i. Space density

Probabilistic Term Weighting n n Goal Explicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection Definition Given a user query q, and the ideal answer set of the relevant documents n From decision theory, the best ranking algorithm for a document D

Probabilistic Term Weighting n Pr(rel), Pr(nonrel): document’s a priori probabilities of relevance and nonrelevance n Pr(D|rel), Pr(D|nonrel): occurrence probabilities of document D in the relevant and nonrelevant document sets

Assumptions n Terms occur independently in documents

Derivation Process

For a specific document D n Given a document D=(d 1, d 2, …, dt) n Assume di is either 0 (absent) or 1 (present). Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1 -pi Pr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1 -qi

Term Relevance Weight

Issue n How to compute pj and qj ? pj = r j / R qj = (dfj-rj)/(N-R) Ø Ø R: the total number of relevant documents N: the total number of documents

Estimation of Term-Relevance n The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collection qj = dfj / N n The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value pj = 0. 5 for all j.

Comparison When N is sufficiently large, N-dfj N, = idfj

Estimation of Term-Relevance n Estimate the number of relevant documents rj in the collection that contain term Tj as a function of the known document frequency tfj of the term Tj. pj = r j / R qj = (dfj-rj)/(N-R) R: an estimate of the total number of relevant documents in the collection.

Summary n Inverse document frequency, idfj Ø n Term discrimination value, dvj Ø n tfij*dvj Probabilistic term weighting trj Ø n tfij*idfj (TFx. IDF) tfij*trj Global properties of terms in a document collection