Modern Information Retrieval Chapter 2 Modeling Probabilistic model

Modern Information Retrieval Chapter 2 Modeling

Probabilistic model n the appearance or absent of an index term in a document is interpreted either as evidence that the document is relevant or that it is irrelevant to a query w establish a weight for each term

n a collection of N documents w R of which are relevant n Rt of which contain term t w ft of which contain t w these values can be obtained from a training set with relevance judgments

n computing probabilities w Pr[relevant t]=Rt ft w Pr[irrelevant t]=(ft-Rt) ft w Pr[relevant t ]=(R-Rt)/(N-ft) w Pr[irrelevant t ]=(N-ft-(R-Rt))/(N-ft)

computing weight Wt for t Wt= Pr[relevant t] Pr[irrelevant t ] Pr[irrelevant t] Pr[relevant t ] = Rt/ft (N-ft-(R-Rt))/(N-ft) (ft-Rt)/ft (R-Rt)/(N-ft) = Rt/(R-Rt) (ft-Rt)/(N-ft-(R-Rt)) n

w Wt>1 indicates that the appearance of t supports the document is relevant w Wt<1 indicates that the appearance of t suggests the document is irrelevant w N=20, R=13, Rt=11, ft=12 Wt=33 w N=20, R=13, Rt=4, ft=7 Wt=0. 59 w Wt=1 indicates that t is neutral

n n w negative weight indicates that the document is predicted to be irrelevant w zero weight indicates that the document is neutral

Comparison n the Boolean model is the weakest model w no partial matching n the vector model and probabilistic model are comparative while the vector model is more popular w term frequency is not considered in the probabilistic model