Modern Retrieval Frameworks Cheng Xiang Zhai Department of

  • Slides: 87
Download presentation
龙星计划课程: 信息检索 Modern Retrieval Frameworks Cheng. Xiang Zhai (翟成祥) Department of Computer Science Graduate

龙星计划课程: 信息检索 Modern Retrieval Frameworks Cheng. Xiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http: //www-faculty. cs. uiuc. edu/~czhai, [email protected] uiuc. edu 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 1

Outline • Risk Minimization Framework [Lafferty & Zhai 01, Zhai & Lafferty 06] •

Outline • Risk Minimization Framework [Lafferty & Zhai 01, Zhai & Lafferty 06] • Axiomatic Retrieval Framework [Fang et al. 04, Fang & Zhai 05, Fang & Zhai 06] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 2

Risk Minimization Framework 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University,

Risk Minimization Framework 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 3

Risk Minimization: Motivation • Long-standing IR Challenges – Improve IR theory • Develop theoretically

Risk Minimization: Motivation • Long-standing IR Challenges – Improve IR theory • Develop theoretically sound and empirically effective models • Go beyond the limited traditional notion of relevance (independent, topical relevance) – Improve IR practice • Optimize retrieval parameters automatically • SLMs are very promising tools … – How can we systematically exploit SLMs in IR? – Can SLMs offer anything hard/impossible to achieve in traditional IR? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 4

Long-Standing IR Challenges • Limitations of traditional IR models – Strong assumptions on “relevance”

Long-Standing IR Challenges • Limitations of traditional IR models – Strong assumptions on “relevance” • Independent relevance • Topical relevance – Can we go beyond this traditional notion of relevance? • Difficulty in IR practice – Ad hoc parameter tuning – Can’t go beyond “retrieval” to support info. access in general 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 5

More Than “Relevance” Relevance Ranking Desired Ranking Redundancy Readability 2008 © Cheng. Xiang Zhai

More Than “Relevance” Relevance Ranking Desired Ranking Redundancy Readability 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 6

Retrieval Parameters • Retrieval parameters are needed to – model different user preferences –

Retrieval Parameters • Retrieval parameters are needed to – model different user preferences – customize a retrieval model according to different queries and documents • So far, parameters have been set through empirical experimentation • Can we set parameters automatically? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 7

Systematic Applications of Language Models to IR • Many different variants of language models

Systematic Applications of Language Models to IR • Many different variants of language models have been developed, but are there many more models to be studied? • Can we establish a road map for exploring language models in IR? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 8

Two Main Ideas of the Risk Minimization Framework • Retrieval as a decision process

Two Main Ideas of the Risk Minimization Framework • Retrieval as a decision process • Systematic language modeling 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 9

Idea 1: Retrieval as Decision-Making (A more general notion of relevance) Given a query,

Idea 1: Retrieval as Decision-Making (A more general notion of relevance) Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ( ) Choose: (D, ) ? Query ? ? 2008 © Cheng. Xiang Zhai Unordered subset 1 2 3 4 … Ranked list Clustering Dragon Star Lecture at Beijing University, June 21 -30, 2008 10

Idea 2: Systematic Language Modeling QUERY MODELING Query Language Model Query USER MODELING Retrieval

Idea 2: Systematic Language Modeling QUERY MODELING Query Language Model Query USER MODELING Retrieval Decision: Documents ? Loss Function User Document Language Models DOC MODELING 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 11

Generative Model of Document & Query [Lafferty & Zhai 01 b] Us er U

Generative Model of Document & Query [Lafferty & Zhai 01 b] Us er U q Partially observed Sourc e observed R d S Query Document inferred 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 12

Applying Bayesian Decision Theory [Lafferty & Zhai 01 b, Zhai 02, Zhai & Lafferty

Applying Bayesian Decision Theory [Lafferty & Zhai 01 b, Zhai 02, Zhai & Lafferty 06] Loss Choice: (D 1, 1) L Choice: (D 2, 2) L query q user U q 1 . . . Choice: (Dn, n) L N loss RISK MINIMIZATION doc set C source S hidden observed Bayes risk for choice (D, ) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 13

Benefits of the Framework • Systematic exploration of retrieval models (covering almost all the

Benefits of the Framework • Systematic exploration of retrieval models (covering almost all the existing retrieval models as special cases) • Derive general retrieval principles (risk ranking principle) • Automatic parameter setting • Go beyond independent-relevance (subtopic retrieval) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 14

Special Cases of Risk Minimization • Set-based models (choose D) • Ranking models (choose

Special Cases of Risk Minimization • Set-based models (choose D) • Ranking models (choose ) Boolean model – Independent loss Probabilistic relevance model Generative Relevance Theory Vector-space Model Two-stage LM KL-divergence model • Relevance-based loss • Distance-based loss – Dependent loss • MMR loss • MDR loss Subtopic retrieval model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 15

Case 1: Two-stage Language Models Loss function U Risk ranking formula q Stage 2:

Case 1: Two-stage Language Models Loss function U Risk ranking formula q Stage 2: compute Stage 2 (Mixture model) Stage 1 S d Stage 1: compute Two-stage smoothing (Dirichlet prior smoothing) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 16

Case 2: KL-divergence Retrieval Models Loss function Risk ranking formula U q S d

Case 2: KL-divergence Retrieval Models Loss function Risk ranking formula U q S d 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 17

Case 3: Aspect Generative Model of Document & Query Us er U Sourc e

Case 3: Aspect Generative Model of Document & Query Us er U Sourc e q Query d Document =( 1, …, k) S PLSI: LDA: 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 18

Optimal Ranking for Independent Loss Decision space = {rankings} Sequential browsing Independent loss Independent

Optimal Ranking for Independent Loss Decision space = {rankings} Sequential browsing Independent loss Independent risk = independent scoring “Risk ranking principle” [Zhai 02] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 19

Automatic Parameter Tuning • Retrieval parameters are needed to – model different user preferences

Automatic Parameter Tuning • Retrieval parameters are needed to – model different user preferences – customize a retrieval model to specific queries and documents • Retrieval parameters in traditional models – EXTERNAL to the model, hard to interpret – Parameters are introduced heuristically to implement “intuition” – No principles to quantify them, must set empirically through many experiments – Still no guarantee for new queries/documents • Language models make it possible to estimate parameters… 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 20

The Way to Automatic Tuning. . . • Parameters must be PART of the

The Way to Automatic Tuning. . . • Parameters must be PART of the model! – Query modeling (explain difference in query) – Document modeling (explain difference in doc) • De-couple the influence of a query on parameter setting from that of documents – To achieve stable setting of parameters – To pre-compute query-independent parameters 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 21

Parameter Setting in Risk Minimization Estimate Query Estimate Documents Query model parameters Query Language

Parameter Setting in Risk Minimization Estimate Query Estimate Documents Query model parameters Query Language Model User model parameters Doc model parameters Loss Function Set User Document Language Models 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 22

Generative Relevance Hypothesis [Lavrenko 04] • • • Generative Relevance Hypothesis: – For a

Generative Relevance Hypothesis [Lavrenko 04] • • • Generative Relevance Hypothesis: – For a given information need, queries expressing that need and documents relevant to that need can be viewed as independent random samples from the same underlying generative model A special case of risk minimization when document models and query models are in the same space Implications for retrieval models: “the same underlying generative model” makes it possible to – Match queries and documents even if they are in different languages or media – Estimate/improve a relevant document model based on example queries or vice versa 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 23

Risk minimization can easily go beyond independent relevance… 2008 © Cheng. Xiang Zhai Dragon

Risk minimization can easily go beyond independent relevance… 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 24

Aspect Retrieval Query: What are the applications of robotics in the world today? Find

Aspect Retrieval Query: What are the applications of robotics in the world today? Find as many DIFFERENT applications as possible. Aspect judgments Example Aspects: A 1: spot-welding robotics A 2: controlling inventory A 3: pipe-laying robots A 4: talking robot A 5: robots for loading & unloading memory tapes A 6: robot [telephone] operators A 7: robot cranes …… d 1 d 2 d 3 …. dk A 1 A 2 A 3 …. . . Ak 1 1 0 0… 0 0 0 1 1 1… 0 0 0… 1 0 1 0. . . 0 1 Must go beyond independent relevance! 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 25

Evaluation Measures • Aspect Coverage (AC): measures per-doc coverage – #distinct-aspects/#docs – Equivalent to

Evaluation Measures • Aspect Coverage (AC): measures per-doc coverage – #distinct-aspects/#docs – Equivalent to the “set cover” problem, NP-hard • Aspect Uniqueness(AU): measures redundancy – #distinct-aspects/#aspects – Equivalent to the “volume cover” problem, NP-hard • 0 0 d 1 01 0 0 1 #doc 1 #asp 2 #uniq-asp 2 AC: 2/1=2. 0 AU: 2/2=1. 0 Examples Accumulated counts 0 1 d 2 01 1 0 0 2 5 4 4/2=2. 0 4/5=0. 8 2008 © Cheng. Xiang Zhai 1 0 d 3 00 1 3 8 5 5/3=1. 67 5/8=0. 625 …. . . …… …… Dragon Star Lecture at Beijing University, June 21 -30, 2008 26

Dependent Relevance Ranking • In general, the computation of the optimal ranking is NP-hard

Dependent Relevance Ranking • In general, the computation of the optimal ranking is NP-hard • A general greedy algorithm – Pick the first document according to INDEPENDENT relevance – Given that we have picked k documents, evaluate the CONDITIONAL relevance of each candidate document – Choose the document that has the highest conditional relevance value 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 27

Loss Function L( k+1 | 1 … k ) known d 1 dk ?

Loss Function L( k+1 | 1 … k ) known d 1 dk ? dk+1 … Maximal Marginal Relevance (MMR) 1 k k+1 Novelty/Redundancy Nov ( k+1 | 1 … k ) The best dk+1 is novel & relevant Relevance Rel( k+1 ) Maximal Diverse Relevance (MDR) 1 Aspect Coverage Distrib. p(a| i) The best dk+1 is complementary in coverage k k+1 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 28

Maximal Marginal Relevance (MMR) Models • Maximizing aspect coverage indirectly through redundancy elimination •

Maximal Marginal Relevance (MMR) Models • Maximizing aspect coverage indirectly through redundancy elimination • Conditional-Rel. = novel + relevant • Elements – Redundancy/Novelty measure – Combination of novelty and relevance 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 29

A Mixture Model for Redundancy Ref. document Maximum Likelihood Expectation-Maximization P(w|Old) =? 1 -

A Mixture Model for Redundancy Ref. document Maximum Likelihood Expectation-Maximization P(w|Old) =? 1 - Collection P(w|Background) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 30

Cost-based Combination of Relevance and Novelty Relevance score 2008 © Cheng. Xiang Zhai Novelty

Cost-based Combination of Relevance and Novelty Relevance score 2008 © Cheng. Xiang Zhai Novelty score Dragon Star Lecture at Beijing University, June 21 -30, 2008 31

Maximal Diverse Relevance (MDR) Models • Maximizing aspect coverage directly through aspect modeling •

Maximal Diverse Relevance (MDR) Models • Maximizing aspect coverage directly through aspect modeling • Conditional-rel. = complementary coverage • Elements – Aspect loss function – Generative Aspect Model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 32

Aspect Generative Model of Document & Query Us er U Sourc e q Query

Aspect Generative Model of Document & Query Us er U Sourc e q Query d Document =( 1, …, k) S PLSI: LDA: 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 33

Aspect Loss Function U q S d 2008 © Cheng. Xiang Zhai Dragon Star

Aspect Loss Function U q S d 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 34

Aspect Loss Function: Illustration perfect redundant Desired coverage “Already covered” p(a| Q) p(a| 1).

Aspect Loss Function: Illustration perfect redundant Desired coverage “Already covered” p(a| Q) p(a| 1). . . p(a| k -1) non-relevant New candidate p(a| k) 2008 © Cheng. Xiang Zhai Combined coverage Dragon Star Lecture at Beijing University, June 21 -30, 2008 35

Risk Minimization: Summary • Risk minimization is a general probabilistic retrieval framework – Retrieval

Risk Minimization: Summary • Risk minimization is a general probabilistic retrieval framework – Retrieval as a decision problem (=risk min. ) – Separate/flexible language models for queries and docs • Advantages – A unified framework for existing models – Automatic parameter tuning due to LMs – Allows for modeling complex retrieval tasks • • Lots of potential for exploring LMs… For more information, see [Zhai 02] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 36

Future Research Directions • Modeling latent structures of documents – Introduce source structures (naturally

Future Research Directions • Modeling latent structures of documents – Introduce source structures (naturally suggest structure-based smoothing methods) • Modeling multiple queries and clickthroughs of the same user – Let the observation include multiple queries and clickthroughs • Collaborative search – Introduce latent interest variables to tie similar users together • Modeling interactive search 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 37

What You Should Know • The key idea of risk minimization is framing the

What You Should Know • The key idea of risk minimization is framing the retrieval problem as a decision problem over the entire collection • The framework covers many specific models as special cases • It facilitates systematic exploration of language models • It allows us to model a problem requiring dependent relevance modeling 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 38

Axiomatic Retrieval Framework Most of the following slides are from Hui Fang’s presentation 2008

Axiomatic Retrieval Framework Most of the following slides are from Hui Fang’s presentation 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 39

Traditional Way of Modeling the Relevance Vector Space Models Query QRep Relevance? Document DRep

Traditional Way of Modeling the Relevance Vector Space Models Query QRep Relevance? Document DRep [Salton et al. 75, Salton et al. 83, Salton et al. 89, Singhal 96] Rel≈Sim(DRep, QRep) Rel≈P(R=1|DRep, QRep) Probabilistic Models [Fuhr et al 92, Lafferty et al 03, Ponte et al 98, Robertson et al. 76, Turtle et al. 91, Rijbergen et al 77] test collection • No way to predict the performance and identify the weaknesses • Sophisticated parameter tuning 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 40

No Way to Predict the Performance 2008 © Cheng. Xiang Zhai Dragon Star Lecture

No Way to Predict the Performance 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 41

Sophisticated Parameter Tuning “k 1, b and k 3 are parameters which depend on

Sophisticated Parameter Tuning “k 1, b and k 3 are parameters which depend on the nature of the queries and possibly on the database; k 1 and b default to 1. 2 and 0. 75 respectively, but smaller values of b are sometimes advantageous; in long queries k 3 is often set to 7 or 1000. ” [Robertson et al. 1999] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 42

High Parameter Sensitivity 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University,

High Parameter Sensitivity 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 43

Hui Fang’s Thesis Work [Fang 07] Propose a novel axiomatic framework, where relevance is

Hui Fang’s Thesis Work [Fang 07] Propose a novel axiomatic framework, where relevance is directly modeled with term-based constraints – Predict the performance of a function analytically [Fang et al. , SIGIR 04] – Derive more robust and effective retrieval functions [Fang & Zhai, SIGIR 05, Fang & Zhai, SIGIR 06] – Diagnose weaknesses and strengths of retrieval functions [Fang & Zhai, under review] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 44

Traditional Way of Modeling the Relevance Query QRep Vector Space Models Rel≈Sim(DRep, QRep) Relevance?

Traditional Way of Modeling the Relevance Query QRep Vector Space Models Rel≈Sim(DRep, QRep) Relevance? Rel≈P(R=1|DRep, QRep) Document DRep Probabilistic Models test collection 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 45

Axiomatic Approach to Relevance Modeling Query Relevance? QRep Vector Space Models Constraint 1 Rel≈Sim(DRep,

Axiomatic Approach to Relevance Modeling Query Relevance? QRep Vector Space Models Constraint 1 Rel≈Sim(DRep, QRep) Constraint 2 … (1) Predict performance We are here Rel≈P(R=1|DRep, QRep) Constraint m Document Probabilistic Models DRep Rel(Q, D) (2) Develop more robust functions test collection (3) Diagnose weaknesses Collection (constraint 1) (constraint 2) … Collection (constraint m) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 46

Part 1: Define retrieval constraints [Fang et. al. SIGIR 2004] 2008 © Cheng. Xiang

Part 1: Define retrieval constraints [Fang et. al. SIGIR 2004] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 47

A Formal Study of Information Retrieval Heuristics (ACM SIGIR 2004) Hui Fang, Tao and

A Formal Study of Information Retrieval Heuristics (ACM SIGIR 2004) Hui Fang, Tao and Cheng. Xiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 48

Empirical Observations in IR • Retrieval heuristics are necessary for good retrieval performance. –

Empirical Observations in IR • Retrieval heuristics are necessary for good retrieval performance. – E. g. TF-IDF weighting, document length normalization • Similar formulas may have different performances. • Performance is sensitive to parameter setting. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 49

Empirical Observations in IR (Cont. ) • Pivoted Normalization Method 1+ln(c(w, d)) • Dirichlet

Empirical Observations in IR (Cont. ) • Pivoted Normalization Method 1+ln(c(w, d)) • Dirichlet Prior Method • Okapi Method 2008 © Cheng. Xiang Zhai Parameter sensitivity Document Inversed Alternative Term Length Document TF Frequency transformation Normalization Frequency Dragon Star Lecture at Beijing University, June 21 -30, 2008 50

Research Questions • How can we formally characterize these necessary retrieval heuristics? • Can

Research Questions • How can we formally characterize these necessary retrieval heuristics? • Can we predict the empirical behavior of a method without experimentation? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 51

Outline • • • Formalized heuristic retrieval constraints Analytical evaluation of the current retrieval

Outline • • • Formalized heuristic retrieval constraints Analytical evaluation of the current retrieval formulas Benefits of constraint analysis – Better understanding of parameter optimization – Explanation of performance difference – Improvement of existing retrieval formulas 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 52

Term Frequency Constraints (TFC 1) TF weighting heuristic I: Give a higher score to

Term Frequency Constraints (TFC 1) TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. • TFC 1 Let q be a query with only one term w. q: If and d 1: d 2: then 2008 © Cheng. Xiang Zhai w Dragon Star Lecture at Beijing University, June 21 -30, 2008 53

Term Frequency Constraints (TFC 2) TF weighting heuristic II: Favor a document with more

Term Frequency Constraints (TFC 2) TF weighting heuristic II: Favor a document with more distinct query terms. • TFC 2 Let q be a query and w 1, w 2 be two query terms. Assume and If and q: w 1 w 2 d 1: d 2: then 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 54

Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection;

Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial IDF(SVM)>IDF(Tutorial) Doc 1 Assume Doc 2 … … Tutorial SVM Tutorial … . . . … SVM SVM Tutorial … 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 55

Term Discrimination Constraint (Cont. ) • TDC Let q be a query and w

Term Discrimination Constraint (Cont. ) • TDC Let q be a query and w 1, w 2 be two query terms. q: ww 1 2 Assume and If for all other words w. d 1: d 2: and then 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 56

Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC 1); Avoid over-penalizing long

Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC 1); Avoid over-penalizing long documents (LNC 2). • LNC 1 Let q be a query. If for some word q: d 1: d 2: but for other words then • LNC 2 Let q be a query. If q: d 1: d 2: and then 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 57

TF-LENGTH Constraint (TF-LNC) TF-LN heuristic: Regularize the interaction of TF and document length. •

TF-LENGTH Constraint (TF-LNC) TF-LN heuristic: Regularize the interaction of TF and document length. • TF-LNC Let q be a query with only one term w. q: If d 1: d 2: and w then 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 58

Analytical Evaluation Retrieval Formula TFCs TDC LNC 1 LNC 2 TF-LNC Pivoted Norm. Yes

Analytical Evaluation Retrieval Formula TFCs TDC LNC 1 LNC 2 TF-LNC Pivoted Norm. Yes Conditional Dirichlet Prior Yes Conditional Yes Okapi (original) Conditional Conditional Okapi (modified) Yes Conditional Yes Yes 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 59

Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection;

Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial IDF(SVM)>IDF(Tutorial) Doc 1 Assume Doc 2 … … Tutorial SVM Tutorial … . . . … SVM SVM Tutorial … 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 60

Benefits of Constraint Analysis • Provide an approximate bound for the parameters – A

Benefits of Constraint Analysis • Provide an approximate bound for the parameters – A constraint may be satisfied only if the parameter is within a particular interval. • Compare different formulas analytically without experimentations – When a formula does not satisfy the constraint, it often indicates non-optimality of the formula. • Suggest how to improve the current retrieval models – Violation of constraints may pinpoint where a formula needs to be improved. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 61

Benefits 1 : Bounding Parameters Pivoted Normalization Method Optimal s (for average precision) LNC

Benefits 1 : Bounding Parameters Pivoted Normalization Method Optimal s (for average precision) LNC 2 s<0. 4 Parameter sensitivity of s Avg. Prec. • 0. 4 s 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 62

Benefits 2 : Analytical Comparison • Okapi Method Negative when df(w) is large Violate

Benefits 2 : Analytical Comparison • Okapi Method Negative when df(w) is large Violate many constraints verbose query Okapi Avg. Prec keyword query Pivoted s or b 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 63

Benefits 3: Improving Retrieval Formulas • Modified Okapi Method Make Okapi satisfy more constraints;

Benefits 3: Improving Retrieval Formulas • Modified Okapi Method Make Okapi satisfy more constraints; expected to help verbose queries Modified Okapi verbose query Okapi Avg. Prec. keyword query Pivoted s or b 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 64

Conclusions and Future Work • Conclusions – Retrieval heuristics can be captured through formally

Conclusions and Future Work • Conclusions – Retrieval heuristics can be captured through formally defined constraints. – It is possible to evaluate a retrieval formula analytically through constraint analysis. • Future Work – Explore additional necessary heuristics – Apply these constraints to many other retrieval methods – Develop new retrieval formulas through constraint analysis 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 65

Axiomatic Approach to Relevance Modeling Query Relevance? QRep Vector Space Models Constraint 1 Rel≈Sim(DRep,

Axiomatic Approach to Relevance Modeling Query Relevance? QRep Vector Space Models Constraint 1 Rel≈Sim(DRep, QRep) Constraint 2 … (1) Predict performance Rel≈P(R=1|DRep, QRep) Constraint m Document Probabilistic Models DRep Rel(Q, D) (2) Develop more robust functions test collection We are here (3) Diagnose weaknesses Collection (constraint 1) (constraint 2) … Collection (constraint m) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 66

Part 2: Derive new retrieval functions [Fang & Zhai SIGIR 05, Fang & Zhai

Part 2: Derive new retrieval functions [Fang & Zhai SIGIR 05, Fang & Zhai SIGIR 06] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 67

Basic Idea of Axiomatic Approach Our target C 2 Functionspace Function S 22 C

Basic Idea of Axiomatic Approach Our target C 2 Functionspace Function S 22 C 3 SS 33 SS 11 C 1 2008 © Cheng. Xiang Zhai Retrieval constraints Dragon Star Lecture at Beijing University, June 21 -30, 2008 68

Function Space Define the function space inductively Q: cat big D: dog big Primitive

Function Space Define the function space inductively Q: cat big D: dog big Primitive weighting function (f) S(Q, D) = S( , ) = f ( , ) Query growth function (h) S(Q, D) = S( , )+h( , , ) )+g( , , ) Document growth function (g) S(Q, D) = S( , 2008 © Cheng. Xiang Zhai ) = S( , Dragon Star Lecture at Beijing University, June 21 -30, 2008 69

Derivation of New Retrieval Functions existing function decompose C 3 C 1 generalize S’

Derivation of New Retrieval Functions existing function decompose C 3 C 1 generalize S’ S constrain C 2 assemble new function 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 70

Representative Derived Function QTF IDF TF length normalization 2008 © Cheng. Xiang Zhai Dragon

Representative Derived Function QTF IDF TF length normalization 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 71

The derived function is less sensitive to the parameter setting better Axiomatic Model 2008

The derived function is less sensitive to the parameter setting better Axiomatic Model 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 72

Adding Semantic Term Matching dog A book is a collection of paper with text,

Adding Semantic Term Matching dog A book is a collection of paper with text, pictures, usually bound together along one edge within covers. A book is �also a literary work or a main division of such a work. A book produced in electronic format is known as an e-book. 2008 © Cheng. Xiang Zhai Training puppies is not always easy: it requires work. Puppies should be touched and held from birth, although only briefly and occasionally until their eyes and ears open. Otherwise the puppy may become vicious. Dragon Star Lecture at Beijing University, June 21 -30, 2008 73

General Approach to Semantic Term Matching • Select semantic similar terms • Expand original

General Approach to Semantic Term Matching • Select semantic similar terms • Expand original query with the selected terms dog 1 puppy 0. 5 doggy 0. 5 hound 0. 5 bone Key challenge: How to weight selected terms? 0. 1 The proposed axiomatic approach provides guidance on how to weight terms appropriately. 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 74

Effectiveness of Semantic Term Matching ROBUST 04 Syntactic term matching ROBUST 05 MAP P@20

Effectiveness of Semantic Term Matching ROBUST 04 Syntactic term matching ROBUST 05 MAP [email protected] 0. 248 0. 352 0. 192 0. 379 0. 302 0. 399 0. 292 0. 502 (baseline) Semantic term matching (21. 8%) 2008 © Cheng. Xiang Zhai (51. 0%) Dragon Star Lecture at Beijing University, June 21 -30, 2008 75

Axiomatic Approach to Relevance Modeling Query Relevance? QRep Vector Space Models Constraint 1 Rel≈Sim(DRep,

Axiomatic Approach to Relevance Modeling Query Relevance? QRep Vector Space Models Constraint 1 Rel≈Sim(DRep, QRep) Constraint 2 … (1) Predict performance Rel≈P(R=1|DRep, QRep) Constraint m Document DRep test collection Collection (constraint 1) (constraint 2) Probabilistic Models Rel(Q, D) (2) Develop more robust functions (3) Diagnose weaknesses … We are here Collection (constraint m)Dragon Star Lecture at Beijing University, June 21 -30, 2008 © Cheng. Xiang Zhai 76

Part 3: Diagnostic evaluation for IR models [Fang & Zhai, under review] 2008 ©

Part 3: Diagnostic evaluation for IR models [Fang & Zhai, under review] 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 77

Existing evaluation provides little explanation for the performance differences test collection Query: dog Doc

Existing evaluation provides little explanation for the performance differences test collection Query: dog Doc 1: Retrieval Function MAP=0. 25 Doc 2: … Docn: trec 8 wt-2 g fr 88 -89 Pivoted 0. 244 0. 288 0. 218 Dirichlet 0. 257 0. 302 0. 202 How to diagnose weaknesses and strengths of retrieval functions? 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 78

Relevance-Preserving Perturbations Perturb term statistics in documents and keep relevance status document scaling perturbation:

Relevance-Preserving Perturbations Perturb term statistics in documents and keep relevance status document scaling perturbation: c. D(d, d, K) concatenate every document with itself K times 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 79

Summary of Perturbations • • • Relevance addition Noise addition Internal term growth Document

Summary of Perturbations • • • Relevance addition Noise addition Internal term growth Document scaling Relevant document concatenation Non-relevant document concatenation Noise deletion Document addition Document deletion 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 80

Length Scaling Test 1. Identify the aspect to be diagnosed test whether a retrieval

Length Scaling Test 1. Identify the aspect to be diagnosed test whether a retrieval function over-penalizes long documents 2. Choose appropriate relevance-preserving perturbations c. D(d, d, K) 3. Perform the test and interpret the results Dirichlet over-penalizes long documents! 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 81

Summary of Diagnostic Tests • Length variation sensitivity tests – Length variance reduction test

Summary of Diagnostic Tests • Length variation sensitivity tests – Length variance reduction test – Length variance amplification test – Length scaling test • Term noise resistance tests – Term noise addition test • TF-LN balance Tests – Single query term growth – Majority query term growth – All query term growth 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 82

Identifying the weaknesses makes it possible to improve the performance trec 8 MAP #RRel

Identifying the weaknesses makes it possible to improve the performance trec 8 MAP #RRel wt 2 g [email protected] MAP #RRel fr 88 -89 [email protected] MAP #RRel [email protected] Dir. 0. 257 2838 0. 397 0. 302 1875 0. 372 0. 207 741 0. 185 M. D. 0. 262 2874 0. 415 0. 321 1930 0. 395 0. 224 811 0. 191 Piv. 0. 244 2826 0. 402 0. 288 1924 0. 369 0. 223 822 0. 206 M. P. 0. 256 2848 0. 411 0. 316 1940 0. 392 0. 230 867 0. 202 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 83

Axiomatic Framework: Summary • A new way of examining and developing retrieval models •

Axiomatic Framework: Summary • A new way of examining and developing retrieval models • Facilitate analytical study of retrieval models • Applicable to the development of all kinds of ranking functions • Limitation: – Constraints can be subjective – Not constructive (thus must rely on other techniques to reduce the search space) 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 84

Future Research Directions • Understand the limit of bag-of-words representation – Develop more constraints

Future Research Directions • Understand the limit of bag-of-words representation – Develop more constraints • Consider other variables (e. g. , # unique words in collection, #unique words in a doc, #frequency variance, …) • a new constraint can potentially lead to an improved model – Impossibility theorem for retrieval • • Show the performance limit Establish an ultimate robust optimal retrieval model Apply the axiomatic framework to pseudo feedback Apply the axiomatic methodology to other (non-retrieval) tasks 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 85

What You Should Know • How retrieval heuristics can be formalized as mathematical constraints

What You Should Know • How retrieval heuristics can be formalized as mathematical constraints on functions • Benefits of constraint analysis 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 86

Roadmap • This lecture: formal retrieval frameworks • Next lecture: personalized search 2008 ©

Roadmap • This lecture: formal retrieval frameworks • Next lecture: personalized search 2008 © Cheng. Xiang Zhai Dragon Star Lecture at Beijing University, June 21 -30, 2008 87