Federated Search of Text Search Engines in Uncooperative

Outline: Ø Introduction: Introduction to federated search Ø Research Problems: the state-of-the-art and preliminary

Introduction Visible Web vs. Hidden Web • Visible Web: Information can be copied (crawled)

Introduction Hidden Web is: - Larger than Visible Web (2 -50 times, Sherman 2001)

Introduction Components of Federated Search System Engine 1 Engine 2 Engine 3 . .

Introduction Solutions of Federated Search • Browsing model: Organize sources into a hierarchy; Navigate

Introduction Solutions of Federated Search • Information source recommendation: Recommend information sources for users’

Introduction Modeling Federated Search • Application in real world - Fed. Stats project: Web

Introduction Modeling Federated Search • TREC data - Large text corpus, thorough queries and

Introduction Modeling Federated Search • Simulation multiple types of search engines - INQUERY: Bayesian

Outline: Ø Introduction Ø Research Problems: the state-of-the-art and preliminary research - Resource Representation

Research Problems (Resource Representation) Previous Research on Resource Representation • Resource descriptions of words

Research Problems (Resource Representation) Previous Research on Resource Representation • Information source size estimation

Research Problems (Resource Representation) Experiment Methodology Methods are allowed the same number of transactions

Research Problems (Resource Representation) Experiments To conduct component-level study - Capture-Recapture: about 385 queries

Research Problems (Resource Selection) Goal of Resource Selection of Information Source Recommendation High-Recall: Select

Research Problems (Resource Selection) Previous Research on Resource Representation • “Big document” resource selection

Research Problems (Resource Selection) Previous Research on Resource Selection • Methods turn away from

Research Problems (Resource Selection) Relevant Doc Distribution Estimation (Re. DDE) Algorithm Source Scale Factor

Previous Work: Resource Selection & Results Merging In resource representation: • Build representations by

Research Problems (Resource Selection) Experiments On testbeds with uniform or moderately skewed source sizes

Research Problems (Resource Selection) Experiments On testbeds with skewed source sizes 25 25

Research Problems (Results Merging) Goal of Results Merging Make different result lists comparable and

Research Problems (Results Merging) Previous Research on Results Merging • Methods approximate comparable scores

Research Problems (Results Merging) Thought Previous algorithms either try to calculate or to mimic

Research Problems (Results Merging) In resource representation: • Build representations by QBS, collapse ……

Research Problems (Results Merging) Experiments Trec 123 Trec 4 -kmeans 3 Sources Selected 50

Research Problems (Unified Utility Framework) Goal of the Unified Utility Maximization Framework Integrate and

Research Problems (Unified Utility Framework) Estimate probabilities of relevance of docs In resource representation:

Research Problems (Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Basic Framework Let

Research Problems (Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Resource selection for

Research Problems (Unified Utility Framework) Experiments Resource selection for information source recommendation 39 39

Research Problems (Unified Utility Framework) Experiments Resource selection for information source recommendation 40 40

Research Problems (Unified Utility Framework) Experiments Resource selection for federated document retrieval Trec 123

Future Research (Dissertation Research) Purpose More experiments to study effectiveness of federated search algorithms

Future Research (Dissertation Research) Information Source Size Estimation • More experiments to study Sample-Resample

Future Research (Dissertation Research) Resource Selection for Information Source Recommendation - High-Recall measures the

Future Research (Dissertation Research) Results Merging • Semi-Supervised Learning (SSL) with only rank information

Future Research (Dissertation Research) Unified Utility Maximization Framework • Weighted High-Precision criterion - Current

Future Research (Dissertation Research) Unified Utility Maximization Framework • Incorporate impact of source retrieval

Future Research (Expected Contribution) Expected Contribution • Propose more theoretically solid and effective solutions

Future Research (Expected Contribution) Expected Contribution • Propose Unified Utility Maximization Framework to integrate

Future Research (Expected Contribution) Expected Contribution • Federated search has been hot research in

Future Research (Schedule) July. 2004 – Aug. 2004 Analyze and develop federated search testbed

Future Research (Schedule) May. 2005 – Aug. 2005 - Utility maximization framework with weighted

Slides: 53

Download presentation

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Thesis Committee: Jamie Callan (Carnegie Mellon University, Chair) Jaime Carbonell (Carnegie Mellon University) Yiming Yang (Carnegie Mellon University) Luis Gravano (Columbia University)

Outline: Ø Introduction: Introduction to federated search Ø Research Problems: the state-of-the-art and preliminary research Ø Future Research: Dissertation research and expected contribution 2 2

Outline: Ø Introduction: Introduction to federated search Ø Research Problems: the state-of-the-art and preliminary research Ø Future Research: Dissertation research and expected contribution 3 3

Introduction Visible Web vs. Hidden Web • Visible Web: Information can be copied (crawled) and accessed by conventional search engines like Google or Alta. Vista • Hidden Web: Information can NOT be copied and accessed by conventional engines. - No arbitrary crawl of the data (e. g. , ACM library) - Updated too frequently to be crawled (e. g. , buy. com) Can NOT Index (promptly) Hidden Web contained in (Hidden) information sources that provide text search engines to access the hidden information 4 4

Introduction Hidden Web is: - Larger than Visible Web (2 -50 times, Sherman 2001) Valuable - Created by professionals Searched by Federated Search Environments: • Small companies: Probably cooperative information sources • Big companies (organizations): Probably uncooperative information sources • Web: Uncooperative information sources 5 5

Introduction Components of Federated Search System Engine 1 Engine 2 Engine 3 . . . (1)Resource Representation Engine 4 …… …… . . Engine N. . . (3) Results Merging (2) Resource Selection 6 6

Introduction Solutions of Federated Search • Browsing model: Organize sources into a hierarchy; Navigate manually From: Invisible-web. net 7 7

Introduction Solutions of Federated Search • Information source recommendation: Recommend information sources for users’ text queries - Useful when users want to browse the selected sources - Contain resource representation and resource selection components • Federated document retrieval: Search selected sources and merge individual ranked lists - Most complete solution - Contain all of resource representation, resource selection and results merging 8 8

Introduction Modeling Federated Search • Application in real world - Fed. Stats project: Web site to connect dozens of government agencies with uncooperative search engines • Previously use centralized solution (ad-hoc retrieval), but suffer a lot from missing new information and broken links • Require federated search solution: A prototype of federated search solution for Fed. Stats is on-going in Carnegie Mellon University - Good candidate for evaluation of federated search algorithms - But, not enough relevance judgments, not enough control… 9 Require Thorough Simulation 9

Introduction Modeling Federated Search • TREC data - Large text corpus, thorough queries and relevance judgments • Simulation with TREC news/government data - Professional well-organized contents - Often be divided into O(100) information sources - Simulate environments of large companies or domain specific hidden Web - Most commonly used, many baselines (Lu et al. , 1996)(Callan, 2000)…. - Normal or moderately skewed size testbeds: Trec 123 or Trec 4_Kmeans - Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density), Nonrelevant (large source with lower relevant doc density) 10 10

Introduction Modeling Federated Search • Simulation multiple types of search engines - INQUERY: Bayesian inference network with Okapi term formula, doc score range [0. 4, 1] - Language Model: Generation probabilities of query given docs doc score range [-60, -30] (log of the probabilities) - Vector Space Model: SMART “lnc. ltc” weighting doc score range [0. 0, 1. 0] • Federated search metric - Information source size estimation: Error rate in source size estimation - Information source recommendation: High-Recall, select information sources with most relevant docs - Federated doc retrieval: High-Precision at top ranked docs 11 11

Outline: Ø Introduction Ø Research Problems: the state-of-the-art and preliminary research - Resource Representation - Resource Selection - Results Merging - A Unified Framework Ø Future Research 12 12

Research Problems (Resource Representation) Previous Research on Resource Representation • Resource descriptions of words and the occurrences - STARTS protocol (Gravano et al. , 1997): Cooperative protocol - Query-Based Sampling (Callan et al. , 1999): § Send random queries and analyze returned docs § Good for uncooperative environments • Centralized sample database: Collect docs from Query-Based Sampling (QBS) - For query-expansion (Ogilvie & Callan, 2001), not very successful - Successful utilization for other problems, throughout this proposal 14 14

Research Problems (Resource Representation) Previous Research on Resource Representation • Information source size estimation Important for resource selection and provide users useful information - Capture-Recapture Model (Liu and Yu, 1999) Use two sets of independent queries, analyze overlap of returned doc ids But require large number of interactions with information sources New Information Source Size Estimation Algorithm - Sample-Resample Model (Si and Callan, 2003) Assume: Search engine indicates num of docs matching a one-term query Strategy: Estimate df of a term in sampled docs Get total df from by resample query from source Scale the number of sampled docs to estimate source size 15 15

Research Problems (Resource Representation) Experiment Methodology Methods are allowed the same number of transactions with a source Two scenarios to compare Capture-Recapture & Sample-Resample methods - Combined with other components: methods can utilize data from Query. Based Sample (QBS) - Component-level study: can not utilize data from Query-Based Sample Capture. Recapture (Scenario 1) Capture. Recapture (Scenario 2) 1 80 85 Sample. Resample 1 300 385 Queries Downloaded documents 16 Data may be acquired by QBS (80 sample queries acquire 300 docs) 16

Research Problems (Resource Representation) Experiments To conduct component-level study - Capture-Recapture: about 385 queries (transactions) - Sample-Resample: 80 queries and 300 docs for sampled docs (sample) + 5 queries ( resample) = 385 transactions Estimated Source Size Measure: th Actual Source Size Collapse every 10 source of Trec 123 Absolute error ratio Trec 123 (Avg AER, lower is better) Trec 123 -10 Col (Avg AER, lower is better) Cap-Recapture 0. 729 0. 943 Sample-Resample 0. 232 0. 299 17 17

Outline: Ø Introduction Ø Research Problems: the state-of-the-art and preliminary research - Resource Representation - Resource Selection - Results Merging - A unified framework Ø Future Research 18 18

Research Problems (Resource Selection) Goal of Resource Selection of Information Source Recommendation High-Recall: Select the (few) information sources that have the most relevant documents Previous Research on Resource Selection • Resource selection algorithms that need training data - Decision-Theoretic Framework (DTF) (Nottelmann & Fuhr, 1999, 2003) DTF causes large human judgment costs - Lightweight probes (Hawking & Thistlewaite, 1999) Acquire training data in an online manner, large communication costs 19 19

Research Problems (Resource Selection) Previous Research on Resource Representation • “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query - Cue Validity Variance (CVV) (Yuwono & Lee, 1997) - CORI (Bayesian Inference Network) (Callan, 1995) - KL-divergence (Xu & Croft, 1999), Calculate KL divergence between distribution of information sources and user query CORI and KL are the state-of-the-art (French et al. , 1999)(Craswell et al, , 2000) But “Big document” approach loses doc boundaries and does not optimize the goal of High-Recall 20 20

Research Problems (Resource Selection) Previous Research on Resource Selection • Methods turn away from “Big document” resource selection approach - b. Gl. OSS (Gravano et al. , 1994) and v. Gl. OSS (Gravano et al, 1999) Turn away from “big document” approach by considering goodness of each doc in the sources But use strong assumptions to calculate doc goodness Thought Resource selection algorithms of information source recommendation need to optimize High-Recall of including most relevant docs Our strategy, estimate the percentage of relevant docs among sources and rank the sources accordingly RElevant Doc Distribution Estimation (Re. DDE) resource selection 21 21

Research Problems (Resource Selection) Relevant Doc Distribution Estimation (Re. DDE) Algorithm Source Scale Factor Estimated Source Size Number of Sampled Docs Rank on Centralized Complete DB “Everything at the top is (equally) relevant” Problem: To estimate doc ranking on Centralized Complete DB 22 22

Previous Work: Resource Selection & Results Merging In resource representation: • Build representations by QBS, collapse . . In resource selection: ranking on CSDB . . • Construct ranking on CCDB with . . . 23 Engine N CCDB Ranking Threshold Engine 2 CSDB Ranking Resource Selection sampled docs into centralized sample DB Engine 1 • Re. DDE Algorithm (Cont) Resource Representation Centralized Sample DB 23

Research Problems (Resource Selection) Experiments On testbeds with uniform or moderately skewed source sizes Evaluated Ranking Desired Ranking 24 24

Research Problems (Resource Selection) Experiments On testbeds with skewed source sizes 25 25

Research Problems (Results Merging) Goal of Results Merging Make different result lists comparable and merge them into a single list Difficulties: - Information sources may use different retrieval algorithms - Information sources have different corpus statistics Previous Research on Results Merging • Most accurate methods directly calculate comparable scores - Use same retrieval algorithm and same corpus statistics (Viles & French, 1997)(Xu and Callan, 1998), need source cooperation - Download retrieved docs and recalculate scores (Kirsch, 1997), large communication and computation costs 27 27

Research Problems (Results Merging) Previous Research on Results Merging • Methods approximate comparable scores - Round Robin (Voorhees et al. , 1997), only use source rank information and doc rank information, fast but less effective - CORI merging formula (Callan et al. , 1995), linear combination of doc scores and source scores § Use linear transformation, a hint for other method § Work in uncooperative environment, effective but need improvement 28 28

Research Problems (Results Merging) Thought Previous algorithms either try to calculate or to mimic the effect of the centralized scores Can we estimate the centralized scores effectively and efficiently? • Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003) - Some docs exist in both centralized sample DB and retrieved docs From Centralized sampled DB and individual ranked lists when long ranked lists are available Download minimum number of docs with only short ranked lists - Linear transformation maps source specific doc scores to source independent scores on centralized sample DB 29 29

Research Problems (Results Merging) In resource representation: • Build representations by QBS, collapse …… In results merging: estimate centralized scores for all docs 30 . . . Final Results Engine N • Find overlap docs, build linear models, . . scores for docs in centralized sample DB Engine 2 . . • Rank sources, calculate centralized Overlap Docs …… In resource selection: Resource Selection CSDB sampled docs into centralized sample DB Ranking Engine 1 • SSL Results Merging (cont) Resource Representation Centralized Sample DB 30

Research Problems (Results Merging) Experiments Trec 123 Trec 4 -kmeans 3 Sources Selected 50 docs retrieved from each source SSL downloads minimum docs for training 10 Sources Selected 31 31

Research Problems (Unified Utility Framework) Goal of the Unified Utility Maximization Framework Integrate and adjust individual components of federated search to get global desired results for different applications Simply combine individual effective components together High-Recall vs. High-Precision High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation High-Precision: Select sources that return many relevant docs at top part of ranked lists for federated document retrieval They are correlated but NOT identical, previous research does NOT distinguish them 33 33

Research Problems (Unified Utility Framework) Estimate probabilities of relevance of docs In resource representation: Probs of Relevance Centralized scores. . Centralized Doc Score CSDB Ranking centralized doc scores • Calculate probs of relevance for all the prob of relevance for jth doc from ith source docs in all available sources 34 Engine N • Use piecewise interpolation to get all . . Doc Rank In resource selection: Resource Selection Centralized doc scores Prob of Rel Engine 2 • Build representations and CSDB • Build logistic model on CSDB Engine 1 UUM Framework Resource Representation Centralized Sample DB 34

Research Problems (Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Basic Framework Let indicate number of docs to retrieve from each source Estimated probs of relevance for all docs prob of given all available resource descriptions centralized retrieval scores utility gained by making selection when is correct Desired solution is: MAP Approximate 35 35

Research Problems (Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Resource selection for information source recommendation High-Recall Goal: Select sources that contain as many relevant docs as possible Number of rel docs in selected sources Number of sources to select Solution: Rank sources by number of relevant docs they contain Called Unified Utility Maximization Framework for High-Recall UUM/HR 36 36

Research Problems (Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Resource selection for federated document retrieval High-Precision Goal: Select sources that return many relevant docs as the top part Number of rel docs in top part of source Number of sources Retrieve fixed to select number of docs Solution: Rank sources by number of relevant docs in top part Called Unified Utility Maximization Framework UUM/HP-FL for High-Precision with Fixed Length 37 37

Research Problems (Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Resource selection for federated document retrieval A variant to select variable number of docs from selected sources Number of documents to select Retrieve variable number of docs Solution: No simple solution, by dynamic programming Called Unified Utility Maximization Framework UUM/HP-VL for High-Precision with Variable Length 38 38

Research Problems (Unified Utility Framework) Experiments Resource selection for information source recommendation 39 39

Research Problems (Unified Utility Framework) Experiments Resource selection for information source recommendation 40 40

Research Problems (Unified Utility Framework) Experiments Resource selection for federated document retrieval Trec 123 Representative 3 Sources Selected SSL Merge 10 Sources Selected 41 41

Outline: Ø Introduction: Introduction to federated search Ø Research Problems: the state-of-the-art and preliminary research Ø Future Research: Dissertation research and expected contribution 42 42

Future Research (Dissertation Research) Purpose More experiments to study effectiveness of federated search algorithms Extend proposed federated research algorithms to better simulate operational environments • Information Source Estimation • Resource Selection • Results Merging • Unified Utility Maximization Framework 43 43

Future Research (Dissertation Research) Information Source Size Estimation • More experiments to study Sample-Resample algorithm - Effects of more resample queries, will that improve estimation accuracy? - Resample query characteristic, which is better, low df or high df? - Sample Resample estimation of larger sources (e. g. , 300, 000 docs) • Sample-Resample without available doc frequency information -Basic Sample-Resample needs doc frequency information from sources, which may not be available in operational environments - Estimate doc frequency from overlap of sampled docs and retrieved results from source 44 44

Future Research (Dissertation Research) Resource Selection for Information Source Recommendation - High-Recall measures the total amount of relevant docs contained in information sources, but users may only care the top ranked docs in every source. - High-Precision variant for source recommendation, UUM/HP algorithms can be the candidate of solutions - Source retrieval effectiveness may need to be considered, discussed later together with new research of unified utility maximization framework 45 45

Future Research (Dissertation Research) Results Merging • Semi-Supervised Learning (SSL) with only rank information - Basic SSL algorithm transforms source specific doc scores into source independent scores. But maybe only doc ranking is available (e. g. , most search engines in Fed. Stats do NOT return doc scores) - Extend SSL by generating pseudo doc scores from doc rank information • Study the difference between SSL algorithm and a desired merging algorithm - Compare the results merging effectiveness of SSL algorithm and an algorithm that merges with actual centralized doc scores 46 46

Future Research (Dissertation Research) Unified Utility Maximization Framework • Weighted High-Precision criterion - Current High-Precision criterion assigns equal weights on top-ranked docs, but - Different top-ranked docs have different contribution (e. g. , 1 st and 500 th) Also users put different amount of attention on the docs New weighted High-Precision Goal in UUM Framework 47 Partial results of trec_eval Precision: At 5 docs: 0. 3640 At 10 docs: 0. 3360 At 15 docs: 0. 3253 At 20 docs: 0. 3140 At 30 docs: 0. 2780 At 100 docs: 0. 1666 At 200 docs: 0. 0833 At 500 docs: 0. 0333 47

Future Research (Dissertation Research) Unified Utility Maximization Framework • Incorporate impact of source retrieval effectiveness into the framework - Current solutions do not consider source retrieval effectiveness, but bad search engines may not return any relevant doc even there a lot. Important in operation environments (e. g. , Pub. Med system uses less effective unranked retrieval) Incorporate source retrieval effectiveness into UUM Framework Idea: Measure source retrieval effectiveness by agreement with centralized retrieval results One possibility, Noise_Model measure effectiveness 48 48

Future Research (Expected Contribution) Expected Contribution • Propose more theoretically solid and effective solutions to the full range of federated search - Sample-Resample source size estimation vs. Capture-Recapture - RElevant Doc Distribution Estimation (Re. DDE) resource selection vs. “big document” approach - Semi-Supervised Learning (SSL) results merging vs. CORI formula 49 49

Future Research (Expected Contribution) Expected Contribution • Propose Unified Utility Maximization Framework to integrate separate solutions -This is the first probabilistic framework to integrate different components together - It allows a better opportunity to utilize available information (e. g. , information in centralized sample database) - It enables us to configure individual components globally for desired overall results than simply combining them together 50 50

Future Research (Expected Contribution) Expected Contribution • Federated search has been hot research in last decade - Most of previous research is tied with “Big document” Approach The new research advances the state-of-the-art - More theoretically solid foundation - More empirically effective Bridge from Cool Research to - Better model real world applications 51 Practical Tool 51

Future Research (Schedule) July. 2004 – Aug. 2004 Analyze and develop federated search testbed with TREC Web data. Sep. 2004 – Dec. 2004 - Experiments to study the behavior of Sample-Resample algorithm - New Sample-Resample source algorithm without available document frequency information Jan. 2005 – Apr. 2005 - SSL algorithm without returned document scores - Influence of component accuracy on overall results of federated search task 52 52

Future Research (Schedule) May. 2005 – Aug. 2005 - Utility maximization framework with weighted high-precision goal - Utility maximization framework with consideration of source retrieval effectiveness Sep. 2005 – Dec. 2005 Analyze the results, summarize and write up thesis 53 53