Faceted Searching and Browsing Over Large Collections Wisam
Faceted Searching and Browsing Over Large Collections Wisam Dakka, Columbia University
Search Beyond Navigational n Data grows as user needs become more Queries complex, go from just navigation to discovery q n [Digital video camera], [energy-efficient cars] Challenges for major search engines q Discovery or research queries n q Several dimensions of relevance in results but no structure n n Limited user activity Prices, stores, reviews, locations, and recent news Google Views: Faceted search with structure for discovery queries
x. Rank: Pushing Structure for Special Queries Search Learn Explore Relate Scan Track
[Digital Video Camera] on Yahoo!
Large Collections and Lengthy Results n n Most users examine only first or second page of query results Relevant results not only on first page, but on subsequent pages
Weaknesses of “Plain’’ Search n Search often unsatisfactory q q q n Poor ranking Large number of relevant items Broad-scope queries Search sometimes insufficient q q Why do we go to movie rental store or bookstore? Not effective for curious users and users with little knowledge of collections
Alternatives for Search: The Topic Facet Our contribution: Summarization-aware topic faceted searching and browsing of news articles
Alternatives for Search: The Time Facet Our contribution: General strategy to naturally impose time in the retrieval task
Alternatives for Search: Multiple Facets Our contribution: Automatically building faceted hierarchies
Agenda: Alternatives Alongside Search n Searching and browsing with the topic facet n Searching and browsing with the time facet [Barak Obama] [Google IPO] n Searching and browsing with multiple facets q q n Extracting useful facets Automatically constructing faceted hierarchies Conclusion and future work
Part 2: The Time Facet Time-Faceted Searching and Browsing [Barak Obama] [Google IPO]
Time in News Archives n Topic-relevance ranking may not be sufficiently powerful Consider query [Madrid bombing] n [Madrid bombing prefer: 03/11/2004− 04/30/2004] Searchers often do not know exact time or date a given event occurred n
What to Do When Relevant Time Periods are Unknown? n Identify relevant time periods using query terms q q n n Restrict query results to these time periods Diversify the top-10 results Alternatively, redefine relevance of a document as a combination of topic relevance and time relevance Improve query reformulations using relevant time periods
General Time-Sensitive Queries [Mad Cow] [Hurricane Florida] [Abu Ghraib] [American beheading] [Barak Obama] [Google IPO] n Time-sensitive results q q n Prioritizing relevant documents from relevant time periods Ranking those documents first Temporal relevance or q The likelihood that day is relevant to query of relevant documents in archive using distribution
Temporal Relevance or n n # Rel Docs n Given [Madrid Bombing], what is the probability that today is relevant vs. 04/13/2004? Simple to compute if relevant documents known Use when relevant Theestimation probability that we see relevantdocuments at time t unknown
Estimating Techniques for n SUM: Compute value as a normalized weighted sum of the relevance scores of documents published on 04/13/2004 [Diaz and Jones] n BINNING: Compute value as F(bin(04/13/2004)) q q q n Choose a distribution function F Arrange days in bins and order bins based on their priority Let bin(04/13/2004) be the priority value of 04/13/2004 bin WORD: Compute value using frequency of query words on 04/13/2004 q Keep track of word frequency for each day in a special index Top-k matching documents Smoothing is applied
Binning for Estimating n n Select distribution function Arrange days in bins and order bins based on their priority q n n 13 76 4 k k kk 5 213 Daily frequency, past frequency, moving window, accumulated mean, bump shapes Let bin(t) be the priority value of time t bin Return F(bin(t)) F 123 4 5 13 k 6 7
Answering Queries: Background q=[Madrid Bombing] d= a document in the collection R= documents relevant to [Madrid Bombing] n To answer q, score each d based on d and q content n LM: Rank based on likelihood of generating q from d n BM 25: Rank d based on the odds of d being observed in R
Answering Time-Sensitive Queries n Related Work: Answering recency queries q q [Barak Obama Speech] or [Myanmar cyclone] “Boost” topic relevance scores of most recent documents, to promote recent articles n Modify prior in language models Does not work for other time-sensitive queries n Goal: General framework for all queries q q A document has two components: content and time Combine traditional relevance (content) with temporal relevance
LM for Time-Sensitive Queries q d Content This is the content of the document This is the content of the document This is the content of the document This is the content of the document Implemented as part of Indri Developed analogous integration with BM 25 (also implemented as part of Lemur) Time
BM 25 for Time-Sensitive Queries q d Content This is the content of the document This is the content of the document This is the content of the document This is the content of the document We showed two ways to approximate this factor Implemented as part of Lemur Time
Evaluating LM and BM 25 n Data collections and queries q TREC News Archive n n q Newsblaster Archive n n n Six years of news crawled daily from multiple sources Amazon Mechanical Turk relevance judgments for 76 queries LM and BM 25 with temporal relevance q n Portion of TREC volumes 4 and 5, 1991 -94 Three sets of time-sensitive queries with relevance judgments SUM, BINNING, and WORD TREC evaluation metrics q P@k and MAP
Performance Over Newsblaster n BUMP and SUM-based improve precision at top recall cutoff levels significantly q precision of our techniques drops for higher recall cutoff levels
Contributions Identify “most important” time period(s) for queries without user input Estimate temporal relevance using different techniques Combine temporal relevance and topic relevance for all time-sensitive queries using several state-ofthe-art retrieval models Evaluate extensively our proposed methods to investigate the implications of adding time into retrieval task
Part 3: Searching and Browsing with Multiple Facets* A. Extracting Useful Facets B. Automatic Construction of Hierarchies * Work published in CIKM 05, SIGIR 06 Workshop, ICDE 07 Demo, and ICDE 08
Facets for Searching and A facet is a “clearly defined, mutually exclusive, and Browsing collectively exhaustive aspect, property, or characteristic n of a class or specific subject” [S. R. Ranganathan] Location People Time Topic Actor Animal Flickr New York Times Useful facets for large collections You. Tube Corbis
Beyond Topic and Time Facets n Objective q Automatically generate a faceted interface over a large collection n n Challenges q q q We do not know what facets appear in the collection We need to build the hierarchy for each facet We need to associate items with facets n n e. g. , The New York Times or You. Tube e. g. , what terms describe the facet in a picture (dog->animal) Approaches q q Supervised and unsupervised extraction of facet terms Hierarchy construction algorithm for each facet
Extraction of Facet Terms Goal: For each new item in the collection, extract descriptive terms and extract a set of useful facets General idea: 1. Identify important terms within each item • 2. Derive context for each important term from external resources • 3. Corbis and You. Tube user-provided tags § Dog e. g. , Wikipedia, Word. Net, … Associate terms with facets § Cat Supervised: Group terms with predefined facet like in Corbis Unsupervised: Cluster terms feline, carnivore, mammal, animal, living being, object, entity orange, fish, tail, cute
Supervised Extraction: Results Using SVM and Ripper n Baseline q n n n 10% (F 1) slightly above random classification Adding hypernyms 71% (F 1) Adding associated keywords Ripper q Investigate whether rule-based assignments are sufficient n q q High-level Word. Net hypernyms 55% (F 1), significantly worse than SVM Some classes (facets) work well with simple, rule-based assignment of terms to facets n n Generic Animals (93. 3%) Action Process Activity (35. 9%) SVM with hypernyms and associated keywords 29 * F 1 = harmonic mean of Precision & Recall
Identifying Important Terms for News n Named Entities using Ling. Pipe named entity recognizer q n Wikipedia Terms using Wikipedia titles, redirects, and anchor text q n Output: named entities (e. g. , Elizabeth II) Output: Wikipedia-listed entities Yahoo Terms using Yahoo term extractor q Output: significant words or phrases
Extracting Context for News n n Document terms too specific for facet hierarchies Solution: Expand terms by querying external resources q Wikipedia q Word. Net
Comparative Term Frequency Analysis Original Text DB n n Expanded Text DB Context expansion introduces many noisy terms However: Facet terms infrequent in original collection, yet frequent in expanded one q q q Frequency-based shifting Rank-based shifting Log-likelihood statistic Use identified terms to build facet hierarchies
Recall and Precision n Single day of Newsblaster q n Month and single day of NYT Recall: q q q n Data Set: 24 sources (SNB) 5 users per story Keep terms listed by >2 users Measure overlap Recall Precision: q q q Is hierarchy term useful? Is it correctly placed? Term precise if >4 users say yes Precision
Efficient Hierarchy Construction n After identifying facets, need to navigate within each facet q n Subsumption algorithm (Croft and Sanderson, SIGIR 1999) Improved version of subsumption algorithm q q q For best parameter values three times faster than original subsumption algorithm Good integration with relational databases Extensive experiments 34
Ranking Methods: Maximize n. Coverage Ranking categories is important and difficult q q n Frequency-based Ranking (Baseline) q n Users see first categories with greatest wealth of information Set-cover Ranking q n Important: limited cognitive ability to understand presented information Difficult: lack of explicit user goals while browsing Maximizing cardinality of top-k ranked categories Merit-based Ranking q Ranks higher categories that enable users to access their contents with smallest average cost 35
Evaluation Results n Generation algorithm runs three times faster than original subsumption algorithm n Merit-based performs well and offers fast access to contents of collection n Merit-based rankings efficient to implement on top of relational database systems, while set-cover rankings typically take longer to compute 36
Task-based User Study Over News Articles n Five users, “locate news items of interest” q q n Initially, keyword search, then facet hierarchies q n Search interface that was augmented with our facet hierarchies Repeat 5 times (different topics) “War in Iraq” then refinements Then, used facet hierarchies directly, keywords later q q q Keyword search was gradually reduced by up to 50% Time required to complete each task dropped by 25% (compared to search only) Satisfaction remained statistically steady
Summary of Contributions n Supervised extraction of facets for collection like Corbis n Unsupervised discovery of useful facet terms for news q q q n Identifying important terms in a document using Wikipedia Deriving important context, useful for facet navigation, using multiple external resources Evaluating quality and usefulness of the generated facets using extensive user studies with Amazon Mechanical Turk service Efficient hierarchy construction algorithm q q q Ranking alternatives Extensive evaluation Human evaluation to examine usefulness and effectiveness of hierarchies for free-text collection
Conclusions n Developed efficient summarization-aware search for Newsblaster n Integrated time in state-of-the-art retrieval models q q n Time-sensitive queries Temporal relevance Developed extraction techniques for useful facets q News collections q Corbis n “Created” efficient hierarchy construction algorithm with ranking alternatives n Performed extensive evaluations
Future Work n Complex user needs q q Detecting discovery queries Introducing structure and facets into Web search results for such queries n Using structure data used for QA q n n Manually or automatically extracted Using informative and authoritative sources Integrating of smart views and hierarchies for data representation Enhancing snippet generation Temporal summaries Searching for less tech-savvy users q Elderly or newcomers
Part 1. The Topic Facet* Summarization-Aware Search and Browsing * Work published in JCDL 2007
Topical Hierarchy of News Events With Machine Summaries
What Makes Search Effective in Newsblaster? n n Informative snippets: Summaries highlight essence of news to help users navigate Browsing ability: Users should be able to navigate articles in a format similar to browsing Newsblaster Speed: Users should not have to wait 12 hours for query results; they should not even wait 12 minutes! Quality: Users should get relevant results Summarization-Aware Search and Browsing
Summarization-Aware Search and Browsing n Offline summarization n Online summarization n Summaries are query-independent Irrelevant documents and relevant documents might be mixed Sensitive to summary quality and coverage/coherence Unacceptably high running time Hybrid alternative q q Some offline clusters might be relevant (no summarization) Some documents in irrelevant clusters might be relevant
A Hybrid Search Alternative: Reusing Offline Summaries and Clusters When Possible 1. Select an initial set of offline clusters 2. 3. 4. 5. 6. Identify relevant offline clusters using a supervised machine learning classifier (more details soon) Build online clusters using relevant documents from irrelevant clusters Rank offline and online clusters Generate summaries for online clusters in the top-k clusters Return the top-k clusters and their summaries
Identifying Relevant Offline Clusters n n Classification task: Given a query and a set of clusters, identify clusters that are relevant to the query Cluster-level features: q q q n (aggregate) Okapi similarity of cluster documents and query (aggregate) Okapi similarity of cluster document titles and query Okapi similarity of cluster summary and query “recall”: fraction of overall matching documents in cluster “precision”: fraction of cluster documents that match query … Query-level features: q q q number of “matching” documents in collection number of “retrieved” clusters average size of retrieved clusters (aggregate) Okapi similarity of query and summaries of retrieved clusters … Further details are omitted from this talk
Step 3: Ranking All Clusters (New and Old) n n Not specific to Hybrid Search, but an essential part of it Only top few clusters returned to users Need to summarize online only new clusters among top clusters for query Alternate ranking strategies: q q q By average Okapi score of matching documents in cluster By maximum Okapi score of matching documents in cluster By distance of document with highest Okapi score to cluster “centroid”
Evaluation Questions n Result Quality: How accurate are documents and summaries? q n Document P@k and Summary P@k Usefulness: How helpful are summaries for leading readers to relevant documents? q n Efficiency: How efficient are our techniques? q n NDCG (Normalized Discounted Cumulative Gain) Response time Evaluation Settings q q Data set: Several days of Newsblaster Labeling: Amazon Mechanical Turk n A service for distributing small tasks to a large number of users, paying a few cents per micro-task
Quality of Documents and Summaries in Results P@20 documents n n n P@k summaries Hybrid. Okapi: At least as good as the state-of-the-art flat-list search Careful use of offline clusters does not damage overall accuracy Hybrid. Okapi or On. OKapi: On average, returned more relevant summaries than Off. Doc. Okapi
Usefulness of Summaries in Results n Can MTURK annotators use the summaries to predict the perfect ranking? q q Top-3 summaries of each technique shown to 5 annotators Use NDCG to measure quality of ranking n n NDCG=1 means perfect ranking Hybrid. Okapi and On. Okapi summaries substantially outperform Off. Doc. Okapi summaries q Off. Doc. Okapi summaries are computed in a query-independent fashion
Efficiency of Producing Search Results n n Offline summaries attractive when response time important Online summaries take too much time (>200 seconds) Hybrid. Okapi: Results quality better than offline; almost as good as online but in significantly less time Careful use of offline clusters does not damage overall result accuracy and substantially reduces cost of summarization at queryexecution time
Contributions n Definition of search strategy ü ü n Defining a rich feature set for cluster classification Defining cluster-ranking strategies Evaluation ü ü Collecting user relevance judgments for clusters and documents Validating effectiveness of the cluster classifier Validating efficiency and accuracy of summarization -aware strategies Validating summary-level result quality
- Slides: 52