Mining Query Subtopics from Search Log Data Hu
Mining Query Subtopics from Search Log Data Hu et al. , SIGIR’ 12 Presented by Baichuan Li
Outline • • • Problem Two observations Approach Results Applications Conclusion
Problem • Intention != Query • Many queries are ambiguous or multifaceted
Problem
Problem • Xbox – Homepage? – Online game? – Where to buy? • Manchester – Manchester news? – Manchester tourism? – Manchester weather? – Manchester united?
Solution • Identifying the major subtopics of a query from the search log data – Personalized search – Query suggestion – Search result presentation • Clustering • Re-ranking • Diversification
Observations from The Logs • One subtopic per search (OSS) – If a user clicks multiple URLs after submitting a query, then the clicked URLs tend to represent the same subtopic
Observations from The Logs • Subtopic clarification by additional keyword (SCAK) – Users often additional keywords (in most cases, one additional keyword) to a query to expand the query in order to clarify its subtopic – The URLs clicked after searching both with the original and the expanded queries tend to represent the same subtopic – The key word tends to be indicative of the subtopic – E. g. • harry shum microsoft
One Subtopic per Search • Use it as the rule for subtopic identification – 10, 000 groups of multiclicks of individual queries – The rule is accurate when all the URLs within the multi-clicks are about the same sense or face
Accuracy V. S. Frequency
Accuracy v. s. Click Position
Distribution • The queries with higher frequencies in search log data are more likely to have multi-clicks
Subtopic Clarification by Additional Keyword • The keywords ‘microsoft’ and ‘jr’ can be used to represent the two groups (subtopics) respectively
Query Types • • ‘Q’: the query is a single phrase ‘Q + W’: ‘Q’ + a keyword ‘W + Q’: a keyword + ‘Q’ ‘Others’
Subtopic Overlap and URL Overlap • Randomly selected 500 pairs of queries with the forms ‘Q’ and ‘Q + W’ • If subtopics of an expanded query are contained in subtopics of the original query, then there is ‘subtopic overlap’ • Check whether two queries share identical clicked URLs (‘URL overlap’)
Distribution • The more popular (frequent) a query is, the more likely the rule is applicable
Mining Subtopics
Preprocessing
Clustering • Similarity function – The similarity function between URLs ui and uj : – S 1: based on the OSS phenomenon where and denote the vectors of multi-clicks of ui and uj respectively – S 2: based on the SCAK phenomenon – S 3: based on string similarities
SCAK Similarity (S 2) where and denote the vectors of keywords associated with ui and uj respectively
Data • TREC Data: TREC search result diversification track 3 in 2009 • Data. Set. A and Data. Set. B: queries and URLs randomly sampled from the logs of the commercial search engine
Results
Application: Search Result Reranking • The user is first asked which subtopic she is interested in, with the subtopics shown at the top of the results page • When the user selects a subtopic, the URLs belonging to the subtopic will be moved to the top (re-ranked) • The relative order between URLs inside and outside of the subtopic will be kept
Example
Evaluation • Data – Collect search log data of 20, 000 randomly selected searches – Each query has at least two subtopics mined by the method • Results – The average position of last clicked URLs is 3. 41 – Assume the cost for the user to check the subtopics and click one of them is 1. 0 – The average position of last clicked URLs belonging to the same subtopics is 1. 80
Conclusion • Study the problem of query subtopic mining • Discovered two phenomena of user search behavior – one subtopic per search – Subtopic clarification by additional keyword • Design a novel similarity function • Applications on search result reranking (and search result clustering) • Problem – Can only be employed when there is enough search log data
- Slides: 27