Combining Keyword Search and Forms for Ad Hoc
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, An. Hai Doan, Jeffrey Naughton University of Wisconsin-Madison
More and more untrained users querying DBMSs n n n E-commerce applications Structured wikipedia (e. g. , DBpedia) Increasing demand for richer queries q n Attr = value, range, sorting, aggregation, etc. Writing SQL queries doesn’t work q Need to know SQL and the schema SIGMOD 2009 2
Keyword Search over Databases n Input: Keywords Output: Ranked list of joined-tuples containing the keywords Has made much progress in recent years n Disadvantage: limited query expressiveness n n q q q Field selection Range queries Aggregation SIGMOD 2009 3
Augmenting Keyword Search n n Augment keyword search with new constructs Field selection not so bad q q n n “Attr = value” But users need to know field names Now consider adding range queries, aggregation… Keyword search becomes a new language for a subset of SQL SIGMOD 2009 4
Building Query with Forms Is “Finding publications by UW-Madison researchers who Simple n are originally from Greece” SIGMOD 2009 5
Making Forms Support Many Queries n Arbitrarily customizable q n Example: QBE q n Users can select tables, columns, values… Filling in values in skeleton tables Ultimately it’s close to asking them to generate SQL again SIGMOD 2009 6
So, we need more-specific forms n Forms that are closer to the user intent n But we would need many forms to support many queries n How do we get the correct form to a user? SIGMOD 2009 7
Our Approach Combining keyword search and query forms: n Offline: q n Generate and index (potentially many) forms At query time: 1. User submits keyword query 2. System returns relevant forms 3. User selects desired form to finish query SIGMOD 2009 8
Challenges n Form generation q q n How specific/general should forms be? How to systematically generate forms and good form descriptions? Keyword search over forms q q q What makes forms different from documents? What issues arise in retrieval and ranking? Do users find it useful? SIGMOD 2009 9
Challenges n Form generation q q n How specific/general should forms be? How to systematically generate forms and good form descriptions? Keyword search over forms q q q What makes forms different from documents? What issues arise in retrieval and ranking? Do users find it useful? SIGMOD 2009 10
Forms VS Documents n Forms have parameters n Query keywords could be q q Terms on a form Parameter values n Not part of the query until users specify them SIGMOD 2009 11
“Naïve” Keyword Search n Query: “author Madison” q q n Naïve-AND: return forms with ALL keywords q n No results Naïve-OR: Return forms with ANY keywords q n Author => a term on a form Madison => a data value “Madison” is ignored Put data values on forms q High storage and maintenance costs SIGMOD 2009 12
Solution: Query Rewrite n n If query Q contains data value d and d is in relation R, rewrite Q to consider R “author Madison” q n Madison is in tables conference, publication, … Alternatives q q q DI-OR DI-AND DIJ SIGMOD 2009 13
DI-OR: Query rewrite with OR semantics n DI-OR q q n Example q q n n Create Q’ = Q + R Then search forms with Q’ using OR-semantics Q: “author Madison” Q’: “author Madison conference publication” Handles terms that refer to data values Results often too inclusive SIGMOD 2009 14
DI-AND: Query rewrite with AND semantics n Example q q q n Q: “Eric Madison” “Eric” => author “Madison” => conference, publication Enumerate new queries using AND semantics: q q “author AND conference” “author AND publication” SIGMOD 2009 15
“Dead” Forms n n Some returned forms give empty results with respect to the keywords Example: a table referenced by many q q q n Person (id, name, …) Tutorial(rid, pid, cid) Conference. Talk(rid, pid, cid) Serve. Conf(rid, pid, …) Write. Pub(rid, pid, …) “Eric” => forms for all 5 tables q But Eric has only written a paper… SIGMOD 2009 16
DIJ: Filtering “Dead” Forms n Example q n On forms having Person table q q n “Eric” => Table = Person, PID = P 1 Check if other tables referencing Person have tuples with PID = 1 Write. Pub(W 7, P 1, …) Return forms for Person and Write. Pub tables SIGMOD 2009 17
Ranking n Using only Lucene’s TF-IDF function is not good enough n Many similar forms q q q Similar form summaries For a given query, similar forms often have same ranking scores When query is not very specific, the best form may be hidden in a bunch of logically similar forms SIGMOD 2009 18
Returning a flat list of forms is unclear The query “dewitt” returns a list of 210 forms SIGMOD 2009 19
Presenting groups of forms in the results n n 1 st-level group: consecutive forms having the same score and based on the same relation. In each first-level group, group forms by the types of queries they support q n Select-From-Where, Aggregation, Union/Intersect Display 2 nd level groups of forms in fixed order q Forms in the same 1 st level group have the same ranking scores. SIGMOD 2009 20
Returning a flat list of forms is unclear Instead of showing a flat list of 210 forms SIGMOD 2009 21
Returning groups of forms n n Shows 23 groups of logically similar forms Users can drill down a “right” group to find the “right” form SIGMOD 2009 22
User Study n Data Set: DBLife q n n n 7 CS grad students 6 information needs Alternatives q n 5 entity tables, 9 relationship tables, 196 forms Naive-OR, Naïve-AND, DI-OR, DI-AND, DIJ Observing q q q # forms returned Rank of “right” form Time SIGMOD 2009 23
Information Needs 1. Find all people who have given a tutorial at VLDB. 2. Find topics of areas related to Jeff Naughton. 3. Find people who have served as the SIGMOD PC chair. 4. Find the first author of all papers cited more than 5 times. (Range query) 5. Find the number of people who have co-authored a paper with David Dewitt. (Count query) 6. Find people who have published with David De. Witt or Jeff Naughton (Union query) SIGMOD 2009 24
Queries of a User n n n Q 1: Q 2: Q 3: Q 4: Q 5: Q 6: “tutorial vldb” “jeff naughton research area” “sigmod chair” (data terms only) “paper citation” “david dewitt coauthor” “dewitt naughton” (data terms only) SIGMOD 2009 25
Comparing # Forms Returned Number of Forms Returned Naive-OR Naive. And DI-OR DI-AND DIJ Q 1 14 0 168 42 42 Q 2 28 0 182 28 28 Q 3 0 0 142 28 28 Q 4 28 28 142 28 28 Q 5 14 0 196 14 14 Q 6 0 0 196 182 168 SIGMOD 2009 26
DI-AND VS DIJ on # Forms Returned n n DIJ eliminates dead forms # dead forms depends on the specific schema and query Average Number of Forms Returned DI-AND DIJ T 1 T 2 T 3 T 4 T 5 T 6 44 44 48 46 38 38 28 28 129 116 64 56 SIGMOD 2009 27
Flat VS Group Ranks n Highest, median, and lowest based on 7 users T 1 T 2 T 3 T 4 T 5 T 6 H 1 1 4 1 Flat Rank M L 1 1 1 69 1 1 15 15 21 21 12 12 #F 44 46 38 28 116 56 SIGMOD 2009 H 1 1 1 Group Rank #G M L 1 1 3. 14 1 7 3. 7 1 1 2. 7 2 2 4 11. 57 1 6 4 28
Breakdown of End-to-End Time n Time by 7 users on 6 information needs Find the Pose query right (sec) form (sec) 7. 0 12. 3 T 1 7. 5 23. 9 T 2 7. 5 18. 0 T 3 79. 7 T 4 12. 0 46. 9 T 5 19. 0 64. 0 T 6 14. 0 Fill out Total Standard the average Median Deviation form time (sec) 5. 3 14. 8 25. 6 15. 2 7. 7 15. 2 24. 6 46. 1 51. 1 106. 9 73. 6 93. 2 SIGMOD 2009 13. 1 48. 0 31. 4 56. 6 29. 9 47. 8 23. 0 26. 0 36. 0 123. 0 80. 0 78. 0 29
Conclusion n Help untrained users pose wide variety of structured queries q n n Generating forms for wide variety of queries Keyword search of forms q n n Keyword search => forms Query rewrite to handle parameter values Presenting forms in groups Many issues should be further explored SIGMOD 2009 30
- Slides: 30