Keyword Searching and Browsing in Databases using BANKS
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe Joint work with: Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan I. I. T. Bombay 6/6/2021 1
Motivation n Keyword search of documents on the Web has been enormously successful n n Simple and intuitive, no need to learn any query language Database querying using keywords is desirable n n SQL is not appropriate for casual users Form interfaces cumbersome: n n 6/6/2021 Require separate form for each type of query — confusing for casual users of Web information systems Not suitable for ad hoc queries 2
Motivation n Many Web documents are dynamically generated from databases n n E. g. Catalog data Keyword querying of generated Web documents n n May miss answers that need to combine information on different pages Suffers from duplication overheads 6/6/2021 3
Examples of Keyword Queries n On a railway reservation database n n On a university database n n “database course” On an e-store database n n “mumbai bangalore” “camcorder panasonic” On a book store database n “sudarshan databases” 6/6/2021 4
Differences from IR/Web Search n Related data split across multiple tuples due to normalization n n E. g. Paper (paper-id, title, journal), Author (author-id, name) Writes (author-id, paper-id, position) Different keywords may match tuples from different relations n What joins are to be computed can only be decided on the fly n 6/6/2021 Cites(citing-paper-id, cited-paper-id) 5
Connectivity n Tuples may be connected by n n Foreign key and object references Inclusion dependencies and join conditions Implicit links (shared words), etc. Would like to find sets of (closely) connected tuples that match all given keywords 6/6/2021 6
Basic Model n Database: modeled as a graph n n Nodes = tuples Edges = references between tuples n n foreign key, inclusion dependencies, . . Edges are directed. BANKS: Keyword search… Multi. Query Optimization paper writes Charuta 6/6/2021 S. Sudarshan Prasan Roy author 7
Answer Example Query: sudarshan roy Multi. Query Optimization writes author 6/6/2021 S. Sudarshan paper writes Prasan Roy author 8
The BANKS Answer Model n Query: set of keywords {k 1, k 2, . . , kn} n n Each keyword ki matches set of nodes Si Answer: rooted, directed tree connecting nodes, with one node from each Si n Root node has special significance, may be restricted to some relations n n n E. g. relations representing entities, not relationships May include intermediate nodes not in any Si and hence a steiner tree. Multiple answers n Ranking based on proximity + prestige 6/6/2021 9
Edge Directionality n Some popular tuples are connected to many other tuples n n Popular tuples would create misleading shortcuts from every tuple to every other n n E. g. Students -> departments -> university E. g. every student would be closely linked with every other student via the department/university Solution: define different forward and backward edge weights n Forward edges: In the direction of the foreign key reference 6/6/2021 10
Edge Weight n Weight of forward edge based on schema n n e. g. citation link weights > “writes” link weights Weight of backward edge = indegree of edges pointing to the node 3 3 1 1 3 1 6/6/2021 11
Edge Weight Scaling n Problem: Some backward edges have unduly large weights n n n Scale edge weights by using log(1+raw-edgeweight) total-edge-weight = edge-weights Edge score E = 1 / total-edge-weight 6/6/2021 12
Node Weight n Nodes have prestige weights too n n n Set node weight = indegree Problem: Nodes with many in-edges result in skewed answers n n Observation: nodes with intuitively greater prestige tend to have greater indegree Subdue extreme node weights by using log(1+indegree) Node score N = root-node-weight + leaf-node-weights 6/6/2021 13
Combining Scores n Problem: how to combine two independent metrics: node weight and edge weight n n n Normalize each to 0 -1 Combine using weighting factor n Additive: (1 - ) E + N n Multiplicative: E N Performance study to compare alternatives and to find reasonable values for 6/6/2021 14
Finding Answer Trees n Backward Expanding Search Algorithm: n n Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword n n 6/6/2021 Create an iterator for each node matching a keyword n Traverse the graph edges in reverse direction Output a node whenever it is on the intersection of the sets of nodes reached from each keyword 15
Backward Expanding Search Query: sudarshan roy paper Multi. Query Optimization writes authors 6/6/2021 S. Sudarshan Prasan Roy 16
Result Ordering n n Answer trees may not be generated in relevance order Solution: n n n Best-first search across all iterators, based on path length Output answers to a buffer Output highest ranked answer from buffer to user when buffer is full 6/6/2021 17
The BANKS System n BANKS provides keyword search coupled with extensive browsing facilities n n Implemented using Java + servlets Keyword search response times typically 1 to 3 seconds on n Schema browsing + data browsing Graphical display of data DBLP database with 100, 000 tuples/300, 000 edges P 3 600 MHz, 512 MB RAM Try it out at www. cse. iitb. ac. in/banks/ 6/6/2021 18
Example of Browsing in BANKS 6/6/2021 19
Anecdotes n “Mohan” n n “Transaction” n n Returns C. Mohan at top based on prestige (number of papers written) Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations) “Sunita Seltzer” n No common papers, but both have papers with Stonebraker: system finds this connection 6/6/2021 20
Effect of Parameters n n n Log scaling of edge weights worked well (1 - ) E + N versus E N -- made little difference Best with =. 2 (subdue node weights but not entirely) g 6/6/2021 E Lo e g d 21
Related Work n Data. Spot (DTL)/Mercado Intuifind [VLDB 98] n n Proximity Search [VLDB 98] n n n No directionality, only studied in Web context Microsoft DBExplorer (this conference) n n n Different model of proximity based on adding up support No edge weights, prestige, different evaluation algorithm Information units (linked Web pages) [WWW 10] n n Based on patent by Palmon (filed 1995, granted 1998) Based on hypergraph model, similar answer model to ours Differences: our model of backward link weights and prestige No ranking, based on SQL generation Addresses efficient construction of text indexes Microsoft English query 6/6/2021 22
Conclusions and Future Work The next big wave: keyword searching and browsing of databases? Future work: n Keyword queries on XML n Disambiguating queries by selecting n n Nodes: G. W. Bush: “Bush Jr” or “Bush Sr” Tree structure: “coauthors” or “cites” Boolean queries, stemming, thesaurus Metadata: column/relation names 6/6/2021 23
Thank You 6/6/2021 24
BANKS Query Result Example n Result of “Soumen Sunita” 6/6/2021 25
6/6/2021 26
Browsing Features n n Hyperlinks are automatically added to all displayed results Template facilities to do a variety of tasks n Browsing data by grouping and creating crosstabs n n Hierarchical views of data n n Nested XML style, even on relational data Graphical displays n n e. g. , theses grouped by department and year Bar charts, pie charts, etc Templates are generic and can be applied on any data matching assumed schema n n Can be applied after applying selections New templates can be created by user, interactively 6/6/2021 27
Combining Keyword Search and Browsing n Catalog searching applications n n Keywords may restrict answers to a small set, then user needs to browse answers If there are multiple answers, hierarchical browsing required on the answers 6/6/2021 28
The BANKS System User HTTP BANKS JDBC Web Server + Servlets n Available on the web, with (part of) DBLP data n n http: //www. cse. iitb. ac. in/banks Connects to any database using JDBC n n JDBC metadata features used to provide schema browsing No programming needed for customization n n Database Minimal preprocessing of database to create indices and give weights to links Extensive set of browsing features 6/6/2021 29
- Slides: 29