Bidirectional Expansion for Keyword Search on Graph Databases

Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Soumen Chakrabarti Rushi Desai 31 Aug 2005 Shashank Pandit S. Sudarshan Hrishikesh Karambelkar http: //www. cse. iitb. ac. in/banks/

Keyword Search on Graph Representation of Data n Keyword search on relational, XML, HTML, etc. data n BANKS, Discover, DBXplorer, XRank, etc. Need to find a (closely) connected set of nodes that together match all given keywords n Focus of our work n n Search algorithms to find connections between nodes

Outline Data, Query and Response Models n Backward Search Algorithm n Bidirectional Search Algorithm n Experiments n Related Work n Conclusions n

Graph Data Model n Data modeled as a directed weighted graph: BANKS [ICDE’ 02] n n Can model relational, XML, HTML, etc. data E. g. , DBLP database n n Node = tuple Edge = foreign key reference BANKS: Keyword search… Multi-Query Optimization paper writes Soumen Sudarshan Prasan Roy author

Graph Data Model (2) n E. g. , XML data <proceedings> <paper id=“ 1”> <title>Databases</title> </paper> <paper id=“ 2”> <title>Keyword Search</title> <cite ref=“ 1”>Databases</cite> </paper> </proceedings> paper (@id = 1) title proceedings paper (@id = 2) cite title

Response Model n Response: Minimal, rooted tree connecting keyword nodes n n Undirected: Discover, DBXplorer Directed: BANKS paper Multi-Query Optimization E. g. , Sudarshan Roy writes author Sudarshan writes author Prasan Roy

Response Ranking n Edge Score = EA n n n Node Score = NA n n n Smaller tree => higher score E. g. , BANKS: EA = 1/ (S edge weights) Measure of authority of nodes in tree E. g. , BANKS: NA = S (leaf and root node authorities) Overall score = f (EA, NA) l n E. g. , BANKS: f (EA, NA) = EA. NA

Finding Answer Trees n Backward Expanding Search: BANKS [ICDE 02] n Intuition: travel backwards from keyword nodes till you hit a common node Query: sudarshan roy paper Multi. Query Optimization writes authors Sudarshan Prasan Roy

Backward Search: Algorithm n Run concurrent single source shortest path iterators from each node matching a keyword Traverse the graph edges in reverse direction n Output next nearest node on each get-next() call n n n Do best-first search across iterators Output node if in the intersection of sets of nodes reached from each keyword

Backward Search: Limitations Wasteful exploration of graph: n n Frequently occurring keywords “Hub” nodes in the graph (high in-degree) “Shashank Sudarshan Database” … Schema Legend Database … author writes Shashank Sudarshan paper

Bidirectional Search: Motivation

Bidir Search: Intuition n First cut solution: n n n Don’t go backward if keyword matches many nodes Don’t go backward if node points to a hub Instead explore forward from other keywords

Bidir Search: Example “Shashank Sudarshan Database” … Database … … Schema Legend author Shashank Sudarshan writes paper

Bidir Search: Issues n What should threshold for not expanding be? n Our solution: prioritize expansion of nodes based on spreading activation n n to penalize frequent keywords and bushy trees How to manage exploration in both directions?

Bidir Search: Spreading Activation n n Node with highest activation explored first Every node given an initial activation 1/5 1/5 1/5 “John” n Gives low activation to frequently occurring keywords

Bidir Search: Spreading Activation n n Node with highest activation explored first Activation spread to neighbors (μ = 0. 3) 1/5 1 0 0. 7 x 1/5 x 1/4 1 0. 3 x 1/5 n 1 Gives low activation to neighbors of hubs

Bidir Search: Iterators n How to manage exploration in both directions? 6 7 [1, ∞] 3 [∞, 1] 4 [2, 3 ∞] [2, ∞] [0, ∞] n … 1 2 “A” “B” 5 [∞, 1] [∞, 0] Single backward iterator + single forward iterator w/ suitable datastructures n n [Dist from “A”, Dist from “B”] [∞, ∞ 2] E. g. , to keep track of parents of nodes Details in full paper

Bidir Search: Algorithm n n Activate matching nodes; insert into backward iterator while (iterators are not empty) n n Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures) n n n Propagate values to ancestors if necessary Insert nodes explored in the backward direction into the forward iterator /* for future forward exploration */ Stop when top-k results are produced

Bidir Search: top-k results Results need not be generated “in-order” n Naïve solution n n Store results in an intermediate heap Output top k results after mk total results have been generated (m ~ 10) Can do better n n Compute upper bound on score of next result; output answers with a higher score Similar to NRA algorithm (Fagin et al. , PODS’ 01)

Experiments n Datasets n n n Workload n n DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges Keywords randomly picked from results of SQL join statements Search algorithms n MI-Bkwd: original backward search n n SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search Time taken/nodes explored n n Iterator for every node matching a keyword Measured when 10 th answer is generated (or last answer if #answers < 10) Origin size n #nodes matched by keywords in the query

Experiments (2) n MI-Bkwd versus SI-Bkwd n SI-Bkwd gain increases with origin size, # keywords

Experiments (3) n SI-Bkwd versus Bidirec n Bidirec gain increases with origin size, # keywords

Experiments (4) n Precision/Recall experiments n n Relevant answers are well-defined; can be generated through SQL statements Both MI-Backward and Bidirectional show similar performance Recall ~ 100% n Precision ~ 100% at near full recall n Few irrelevant answers produced before generating all relevant answers n n Bidirectional runs faster, yet minimal loss of relevant results!

Experiments (5) n Comparison with Sparse: Hristidis et al. [VLDB’ 03] n n Generate join expressions leading to query results Use DB-provided scores for ranking tuples and aggregate them to rank answer trees For top-k results: automatically determine required number of join expressions Sparse-LB n n n Manually generate required join expressions Sparse needs to do at least this much (and usually a lot more!) Bidirectional versus Sparse-LB n Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)

Experiments (6) n SI-Bkwd versus Bidirec: by origin size A = (T, S, S, S) B = (M, M, M, M) C = (M, L, L, L) D = (M, M, L, L) E = (T, L, L, L) F = (T, S, M, L) G = (T, M, L, L) H = (T, T, T, L) n Bidirec gains more with unbalanced origin sizes

Discussion Bidirectional search as dynamic per-tuple join ordering n Related work in this area: Eddies n n Bidirectional search Schema-less n Prioritization based on activation instead of selectivity n Generate answers in relevance order n

Related Work n Keyword querying on relational data: Discover (UCSD), DBExplorer (Microsoft) n n n Keyword querying on XML n n n Use SQL generation, without in-memory data structures Issues: generate join plans, re-use common sub-expressions, etc. XRank (Cornell), Schema-Free XQuery (Michigan), … Tree model is too limited Object. Rank

Conclusions n Graph model n n Convenient common denominator representation Schema-free querying leads to graph search n n n Purely backward strategy inadequate Bidirectional search with spreading activation performs much better Dynamically choose join order on per-tuple basis

Thank You! Questions? ? 31 Aug 2005

Future of Keyword Search in DBs n Next generation of intelligent search will require context information n n Is there a killer app? n n E. g. search email, files, calendar, . . Information integration will be important Graph structured data will be a key component Deep web? Display of answers n n Users don’t want to see schema details Can we leverage off existing (Web) apps?

BANKS Future Work n Applications of BANKS n n Exploit BANKS to integrate different sources of data n n Extract information, Infer soft links BANKS for personal information management n n Soumen Chakrabarti, Sunita Sarawagi and students SPIN: Search Personal Information Networks Ongoing/future work on BANKS: n More sysadmin/user control on ranking n n n Characterize bidirectional search better n n One size does not fit all BANKS provides infrastructure And find other applications Security

Bidir Search: top-k results (2) n n Compute upper bound on score of next result; output answers with a higher score Computing the bound n n n mi = minimum path length explored backward from keyword i unseen answer node: 1/(m 1 + m 2 + … + mn ) visited answer node: suppose reached from first x keywords with distance di n n n 1/[(d 1 + d 2 + … + dx ) + (mx+1 + mx+2 + … + mn )] combine this with max node prestige We simply use n 1/(m 1 + m 2 + … + mn )! Experiments show no significant loss in using this heuristic