Keyword Search on GraphStructured Data S Sudarshan IIT

  • Slides: 58
Download presentation
Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti,

Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi K. , Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar Jan 2009

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms n n Tree answer model Proximity queries Backward Expanding Search Bidirectional Search on external memory graphs Conclusion 2

Keyword Search on Semi-Structured Data Keyword search of documents on the Web has been

Keyword Search on Semi-Structured Data Keyword search of documents on the Web has been enormously successful n Much data is resident in databases n n Organizational, government, scientific, medical data Deep web Goal: querying of data from multiple data sources, with different data models n Often with no schema or partially defined schema 3

Keyword Search on Structured/Semi-Structured Data n Key differences from IR/Web Search: n n Normalization

Keyword Search on Structured/Semi-Structured Data n Key differences from IR/Web Search: n n Normalization (implicit/explicit) splits related data across multiple tuples To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords n soumen crawling BANKS: Keyword search… or soumen byron Focused Crawling … paper writes Sudarshan Soumen C. Byron Dom author 4

Graph Data Model n Lowest common denominator across many data models n Relational n

Graph Data Model n Lowest common denominator across many data models n Relational n n XML n n node = document, edge = links inferred by data extraction Knowledge representation n n node = page, edge = hyperlink Documents n n Node = element, edge = containment/idref/keyref HTML n n node = tuple, edge = foreign key node = entity, edge = relationship Network data n e. g. social networks, communication networks 5

Graph Data Model (Cont) n Nodes can have n labels n n n E.

Graph Data Model (Cont) n Nodes can have n labels n n n E. g. relation name, or XML tag textual or structured (attribute-value) data Edges can have labels 6

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms n n Tree answer model Proximity queries Backward Expanding Search Bidirectional Search on external memory graphs Conclusion 7

Query/Answer Models n Basic query model: n Keywords match node text/labels n Can extend

Query/Answer Models n Basic query model: n Keywords match node text/labels n Can extend query model with attribute specification, path specifications n n e. g. paper(year < 2005, title: xquery), Alternative answer models n n tree connecting nodes matching query keywords nodes in proximity to (near) query keywords 8

Tree Answer Model n Answer: Rooted, directed tree connecting keyword nodes n n In

Tree Answer Model n Answer: Rooted, directed tree connecting keyword nodes n n In general, a Steiner tree Multiple answers possible n paper Focused Crawling Answer relevance computed from n n answer edge score combined with answer node score writes author Soumen C. writes author Byron Dom Eg. “Soumen Byron” 9

Answer Ranking n n Naïve model: answers ranked by number of edges Problem: n

Answer Ranking n n Naïve model: answers ranked by number of edges Problem: n Some tuples are connected to many other tuples n n Highly connected tuples create misleading shortcuts n n E. g. highly cited papers, popular web sites six degrees of separation Solution: use directed edges with edge weights n allow answer tree to have edge u v if original graph has v u, but at higher cost 10

Edge Weight Model-1 n Forward edge weight (edge present in data) n n n

Edge Weight Model-1 n Forward edge weight (edge present in data) n n n Create extra backward edges v u for each edge u v present in data n n Default to 1, can be based on schema Lower weight closer connection Edge weight log(1+#edges pointing to v) Overall Answer-tree Edge Score EA = 1/ (S edge weights) n Higher score better result 3 3 3 1 11

Edge Weight Model -2 n Probabilistic edge scoring model n Edge traversal probability (from

Edge Weight Model -2 n Probabilistic edge scoring model n Edge traversal probability (from a given node): n n n n Forward 1/out-degree Backward 1/in-degree Can be weighted by edge type Path weight = probability of following each edge in path Edge score = log(edge traversal probability) Answer-tree Edge Score EA = (harmonic) mean of path weights from root to each leaf Note: n n other edge weight models possible our search algorithms are independent of how edge weights are computed 12

Node Weight n Node prestige based on indegree n n More incoming edges higher

Node Weight n Node prestige based on indegree n n More incoming edges higher prestige Page. Rank style transfer of prestige n Node weight computing using biased random walk model Node weight: function of node prestige, other optional criteria such as TF/IDF n Answer-tree Node score NA = root node weight + S leaf node weights n 13

Overall Tree Answer Score n Overall score of answer tree A n n n

Overall Tree Answer Score n Overall score of answer tree A n n n combine tree and node scores for details, and recall/precision metrics see BANKS papers in ICDE 2002 and VLDB 2005 Anecdotal results on DBLP Bibliography n n n “Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations) “soumen sudarshan”: several coauthored papers, links via common co-authors “goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/coauthor connections 14

Answer Models n n Tree Answer Model Proximity (near query) model 15

Answer Models n n Tree Answer Model Proximity (near query) model 15

Proximity Queries n Node weight by proximity author (near olap) (on DBLP) n faculty

Proximity Queries n Node weight by proximity author (near olap) (on DBLP) n faculty (near earthquake) (on IITB thesis database) n n n Node prestige > if close to multiple nodes matching near keywords Example applications n Raghu Finding experts on a particular area OLAP over uncertain. . Widom Computing sparse cubes… Overview of OLAP… Allocation in OLAP … 16

Proximity via Spreading Activation n Idea: n Each “near” keyword has activation of 1

Proximity via Spreading Activation n Idea: n Each “near” keyword has activation of 1 n n Each node n n n keeps fraction 1 -μ of its received activation and spreads fraction μ amongst its neighbors Combine activation ai received from neighbors n n Divided among nodes matching keyword, proportional to their node prestige a = 1 – Π(1 -ai) (belief function) Graph may have cycles n Iterate till convergence 17

Example Answers n Anecdotal results on DBLP Bibliography n n author (near recovery): Dave

Example Answers n Anecdotal results on DBLP Bibliography n n author (near recovery): Dave Lomet, C. Mohan, etc sudarshan(near change): Sudarshan Chawate sudarshan(near query): S. Sudarshan Queries can combine proximity scores with tree scores n n hector sudarshan(near query) vs. hector sudarshan author(near transactions) data integration 18

Related Work n Proximity Search n Goldman, Shivakumar, Venkatasubramanian and Garcia-Molina [VLDB 98] Considers

Related Work n Proximity Search n Goldman, Shivakumar, Venkatasubramanian and Garcia-Molina [VLDB 98] Considers only shortest path from each node, aggregates across nodes n Our version aggregates evidence from alternative paths n n n E. g. author (near “Surajit Chaudhuri”) Object Rank [VLDB 04] n Similar idea to ours, precomputed 19

Related Work n Keyword querying on relational databases n n Keyword querying on XML

Related Work n Keyword querying on relational databases n n Keyword querying on XML : Tree Model n n n DBExplorer (Microsoft, ICDE 02) Discover (UCSD, VLDB 02, VLDB 03), Use SQL generation, not applicable to arbitrary graphs ranking based only on #nodes/edges XRank (Cornell, SIGMOD 03), proximity in XML (AT&T Research, VLDB 03), Schema-Free XQuery (Michigan, VLDB 04), Tree model is too limited Keyword querying on XML: Graph Model n n XKeyword (UCSD, ICDE 03, VLDB 03), Sphere. Search (Max. Planck, VLDB 05) ranking based only on #nodes/edges 20

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms n n Tree answer model Proximity queries Backward Expanding Search Bidirectional Search on external memory graphs Conclusion 21

Finding Answer Trees n Backward Expanding Search Algorithm (Bhalotia et al, ICDE 02): n

Finding Answer Trees n Backward Expanding Search Algorithm (Bhalotia et al, ICDE 02): n n Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword n Create an iterator for each node matching a keyword n n Traverse the graph edges in reverse direction Output next nearest node on each get-next() call Do best-first search across iterators Output an answer when its root has been reached from each keyword n Answer heap to collect and output results in score order 22

Backward Expanding Search Query: soumen byron paper Focused Crawling writes authors Soumen C. Byron

Backward Expanding Search Query: soumen byron paper Focused Crawling writes authors Soumen C. Byron Dom 23

Backward Exp. Search Limitations n Too many iterators (one per keyword node) n Solution:

Backward Exp. Search Limitations n Too many iterators (one per keyword node) n Solution: single iterator per keyword (SI-Bkwd search) n n Changes answer set slightly n n n Different justifications for same root may be lost Not a big problem in practice Nodes explored for different keywords can vary greatly n n n tracks shortest path from node to keyword E. g. “mining” or “query” vs “knuth” High fan-out when traversing backwards from some nodes Connection with join ordering n Similar to traversing backwards from all relations that have selections 24

Bidirectional Search: Motivation 25

Bidirectional Search: Motivation 25

Bidirectional Search: Intuition n First cut solution: n n n Problems n n n

Bidirectional Search: Intuition n First cut solution: n n n Problems n n n Don’t expand backward if keyword matches many nodes Instead explore forward from other keywords Doesn’t deal with high fan-out during search What should cutoff for not expanding be? Better solution: [Kacholia et al, VLDB 2005] n n Perform forward search from all nodes reached Prioritize expansion of nodes based on n n path weight (as in backward expanding search) + spreading activation n to penalize frequent keyword and bushy trees 26

Bidirectional Search: Example OLAP Harper Divesh Query: harper divesh olap 27

Bidirectional Search: Example OLAP Harper Divesh Query: harper divesh olap 27

Bidirectional Search (1) n Spreading activation to prioritize backward search n n (Different from

Bidirectional Search (1) n Spreading activation to prioritize backward search n n (Different from spreading activation for near queries) Lower weight edges get higher share of activation Nodes prioritized by sum of activations Single forward iterator 28

Bidirectional Search (2) n Forward search iterator n n Forward search from all nodes

Bidirectional Search (2) n Forward search iterator n n Forward search from all nodes reached by backward search Track best forward path to each keyword n n Initially infinite cost Whenever this changes, propagate cost change to all affected ancestors 22, 2 , ∞ ∞, 2 k 1 k 2 29

Bidirectional Search (3) n On each path length update (due to backward or forward

Bidirectional Search (3) n On each path length update (due to backward or forward search) n n n Check if node can reach all keywords If so, add it to output heap When to output nodes from heap n For each keyword Ki, track Mi n n Edge score bounds: n n Mi: minimum path length to Ki among all yet-to-beexplored nodes in backward search tree What is the best possible edge score of a future answer? Bounds similar to NRA algorithm (Fagin) Cheaper bounds (e. g. 1/Max(Mi)) or heuristics (e. g. 1/Sum(Mi)) can be used Output answer if its score is > overall score upper bound for future answers 30

Performance n Worst case complexity: polynomial in size of graph n n But for

Performance n Worst case complexity: polynomial in size of graph n n But for typical (average) case, even linear is too expensive Intuition: typical query should access only small part of graph n n Studied experimentally Datasets: DBLP, IMDB, US Patent Queries: manually created Typical cases n n < 1 second to generate answer 10 K-100 K nodes explored 31

Performance Results n Two versions of backward search: n n Iterator per node (MI-Bkwd)

Performance Results n Two versions of backward search: n n Iterator per node (MI-Bkwd) vs Iterator per keyword (SIBkwd) Origin size: number of nodes matching keywords Time ratio MI/SI n Very minor loss in recall 32

Performance Results n SI-Bkwd versus Bidirectional search n Bidirectional search gain increases with origin

Performance Results n SI-Bkwd versus Bidirectional search n Bidirectional search gain increases with origin size, # keywords 33

Related Work (1) n Publish as document approach n n Gather related data into

Related Work (1) n Publish as document approach n n Gather related data into a (virtual) document and index the document (Su/Widom, IDEAS 05) Positives n n Avoids run-time graph search Works well for a class of applications n n E. g. Bibliographic data DBLP page per author Negatives n n n Not all connections can be captured Duplication of data across multiple documents High index space overhead 34

Related Work (2) n DPBF (Ding et al. , ICDE 07) n dynamic programming

Related Work (2) n DPBF (Ding et al. , ICDE 07) n dynamic programming technique n n exact for top-1 answer, heuristic for top-k BLINKS (He et al. , SIGMOD 07) n Round-robin expansion across iterators n n Optimal within a factor of m, with m keywords Forward index: node to keyword distance Used instead of searching forward n single level index: impractically large space n bi-level index: > main memory IO efficiency? n 35

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms

Outline Motivation and Graph Data Model n Query/Answer models n n Graph Search Algorithms n n Tree answer model Proximity queries Backward Expanding Search Bidirectional Search on external memory graphs Conclusion 36

External Memory Graph Search n Graph representation quite efficient n n Requires of <

External Memory Graph Search n Graph representation quite efficient n n Requires of < 20 bytes per node/edge Problem: what if graph size > memory? n Alternative 1: Virtual Memory n n Alternative 2 (for relational data): SQL n n thrashing not good for top-K answer generation across multiple SQL queries Alternative 3: use compressed graph representation to reduce IO n [Dalvi et al, VLDB 2008] 37

Supernodes and Superedges 38

Supernodes and Superedges 38

Multi-granular Graph n Dumb algorithm n n search on supernode graph get k*F answers,

Multi-granular Graph n Dumb algorithm n n search on supernode graph get k*F answers, expand their supernodes into memory, search on resultant graph no guarantees on answers Better idea: use multi-granular graph n n Supernode graph in memory Some nodes expanded n n Expanded nodes are part of cache Algorithms on multi-granular graph (coming up) 39

Multi-granular Graph 40

Multi-granular Graph 40

Expanding Nodes n Key idea: Edge score of answer containing a supernode is lower

Expanding Nodes n Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer 41

Iterative Search n Iterative search on multi-granular graph n Repeat n n n Until

Iterative Search n Iterative search on multi-granular graph n Repeat n n n Until top K answers are all pure Guarantees finding top-K answers n n search on current multi-granular graph using any search algorithm, to find top results expand super nodes in top results Very good IO efficiency But high CPU cost due to repeated work Details: nodes expanded above never evicted from “virtual memory” cache 42

Incremental Search n n n Idea: when node expanded, incrementally update state of search

Incremental Search n n n Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph Run search algorithm until top K answers are all pure Currently implemented for backward search n Modifies the state of the Dijkstra shortest path algorithm used by backward search n n One shortest path search iterator per keyword SPI tree: shortest path iterator tree 43

Incremental Search (1) SPI tree for k 1 44

Incremental Search (1) SPI tree for k 1 44

Incremental Search (2) 45

Incremental Search (2) 45

Incremental Search (3) 46

Incremental Search (3) 46

External Memory Search: Performance Queries 47

External Memory Search: Performance Queries 47

External Memory Search: Performance n Supernode graph very effective at minimizing IO n Cache

External Memory Search: Performance n Supernode graph very effective at minimizing IO n Cache misses with incremental often < # nodes matching keywords Iterative has high CPU cost n VM (backward search with cache as virtual memory) has high IO cost n Incremental combines low IO cost with low CPU cost n 48

Conclusions n Keyword search on graphs continues to grow in importance n n Ranking

Conclusions n Keyword search on graphs continues to grow in importance n n Ranking is critical n n E. g. graph representation of extracted knowledge in YAGO/NAGA (Max Planck) Edge and node weights, spreading activation Efficient graph search is important n In-memory and external-memory 49

Ongoing/Future Work n External memory graph search n Compression ratios for supernode graph for

Ongoing/Future Work n External memory graph search n Compression ratios for supernode graph for DBLP/IMDB: factor of 5 to 10 n n Graph search in a parallel cluster n n n Ongoing work on graph clustering shows good results Goal: search integrated WWW/Wikipedia graph New search algorithms Integration with existing applications n n To provide more natural display of results, hiding schema details Authorization 50

BANKS References n n Keyword Searching and Browsing in databases using BANKS, Gaurav Bhalotia,

BANKS References n n Keyword Searching and Browsing in databases using BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE 2002 User Interaction in the BANKS System, Demo paper, B. Aditya, Soumen Chakrabarti, Rushi Desai, Arvind Hulgeri, Hrishikesh Karambelkar, Rupesh Nasre, Parag, S. Sudarshan ICDE 2003 Bidirectional Expansion For Keyword Search on Graph Databases, Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar, VLDB 2005 Keyword Search on External Memory Data Graphs Bhavana Dalvi, Meghana Kshirsagar and S. Sudarshan, VLDB 2008 51

Thanks!

Thanks!

Time and Nodes Explored Bidir Nodes Bi. Dir Time 53

Time and Nodes Explored Bidir Nodes Bi. Dir Time 53

Screenshots (1) n author (near recovery) 54

Screenshots (1) n author (near recovery) 54

Near Queries with Multiple Keywords n n Spread activation from each keyword separately Then

Near Queries with Multiple Keywords n n Spread activation from each keyword separately Then combine the activations from different keywords n n OR: use addition or belief combination AND: take product of activations n Gives better results 55

The BANKS System User n Database http: //www. cse. iitb. ac. in/banks/ No programming

The BANKS System User n Database http: //www. cse. iitb. ac. in/banks/ No programming needed for customization n n BANKS JDBC Web Server Servlets Available on the web, with + DBLP, IMDB and IITB ETD data n n HTTP Minimal preprocessing to create indices and give weights to links Provides keyword search coupled with extensive browsing features n n Schema browsing + data browsing Hyperlinks are automatically added to all displayed results Browsing data by grouping and creating crosstabs Graphical display of data: bar charts, pie charts, etc 56

BANKS Architecture n n Data resident on disk Graph structure of data resident in

BANKS Architecture n n Data resident on disk Graph structure of data resident in memory n Nodes and edges with their types/counts n n n 16 x|V|+8 x|E| bytes Search done in memory Why n n Allows us to use interesting graph traversal based algorithms without being constrained by SQL and related performance issues With current memory sizes, database graphs for most applications will fit in memory 57

Probabilistic Edge Score Model (2) n Paths from root to leaves are considered separately,

Probabilistic Edge Score Model (2) n Paths from root to leaves are considered separately, even if they share edges n More efficient search algorithms with this models (He et al. , SIGMOD 07) 0. 5 1 0. 5 0. 167 58