Finding and Approximating Topk Answers in Keyword Proximity

Keyword Proximity Search (KPS) A paradigm for data extraction Data have varying degrees of

Querying Structure & Content by Keywords appear in different parts of the data Answers

Past Work on KPS (Keyword Proximity Search) • Data. Spot (Sigmod 1998) • Information

The Goal of this Paper Devise efficient algorithms for finding highquality answers in keyword

Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in

Data Graphs v Structural and keyword nodes v Edges may have weights – Weak

Queries are sets of keywords from the data graph Q={ Summers , Cohen ,

Query Answers PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 10

Query Answers An answer is a directed subtree of the data graph v Contains

Ranking: Inversely Proportional to Weight rank(A)=(weight(A))-1 Smaller subtrees represent closer associations PODS'06 Finding and

Enumerating in Exact (Ranked) Order AB C A B C If A B C

C may in beaa. Cfunction of G and. Order Q Enumerating -Approximate A B

Polynomial Delay Yardstick of efficiency: Polynomial delay AB C A B C A B

Top Answers are Steiner Trees • Finding the top answer in KPS (a. k.

So What Can Be Done? Can answers of KPS be enumerated in the exact

Our Results Theorem 1: Under data complexity, answers of KPS can be enumerated in

Our Results (cont’d) Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding

The Meaning of the Results KPS is tractable under data complexity All results on

Lawler’s Method • We use the technique of Lawler (1972), which is an iterative

Two Problems to Solve 1. What exactly are the constraints? constraints (That is, how

Solving the First Problem Constraints are subtrees of the graph • Pairwise node disjoint

Two Problems to Solve (One Left) 1. What exactly are the constraints? constraints (That

Formulation of the Second Problem Input: constraints (node-disjoint subtrees, keywords as leaves) Objective: A

Finding a Minimal Supertree Input: G, T (constraints, i. e. , subtrees) 1. Collapse

This is not Enough! Input: constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal

Query Answers Revisited An answer is a directed subtree of the data graph v

An Example PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 31

An Example This edge is redundant! But, it cannot be removed since it is

What if We Remove Edges of Constraints? • What if we first generate a

Our Approach Transform Constraints Min. Supertree New constraints Answer The root of this subtree

This Process is Repeated Min. Supertree m r o f s n a Tr

About the Transformation • The details of the exact transformation and the proof of

A Different View: Chain of Reductions Enumerating answers in ranked order Adapting Lawler’s method

Modifying the Chain of Reductions Enumeration in an approximate order Similar Finding approximate answers

Exact Order Revisited m r o f s Min. Supertree n a Tr Min.

The Algorithm Constraints ≤ C times the optimum A C-approximation of the minimal supertree

The combined subgraph contains an answer Combine the Subtrees ≤ (C+1) times the optimum

Keyword Proximity Search • A common paradigm for keyword search over structured databases •

Our Results • Under data complexity, answers can be enumerated in the exact ranked

Our Chain of Reductions Enumerating answers in sorted order Lawler’s approach Finding the top

Other Variant of KPS Our algorithms can be adapted to other popular variants of

Undirected Variant Answers are undirected trees PODS'06 Finding and Approximating Top-k Answers in Keyword

Strong Variant Answers are undirected trees and keywords are leaves PODS'06 Finding and Approximating

Open Problems • Can we improve the space efficiency of our algorithms? • Some

Implementation Considerations • Bottlenecks: Steiner-tree algorithms and approximations • Thin graphs allow in-memory execution

Related Work: Order vs. Efficiency Exact Order This work More Desirable Approximate Order Heuristic

Illustration of Lawler’s Method 54 CIKM 2005

Lawler’s Method (1972) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 55

1. Find the Top Answer In principle, at this point we should find the

2. Partition the Remaining Answers PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity

2. Partition the Remaining Answers Each partition is defined by a distinct set of

3. Find the Top of each Set PODS'06 Finding and Approximating Top-k Answers in

4. Find the Second Answer The second answer is the best among all the

5. Further Divide the Chosen Partition PODS'06 Finding and Approximating Top-k Answers in Keyword

And so on… PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 62

Our Constraints Inclusion constraints • Node-disjoint subtrees of the data graph • All the

edges(A) I = {e(cont) Partitioning a Partition 1, …, ek} A I E

Constraints (subtrees/edges) are obtained from existing Generating Constraints (intuition) constraints of the current partition

Collapsing a Subtree PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 68

1. Remove All Edges and Internal Nodes Only the root is left PODS'06 Finding

2. Remove Incoming Edges of Internal Nodes PODS'06 Finding and Approximating Top-k Answers in

3. Add Outgoing Edges to the Root An edge that emanates from an internal

More Details • When adding an outgoing edge (r, u) to the root, the

Slides: 72

Download presentation

Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld and Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science האוניברסיטה העברית בירושלים The Hebrew University of Jerusalem 1 CIKM 2005

Keyword Proximity Search (KPS) A paradigm for data extraction Data have varying degrees of structure – Relational databases, XML, Web sites Queries are sets of keywords − No structural constraints The Goal: Extract meaningful parts of data w. r. t. the keywords PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 2

Querying Structure & Content by Keywords appear in different parts of the data Answers show occurrences of keywords, as well the associations among these occurrences search Vardi Databases … Proximity of the keywords in the answer indicates a close (strong) semantic association among them PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 3

Past Work on KPS (Keyword Proximity Search) • Data. Spot (Sigmod 1998) • Information Units (WWW 2001) • BANKS (ICDE 2002, VLDB 2005) • DISCOVER (VLDB 2002) • DBXplorer (ICDE 2002) • XKeyword (ICDE 2003) • … PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 4

The Goal of this Paper Devise efficient algorithms for finding highquality answers in keyword proximity search PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 5

Contents Introduction Formal Setting The Main Results Enumerating in the Exact Order Enumerating in an Approximate Order Conclusion and Future Work PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 6

Data Graphs v Structural and keyword nodes v Edges may have weights – Weak relationships are penalized by high weights PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 8

Queries are sets of keywords from the data graph Q={ Summers , Cohen , coffee } PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 9

Query Answers PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 10

Query Answers An answer is a directed subtree of the data graph v Contains all keywords of the query v Has no redundant edges (and nodes) The root has two or more children The keywords of the query are the leaves PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 11

Ranking: Inversely Proportional to Weight rank(A)=(weight(A))-1 Smaller subtrees represent closer associations PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 12

Enumerating in Exact (Ranked) Order AB C A B C If A B C A B Then C A B A C B ≤ C A A B B C C Top-k Answers PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 13

C may in beaa. Cfunction of G and. Order Q Enumerating -Approximate A B C A If B C A Then A B B C C A B C ≤C C C-Approximation of the Top-k Answers (Fagin et. al, PODS’ 01) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 14

Polynomial Delay Yardstick of efficiency: Polynomial delay AB C A B C A B C Polynomial time between generating successive answers Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 15

Top Answers are Steiner Trees • Finding the top answer in KPS (a. k. a. the Steinertree problem) is intractable – Therefore, one cannot enumerate all answers in ranked order with polynomial delay • However, the top answer can be found efficiently under data complexity – That is, the number of keywords is fixed • Approximations can be found efficiently under query-and-data complexity – There is a lot of work on Steiner-tree approximations PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 17

So What Can Be Done? Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity? Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving the approximation ratio)? PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 18

Our Results Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact order with polynomial delay AB C A B C A PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search B C 19

Our Results (cont’d) Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order AB C A B C A PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search B C 20

The Meaning of the Results KPS is tractable under data complexity All results on Steiner trees can be applied to KPS Under query-and-data complexity, an efficient enumeration in an approximate order can be done with almost the same ratios as Steiner trees From a theoretical point of view, using heuristics is not the only option Existing approaches to KPS are heuristics –Exponential delay in the worst case –No provable nontrivial approximation ratios PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 21

Lawler’s Method • We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers • Each iteration generates the next answer by finding the top answer under constraints • Lawler’s method is designed for general (discrete) optimization problems • When applying it to a specific problem, one needs to deal with the following two issues PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 23

Two Problems to Solve 1. What exactly are the constraints? constraints (That is, how can we apply Lawler’s method so that the constraints make it possible to find top answers efficiently? ) 2. How can we find efficiently the top answer under constraints? PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 24

Solving the First Problem Constraints are subtrees of the graph • Pairwise node disjoint • Their leaves are exactly the keywords of the query An answer satisfies the constraints if it contains all the subtrees (i. e. , a supertree) supertree PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 25

Two Problems to Solve (One Left) 1. What exactly are the constraints? constraints (That is, how can we apply Lawler in a way that the constraints enable finding the top answer efficiently? ) 2. How can we find efficiently the top answer under constraints? PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 26

Formulation of the Second Problem Input: constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i. e. , containing all the subtress) Next, an algorithm that solves “almost” this problem, namely: (Almost the same) Objective: A minimal supertree satisfying the constraints PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 27

Finding a Minimal Supertree Input: G, T (constraints, i. e. , subtrees) 1. Collapse each of the subtrees of T into a node 2. Find a Steiner tree T of the collapsed subtrees 3. Restore the collapsed subtrees in T (more details in the proceedings…) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 28

This is not Enough! Input: constraints (node-disjoint subtrees, keywords as leaves) Objective: A minimal answer satisfying the constraints (i. e. , containing all the subtress) Not the same! (Almost the same) Objective: A minimal supertree satisfying the constraints PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 29

Query Answers Revisited An answer is a directed subtree of the data graph v Contains all keywords of the query v Has no redundant edges (and nodes) Keywords are the leaves The root has two or more children PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 30

An Example PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 31

An Example This edge is redundant! But, it cannot be removed since it is a constraint! The minimal supertree The minimal answer satisfying theminimal constraints satisfying the constraints The answer can be completely different from the minimal supertree Furthermore, there can be no answer even if there is a supertree PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 32

What if We Remove Edges of Constraints? • What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)? • The constraints are violated, leading to a failure of Lawler’s method! • That is, – Some answers will be duplicated – While other answers will not be generated at all PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 33

Our Approach Transform Constraints Min. Supertree New constraints Answer The root of this subtree has more than one child and it must be the root of the answer PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 34

This Process is Repeated Min. Supertree m r o f s n a Tr rm o f s n a r T Trans form Min. Supertree The best is the final answer Tra ns for m Constraints #keywords Up to 2 times (fixed & usually fewer) Min. Supertree PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 35

About the Transformation • The details of the exact transformation and the proof of correctness are intricate • All can be found in the proceedings… This concludes the algorithm for enumerating in the exact order PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 36

A Different View: Chain of Reductions Enumerating answers in ranked order Adapting Lawler’s method Finding the top answer under constraints Transformation of constraints Finding minimal supertrees Collapse and restore Finding Steiner trees PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 37

Modifying the Chain of Reductions Enumeration in an approximate order Similar Finding approximate answers under constraints Completely different! Finding approximations of minimal supertrees Similar Finding approximations of Steiner trees PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 39

Exact Order Revisited m r o f s Min. Supertree n a Tr Min. rm We cannot allow. Supertree o it under query f s n a r T Up to 2#keywords Trans form Tra ns for m Constraints -and-data complexity! Min. Supertree PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 40

The Algorithm Constraints ≤ C times the optimum A C-approximation of the minimal supertree (collapse and restore) ≤ 1 times the optimum A minimal answer for 3 or fewer constraints (the algorithm for the exact order) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 41

The combined subgraph contains an answer Combine the Subtrees ≤ (C+1) times the optimum ≤ C times the optimum A C-approximation of the minimal supertree (collapse and restore) ≤ 1 times the optimum A minimal answer for 3 or fewer constraints (the algorithm for the exact order) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 42

Keyword Proximity Search • A common paradigm for keyword search over structured databases • In the formal model: – Data are directed and weighted graphs – Queries are sets of keywords (i. e. , nodes) from the data graph – Query answers are non-redundant subtrees containing the keywords of the query • The goal is to find the top-k answers, where the rank is inversely proportional to the weight • A stronger goal: enumeration with poly. delay PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 44

Our Results • Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay • Under query-and-data complexity, every efficient C -approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 45

Our Chain of Reductions Enumerating answers in sorted order Lawler’s approach Finding the top answer under constraints The intricate part … Finding minimal supertrees Subtree Collapse/Restore Finding Steiner trees PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 46

Other Variant of KPS Our algorithms can be adapted to other popular variants of KPS PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 47

Undirected Variant Answers are undirected trees PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 48

Strong Variant Answers are undirected trees and keywords are leaves PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 49

Open Problems • Can we improve the space efficiency of our algorithms? • Some ranking functions (e. g. , height) are easier than weight when looking for the top answer (no constraints), but – The chain of reductions doesn’t work – The complexity of finding the top answer under constraints is unknown • Can our results hold for richer queries that also have structural constraints? PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 50

Implementation Considerations • Bottlenecks: Steiner-tree algorithms and approximations • Thin graphs allow in-memory execution of our algorithms, even for large XML documents (e. g. , DBLP) • New and intuitive ranking functions that are easier to implement efficiently PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 51

Related Work: Order vs. Efficiency Exact Order This work More Desirable Approximate Order Heuristic Order (Queries have a fixed size) More Efficient (no approx. guaranteed) Past work No Order PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 52

Thank you. Questions? 53 CIKM 2005

Illustration of Lawler’s Method 54 CIKM 2005

Lawler’s Method (1972) PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 55

1. Find the Top Answer In principle, at this point we should find the second-best answer But Instead… PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 56

2. Partition the Remaining Answers PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 57

2. Partition the Remaining Answers Each partition is defined by a distinct set of constraints PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 58

3. Find the Top of each Set PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 59

4. Find the Second Answer The second answer is the best among all the top answers in the partitions PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 60

5. Further Divide the Chosen Partition PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 61

And so on… PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 62

Adapting Lawler’s Method 63 CIKM 2005

Our Constraints Inclusion constraints • Node-disjoint subtrees of the data graph • All the leaves are keywords • An answer must contain all the subtrees Exclusion constraints • Edges of the data graph • An answer must not contain any of the edges PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 64

edges(A) I = {e(cont) Partitioning a Partition 1, …, ek} A I E A A 0 I E ⋃{e 1} A 0 A 1 I ⋃{e 1} E ⋃{e 2} A 1 A 2 I ⋃{e 1, e 2} E ⋃{e 3} A 2 A 3 I ⋃{e 1, e 2, e 3} E ⋃{e 4} A 3 I ⋃{e 1, …, ek-1} E ⋃{ek} Ak-1 … PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 65

Constraints (subtrees/edges) are obtained from existing Generating Constraints (intuition) constraints of the current partition and the top answer PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 66

Collapsing Subtrees 67 CIKM 2005

Collapsing a Subtree PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 68

1. Remove All Edges and Internal Nodes Only the root is left PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 69

2. Remove Incoming Edges of Internal Nodes PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 70

3. Add Outgoing Edges to the Root An edge that emanates from an internal node becomes an outgoing edge of the root PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 71

More Details • When adding an outgoing edge (r, u) to the root, the weight of (r, u) is the minimal weight among all the edges from the collapsed subtree to u • When restoring a subtree, each outgoing edge (r, u) of the root is replaced with an (arbitrary) original edge from the restored subtree to u, with the same weight • Incoming edges of internal nodes of the subtree are never restored – Such edges cannot participate in G-supertrees PODS'06 Finding and Approximating Top-k Answers in Keyword Proximity Search 72