Keyword Proximity Search in Complex Data Graphs Konstantin
- Slides: 63
Keyword Proximity Search in Complex Data Graphs • Konstantin Golenberg • Benny Kimelfeld • Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science
Overview Keyword Proximity Search System Overview Algorithm for Answer Generation Ranking Answers Conclusions & Future Work SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Schema-Free Extraction of Data Exposure to many databases Nowadays… • Different types (relational, XML, RDF…) • Different schemas • Not easy to use traditional paradigms of querying (e. g. , SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema • Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema The natural (and popular) option: Keyword Search −Problem: Inherently different from standard IR SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Keyword Proximity Search (KPS) Data have varying degrees of structure – Relational (w/ foreign keys), XML (w/ id-references) – Natural representation by a graph – Usually, data-centric rather than document-centric A query is a set of keywords − No structural constraints The Goal: Extract meaningful parts of data w. r. t. the keywords • Agrawal et al. ICDE’ 02 • Hristidis et al. , VLDB’ 02, 03, ICDE’ 03 • Bhalotia et al. VLDB’ 05 • Kacholia al. , VLDB’ 06 • Ding et al. , ICDE’ 07 • Liu et al. , SIGMOD’ 06 • Wang et al. , VLDB’ 06 • Luo et al. , SIGMOD’ 07 … SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Example: Search in RDB Cities Organizations ID Name Population ID Name Head Q. 22 Amsterdam 1101407 135 EU 73 73 Brussels 951580 175 ESA 81 Countries Memberships Code Name Area Capital Country Org. NL Netherlands 37330 22 B 135 B Belgium 30510 73 NL 135 search SIGMOD’ 08 Belgium , Brussels Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Brussels is the capital city of Belgium Cities Organizations ID Name Population ID Name Head Q. 22 Amsterdam 1101407 135 EU 73 73 Brussels 951580 175 ESA 81 Countries Memberships Code Name Area Capital Country Org. NL Netherlands 37330 22 B 135 B Belgium 30510 73 NL 135 search SIGMOD’ 08 Belgium , Brussels Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Brussels hosts EU and Belgium is a member Cities Organizations ID Name Population ID Name Head Q. 22 Amsterdam 1101407 135 EU 73 73 Brussels 951580 175 ESA 81 Countries Memberships Code Name Area Capital Country Org. NL Netherlands 37330 22 B 135 B Belgium 30510 73 NL 135 search SIGMOD’ 08 Belgium , Brussels Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Example: Search in XML search SIGMOD’ 08 Yannakakis , Approximation Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Yannakakis wrote a paper about Approximation search SIGMOD’ 08 Yannakakis , Approximation Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Yannakakis is cited by a paper about Approximation search SIGMOD’ 08 Yannakakis , Approximation Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Data Graphs v Structural and keyword nodes v Edges and nodes may have weights – Weak relationships are penalized by large weights Each keyword has one occurrence in the data graph (technical) SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Queries are sets of keywords from the data graph Q={ Summers , Cohen , coffee } SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
An Answer is a Reduced Subtree An answer is a subtree of the data graph v Contains all keywords of the query v Has no redundant edges (and nodes) This paper 3 variants: directed, undirected, strong (undirected, kw’s are leaves); SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Previous Solutions • Lack of guarantees − Highly relevant answers might be missed, and / or − Inefficient algorithms • Rather simple data sets – a (very) small number of relevant answers − They considered data that are essentially collections of entities, namely, DBLP, IMDB, Lyrics, etc. − An answer is usually within the scope of an entity → e. g. , the keywords appear in a single movie • Crucial problems ignored − In particular, the “repeated information” problem − Especially pervasive in complex data graphs SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Contributions A system for keyword proximity search • An algorithm for generating answers with guarantees − Does not miss (valuable) answers − Efficient (polynomial delay) − Answers generated in a 2 -approximate order by height • A ranking technique that is aware of the repeated-information problem − Gives preference to answers with low similarity to earlier ones • Experimentation over a highly-cyclic data graph − The Mondial database − Many “meaningful” connections among keywords SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The MONDIAL Database Institute for Informatics Georg-August-Universität Göttingen http: //www. dbis. informatik. uni-goettingen. de/Mondial/
Overview Keyword Proximity Search System Overview Algorithm for Answer Generation Ranking Answers Conclusions & Future Work SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Challenges • Huge no. of answers; not instantiated! − Not simple to generate all relevant answers, even if ranking is ignored − For practical ranking functions, enumerating the answers in ranked order is probably impossible • For example, finding the smallest answer is the intractable Steiner-tree problem • Redundancy / repeated information − Many answers are very similar (altogether provide a low amount information) − Crucial in complex (highly cyclic) data graphs We employ a two-phase architecture: SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Architecture: Generator + Ranker Simplified ranking at first [Bhalotia et al. , ICDE’ 02, VLDB’ 05] Answer Generator Ranker Generates next M·k answers Ranks all answers generated up to now (simplified ranking function) (- printed ones) • search(keywords) • next k answers SIGMOD’ 08 top-k answers (relative to those that have already been printed) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Overview Keyword Proximity Search System Overview Algorithm for Answer Generation Ranking Answers Conclusions & Future Work SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Generating the Top Answers: Not Trivial! To demonstrate the difficulty of generating the “good” (top) answers, let’s see how existing approaches operate on a simple example: SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Find the Answers in this Example! SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The BANKS Approach [Bhalotia et al. , ICDE’ 02, VLDB’ 05] Answers are directed subtrees ∀ nodes v (in a “good” order) and keyword occurrences: Generate the min-height subtree emanating from v SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The BANKS Approach [Bhalotia et al. , ICDE’ 02, VLDB’ 05] Answers are directed subtrees What about this answer? Never generated! ∀ nodes v (in a “good” order) and keyword occurrences: Generate the min-height subtree emanating from v SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The NUITS Approach [Ding et al. , ICDE’ 07] Answers are undirected subtrees ∀ nodes v (in a “good” order): Generate the min-weight subtree that includes v SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The NUITS Approach [Ding et al. , ICDE’ 07] Answers are undirected subtrees This node is redundant It is actually the previous answer! ∀ nodes v (in a “good” order): Generate the min-weight subtree that includes v SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The NUITS Approach [Ding et al. , ICDE’ 07] Answers are undirected subtrees This node is redundant Again, the previous answer! ∀ nodes v (in a “good” order): Generate the min-weight subtree that includes v SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The NUITS Approach [Ding et al. , ICDE’ 07] Answers are undirected subtrees What about this answer? Never generated! Severe limit on # of generated answers! (≤ one per node) ∀ nodes v (in a “good” order): Generate the min-weight subtree that includes v SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The DISCOVER / DBXplorer Approach [Hristidis et al. , VLDB’ 02, 03, ICDE’ 03] [Agrawal et al. ICDE’ 02] Easy to implement! DBMS queries– No in-memory graph algorithms All answers are generated in ranked order! ∀ possible queries Q (from the schema) in inc. size: Evaluate Q over the database SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The DISCOVER / DBXplorer Approach [Hristidis et al. , VLDB’ 02, 03, ICDE’ 03] [Agrawal et al. ICDE’ 02] Worst case: exponential in the data But many queries do not generate any answer at all! Inefficient! Limited Ranking! by the query (rather than the answer) weight ∀ possible queries Q (from the schema) in inc. size: Evaluate Q over the database SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
We Need Generators w/ Guarantees! All answers are generated − In particular, each of the “relevant” answers is produced at some point (100% recall is achievable) Controlled order of answers − For instance, increasing weight, increasing height, approximate (what is the ratio? ) / heuristic order Efficiency − The top-k answers should be generated efficiently − Bound on time between successive answers SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Order by Increasing Weight / Height If Then ≤ Top-k Answers SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Approximate and Heuristic Orders Approximate order Heuristic order There is a provable bound on the extent to which the actual order can deviate from the optimal one Intuitively, expected to be close to the optimal order, but there is no guarantee SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
C-Approximate Order (inc. Weight / Height) If Then ≤C C-Approximation of the Top-k Answers [Fagin et al. , PODS’ 01] SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Our Approach • PODS’ 06: Enum. by (exact / approx) inc. weight − Problem: Repeated application of Steiner-tree alg’s − “Heavy” – hard to implement efficiently • Here: Follow the basic approach of PODS’ 06 • But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order − Recall: BANKS might miss highly relevant answers • Thus, we bypass Steiner trees and obtain a much faster algorithm • Our alg. has all 3 guarantees: answers are not missed, missed approximate order, order poly. delay SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
An Overview of the Algorithm Task: Enum. by (2 -approx. ) increasing height Task: Lawler / Yen method Types of Constraints: • Inclusion: “include edge e” • Exclusion: “exclude edge e” Find (a 2 -approx. of) the shortest answer under constraints The intricate part … Task: Backward-search (Dijkstra) iterators (~ BANKS) SIGMOD’ 08 Find the shortest answer (w/o constraints) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Finding an Answer under Constraints • Inclusion: “include edge e” • Exclusion: “exclude edge e” Handling exclusion constraints is easy Simply remove the excluded edges from the graph SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Inclusion Constraints are the Problem • Inclusion: “include edge e” • Exclusion: “exclude edge e” redundant edge The shortest subtree that contains the kw’s and satisfies the const’s But it is not an answer! • Not reduced (has redundancy) • Moreover, includes a previously printed answer • Sometimes, no answer at all! SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The Correct Answer • Inclusion: “include edge e” • Exclusion: “exclude edge e” Technique: 1. Generate a min-height subtree (as in the wrong solution) 2. Not an answer? → modify • Intricate to guarantee 2 -approx. • Details in the proceedings SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Running Times Each entry is an avg. of 4 queries SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Alg. Order vs. Weight Order How many answers are generated in order to obtain the top-k (among 1000) according to weight? Each entry is an avg. of 4 queries SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Effective Approx. Ratio: Height ↑ 2 keywords % 3 keywords k (answers) Effective approx. ratio SIGMOD’ 08 worst / best (among first k) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Effective Approx. Ratio: Height ↑ 4 keywords % 5 keywords k (answers) Effective approx. ratio SIGMOD’ 08 worst / best (among first k) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Effective Approx. Ratio: Weight ↑ 2 keywords % 3 keywords k (answers) Effective approx. ratio SIGMOD’ 08 worst / best (among first k) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Effective Approx. Ratio: Weight ↑ 4 keywords % 5 keywords k (answers) Effective approx. ratio SIGMOD’ 08 worst / best (among first k) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Overview Keyword Proximity Search System Overview Algorithm for Answer Generation Ranking Answers Conclusions & Future Work SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The Basic Ranking Function 1 abs-rel(a) = weight(a) Σ weight(node) + Σ weight(edge) node∊a edge∊a weight(a) = SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Determining the Weight of an Edge org. enters many countries → weak connection (large weight) Many org’s enter country → weak connection (large weight) Strong connection (small weight) SIGMOD’ 08 Strongest! Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The Basic Ranking Function (cont’d) 1 abs-rel(a) = weight(a) Σ weight(node) + Σ weight(edge) l e v node∊a edge∊a weight(a) R =e ant answ ers but … weight(node) = fixed (1) weight(edge) = log(1 + α·out(v 1→t 2) + (1 − α)·in(t 1→v 2)) edge = (v 1, v 2) # t 2 nodes with edges from v 1 tag(vi) = ti # t 1 nodes with edges to v 2 SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Answers with High Similarity SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Combinations of Connections But each individual answer is relevant! SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Dynamic Ranking … Output Candidate Answers Next-Answer() a ← extract-top-candidate() print(a) for all candidates c and pairs of keywords k 1, k 2 if c and a connect k 1 and k 2 similarly, then penalize(c) What does it mean SIGMOD’ 08 ? Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Two Types of “Similarity” a c 1 The same connection Iso mo r (sa phic me con sch nec em tion a) k 1, k 2 = Belgium, France 2 options: • Sum over printed answers • Max over printed answers SIGMOD’ 08 Penalty: 1 c 2 Penalty: p (≤ 1) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The General Ranking Function 1 abs-rel(c) = weight(c) ∑ rpt-inf(c) = k , k ∊∑ kw’s 1 2 score(c) = SIGMOD’ 08 printed p answers or 1 or ∑max p or 1 k 1, k 2 printed answers ∊ kw’s 1 1 + ε · rpt-inf(c) abs-rel(c) Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
The bottom configuration is better than the top one Score Loss ofvs. Diversity Smaller reduction score for similar/higher degree of diversity Score (1/weight) Connections (u. t. iso. ) Sum, p=1. 0 % of max. Max, p=0. 1 • 5 keywords • Avg. of 4 queries • Top-20 answers SIGMOD’ 08 ε Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Overview Keyword Proximity Search System Overview Algorithm for Answer Generation Ranking Answers Conclusions & Future Work SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Conclusions • KPS in complex data graphs has inherent problems that are ignored in existing systems • 2 -component arch. : answer generator & ranker • 1 st component: Enum. algorithm w/ guarantees − Efficient, correct (no missed answers), 2 -approximate order by height − In the paper: Ext. to OR semantics (exact order) • 2 nd component: Dynamically ranks candidates by penalizing them for repeated information − Our experiments over Mondial suggest a tuning of the parameters that gives the best tradeoff between information gain and score loss SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Current & Future Research • Improve / optimize the answer generator − Successful: Parallelism − Concurrent queries? • Implement different answer generators − E. g. , by (approx. ) increasing weight [KS-PODS’ 06] • Assessment by humans − Relevancy / repeated information − Methodology example: [Zhang et al. , SIGIR’ 02] • Other aspects − Answer presentation → SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Answer Presentation • On the Web, we instantly get the meaning of an answer (Web page) by the <title>, URL and, possibly, a snippet of the text • In KPS, understanding the meaning of a subtree is note straightforward—need to derive the semantics from the graphical presentation SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
What’s the Meaning of this Answer? Harder in XML! IMDB • No division into relations (everything is element / attribute) • What information is needed to describe a node? A snapshot of BANKS demo (http: //www. cse. iitb. ac. in/banks/) SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Answer Presentation • On the Web, we instantly understand the meaning of an answer (Web page) by reading the <title> element, the URL and, possibly, a snapshot of the text • In KPS, understanding the meaning of a subtree is cumbersome since we need to derive the semantics from the presentation • Graphical presentation is based on restructuring answers in terms of of entities, properties and Solution: relationships (under • Apply heuristics for determining the minimal develop. ) set of properties required for each entity SIGMOD’ 08 Keyword Proximity Search in Complex Data Graphs Benny. University Kimelfeld The Hebrew
Thank you! Questions?
- Keyword proximity example
- Keyword search techniques
- Keyword generation for search engine advertising
- Verbatim meaning in google search
- Good state and bad state graphs in software testing
- Comparing distance/time graphs to speed/time graphs
- Graphs that enlighten and graphs that deceive
- 5-3 practice polynomial functions
- Mining complex data objects
- Mining complex types of data
- Cs 412 introduction to data mining
- Pauline and bruno have a big argument
- Ghon complex
- Quiz on simple compound and complex sentences
- The electra complex
- Sigmund freud psychodynamic theory
- Difference between oedipus complex and electra complex
- Konstantin dmitriyevich ushinskiyning pedagogik tizimi
- Dr. konstantin ziolkowski
- Florian feistle
- Konstantin busch
- Matus kucera konstantin a metod
- Konstantin zioutas
- Three holes
- Dr. konstantin ziolkowski
- Dr konstantin nikitin
- Romeo in julija oznaka oseb
- Konštantín porfyrogenetos
- Hyperbolic geometry of complex networks
- Constantin cel mare naissus
- Konstantin busch
- Konstantin stanislavski born
- Konstantin boyarsky
- Konstantin zouboulis
- Konstantin stanislavski born
- Konstantin ustinovich chernenko
- Konstantin porfirogenet
- Konstantin stanislavski born
- Konstantin kudryashov
- Konstantin meyl
- Jojo stammbaum
- Konstantin busch
- Konstantin car
- Konstantin stefanov
- Konstantin yakunin
- Konstantin matchev
- Europaprogrammet
- Konstantin alekseyevich korovin
- Konstantin busch
- Konstantin ivanov nmr
- Location
- Contrast repetition alignment proximity
- Root proximity
- Effects layout
- What is proximity control
- Scalpel pic
- Contrast, repetition, alignment proximity examples
- Dissimilarity for attributes of mixed types examples
- With-it-ness classroom management
- Proximity printing
- Design principles contrast
- App inventor sensor
- App inventor gps
- Proximity printing