Efficient Keyword Search across Heterogeneous Relational Databases Mayssam

Efficient Keyword Search across Heterogeneous Relational Databases Mayssam Sayyadian, An. Hai Doan University of Wisconsin - Madison Hieu Le. Khac University of Illinois - Urbana Luis Gravano Columbia University

Key Message of Paper Precise data integration is expensive l But we can do IR-style data integration very cheaply, with no manual cost! l – just apply automatic schema/data matching – then do keyword search across the databases – no need to verify anything manually l Already very useful Build upon keyword search over a single database. . . 2

Keyword Search over a Single Relational Database l A growing field, numerous current works – – l Many related works over XML / other types of data – – l DBXplorer [ICDE 02], BANKS [ICDE 02] DISCOVER [VLDB 02] Efficient IR-style keyword search in databases [VLDB 03], VLDB-05, SIGMOD-06, etc. XKeyword [ICDE 03], XRank [Sigmod 03] Te. XQuery [WWW 04] Object. Rank [Sigmod 06] Top. X [VLDB 05], etc. More are coming at SIGMOD-07. . . 3

A Typical Scenario Customers tid custid name Complaints contact addr tid id emp-name comments t 1 c 124 Cisco Michael Jones … u 1 c 124 Michael Smith Repair didn’t work t 2 c 533 IBM David Long … u 2 c 124 John Deferred work to t 3 c 333 MSR David Ross … John Smith Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers Repair didn’t work score=. 8 Deferred work to John Smith score=. 7 t 1 c 124 Cisco Michael Jones … u 1 c 124 Michael Smith t 1 c 124 Cisco Michael Jones … u 2 c 124 John 4

Our Proposal: Keyword Search across Multiple Databases Employees Complaints comments tid empid u 1 c 124 Michael Smith Repair didn’t work v 1 e 23 Mike D. Smith u 2 c 124 John Deferred work to v 2 e 14 John Brown John Smith v 3 e 37 Jack Lucas tid id emp-name Groups Customers tid custid name contact addr tid eid reports-to t 1 c 124 Cisco Michael Jones … x 1 e 23 e 37 t 2 c 533 IBM David Long … x 2 e 14 e 37 t 3 c 333 MSR Joan Brown … Query: [Cisco Jack Lucas] t 1 c 124 Cisco Michael Jones … u 1 c 124 Michael Smith Repair didn’t work v 1 e 23 Mike D. Smith x 1 e 23 e 37 across databases v 3 e 37 Jack Lucas IR-style data integration 5

A Naive Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB apply current keyword search techniques Just like in traditional data integration, this is too much manual work 6

Kite Solution l Automatically find FK joins / matching data instances across databases no manual work is required from user Employees Complaints comments tid empid u 1 c 124 Michael Smith Repair didn’t work v 1 e 23 Mike D. Smith u 2 c 124 John Deferred work to v 2 e 14 John Brown John Smith v 3 e 37 Jack Lucas tid id emp-name Groups Customers tid custid name contact addr tid eid reports-to t 1 c 124 Cisco Michael Jones … x 1 e 23 e 37 t 2 c 533 IBM David Long … x 2 e 14 e 37 t 3 c 333 MSR Joan Brown … 7

Complaints Automatically Find FK Joins across Databases Employees comments tid empid u 1 c 124 Michael Smith Repair didn’t work v 1 e 23 Mike D. Smith u 2 c 124 John Deferred work to v 2 e 14 John Brown John Smith v 3 e 37 Jack Lucas tid id emp-name Current solutions analyze data values (e. g. , Bellman) l Limited accuracy l – e. g. , “waterfront” with values yes/no “electricity” with values yes/no l Our solution: data analysis + schema matching – improve accuracy drastically (by as much as 50% F-1) Automatic join/data matching can be wrong incorporate confidence scores into answer scores 8

Incorporate Confidence Scores into Answer Scores l Recall: answer example in single-DB settings t 1 c 124 Cisco l Michael Jones … u 1 c 124 Michael Smith Repair didn’t work score=. 8 Recall: answer example in multiple-DB settings score 0. 7 for data matching t 1 c 124 Cisco Michael Jones … u 1 c 124 Michael Smith Repair didn’t work v 1 e 23 Mike D. Smith score 0. 9 for FK join score (A, Q) = x 1 e 23 e 37 v 3 e 37 Jack Lucas α. score_kw (A, Q) + β. score_join (A, Q) + γ. score_data (A, Q) size (A) 9

Summary of Trade-Offs l SQL queries Precise data integration – the holy grail l IR-style data integration, naive way – manually identify FK joins, matching data – still too expensive l IR-style data integration, using Kite – automatic FK join finding / data matching – cheap – only approximates the “ideal” ranked list found by naive 10

Kite Architecture Q = [ Smith Cisco ] Index Builder IR index 1 … IR indexn Foreign key joins Condensed CN Generator Refinement rules Top-k Searcher Data instance matcher Foreign-Key Join Finder Data-based Join Finder D 1 Schema Matcher … Dn Offline preprocessing – Partial – Full – Deep Distributed SQL queries D 1 … Dn Online querying 11

Online Querying Database 1 Relation 2 Database 2 Relation 1 Relation 2 What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers 12

Create Answer Templates Service-DB Find tuples that contain query keywords – – Use DB’s IR index example: Complaints Customers u 1 v 1 u 2 v 3 Q = [Smith Cisco] Tuple sets: Service-DB: Complaints. Q={u 1, u 2} Customers. Q={v 1} HR-DB: Employees. Q={t 1} Groups. Q={} Create tuple-set graph HR-DB Groups Employees x 1 t 1 x 2 t 3 Schema graph: Customers J 1 Complaints J 4 Emps J 1 Groups J 3 Tuple set graph: Customers{} J 2 J 4 Complaints{} Emps{} J 1 J 4 J 3 Customers. Q J 1 Complaints. Q J 4 Emps. Q J 2 Groups{} J 2 13

Create Answer Templates (cont. ) l Search tuple-set graph to generate answer templates – also called Candidate Networks (CNs) l Each answer template = one way to join tuples to form an answer sample CNs sample tuple set graph Customers{} J 1 J 4 Complaints{} CN 1: Customers. Q Emps{} J 1 J 4 J 3 J 2 J 1 J 4 J 3 Groups{} J 2 Customers. Q J 1 Complaints. Q J 4 Emps. Q J 1 CN 2: Customers. Q Complaints{Q} J 2 J 4 CN 3: Emps. Q Groups{} Emps{} Complaints{Q} J 2 J 3 J 4 CN 4: Emps. Q Groups{} Emps{} Complaints{Q} 14

Materialize Answer Templates to Generate Answers l By generating and executing a SQL query CN: Customers. Q Complaints. Q SQL: SELECT * FROM Customers C, Complaints P J 1 ( Customers. Q = {v 1} , Complaints. Q = {u 1, u 2}) WHERE C. cust-id = P. id AND (C. tuple-id = v 1) AND (P. tuple-id = u 1 OR tuple-id = u 2) l Naive solution – materialize all answer templates, score, rank, then return answers l Current solutions – find only top-k answers – materialize only certain answer templates – make decisions using refinement rules + statistics 15

Challenges for Kite Setting l More databases way too many answer templates to generate – can take hours on just 3 -4 databases l Materializing an answer template takes way too long – requires SQL query execution across multiple databases – invoking each database incurs large overhead l Difficult to obtain reliable statistics across databases l See paper for our solutions 16

Empirical Evaluation Domains Domain Avg # approximate FK joins tuples Avg # tables Avg # tuples per table attributes per # DBs per DB per table schema total across DBs per pair Total size DBLP 2 3 3 11 6 11 500 K 400 M Inventory 8 5. 4 890 804 33. 6 2 K 50 M Sample Inventory Schema AUTHOR ARTIST BOOK CD WH 2 BOOK WH 2 CD WAREHOUSE Inventory 1 The DBLP Schema AR (aid, biblo) CITE (id 1, id 2) PU (aid, uid) AR (id, title) AU (id, name) CNF (id, name) DBLP 1 DBLP 2 17

Runtime Performance (1) runtime vs. maximum CCN size time (sec) DBLP Inventory max CCN size 2 -keyword queries, k=10, 5 databases 2 -keyword queries, k=10, 2 databases runtime vs. # of databases max CCN size Hybrid algorithm adapted to run over multiple databases Inventory time (sec) Kite without adaptive rule selection and without rule Deep Kite without condensed CNs # of DBs maximum CCN size = 4, 2 -keyword queries, k=10 Kite without rule Deep Full-fledged Kite algorithm 18

Runtime Performance (2) DBLP Inventory time (sec) runtime vs. # of keywords in the query |q| max CCN=6, k=10, 2 databases max CCN=4, k=10, 5 databases time (sec) runtime vs. # of answers requested k 2 -keyword queries, max CCN=4, |q|=2, 5 databases Inventory k 2 -keyword queries, max CCN=4, 5 databases 19

Query Result Quality Pr@k k OR-semantic queries l k AND-semantic queries Pr@k = the fraction of answers that appear in the “ideal” list 20

Summary l Kite executes IR-style data integration – performs some automatic preprocessing – then immediately allows keyword querying l Relatively painless – no manual work! – no need to create global schema, to understand SQL l Can be very useful in many settings: e. g. , on-the-fly, best-effort, for non-technical people – enterprises, on the Web, need only a few answers – emergency (e. g. , hospital + police), need answers quickly 21

Future Directions l Incorporate user feedback interactive IR-style data integration l More efficient query processing – large # of databases, network latency l Extends to other types of data – XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work 22

BACKUP 23

accuracy (F 1) Other Experiments ü Schema matching helps improve join discovery algorithm drastically ü Kite also improves singledatabase keyword search algorithm m. Hybrid time (sec) Kite over single database max CCN size 24