Planning for the Web I Data Integration Dan

Planning for the Web I Data Integration Dan Weld University of Washington June, 2003 © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration

Acknowledgements • • Oren Etzioni Alon Halevy Zachary Ives Rao Kambhampati Caveat © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 2

My Two Talks for Today • Data Integration Providing uniform access to disparate data srcs AI meets DB Answering queries using views Execution in the face of uncertainty, latency • Service Integration Invoking and composing web services Query and update Planning with incomplete information © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 3

Overview: Data Integration • • Motivation / intro Wrappers / information extraction Database review Integrating data sources Content, completeness, capabilities Reformulation algorithms: Bucket © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 4

What is Data Integration? A system providing: Uniform (same query interface to all sources) Access to (queries; eventually updates too) Multiple (we want many, but 2 is hard too) Autonomous (DBA doesn’t report to you) Heterogeneous (data models are different) Structured (or at least semi-structured) Data Sources (not only databases). © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 5

User enters query Formulate queries Lycos . . . Excite Collate results Remove duplicates Post-process + rank Download? Present to user © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 6

• • • Meta-? Web Search Shopping Product Reviews Chat Finder Columnists (e. g. jokes, sports, …. ) Email Lookup Event Finder People Finder Restaurant Reviews Job Listings Classifieds Apartment + Real Estate © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 7

Intuition: Info Integration • Info aggregation … on Steroids! • Want agent such that • User says what she wants • Agent decides how & when to achieve it • Example: Show me all reviews of movies starring Matt Damon that are currently playing in Seattle Sidewalk IMDB Fa o g n a nd Ebert

Info. Aggregation vs. Integration prices of laptop with … store 1 store 2 … store. N movies in Seattle starring … IMDB sidewalk join rev 1 rev 2 … rev. N sort • • Join, sort aggregate More complex queries Dynamic generation/optimization of execution plan Applicable to wider range of problems Much harder to implement efficiently © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 9

Challenges User must know which sites have relevant info User must go to each one in turn Slow: Sequential access takes time Confusing: Each site has a different interface User must manually integrate information

Practical Motivation • Enterprise Business “dashboard’’; web-site construction. • WWW Comparison shopping Portals integrating data from multiple sources B 2 B, electronic marketplaces • Science and culture: Medical genetics: integrating genomic data Astrophysics: monitoring events in the sky. Environment: Puget Sound Regional Synthesis Model Culture: uniform access to all cultural databases produced by countries in Europe. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 11

The Problem: Data Integration Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 12

Current Solutions • Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money. • Data warehousing: load all the data periodically into a warehouse. 6 -18 months lead time Separates operational DBMS from decision support DBMS. (not only a solution to data integration). Performance is good; data may not be fresh. Need to clean, scrub you data. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 13

Data Warehouse Architecture User queries OLAP / Decision support/ Data cubes/ data mining Relational database (warehouse) Data extraction, cleaning/ scrubbing Data extraction programs Data source © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Data source 14

• Pro Warehouse Summary Relatively simple Good performance (OLAP support) Mature technology (DB, ETL industries) • Con Expensive Stale data Risky – most warehouse projects fail • Rigid architecture • Fixed schema • Must know all queries ahead of time © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 15

Architecture for Virtual Integration Leave the data in the sources. When a query comes in: 1) 2) 3) 4) Determine which sources are relevant to query. Break query into sub-queries for each source. Get answers from sources Combine. Data is fresh. Challenge: performance. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 16

Virtual Integration Architecture User queries Mediated schema Reformulator Mediator: Optimizer Execution engine Data source catalog wrapper Data source Sources can be: relational, hierarchical (IMS), structured files, web sites. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 17

Research Projects • • • Garlic (IBM), Information Manifold (AT&T) Tsimmis, Info. Master (Stanford) Internet Softbot/Razor/Tukwila (U Wash. ) Hermes (Maryland) Telegraph / Eddies (UC Berkeley) Niagara (Univ Wisconsin) DISCO, Agora (INRIA, France) SIMS/Ariadne (USC/ISI) Emerac/Havasu (ASU) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 18

Industry • • Nimble Technology Enosys Markets IBM BEA © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 19

Dimensions to Consider • • • How many sources are we accessing? How autonomous are they? Meta-data about sources? Is the data structured? Queries or also updates? Requirements: accuracy, completeness, performance, handling inconsistencies. • Closed world assumption vs. open world? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 20

Outline üMotivation / introduction • Wrappers / information extraction • Database Review • Integrating data sources Content, completeness, capabilities Reformulation algorithms: Bucket © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 21

• Task Wrapper Programs to communicate with the data sources and do format translations. • Built w. r. t. a specific source. • Can sit either at the source or mediator. • Often hard to build (very little science). • Can be “intelligent” perform source-specific optimizations. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 22

Example Introduction to DB Transform: Phil Bernstein Eric Newcomer Addison Wesley, 1999 into: <book> <title> Introduction to DB </title> <author> Phil Bernstein </author> <author> Eric Newcomer </author> <publisher> Addison Wesley </publisher> <year> 1999 </year> </book> © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 23

Wrapper Construction • Use PERL, or • Generate wrappers automatically Get training examples • Human marks up selected pages with GUI tool Use shallow NLP to create features Favorite learning method • HMMs, VS on prefix, postfix strings, ? ? Boosting Co-training • See research on information extraction © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 24

Sem. Tag & Seeker • WWW-03 Best Paper Prize • Seeded with TAP ontology (72 k concepts) And ~700 human judgments • Crawled 264 million web pages • Extracted 434 million semantic tags Automatically disambiguated © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 25

Outline üMotivation / Introduction üWrappers / Information extraction • Database Review • Relational algebra, SQL, datalog • Views • Optimization (query planning) • Integrating data sources Content, completeness, capabilities Reformulation algorithms: Bucket © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 26

Traditional Database Architecture Query (SQL) Answer (relation) Database Manager (DBMS) -Storage mgmt -Query processing -View management -(Transaction processing) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Database (relational) 27

Relational Data: Terminology Product relation Name attribute (Arity=4) Price Category Manufacturer gizmo $19. 99 gadgets Gizmo. Works Power gizmo $29. 99 gadgets Gizmo. Works Single. Touch $149. 99 photography Canon Multi. Touch $203. 99 household Hitachi tuple schema Product(Name: string, Price: real, category: enum, Man: string) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 28

Relational Algebra • Operators tuple sets as input, new set as output • Operations Union, Intersection, difference, . . Selection ( ) Projection ( ) Cartesian product (X) • Join ( ) City © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration Manufacturer Tempe Gizmo. Works Kyoto Canon Dayton Hitachi 29

SQL: A query language for Relational Algebra Many standards out there: SQL 92, SQL 3, SQL 99 Select attributes From relations (possibly multiple, joined) Where conditions (selections) Other features: aggregation, group-by etc. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration “Find companies that manufacture products bought by Joe Blow” SELECT Company. name FROM Company, Product WHERE Company. name=Product. maker AND Product. name IN (SELECT product FROM Purchase WHERE buyer = “Joe Blow”); 30

Deductive Databases • Tables viewed as predicates. • Ops on tables expressed as “datalog” rules (Horn clauses, without function symbols) Enames(Name) : - Employe(Name, SSN) [Projection] Wealthy-Employee(Name) : - Employee(Name, SSN), Salary(SSN, Money), Money> 10 [Selection] Ed(Name, Dname) : - Employee(Name, SSN), E_Dependents(SSN, Dname) [Join] ERelated(Name, Dname) : - Ed(Name, Dname) ERelated(Name, Dname) : - Ed(Name, D 1), ERelated(D 1, D 2) [Recursion] © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 31

Views are relations, except that they are not physically stored. Uses: • simplify complex queries, & • define conceptually different views of DB for diff. users. Example: purchases of telephony products: CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase. product = Product. name AND Product. category = “telephony” © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 33

A Different View CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person. city = “Seattle” AND Person. name = Purchase. buyer We can later use the view: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view. product = Product. name AND Product. category = “shoes” What’s really happening when we query a view? ? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 34

Materialized Views • Views whose corresponding queries have been executed and the data is stored in a separate database Uses: Caching • Issues Using views in answering queries • Normally, views are available in addition to DB – (so, views are local caches) • In information integration, views may be the only things we have access to. – An internet source specializing in tom hanks movies can be seen as a view on a database of all movies. – Except, there is no DB out there which contains all movies. . © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 35

DB Inference • Type 1 Evaluate query on instance of data Exponential in size of query Key is performance in size of data • Type 2 Query containment Q 1 Q 2 No matter what data instances Logical entailment © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 36

Query Optimization Declarative query SELECT S. buyer FROM Purchase P, Person Q WHERE P. buyer=Q. name AND Q. city=‘seattle’ AND Q. phone > ‘ 5430000’ Imperative execution plan: buyer City=‘seattle’ phone>’ 5430000’ Inputs: Buyer=name (Simple Nested Loops) • the query • available memory Person • statistics about the data Purchase • indexes, (Table scan) (Index scan) • cardinalities, • selectivity factors Ideally: Want to find best plan. Practically: Avoid worst plans! © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 37

(On-the-fly) sname bid=100 (On-the-fly) sname (On-the-fly) rating > 5 Pipelined (Use hash index; do sid=sid not write result to temp) rating > 5 bid=100 (Simple Nested Loops) hash join sid=sid Reserves Sailors Goal of optimization: To find more efficient plans that compute the same answer. sname Sailors (On-the-fly) rating > 5 (On-the-fly) SELECT S. sname FROM Reserves R, Sailors S (Sort-Merge Join) sid=sid (Scan; write to temp T 1) bid=100 Sailors WHERE R. sid=S. sid AND R. bid=100 AND S. rating>5 Reserves © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 38

Relational Algebra Equivalences • Allow us to choose different join orders and to ‘push’ selections and projections ahead of joins. Selections: (Commute) Projections: Joins: (Cascade) R (S T) (R (R S) (S R) S) T (Associative) (Commute)

Optimizing Joins • Q(u, x) : - R(u, v), S(v, w), T(w, x) R S T • Many ways of doing a single join R S Symmetric vs. asymmetric join operations • Nested join, hash join, double pipe-lined hash join etc. Processing costs alone vs. proc. + transfer costs • Get R and S together vs, get R, get just the tuples of S that will join with R (“semi-join”) • Many orders in which to do the join (R join S) join T (S join R) join T (T join S) join R etc. • All with different costs © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 40

Determining Join Order In principle, must consider all possible join orderings: D D C A B C D A B C A As # of joins increases, # plans grows rapidly Must restrict search space. System-R: consider only left-deep join trees. This lets us generate all fully pipelined plans: B Intermediate results not written to temporary files. (Not all left-deep trees are fully pipelined)

Cost Estimation • For each plan considered, estimate cost: Estimate cost of each operation in plan tree. • Depends on input cardinalities. Estimate size of result for each op in tree! • Use information about the input relations. • Selectivity (Histograms) • For selections and joins, assume independence of predicates. • System R cost estimation approach. Very inexact, but works ok in practice. More sophisticated techniques known now.

Key Lessons in Optimization • Classic planning / execution scenario Uncertainty / replanning key for data integration • Main points Disk IO as cost metric Algebraic rules / use in query transformation. . Join ordering via dynamic programming Estimating cost of plans • Size of intermediate results. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 44

e s i epr R Integrator vs. DBMS No common schema Sources with heterogeneous schemas Semi-structured sources Legacy Sources Not relational-complete Access/process limitations Autonomous sources Uncontrolled source content overlap Lack of source statistics Tradeoffs: plan cost, coverage, quality, … Multi-objective cost models Unpredictable run-time behavior Makes query execution hard © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 45

Outline üMotivation üWrappers / information extraction üDatabase Review • Integrating data sources Content, completeness, capabilities Reformulation algorithms: Bucket © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 46

Remnder: Info Integration • Want agent such that • User says what she wants • Agent decides how & when to achieve it • Example: Show me all reviews of movies starring Matt Damon that are currently playing in Seattle Sidewalk IMDB Fa o g n a nd Ebert

Data Source Catalog • Contains meta-information about sources: Logical source contents (books, new cars). Source capabilities (can answer SQL queries? ) Source completeness (has all books). Physical properties of source and network. Statistics about the data (like in an RDBMS) Source reliability Mirror sources? Update frequency. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 48

Content Descriptions • User queries refer to the mediated schema. • Source data is stored in a local schema. • Content descriptions provide semantic mappings between different schemas. • Data integration system uses the descriptions to translate user queries into queries on the sources. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 49

Desiderata for Source Descriptions • Expressive power: distinguish between sources with closely related data. Enable pruning of access to irrelevant sources. • Easy addition: make it easy to add new data sources. • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 50

Reformulation Problem • Given: A query Q posed over the mediated schema Descriptions of the data sources • Find: A query Q’ over the data source relations, such that: • Q’ provides only correct answers to Q, and • Q’ provides all possible answers from to Q given the sources. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 51

Approaches to Specifying Source Descriptions • Global-as-view: express the mediated schema relations as a set of views over the data source relations • Local-as-view: express the source relations as views over the mediated schema. • Can be combined with no additional cost. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 52

Global-as-View Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create View Movie AS select * from S 1 [S 1(title, dir, year, genre)] union select * from S 2 [S 2(title, dir, year, genre)] union [S 3(title, dir), S 4(title, year, genre)] select S 3. title, S 3. dir, S 4. year, S 4. genre from S 3, S 4 where S 3. title=S 4. title © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 53

Global-as-View: Example 3 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S 4: S 4(cinema, genre) Create View Movie AS select NULL, genre from S 4 Create View Schedule AS select cinema, NULL from S 4. But what if we want to find which cinemas are playing comedies? © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 55

Global-as-View Summary C Very easy conceptually. Query reformulation view unfolding. C Can build hierarchies of mediated schemas. D Sometimes loose information. Not always natural. D Adding sources is hard. Need to consider all other sources that are available. May need to modify every global view defn © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 56

Local-as-View: example 1 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Create Source S 1 AS [S 1(title, dir, year, genre)] select * from Movie Create Source S 3 AS [S 3(title, dir)] select title, dir from Movie Create Source S 5 AS [S 5(title, dir, year)] select title, dir, year from Movie where year > 1960 AND genre=“Comedy” © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 57

Local-as-View: Example 2 Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time). Source S 4: S 4(cinema, genre) Create Source S 4 select cinema, genre from Movie m, Schedule s where m. title=s. title. Now if we want to find which cinemas are playing comedies, there is hope! © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 58

Local-as-View Summary • Very flexible. You have the power of the entire query language to define the contents of the source. • Hence, can easily distinguish between contents of closely related sources. • Adding sources is easy: They’re independent of each other. • Query reformulation: Answering queries using views! © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 59

The General Problem • Given a set of views V 1, …, Vn, and a query Q, can we answer Q using only the answers to V 1, …, Vn? Many, many papers on this problem. Great survey on the topic: (Halevy, 2001). • The best performing algorithm: Mini. Con (Pottinger & Levy, 2000). © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 60

Example Query: Q(x): - r 1(x, y) & r 2(y, x) Movie, Schedule Views: Create Source V 1 as V 1(a): -r 1(a, b) select a V 2(d): -r 2(c, d) from r 1 V 3(f): - r 1(f, g) & r 2(g, f) Create Source V 3 as select r 1. arg 1 from r 1, r 2 where r 1. arg 1 = r 2. arg 2 © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 61

¶ 1) Bucket Algorithm (1) For each subgoal in the query, place relevant views in the subgoal’s bucket Query: Q(x): - r 1(x, y) & r 2(y, x) Views: V 1(a): -r 1(a, b) V 2(d): -r 2(c, d) V 3(f): - r 1(f, g) & r 2(g, f) Buckets: r 1(x, y) r 2(y, x) V 1(x), V 3(x) V 2(x), V 3(x) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 62

Bucket Algorithm (2) 2) For every combo in the Cartesian product of the buckets, “check containment of the query” Q(x): -r 1(x, y) & r 2(y, x) Q’ 4(x): -r 1(x, a) & r 2(a, x) & r 1(x, b) & r 2(b, x) Bucket Algorithm will check all possible Candidate rewritings: combinations Q’ 1(x) : - V 1(x) & V 2(x) r 2(y, x) Q’ 2(x) : - V 1(x) & V 3(x) r 1(x, y) V 2(x), V 3(x) Q’ 3(x) : - V 3(x) & V 2(x) V 1(x), V 3(x) Q’ 4(x) : - V 3(x) & V 3(x) r 1(x, y) r 2(y, x) © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 63

Modeling Source Capabilities • Negative capabilities: A web site may require certain inputs (in an HTML form). • Positive capabilities: A source may be an ODBC compliant system. Need to decide placement of operations according to capabilities. • Problem: how to describe and exploit source capabilities. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 64

Example #1: Access Patterns Mediated schema relation: cites(paper 1, paper 2) Create Source S 1 as select * from cites given paper 1 S 1($x, y) : - cites(x, y) Create Source S 2 as select paper 1 from cites S 2(x) : - cites(x, y) Q(x) : - cites(x, “W 03”) & cites(x, y) Select paper 1 from cites where paper 2=“W 03” Query: © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 65

Example #2: Access Patterns Create Source S 1 as select * from Cites given paper 1 Create Source S 2 as select paper. ID from UW-Papers Create Source S 3 as select paper. ID from Award. Papers given paper. ID Query: select * from Award. Papers © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 66

Example #2: Solutions • Can’t go directly to S 3 (it requires a binding). • Can go to S 1, get UW papers, and check if they’re in S 3. • Can go to S 1, get UW papers, feed them to S 2, and then check if they’re in S 3. • Can go to S 1, feed results into S 2 again, then check if they’re in S 3. • Note: we can’t a priori decide when to stop. Need recursive query processing. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 67

Algorithms • Inverse Rules [Duschka & Genesereth] • Minicon [Pottinger & Levy] © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 68

Complexity of finding maximallycontained plans • Complexity depends on sources, not query Sources as unions of conjunctive queries (NP-hard) • Disjunctive descriptions Sources as recursive queries (Undecidable) True source contents Advertised description © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 69

Matching Objects Across Sources • How do I know that D. Weld in source 1 is the same as Daniel S. Weld in source 2? • If uniform keys across sources, easy. • If not: Domain specific solutions • (e. g. , maybe look at the address, …). Use IR techniques (Cohen, 98). • Judge similarity as you would between documents. Use concordance tables. • These are time-consuming to build, but you can then sell them for lots of money. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 70

Schema Mapping Problem • Types of structures: Database schemas, XML DTDs, ontologies, …, • Input: Two (or more) structures, S 1 and S 2 (perhaps) Data instances for S 1 and S 2 Background knowledge • Output: A mapping between S 1 and S 2 Should enable translating between data instances. © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 71

Semantic Mappings between Schemas • Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1 -1 mapping non 1 -1 mapping house location contact name © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration full-baths half-baths phone 72

Summary • Data Integration Providing uniform access to disparate data srcs AI meets DB • Modeling Data Sources GAV, LAV • Query Reformulation Answering queries using views Bucket algorithm [Inverse rules, Minicon] • Schema Matching © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 73

Next Talk • Executing data integration plans Variable latency, Poor info on size & speed of remote sources Adaptive execution • Service integration Invoking and composing web services Query and update Planning with incomplete information © Daniel S. Weld, PLANET 2003 Tutorial on Data Integration 74