Intelligent Information Integration I 3 ChunNan Hsu Institute

  • Slides: 29
Download presentation
Intelligent Information Integration (I 3) Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei,

Intelligent Information Integration (I 3) Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright © 1998 Chun-Nan Hsu, All right reserved Prepared for a presentation at IIS, AS, Taiwan October 12, 1998 Lab name TBA IIS internal talk 1

Information Distribution and Information Integration l l Real query: need a list of attorneys

Information Distribution and Information Integration l l Real query: need a list of attorneys in Phoenix Metro Area specialized in immigration and deportation. Also show their years-in-service, educational background and languages spoken? ? The answer IS on the Web! » » l US West yellow page web site US bar association member directory web site Alumni directory of law schools and more… BUT… Lab name TBA IIS internal talk 2

Intelligent Information Integration l Environment assumptions: » Autonomous information sources » Heterogeneous but relevant

Intelligent Information Integration l Environment assumptions: » Autonomous information sources » Heterogeneous but relevant » Query only (no or limited update allowed) l Desiderata: » Extensible – easily add new sources » Flexible – can be queried in as many ways as integrated sources » Scalable – integrate 1, 000 s, 100, 000 s relevant information sources Lab name TBA IIS internal talk 3

Solution: Information Integration Systems (IIS) l Also known as » information mediation agents, information

Solution: Information Integration Systems (IIS) l Also known as » information mediation agents, information mediators » information gathering agents » information brokering agents, information brokers l Key ideas: » users access data through a domain model » information sources represented by a source model » the mediator reformulates domain model query into source model sub-queries » the mediator constructs a query plan that determines the orders of data flow and execution to retrieve data Lab name TBA IIS internal talk 4

Architecture Human & Computer Users User Services: • Query • Monitor Describes the domain:

Architecture Human & Computer Users User Services: • Query • Monitor Describes the domain: terms and their relations Information Integration Service Domain model Query planner Source model Query plan optimizer/ executor Wrapper SQL Optimizes and executes a query plan ORB Provides source descriptions and semantic integration Text, Images/Video, Spreadsheets Lab name TBA Hierarchical & Network Databases Relational Databases Provides translation and communication with sources Object & Knowledge Bases Heterogeneous Data Sources IIS internal talk Determines how to answer input queries 5

Query processing flow Domain model Information Integration Service Source model Query planner Query plan

Query processing flow Domain model Information Integration Service Source model Query planner Query plan optimizer/ executor ORB SQL Wrapper Information sources Query plan User query Subquery Answer Lab name TBA IIS internal talk Translated query and data 6

Query plan l Query: Find immigration attorneys in Phoenix and their educational background. .

Query plan l Query: Find immigration attorneys in Phoenix and their educational background. . . Us-west. com SELECT name 1, phone, address FROM LAWFIRMS WHERE location = Phoenix and class = law-firms and specialty = immigration P SELECT * FROM P, S, … WHERE P. name 1 = S. name 2 output S SELECT name 2, year, degree FROM ALUMNI Law. yale. edu WHERE Law. yale. edu Law. harvard. edu Law. yale. edu Name 2 is one of name 1 Lab name TBA IIS internal talk 7

Representation and integration of domain and source models Lab name TBA IIS internal talk

Representation and integration of domain and source models Lab name TBA IIS internal talk 8

Integrating domain and source models -Example airline l pilots(pilot, airline) s 1(pilot, aircraft), s

Integrating domain and source models -Example airline l pilots(pilot, airline) s 1(pilot, aircraft), s 2(aircraft, airline). l Domain class Domain as view of source pilots pilot Source as view of domain: s 1(pilot, _) pilots(pilot, _). s 2(_, airline) pilots(_, airline). pilot Referential integrity constraint on aircraft l s 1 airline s 2 aircraft Source-links Lab name TBA Source classes Source-links IIS internal talk 9

Representation of queries l Queries » » » Lab name TBA enumerate? Conjunctive? Negation?

Representation of queries l Queries » » » Lab name TBA enumerate? Conjunctive? Negation? Disjunctive? Aggregate operators? (group-by, having, etc. SQL stuff) IIS internal talk 10

Properties of Query Plans l Quality of the answer » Anything not asked is

Properties of Query Plans l Quality of the answer » Anything not asked is returned? » Maximally contained? (due to O. Duscheka, 1998) l Executable (retrievable) query plans » One that contains no domain model term Lab name TBA IIS internal talk 11

Query Planning in SIMS -- decompose airline l pilots pilot Q(? p): pilots(? p,

Query Planning in SIMS -- decompose airline l pilots pilot Q(? p): pilots(? p, ? a), pilots(mike, ? a). l Domain class Q: Pilots for the same airline as Mike Sources: Source classes Source-links s 1(pilot, aircraft). S 2(aircraft, airline). l Decomposed query: Q(? p): s 1(mike, ? a), s 2(? a, ? al), s 2(? a 2, ? al), s 1(? p, ? a 2). Lab name TBA pilot s 1 airline s 2 aircraft IIS internal talk 12

Query Planning in SIMS -- partition Flight-hours l Subset-of Q(? h): pilots(mike, ? h).

Query Planning in SIMS -- partition Flight-hours l Subset-of Q(? h): pilots(mike, ? h). l Domain classes pilot Q: What’s flight-hours of Mike? airline Sources: Civil pilots Military pilots s 3(pilot, aircraft, hours). S 4(pilot, aircraft, hours). l Source classes Partitioned subqueries: Q(? p): s 3(mike, _, ? h). Q(? p): s 4(mike, _, ? h). Lab name TBA pilot IIS internal talk s 3 pilot aircraft s 4 13

Query planning in SIMS l l There are 7 other such operators (Arens et

Query planning in SIMS l l There are 7 other such operators (Arens et al. 1995, JIIS) for query “reformulation” In addition there are 9 other operators about opening a source, moving data around, etc (Knoblock, 1996, AIPS) Planning involves selecting appropriate operators and determining the best order for these operators There always many choices and search is required to find the “optimal” query plan Lab name TBA IIS internal talk 14

Recursive query plan airline l pilots pilot Q(? p): pilots(? p, ? a), pilots(mike,

Recursive query plan airline l pilots pilot Q(? p): pilots(? p, ? a), pilots(mike, ? a). l Domain class Q: Pilots for the same airline as Mike Sources: Source-links s 1(pilot, aircraft). l l Non-recursive query plan? Maximally contained? pilot – (this part is due to O. Duscheka, 1997, Ph. D thesis, Stanford Univ. ) Lab name TBA Source classes s 1 aircraft IIS internal talk 15

Negative results of query planning using source-as-view l Query planning for a query plan

Negative results of query planning using source-as-view l Query planning for a query plan equivalent to an input datalog query is UNDECIDABLE » otherwise, theorem-proving for first-order logic will be decidable » (see O. Duscheka, 1998, Ph. D thesis, Stanford University) l Query planning for conjunctive, comparison-free queries with minimal number of sources accessed is NP-complete » otherwise, containment of two datalog program will be polynomial » (see A. Levy, 1995, PODS) Lab name TBA IIS internal talk 16

Domain as view of source l l Simply replacing domain terms in a query

Domain as view of source l l Simply replacing domain terms in a query with their view definitions will yield an executable query plan Add a new source may require change the whole domain model- source model integration » not a problem for source-as-view Lab name TBA IIS internal talk 17

Query optimizations l Semantic query optimization (Hsu and Knoblock, 1999 IEEE TKDE) l l

Query optimizations l Semantic query optimization (Hsu and Knoblock, 1999 IEEE TKDE) l l l Less “semantic” (using local completeness, functional dependency, etc. ) (Kwok AAAI-96, Levy) Exploring parallelism in plans (Knoblock, IJCAI-95) Replanning failed retrieval (Knoblock, IJCAI-95) Caching (static) Dynamic caching (using partial results from subqueries) Lab name TBA IIS internal talk 18

Basic idea of adaptive semantic query optimization Input Query Give me all the papers

Basic idea of adaptive semantic query optimization Input Query Give me all the papers written by “Chunnan” R 1: If AUTHOR is an “AIer” PAPER is “AI” paper R 2: “Chunnan” is an “AIer” R 3: . . . PESTO Query Optimizer BASIL Semantic Rules learner/KDDer Optimized Query Give me all the “AI” papers written by “Chunnan” Lab name TBA IIS internal talk Databases 19

Web wrapper Name Degree School Affiliation WL Hsu Ph. D CS Ho Ph. D

Web wrapper Name Degree School Affiliation WL Hsu Ph. D CS Ho Ph. D C. Chen Ph. D C. Wu Ph. D Mark Liao Ph. D CJ Liau Ph. D WK Cheng Ph. D WC Wang MS : : : Lab name TBA IIS internal talk Cornell IIS, Sinica NTU EE, NTIT SUNY EE, NTIT Utexas Cedu, NNU NWU IIS, Sinica NTU IIS, Sinica TKU Tunghai Syracus FIT 20

Wrapper construction l l l For structured databases, wrapper construction is an engineering problem

Wrapper construction l l l For structured databases, wrapper construction is an engineering problem Web sources requires an information extractor Hand-encoded Web information extractor? » Web page changed frequently (8% monthly failure rate at Junglee) l Web wrapper induction? YES (Hsu 1999, J of Info Systems; Kushmerick 1997, Ph. D Thesis, U of WA) l XML will make wrapper induction easier Lab name TBA IIS internal talk 21

Major players (1) l SIMS, Ariadne » Arens, Knoblock, Minton, Shen, Hsu, et al.

Major players (1) l SIMS, Ariadne » Arens, Knoblock, Minton, Shen, Hsu, et al. at ISI of USC » flexible query planner, adaptive semantic query optimizer l Information Manifold » Levy, Srivastava, Kirk, et al. At AT&T Lab » query reformulation, relevant source selections l TSIMMIS » Hammer, Garcia-Molina, Papakonstantinou, Ullman et al. at Stanford University » object-based data modeling (OEM) Lab name TBA IIS internal talk 22

Major players (2) l Softbot family: Occam, Razor, etc. » Etzioni, Weld, Kowk, Kushmerick,

Major players (2) l Softbot family: Occam, Razor, etc. » Etzioni, Weld, Kowk, Kushmerick, Friedman » Fast query planning, wrapper induction, query optimization l Infomaster » Duscheka and Genesereth at Stanford » recursive query plans, theoretical analysis of III l Others » HERMES at U of Maryland, Broker Agents at SRI, Ontobroker at AFIB Germany, etc. » Taiwan? Academia Sinica (WL Hsu, CN Hsu) and VF at NTU (YJ Hsu), others? Lab name TBA IIS internal talk 23

Positive results of intelligent information integration l Spin-off’s » Junglee (www. junglee. com) –

Positive results of intelligent information integration l Spin-off’s » Junglee (www. junglee. com) – Key scientists: Mike Stonebraker? Peter Norvig – Largest integration: 700 Web sites, 30 attributes, 1000+ wrappers – Bought by Amazon. com for ~$180, 000 » Jango (www. jango. com) – Key scientists: Dan Weld, Oren Etzioni – Bought by Excite. com l Startups » Socratix Systems – Key scientist: Oliver Duscheka Lab name TBA IIS internal talk 24

Competing alternatives l Hardwired » mostly applied? l Schema Integration » dying? l Distributed

Competing alternatives l Hardwired » mostly applied? l Schema Integration » dying? l Distributed Heterogeneous Multi-Databases » dying? Name too long? l Data warehousing » kicking real good! » Updating a tough problem l Software Reverse Engineering » Taiwan has an edge on this? Lab name TBA IIS internal talk 25

Research challenges l l l l Optimization Probabilistic representation of domain-source models Probabilistic query

Research challenges l l l l Optimization Probabilistic representation of domain-source models Probabilistic query answering, anytime, imprecise query answering Automatic locating and integrating relevant new sources Sharing information between incompatible sources (F -C? Exchange rate? Aliases? ) Wrapper induction Cooperative agents for information integration Lab name TBA IIS internal talk 26

Information sources of intelligent information integration l Journals » Journal of intelligent systems, information

Information sources of intelligent information integration l Journals » Journal of intelligent systems, information systems, intelligent information systems, cooperative information systems, agents(? ) » Other journals for AI, databases l Meetings » » » Lab name TBA 1998 AAAI Workshop on AI & Information Integration 1998 ECAI Workshop on Intelligent Information Integration 1997 SIGMOD Workshop on Semistructured data 1997 German Annual Conference On AI Workshop on III 1995 AAAI Symposium on Information Gathering from Distributed Heterogenous Sources IIS internal talk 27

More information sources on Intelligent Information Integration l l Best papers usually published in

More information sources on Intelligent Information Integration l l Best papers usually published in AAAI, IJCAI, SIGMOD and PODS Upcoming meeting: » IJCAI-99 WORKSHOP on Intelligent Information Integration (proposed) Lab name TBA IIS internal talk 28

The Future Networked Information Mediators Human & Computer Users Attorney Mediator Immigration Attorney Mediator

The Future Networked Information Mediators Human & Computer Users Attorney Mediator Immigration Attorney Mediator Phoenix Attorney Mediator Good Attorney Mediator Bad Attorney Mediator Law school Mediator Heterogeneous Data Sources Lab name TBA IIS internal talk 29