Crossing the Structure Chasm Alon Halevy University of

  • Slides: 37
Download presentation
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002

Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002

Outline n The structure chasm: n n n Crossing the chasm at the U.

Outline n The structure chasm: n n n Crossing the chasm at the U. of Washington: n n Old problem Reasons for renewed interest (Semantic Web). Getting people to structure their data (Darwin) Large-scale data sharing by Peer-data management (Piazza) What can you do with a corpus of 1 million schemas? (Let’s build it and see). Challenges going forward.

The Structure Chasm The U-World The S-World n Authoring: easy, learned n Authoring: need

The Structure Chasm The U-World The S-World n Authoring: easy, learned n Authoring: need to in grade-school. design structure first. Learned in college. n Querying: I need to n Querying: easy, know how you keywords, I don’t need structured your data. to know the database. n Data sharing: put our negotiations, possibly documents into the committee work. same corpus/search engine. n Results: approximate. n Results: need to be For human viewing. precise; may affect bank accounts.

Why Should We Care? n The chasm is limiting the use of S-World technology:

Why Should We Care? n The chasm is limiting the use of S-World technology: n n n Losing potential customers, applications. Being ignored by friends and family. People often end up using more lightweight solutions: n n Causes loss of functionality No easy migration path back to the S-World.

Creating The Semantic Web n We want a web of structured data, where searches

Creating The Semantic Web n We want a web of structured data, where searches are more meaningful: n n It’s a gigantic distributed database. Content authors are not DB/KR specialists. The SW has not taken off yet precisely because of the chasm. Claim: We do not know how to build large-scale data sharing systems.

Crossing the Structure Chasm n n Goal: import some of the nice properties of

Crossing the Structure Chasm n n Goal: import some of the nice properties of the U-world into the S-world. Make authoring, querying and sharing of data easier. No illusions: The S-world will always be harder than the U-world. Some non-solutions: n n (from KR people): more expressive representation languages. (from DB people): didn’t XML solve this problem?

Pause

Pause

Crossing the Chasm with Revere Entice people to structure their data The Corpus Darwin

Crossing the Chasm with Revere Entice people to structure their data The Corpus Darwin Enable sharing of data without central control Piazza Tools for facilitating authoring and sharing of data

Crossing the Chasm w/ Revere n Key components: n n Darwin: Get people to

Crossing the Chasm w/ Revere n Key components: n n Darwin: Get people to structure their data. Piazza: Enable people to share their data. Statistics over structures: Import the main technique of the U-world into the S-world. Goal: create infrastructure for building semantic web applications: n First case study: creating a SW from data that is already on people’s web pages.

Outline n The structure chasm: n n n Old problem Reasons for renewed interest

Outline n The structure chasm: n n n Old problem Reasons for renewed interest (Semantic Web). Crossing the chasm at the U. of Washington: n n n Getting people to structure their data (Darwin) Large-scale data sharing by Peer-data management (Piazza) What can you do with a corpus of 1 million schemas? (Let’s build it and see).

Darwin: an evolutionary approach to the semantic web Joint work with: Etzioni, Gribble, Levy,

Darwin: an evolutionary approach to the semantic web Joint work with: Etzioni, Gribble, Levy, Mc. Dowell, Vlasseva n Two challenges: n n n Can we create conditions to entice people to create semantic content? Can a database evolve rather than being created in the traditional fashion? Goal: create a semantic web from data that is already on web pages: events, contact info, … n n Large number of very heterogeneous web pages. Wrapper technology does not apply. Accessing at query-time not scalable.

Key Ideas of Darwin n Make it easy: tool for annotating HTML pages n

Key Ideas of Darwin n Make it easy: tool for annotating HTML pages n n Immediate gratification: n n No need to replicate data. A set of applications that provide immediate benefit (calendar, phone book) Illustrate that even partial data is useful. Defer checking integrity constraints Start local, reach out to others later.

Darwin and the Chasm n n Addresses first step: getting data into structured form.

Darwin and the Chasm n n Addresses first step: getting data into structured form. Challenges: n n n How to entice people to create content How to evolve a database/knowledge base? How to do this in a scalable fashion.

Outline n The structure chasm: n n n Old problem Reasons for renewed interest

Outline n The structure chasm: n n n Old problem Reasons for renewed interest (Semantic Web). Crossing the chasm at the U. of Washington: n n n Getting people to structure their data (Darwin) Large-scale data sharing by Peer-data management (Piazza) What can you do with a corpus of 1 million schemas? (Let’s build it and see).

Large-Scale Data Sharing (With Ives, Mork, Suciu, Tatarinov) n n Goal: to share structured

Large-Scale Data Sharing (With Ives, Mork, Suciu, Tatarinov) n n Goal: to share structured data across multiple autonomous sites. Current solution: data integration n Query a set of data sources through a mediated schema. Use XML as a data sharing format, and XQuery. Information Manifold (96), Tukwila (99), Nimble Technology (www. nimble. com).

Data Integration Systems

Data Integration Systems

Limitations of Data Integration n The mediated schema: n n n Creating it is

Limitations of Data Integration n The mediated schema: n n n Creating it is hard, often infeasible. Mapping to it may involve repetitive work. Querying it can be hard for users familiar with their own schema. Note: much better than warehousing. Goal: share data without a single mediated schema.

Peer Data-Management n n PDMS: a network of peers Peers can: n n n

Peer Data-Management n n PDMS: a network of peers Peers can: n n n Export base data Provide views on base data Serve as logical mediators for other peers A peer can be both a server and a client. Semantic relationships are specified locally (between small sets of peers).

Advantages of PDMS n n No need for a central mediated schema. Can map

Advantages of PDMS n n No need for a central mediated schema. Can map data opportunistically, as is most convenient. Queries are posed using the peer’s schema. Answers come from anywhere in the system. Relationship to peer-to-peer file sharing: n n Data has rich semantics Probably not as dynamic in membership.

Example PDMS LH: Crit. Bed(bed, hosp, room, PID, status) H: Crit. Bed(bed, hosp, room),

Example PDMS LH: Crit. Bed(bed, hosp, room, PID, status) H: Crit. Bed(bed, hosp, room), H: Patient(PID, bed, status)

Ad-hoc Additions to a PDMS

Ad-hoc Additions to a PDMS

PDMS Research Directions n Schema mediation: n n Languages for specifying mappings. Algorithms for

PDMS Research Directions n Schema mediation: n n Languages for specifying mappings. Algorithms for answering queries. Easy generation of mappings. Efficiency and optimization: n n n Avoiding redundant paths, following best ones. Propagating updates efficiently (w/ Mork, Gribble). Distributed indexing of views (Dalvi, Suciu).

Schema Mediation in PDMS n n The formalism for the semantic glue. From data

Schema Mediation in PDMS n n The formalism for the semantic glue. From data integration, we have: n n n Global-as-view (GAV): mediated schema is defined as views over the sources [query composition]. Local-as-view (LAV): sources are defined as views over mediated schema [answering q’s u/views] GLAV: a combination of both: n n n Qsource = Qschema Qsource Qschema Query answering is understood for a two-tier network: a mediator over multiple sources.

LH: Crit. Bed(bed, hosp, room, PID, status) H: Crit. Bed(bed, hosp, room), H: Patient(PID,

LH: Crit. Bed(bed, hosp, room, PID, status) H: Crit. Bed(bed, hosp, room), H: Patient(PID, bed, status)

Mediation: the Relational Case n A mediation language that uses GLAV locally. n n

Mediation: the Relational Case n A mediation language that uses GLAV locally. n n n Precise conditions for when global query answering in a PDMS is tractable/decidable. A query answering algorithm that combines chains of query composition and answering queries using views. See ICDE-03 paper for details.

Mediation: the XML Case n Mediation language: n n n XQuery is inappropriate. Our

Mediation: the XML Case n Mediation language: n n n XQuery is inappropriate. Our language allows incremental specification of mappings. Uses subset of XQuery. Query answering algorithm: n New techniques for answering queries using views: n n Challenge: nesting in XML structure. Implementation: based on XML.

Additional Mediation Issues n Mapping composition: n n n Given A-B and B-C mappings,

Additional Mediation Issues n Mapping composition: n n n Given A-B and B-C mappings, is there an A-C mapping that doesn’t lose information? Yes, and no. Even when yes, it may be infinite. [w/ Madhavan]. Basic framework and properties of mappings: n n KR community needs to consider mappings as first -class citizens. See [AAAI-02].

Piazza and the Chasm n Enable data ad-hoc large-scale data sharing. n n No

Piazza and the Chasm n Enable data ad-hoc large-scale data sharing. n n No need for central control or schema. Open issues: n n Optimization (follow only good paths? ) Annotations on mappings? Intelligent data placement. Update propagation.

Outline n The structure chasm: n n n Old problem Reasons for renewed interest

Outline n The structure chasm: n n n Old problem Reasons for renewed interest (Semantic Web). Crossing the chasm at the U. of Washington: n n n Getting people to structure their data (Darwin) Large-scale data sharing by Peer-data management (Piazza) What can you do with a corpus of 1 million schemas? (Let’s build it and see).

Corpus Based Tools n Information retrieval works by: n n n Large corpora of

Corpus Based Tools n Information retrieval works by: n n n Large corpora of text Statistics over word occurrences in texts. Can we do the same in the S-World? n n Create a corpus of schemas. Use it to build tools that facilitate authoring, querying and sharing data.

The Corpus n Contents: n n Schemas, ontologies, meta-data, queries. Sample statistics: n n

The Corpus n Contents: n n Schemas, ontologies, meta-data, queries. Sample statistics: n n n How often does a word appear as a relation name? When it does, what tend to be the attribute names? What other tables are there? What are the foreign keys?

Sample Tools n Auto-complete: n n Schema matcher: n n I start creating a

Sample Tools n Auto-complete: n n Schema matcher: n n I start creating a schema, and the tools suggests a completion (perhaps I start only with data, not schema). I can map two between two schemas by relating them both to the corpus. Query reformulator: n I ask a query using my terminology, and the tools reformulates it to a particular database schema.

Why are we Optimistic? n Because of our work on LSD and GLUE [w/

Why are we Optimistic? n Because of our work on LSD and GLUE [w/ Doan, Domingos, Madhavan]: n n n We computed classifiers for attributes of schemas. Classifiers are a particular kind of statistic. This is a huge community project: n n We need your help Or at least, your schemas

Summary n Takeaway questions: n n How can we entice people to structure data?

Summary n Takeaway questions: n n How can we entice people to structure data? Can we generalize Peer-to-peer systems to structured data? What can we do with a corpus of schemas, and how can we build it? For more details, www. cs. washington. edu/homes/alon n [CIDR-03], [AAAI-02], [ICDE-03], [WWW-02], [SIGMOD-01].