Distributed Databases Heterogeneous Distributed Databases Course outlines Motivation
Distributed Databases Heterogeneous Distributed Databases Course outlines Motivation, Database Integration Data. Base Management Systems “DBMS”– An overview Old-School Approaches Database integration - Problem Dimensions Important Types of Integrated Database Systems Multidatabases Federated Databases Schema Architecture Handling Integration Conflicts A Generic Federated DBMS Architecture Wrapper- Architecture & Tasks, Example Schema Integration Query Processing Web as a Loosely Coupled Federated Database The Mediator - Mediator Systems Data Warehousing 1/23
Motivation If we all use the same database, things are “quite simple” However, we often use Heterogeneous data sources - Heterogeneous DBMS Different data formats/data types etc. Ê We will learn about data or database integration. Ê We will see approaches, which allow access to different databases from within the same application (e. g. JDBC, transaction monitors, or workflow management systems). Goal: 2
Database Integration Data Integration Ê We did not directly work on it; however, used different service interfaces which requires data interchange Ê Typically requires some manual interaction However, It is very important to understand the difference between database integration and access to different databases. With database integration the integrated DBMS provides data processing functionality (e. g. querying, updating) for the integrated database as if it were managed by a standard DBMS. There exist many situations where data processing can be done much more efficiently on the integrated database than if it would be performed on each of the component databases (i. e. the databases to be integrated). Integrated DBMS provide an intermediate layer that hides heterogeneity, distribution and autonomy. 3
Data. Base Management Systems “DBMS” – An overview Provide a unique point of access data The DBMSto is athe software system that ÊDDL/DML ÊTransactions ÊSoftware system Application-1 Application-2 Application-n supports DDL/DML/Transactions for accessing a database The main differences among different DBMS are: Ê Data and operational model Ê Type of transaction support Ê Performance Application-n Application-1 Application-2 DDL/DML Unique Interface to the database File calls Data-1 Data-2 Data-n Databasae Typical Data Management architecture Typical DBMS architec 4
Typical Data integration problem Manual, Global Integration Ê Manually merge multiple databases Application-1 Application-2 Application-n into a new global database Ê Time consuming and error prone Solution ? Ê Local autonomy lost DDL/DML Ê Static solution DBMS 1 DBMS 2 DBMSn Ê Does not scale with number of databases Multi-database Language Approach Data-1 Data-2 Ê No attempt at integrating schemas Ê Language (e. g. , SQL) used to integrate information sources at run-time Simple example: Federated DBMS Old-School Approaches Data-n Book( ISBN, Title, Price, Author) Author(Name, ISBN) Use S 1, S 2 SELECT Titre S 2 FROM S 1. Book, S 2. Livre(ISBN, Prix, Titre) WHERE S 1. Book. ISBN = S 2. Livre. ISBN Auteur(Prenom, Nom, ISB S 1 Ê Not transparent (you need to know all databases!) Ê Heavy burden on (expert) users Book(ISBN, Title) Ê Global queries subject to local changes 5 Author(Name,
Database integration - Problem Dimensions Distribution How to Deal with Distribution? Distributed database Problems Autonomy Ê data access over the network Ê inconsistent replicated data Solutions Ê solved by using Web access (over HTTP) Ê Web Services, Java RMI, … Ê publishing using JSP Ê JDBC to access remote databases Ê etc. Federated data Heterogeneity Centralized How to Deal with Heterogeneity? database Problems Ê Data models Ê Schemas Ê Data Solutions. Variants - General issues Solution How to Deal with Autonomy? Problems Ê changing structure of Web page Ê different coverage of Web sites Ê availability of services Solutions Ê integration FMappings, Bottom-upschema vs. top-down Ê Standards engineering F Virtual vs. materialized integration F Read-only vs. read-write access F Transparency: language, 6
Important Types of Integrated Database Systems Multidatabases - Integrated access to different relational DBMS Ê create an intermediate layer that would let them appear like single relational database. Federated Databases - Integrated access to different DBMS Ê integrating data from databases that use different data models Ê data is not stored in relational databases, but either in legacy databases (hierarchical or network databases), non-standard databases (object-oriented databases) or in document-oriented formats (files, EDI data, Web data) Mediator Systems - Integrated access to different data sources (on the web) Ê Typically support the semantic integration of the database schemas. Ê Another form of federated DBMS evolved which focuses on read-only access to online resources on the Web data sources typically do not support full DBMS functionality Data Warehouses - Materialized integrated data sources 7
Integrated Database Systems Multidatabases Integrated access to different relational DBMS Enable transparent access to multiple (relational) databases (relational database middleware) Ê Hide distribution, different database language variants Ê Overcome the differences among the different SQL dialects, and can decompose queries which are posed against the component databases Ê Process queries and updates against multiple databases (2 -phase commit) Ê An important aspect is that multidatabases do not hide the Sybase. PUBLICATIONS(PNR, TITLE, AUTHOR, multidatab different schemas from the component databases SELECT p 2. title JOURNAL, PAGES) ase ÊFROM the multidatabase schemasche is just the union of the. TITLE, schemas Sybase. AUTHORS(ANR, NAME, from Sybase. PUBLICATI AFFILIATION)in order to avoid name ma the component databases (after renaming, ONS p 1, Oracle. PAPERS(NUMBER, TITLE, WRITER, conflicts). Oracle. PAPERS p 2 PUBLISHED) WHERE p 1. title = p 2. title SELECT title FROM PUBLICATI PAPE ONS PAPERS(NUMBER, TITLE, WRITER, Oracle. WRITER(FIRSTNAME, LASTNAME, PUBLISHED) Sybase NROFPUBLICATIONS) WRITER(FIRSTNAME, LASTNAME, Site Oracle PUBLICATIONS(PNR, TITLE, NROFPUBLICATIONS) Site AUTHOR, JOURNAL) AUTHORS(ANR, TITLE, NAME, 8
Integrated Database Systems Federated Databases Schema Architecture Ê Accepted model for integrated database systems with integrated Schema Ê 5 -level architecture Ê Data independence Export Schema Ê provided by data source Ê Unifies data model Ê Defines access functions Import Schema Ê provided by wrapper Ê view on export schema Integrated Schema A Widely Used Archi View-1 View-2 View-n Integrated Schema Import Schema Import Schema Export Schema Export Schema Ê provided by mediator Ê Homogenizes and unions import schemas Application views Ê provided by application Ê as in centralized DBMS Relatio nal DBMS Objec t Orient ed DBMS Web Serve r File Syste m 9
A Widely Used Architecture: Federated DBMS Handling Integration Conflicts (1/2) Ê What types of problems can one encounter integrating corresponding data? Ê Different structural representation (e. g. attribute vs. table) Ê Different naming schemes Types of Conflicts Schema level Conflict Resolution Ê Depends on type of conflict Ê Requires construction of mappings Ê Mappings might be complex, e. g. not expressible as SQL views Naming Conflicts Ê Homonyms Ê Naming conflicts Ê same name used for different Ê Structural conflicts concepts Ê Classification conflicts Ê Resolution: introduce prefixes to distinguish the names Ê Constraint and behavioral conflicts Ê Synonyms Data level Ê Identification conflicts Ê Representational conflicts Ê different names for the same concepts Ê Resolution: introduce a mapping 10 to a common name
A Widely Used Architecture: Federated DBMS Handling Integration Conflicts (2/2) Structural Conflicts Data Correspondences Ê Different, non-corresponding attributes Resolution: create a relation with the union of the attributes Ê Different datatypes Resolution: build a mapping function Ê Different data model constructs e. g. attribute vs. relation Resolution: requires higher order mappings Classification Conflicts Ê Relations can have different coverage (inclusion, non-empty intersection) Resolution: build generalization hierarchies Ê Additional problem Ê Identification of corresponding Ê Corresponding data instances Ê similar to naming conflicts at schema level Ê Resolution: mapping tables and functions Ê Similarity functions Ê Corresponding data values, data conflicts Ê of corresponding data instances Ê Resolution: mapping tables and functions Prefer data from more trusted data source Constraint and Behavioral Conflicts Ê Cardinality conflicts different types of cardinality constraints on relationships Resolution: use the more general 11 constraint
A Widely Used Architecture: Federated DBMS A Generic Federated DBMS Architecture Wrapper-Mediator architecture mediators integrate the data from the DBs wrappers convert to a common representation Application-1 Application-n request/query Federated DBMS (Mediator) Wrappers Architecture result/data Wrapper Compensation for missing Transformation processingof data model Metadata capabilities Oracle Sybasel IBM-DB 2 Communication interface DB 1 DB 2 DBn Source data integrity constra 12
A Widely Used Architecture: Federated DBMS Wrapper- Architecture & Tasks Wrappers are software to overcome syntactic heterogeneity result/data (communication, data request/query model, functionality) Compensation for missing Transformation processingof data model Metadata capabilities Communication interface Goal: Source data integrity constrai Providing Transparent Access to Data Sources Tasks - Translate among different data models it has to: Ê receive requests and return results according to the canonical data model through it's API Ê transform requests and compensate for missing functionality Ê have a communication interface with which it contacts the source database 13
A Widely Used Architecture: Federated DBMS Wrapper - Examples Data Model consists of : Translate among different data models Data types, Integrity constraints, Operations (e. g. query language) ÊData model used by sources - relational? HTML? XML? RDF? Custom? Text? ÊData model used by integrated DB - canonical data model (e. g. relational, XML) ÊQuery models - SQL, retrieval queries (in information retrieval), data mining (statistics) Examples: Wrapping Relational Data into XML/HTML ÊData types - trivial ÊIntegrity Constraints (e. g. primary keys) , requires XML Schema Wrapping XML/HTML into ÊOperations - none in HTML Relational Data Ê Data Types - which difficulties? Ê Integrity Constraints - none in HTML Ê Operations - requires generally XQuery form fields can be considered as hardcoded queries 14
Federated DBMS Schema Integration Creation of an integrated schema from heterogeneous schemas Methodology Identifying Schema Correspondences Standard Methodology 1. Wrapper: Schema translation from the original data model into the canonical data model. 2. Identifying corresponding concepts: Explore semantic concepts relating different original schemas 3. Mapping corresponding concepts into a common representation Integrated Schema translation (wrapper) Corresponden ce Ê source schema, source database investigation Ê source application Conflict Ê database administrator, developer, user resolution Ê Semantic correspondences and Ê e. g. related names schema Ê One of the important current research topics integration Ê Structural correspondences - Reach-ability by paths “Integrated 15 Ê Sources of information
A Widely Used Architecture: Federated DBMS Example Data sources Wrapper – export schemas Importe d schem as Integrate d schema Vie ws Views <complex. Type name="Book"> The federated <element name="title" DBMS has to type="string"/> map the two <element name="authors" import schemas type="string"/> into one <element name="pages" integrated type="string"/> The schema </complex. Type> federated Integrated <complex. Type name="Book"> <complex. Type name="book"> database schema <element name="title“ <element name="title" determines type="string"/> the <element name="authors" <element name="author" Import schema DB 1 Import schema D compatible import type="string"/> schemas </complex. Type>name="Book"> </complex. Type> <complex. Type <element name="title" type="string"/> <complex. Type name="Book"> <element name="authors" type="string"/> <element name="title" type="str <element name="pages" type="string"/> <element name="author" type="s </complex. Type> (Wrapper – export schema) (Wrapper – export sche For transforming the data, the wrapper ORACLE - SQL DBMS would need to compute PAGES) a join PUBLICATIONS(PNR, TITLE, AUTHOR, JOURNAL, using the underlying AUTHORS(ANR, TITLE, FIRSTNAME, SURNAME AFFILIATION) database system. XML DTD <!ELEMENT book (Title, author)> <!ELEMENT title #PCDATA> <!ELEMENT author #PCDATA> 16
A Widely Used Architecture: Federated DBMS Federated Databases Query A query for all authors of books is now submitted to the integrated Processing database in XML Query Importe d schem as Integrate d schema Vie ws Q = FOR $b IN //Book RETURN $b/author Q 1 = FOR $b IN //Book RETURN A 1 = $b/authors {<authors> … <authors>} Data sources Wrapper – export schemas U Submit query Q Q 1' = SELECT a. name FROM AUTHORS ORACLE - SQL a DBMS A 1’ = {<author> … <author>} Result of Q A 2 = {<authors> … <authors>} Q 2 = FOR $b IN //Book RETURN $b/authors A 2 = {<authors> … <authors>} Q 2' = //book/author A 2 = {<authors> … <authors>} PUBLICATIONS(PNR, TITLE, AUTHOR, JOURNAL, PAGES) AUTHORS(ANR, TITLE, FIRSTNAME, SURNAME AFFILIATION) XML DTD <!ELEMENT book (Title, author)> <!ELEMENT title #PCDATA> <!ELEMENT author #PCDATA> 17
The Mediator - Mediator Systems Main differences to conventional federated databases Ê Read-only access Ê Non-database resources Ê Light-weight and flexible Ê Integrate data with same "real-world meaning", but different representations Ê Semantics are important Ê integration mapping ! schema integration Ê can be implemented, e. g. , as database views Ê Decompose queries against the integrated schema to queries against source DBs (only for virtual integration) Example: Local As View approach Ê Each local database is defined as a view on the integrated schema Ê A simple Example: Source A: R 1(prof, course, university) R 2(title, prof, course) Source B: Ê Definition of the global, integrated schema: ÊGlobal (prof, course, title, university) ÊSource A defined as: Source 18 B
Web as a Loosely Coupled Federated Database Accessing databases through the Web can be seen as a way of loosely coupling heterogeneous data sources Consider the caseofofthe data occurs at the user interface The integration multiple databases rather than at the data management system accessed by the same Web browser query web-enabled Federated database layer: such a. Federated Database query would be automatically Server decomposed and the results web-enabled would be collected from the Database different databases by the server federated database server itself. With integration at the user interface layer, the user would Strong coupling Loose coupling have to, manually, formulate 19 individual queries for each
Data Warehousing - An overview (1/2) Motivation Ê Economy is characterised by continuous changes in marketing and business opportunities Ê Companies have accumulated data about their business Ê Often, this data is heterogeneous, is obtained from different departments and/or customers/suppliers Ê Businesses use data as decision support systems (based on information system) Ê Information is often “hidden” What is a data warehouse (DW)? Ê Data repository with heterogeneous data sources ÊOne approach to data integration Ê Data is aggregated Ê Contains several materialised views Ê(expressed e. g. , as SQL views but stored) Ê Views are updated at regular intervals Ê Multi-dimension data model Ê Designed for quick data retrieval by ad-hoc queries 20
Data Warehousing - An overview (2/2) Are a specialized form of federated database system, where the integrated database is materialized Ê perform efficient data analysis Ê OLAP (on-line analytical processing) Ê Data mining Simplified DW Architecture End-Users Uses typically limited forms of database schemas (e. g. relational star schemas), supports DW Dedicated Software multidimensional data model OLAD / Data. Mining Main problems Ê Updates and data consistency Ê Wrapper development Ê Efficient processing of OLAP and data mining queries Data One central wherehouserepository Data Warehouse Features Ê Data from multiple external data sources is Ê Extracted, Filtered, Merged Ê Enriched by historical information Ê Stored in a central repository Ê Original normal forms are often not used any more! Operational databases or other data sources 21
- Slides: 21