Corso di Architetture della Info A A 2009

  • Slides: 37
Download presentation
Corso di Architetture della Info A. A. 2009 -2010 Carlo Batini 5. 1. 2

Corso di Architetture della Info A. A. 2009 -2010 Carlo Batini 5. 1. 2 I sistemi di Data Integration elementi architetturali 1

Data Integration (or mediator) systems 2

Data Integration (or mediator) systems 2

Data Integration definition Data integration is a major research and business area that has

Data Integration definition Data integration is a major research and business area that has the main purpose of allowing a user to provide uniform access to multiple, autonomous, heterogeneous data sources through the presentation of a unified view of these data. Finding this agreement is complex because one has to find differences and similarities in each schema to be able to conform. 3

The plus of data integration architectures wrt federated architectures Manages – schema level heterogeneities

The plus of data integration architectures wrt federated architectures Manages – schema level heterogeneities more complex than in federated databases – (to some extent. . ) instance level heterogeneities due to quality errors (accuracy, currency, incompleteness, inconsistencies, etc. ) in data

Data integration – several approaches Data integration stands for several approaches for combining data

Data integration – several approaches Data integration stands for several approaches for combining data from different data sources [Hull, 1997]: • Integrated read-only views: Mediation. To support an integrated, read-only, view of data that resides in multiple databases (the majority of academic and commercial systems) • Integrated read-write views: Mediation with update. An extension of the mediation architecture to support updates against an integrated view • Initially, we will deal only with the first issue 5

Schema level heterogeneities

Schema level heterogeneities

Schema level heterogeneities • NB heterogeneity and conflic are synonym in the following Are

Schema level heterogeneities • NB heterogeneity and conflic are synonym in the following Are of two types • Name heterogeneities • Type heterogeneities

Name heterogeneities • Sinonyms – Different names for the same concepts – employee, clerk

Name heterogeneities • Sinonyms – Different names for the same concepts – employee, clerk – exam, course – code, num • Homonyms – Same name for different concepts - Employee as employee in one schema, as vendor in another schema

Examples of name heterogeneities • Name conflicts – HOMONYMS Product price (production price) Product

Examples of name heterogeneities • Name conflicts – HOMONYMS Product price (production price) Product price (sale price) – SYNONIMS Department Division

Type conflicts The same concepts is represented with different conceptual structures in two schemas

Type conflicts The same concepts is represented with different conceptual structures in two schemas • Different definition domains for the same attribute in two schemas • Attribute in one schema and derived value in another schema • Attribute in one schema and entity in another schema • Attribute in one schema and generalization hierarchy in another schema • Entity in one schema and relationship in another schema • Different abstraction levels for the same concept in two schemas: e. g. two entities with homonym names related by an IS-A hierarchy in two schemas • Different granularities in the definition domains • Different cardinalities in the same relationships • Key conflicts • See next pages for examples -

Examples of type conflicts - 1 TYPE CONFLICTS • in a single attribute (e.

Examples of type conflicts - 1 TYPE CONFLICTS • in a single attribute (e. g. NUMERIC, ALPHANUMERIC, . . . ) e. g. the attribute “gender”: – – Male/Female M/F 0/1 In Italy, it is implicit in the “codice fiscale” (SSN) Year has a four digit domain in one schema and two digit domain in another schema

Examples of type conflicts - 2 • different currencies (euros, US dollars, etc. )

Examples of type conflicts - 2 • different currencies (euros, US dollars, etc. ) • different measure systems (kilos vs pounds, centigrades vs. Farhenheit. ) • different granularities (grams, kilos, etc. )

Examples of type conflicts - 3 Structure conflicts EMPLOYEE DEPARTMENT PROJECT Person MAN WOMAN

Examples of type conflicts - 3 Structure conflicts EMPLOYEE DEPARTMENT PROJECT Person MAN WOMAN Person GENDER BOOK PUBLISHER

Examples of type conflicts - 4 • DEPENDENCY (OR CARDINALITY) CONFLICTS EMPLOYEE 1: 1

Examples of type conflicts - 4 • DEPENDENCY (OR CARDINALITY) CONFLICTS EMPLOYEE 1: 1 1: n DEPARTMENT 1: 1 1: n PROJECT

Examples of type conflicts - 5 • KEY CONFLICTS PRODUCT CODE PRODUCT LINE CODE

Examples of type conflicts - 5 • KEY CONFLICTS PRODUCT CODE PRODUCT LINE CODE DESCRIPTION

Data integration • The research community has been investigating data integration for about 20

Data integration • The research community has been investigating data integration for about 20 years: different research communities (database, artificial intelligence, semantic web) have been developing and addressing issues related to data integration: – Definitions, architectures, classification of the problems to be addressed – Data Integration problems have been analyzed in different perspectives and different approaches have been proposed – Developed benchmarks allow the evaluation and the comparison of the approaches (THALIA benchmark) – Several commercial software suites have been released and are on testing in real environments 16 16

Integration of Heterogeneous & Distributed Data Sources “Data integration is the problem of combining

Integration of Heterogeneous & Distributed Data Sources “Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data” (Global Virtual Schema (GS)) [Lenzerini, 2002] Global Schema (GS) Query Mapping Local Schema DB DB File XML 17 17

Main elements of DI architecture • Three main elements of the architecture of a

Main elements of DI architecture • Three main elements of the architecture of a schema integration system can be distinguished. These elements are: • a global schema • one or more source/local schemas • mappings between the global and the source/local schemas 18

Typical architecture of a data integration system User query Global schema Mediator Mapping Wrapper

Typical architecture of a data integration system User query Global schema Mediator Mapping Wrapper Local schema 1 Local schema 2 Local schema n Source 1 Source 2 Source n 19

Definitions of global schema and mappings • The global schema describes the structure of

Definitions of global schema and mappings • The global schema describes the structure of the schema representing the whole universe of discourse. • The mappings, or connections, describe how each element in the local schemas relates to the global schema (REMARK mappings can be expressed in the two directions…) 20

Typical architecture of a data integration system From local schemas to the global schema

Typical architecture of a data integration system From local schemas to the global schema From the global schema to local schemas User query Global schema Mediator Wrapper Local schema 1 Source 1 User query Mapping Wrapper Local schema 2 Source 2 Global schema Mediator Mapping Wrapper Local schema n Local schema 1 Source n Source 1 Mapping Wrapper Local schema 2 Source 2 Mapping Wrapper Local schema n Source n 21

Definitions of global schema and mappings • The global schema describes the structure of

Definitions of global schema and mappings • The global schema describes the structure of the schema representing the whole universe of discourse. • The mappings, or connections, describe how each element in the local schemas relates to the global schema • Mappings can be expressed in the two directions • Summarized, the essence of integration is to combine information in a logical way so information can be queried as one through a common interface. • The schema for each information source needs to be connected through a mapping with the global schema of the common interface to enable 22 querying.

Mediators (1) Resultset GLOBAL SCHEMA Full_professor (name, mail, area) Search mail of professors whose

Mediators (1) Resultset GLOBAL SCHEMA Full_professor (name, mail, area) Search mail of professors whose research activities are in the “Database area” Query Interface Global Schema View Select e-mail From Professor Where area = “Database” Mapping Select mail From Faculty_member Where research_topic = “Database” Local Schemata SOURCE 1 Professor (first_name, last_name, e-mail, area) Wise 2009 – Poznan (PL) Local Sources SOURCE 2 Faculty_member(name, mail, research_topic) Università di Modena e Reggio Emilia & Milano Bicocca 23 23

Mediators (2) • • • The mediator builds a unified schema of several (heterogeneous)

Mediators (2) • • • The mediator builds a unified schema of several (heterogeneous) information sources and allows a user to formulate a query on it The user query is transformed in a set of sub-queries, one for each data source involved in the query The results are collected by the Mediator, merged and shown to the user Wise 2009 – Poznan (PL) Università di Modena e Reggio Emilia & Milano Bicocca 24 24

Architettura funzionale di un Data Integration system Mediatore - Fornisce agli utenti una rappresentazione

Architettura funzionale di un Data Integration system Mediatore - Fornisce agli utenti una rappresentazione virtuale unica delle fonti, data dallo schema globale - Traduce le queries in termini di frammenti, inviate ai wrapper -Ricompone i risultati restituiti dai wrapper - Effettua le azioni di data fusion e di risoluzione delle eterogeneita’ sui valori client Multi. DBMS Mediatore Wrapper DBMS BD BD client 25

Instance level heterogeneities

Instance level heterogeneities

Mediators object fusion and reconciliation A mediator’s main functionality is object fusion: v group

Mediators object fusion and reconciliation A mediator’s main functionality is object fusion: v group together information about the same real world entity v remove redundancy among the various data sources v resolve inconsistencies among the various data sources v achieve accuracy, completeness, currency (and other DQ dimensions…) among data from different data sources

Architettura funzionale di un Data Integration system Wrapper -Traduce la richiesta che proviene dal

Architettura funzionale di un Data Integration system Wrapper -Traduce la richiesta che proviene dal mediatore in termini della rappresentazione logico fisica dello schema locale sottostante client DI System Mediator Wrapper DBMS BD BD client 28

Mediators (3) • We may divide the interactions with a mediator in two phases:

Mediators (3) • We may divide the interactions with a mediator in two phases: 1. The creation of the unified representation (Publishing phase at design time) 2. The formulation and the execution of a query in the unified representation (Querying phase) Wise 2009 – Poznan (PL) Università di Modena e Reggio Emilia & Milano Bicocca 29 29

Architettura funzionale di un MDBS nel nostro esempio client Global schema Professore Corso client

Architettura funzionale di un MDBS nel nostro esempio client Global schema Professore Corso client Multi. DBMS Studente Mediatore Wrapper DBMS BD BD client 30

Architettura funzionale di un mediator system - esempio client Multi. DBMS Mediatore Studente Wrapper

Architettura funzionale di un mediator system - esempio client Multi. DBMS Mediatore Studente Wrapper DBMS Corso Local schema client Professore Modulo BD BD Local schema 31

Virtual Integration Architecture including optimization functionality User queries Mediated schema Reformulator Mediator: Optimizer Execution

Virtual Integration Architecture including optimization functionality User queries Mediated schema Reformulator Mediator: Optimizer Execution engine Data source catalog wrapper Data source Sources can be: relational, hierarchical (IMS), structured files, web sites. 32

DI Systems and design time vs run time issues • Publishing phase (or Design

DI Systems and design time vs run time issues • Publishing phase (or Design time) – [The global schema and the mappings] must be defined from source schemas • Run time – Queries are executed and – Global schema, local schemas and the mappings are maintained 33

Mediators – relevant challenges Querying Phase Publishing Phase Visualizing the unified schema Model and

Mediators – relevant challenges Querying Phase Publishing Phase Visualizing the unified schema Model and Language formulating queries User Interface Model and language for representing the unified schema Matching and Mapping the unified schema and the local sources Building the unified schema Managing updates Mediator Model and language for querying the schema Query unfolding / rewriting Data fusion and cleaning Query transformation and execution Schema extraction Data Sources Wise 2009 – Poznan (PL) Università di Modena e Reggio Emilia & Milano Bicocca 34 34

Mediators – relevant challenges Querying Phase Publishing Phase Visualizing the unified schema Model and

Mediators – relevant challenges Querying Phase Publishing Phase Visualizing the unified schema Model and Language formulating queries User Interface Model and language for representing the unified schema Matching and Mapping the unified schema and the local sources Building the unified schema Managing updates Mediator Model and language for querying the schema Query unfolding / rewriting Data fusion and cleaning Query transformation and execution Schema extraction Data Sources Wise 2009 – Poznan (PL) Università di Modena e Reggio Emilia & Milano Bicocca 35 35

Design time Run time Mediated Schema query reformulation Semantic mappings optimization & execution wrapper

Design time Run time Mediated Schema query reformulation Semantic mappings optimization & execution wrapper wrapper 36

Basic properties of a DI System A System Providing: – Uniform (same query interface

Basic properties of a DI System A System Providing: – Uniform (same query interface to all sources) – Access to (queries; eventually updates too) – Multiple (we want many, but 2 is hard too) – Autonomous (DBA doesn’t report to you) – Heterogeneous (data models are different) – Structured (and at least semi-structured) – Data Sources (not only databases). 37