Research Topics in Computing Data Modelling for Data

  • Slides: 41
Download presentation
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David

Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George Data Integration

Modelling & Data Integration Key Elements of today’s Presentation David George n Key Drivers

Modelling & Data Integration Key Elements of today’s Presentation David George n Key Drivers for Data Integration n Dimensions and Issues in Integration n Three Integration Approaches Data Integration 2

Drivers for Data Integration David George Data Integration 3

Drivers for Data Integration David George Data Integration 3

Drivers for Data Integration n Organisations evolving as global entities with distributed data. Systems

Drivers for Data Integration n Organisations evolving as global entities with distributed data. Systems characterised by mix of legacy and new databases and applications. Organisational change : n n n David George (1) Organic growth – size and diversity. Business re-engineering. Corporate mergers and acquisitions. Data Integration 4

Drivers for Data Integration n n Organisations evolved as collections of distinct, autonomous departments

Drivers for Data Integration n n Organisations evolved as collections of distinct, autonomous departments with disconnected systems e. g. in financial services. Trends in Business Intelligence initiatives: n n David George (2) Decision-making support. Customer segmentation. Marketing strategies. Development of distributed or multidatabase systems. Data Integration 5

Dimensions and Issues in Integration David George Data Integration 6

Dimensions and Issues in Integration David George Data Integration 6

Architecture & Design Issues n Multidatabase systems can be classified in two ways: n

Architecture & Design Issues n Multidatabase systems can be classified in two ways: n Homogeneous systems – local databases having same techniques and language. n n Heterogeneous systems – local databases demonstrating diverse data models and language. Key Dimensions in systems heterogeneity n n David George System heterogeneity – hardware, OS, DBMS Semantic heterogeneity - models and data Data Integration 7

Why Heterogeneity/Conflict? Design >> >>>> << Check <<<< Translating conceptualisations of the real world

Why Heterogeneity/Conflict? Design >> >>>> << Check <<<< Translating conceptualisations of the real world into database world representations David George Data Integration 8

Research Work Conceptualised n Books Model (a) The data of interest is about Books,

Research Work Conceptualised n Books Model (a) The data of interest is about Books, their Publishers and adopting Universities. n Publications Model (b) The data of interest is about Publications and their Types David George Data Integration 9

Books Name Title Publisher Name Published by Address Book Adopted by City University Refer

Books Name Title Publisher Name Published by Address Book Adopted by City University Refer to Name Topics -------------------------- Publications Title Publication Code Publisher contains Word Keywords David George Data Integration Code Research Area 10

A Name Title Publisher Name Published by Address Book Adopted by City University Refer

A Name Title Publisher Name Published by Address Book Adopted by City University Refer to Name Topics B -------------------------Title Publisher Name Published by Publication Code contains Word Name Keywords Topics David George Data Integration Code Research Area 11

Books and Publications Integrated Name Publisher Name Title Address Published by Book Adopted by

Books and Publications Integrated Name Publisher Name Title Address Published by Book Adopted by City University Refer to Name Publication contains Topics Code Research Area Title David George Code Data Integration 12

Semantic Heterogeneity/Conflict Structural Conflicts n Generalisation versus Specialisation Conflicts. n Entity versus attributes. n

Semantic Heterogeneity/Conflict Structural Conflicts n Generalisation versus Specialisation Conflicts. n Entity versus attributes. n Naming conflicts. Attribute (Domain) Conflicts n Data Type conflicts. n Measure and Scale conflicts. n Integrity, Presence & Absence. n Data Values David George Data Integration 13

Semantic Heterogeneity/Conflict n Generalisation/Specialisation Conflicts. (i. e. Structural) n Naming conflicts. n Synonyms e.

Semantic Heterogeneity/Conflict n Generalisation/Specialisation Conflicts. (i. e. Structural) n Naming conflicts. n Synonyms e. g. Customer n Homonyms e. g. Market (Products) vs vs Data Integration Client Market (Customers) 14

Semantic Heterogeneity/Conflict n Data Type (representation) conflicts. n n n - 26254006 (integer or

Semantic Heterogeneity/Conflict n Data Type (representation) conflicts. n n n - 26254006 (integer or string) No vs Name (integer or string) Measure and Scale etc conflicts. n n n David George Student Dimension Measure Scale Precision Date - volume vs weight light years vs miles vs kilometres 1: 100 versus A: E dd/mm/yyyy vs mm-dd-yy ? ? ? Data Integration 15

Semantic Heterogeneity/Conflict n Integrity Constraints e. g. Age Range <21 vs Age >18 n

Semantic Heterogeneity/Conflict n Integrity Constraints e. g. Age Range <21 vs Age >18 n Referential conflict 1: 1 vs 1: M (e. g. 1 invoice for 1/ M orders) n n Presence/Absence. n n n Data Values n David George No null, nulls – e. g. optional No corresponding attribute Same items different values Data Integration 16

Integration Approaches David George Data Integration 17

Integration Approaches David George Data Integration 17

Integration Approaches n Federated Database (Multidatabase) Systems. n Data Warehouse (Materialised in house) Systems.

Integration Approaches n Federated Database (Multidatabase) Systems. n Data Warehouse (Materialised in house) Systems. n Mediators (Virtual integration) Systems. David George Data Integration 18

Federated Database Systems David George Data Integration 19

Federated Database Systems David George Data Integration 19

Federated Databases David George Data Integration (1) 20

Federated Databases David George Data Integration (1) 20

Federated Databases n A Class of heterogeneous databases that: n n Consist of both

Federated Databases n A Class of heterogeneous databases that: n n Consist of both new and old systems. Previously existed in their own stand-alone (autonomous) environments. Integration is a consequence of distribution. Organisation can adopt different architectures i. e. the way databases are mapped together: n n David George (2) Loosely Coupled integrations. Tightly Coupled integrations. Data Integration 21

Federated Databases n Tightly Coupled Federations n n David George (3) Federation administrator determines

Federated Databases n Tightly Coupled Federations n n David George (3) Federation administrator determines schema view for all component systems in the federation. Negotiates export schemas (tables and attributes) from federation participants who control exports of local schemas. n Local schema exports integrated as a federated schema. n Less autonomy at federation user level for view creation. Data Integration 22

Federated Databases n Loosely Coupled Federations n The federated component databases have a greater

Federated Databases n Loosely Coupled Federations n The federated component databases have a greater degree of autonomy. n No central schema view is imposed on users. n Federated user is effectively an administrator creating views. n David George (4) User employs a MDB Query Language (v TC schema integration). Data Integration 23

Federated Databases n n n David George (5) Sharing is made explicit by allowing

Federated Databases n n n David George (5) Sharing is made explicit by allowing export schemas from the local or component database. The export schemas are imported to the federation to represent the shareable federated database. Each source can call on others for information. FDBMSs differ from homogeneous Distributed DBMSs – they use the same data model and DBMS. DDBMSs sharing is therefore implicit. Data Integration 24

Data Warehousing Systems David George Data Integration 25

Data Warehousing Systems David George Data Integration 25

Data Warehousing Network Internet (1) Local Operational -------------------------- Integration & Storage R 2 David

Data Warehousing Network Internet (1) Local Operational -------------------------- Integration & Storage R 2 David George R 3 Data Integration Warehouse Decision Support & Mining 26

Data Warehousing n n Represents the physical separation of operational and decision support environments.

Data Warehousing n n Represents the physical separation of operational and decision support environments. Operational data provides the raw material for: n n David George (2) Decision support systems. Data-mining (DM). E. g. identifying trends or characteristics. DM = process of “non-trivial extraction of implicit, previously unknown, and potentially useful information”. Data Integration 27

Data Warehousing (3) n Warehouse integrates multiple, heterogeneous data sources - e. g. Relational

Data Warehousing (3) n Warehouse integrates multiple, heterogeneous data sources - e. g. Relational DBs, flat files. n Data is pre-fetched into a central or intermediate warehouse repository by mediation process. n Data is “cleaned” and data integration techniques applied e. g. filtered, joined or aggregated. n n David George Data may be transformed to conform to the warehouse schema. Provides consistency in naming conventions, data structures, attributes, etc. Data Integration 28

Data Warehousing n n (4) Data then stored (materialised) in warehouse repository – possibly

Data Warehousing n n (4) Data then stored (materialised) in warehouse repository – possibly in separate data marts. Result is a repository of synthesised data for management decision-making. n Queries are made over the repository’s global schema. n Information is independent from the source data. n Data extraction tends to be periodically. David George Data Integration 29

Mediator (+Wrapper) Systems David George Data Integration 30

Mediator (+Wrapper) Systems David George Data Integration 30

Mediator Systems Network (1) Internet Query Translation Mediator David George Data Integration 31

Mediator Systems Network (1) Internet Query Translation Mediator David George Data Integration 31

Mediator Systems (2) n Global schema created and mapped to the source schemas. n

Mediator Systems (2) n Global schema created and mapped to the source schemas. n User makes queries over global, mediated schema. n Mappings can be either: n Global-as-view (GAV). n Local-as-view (LAV). n Mediator translates global schema query and reformulates it into sub-queries of local schemas. n Wrappers execute and return. David George Data Integration 32

Mediator Systems n n David George (3) Wrappers standardise how source information is described

Mediator Systems n n David George (3) Wrappers standardise how source information is described and accessed (i. e. they translate or adapt). Query answers are returned to the user on demand – after sources are interrogated. Thus data is always up-to-date (v. Warehousing). Mediators integrate information view, without integrating the source data. Data Integration 33

Mediator Systems n n David George (4) Results in a homogeneous information source using

Mediator Systems n n David George (4) Results in a homogeneous information source using views - based on the mediated (global) schema. Integration is virtual i. e. retrieved by the mediator but not stored in any central repository. Differs from Warehousing Queries – where made to materialised data. In short – provides virtual source schema integration via schema mapping and integrated view. Data Integration 34

Comparisons David George Data Integration 35

Comparisons David George Data Integration 35

Federation versus Warehousing & Mediation n n David George Federation represents a more “static”

Federation versus Warehousing & Mediation n n David George Federation represents a more “static” approach – using agreed couplings to allow view creation. Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes. Data Integration 36

Warehousing vs. Mediation n Warehouse: n n n Heterogeneous data is integrated in advance

Warehousing vs. Mediation n Warehouse: n n n Heterogeneous data is integrated in advance and stored inhouse for direct query and analysis. Mediation: n n n David George Update-driven: i. e. in warehouse repository Wrapper and Mediator layer on top of source DBs. Query-driven: Query to mediated schema then translated into queries appropriate to sources. Results integrated into a global answer set. Data Integration 37

Summary David George Data Integration 38

Summary David George Data Integration 38

Summary n Drivers for Data Integration n Integration Issues n n n Different Conceptual

Summary n Drivers for Data Integration n Integration Issues n n n Different Conceptual Model representations. Resulting Semantic Heterogeneities. Integration Approaches n n David George Organisational change. Business Intelligence and Strategies. Federated Systems. Data Warehousing and Mediator Systems. Data Integration 39

Next step …… David George Data Integration 40

Next step …… David George Data Integration 40

Research Resources Reference Material n n n Journals Books Presentation slides UCLAN Website n

Research Resources Reference Material n n n Journals Books Presentation slides UCLAN Website n Internal: http: //janus/dgeorge/integration/journals. asp n External: http: //www. janus. computing. uclan. ac. uk/dgeorge/integration/journals. asp David George Data Integration 41