1 Last update 17 October 2007 Advanced databases
1 Last update: 17 October 2007 Advanced databases – Defining and combining heterogeneous databases: Basics and overview Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 1
Motivation: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs 2 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 2
3 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBM’s DB 2 User sovereignty & multidatabase language approach Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 3
4 Multi database systems Multiple databases created for the same functionality n Different operating systems, data formats, query languages etc Typically DBs managed by DBMSs running on heterogeneous computing platforms Information sharing across dissimilar platforms n Interconnect previously isolated software systems (DBMS) n Not only invoke but also coordinate interactions Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 4
5 Interoperating with heterogeneous databases – requirements (1) n Distributed transparency l n Heterogeneity transparency l n users must access a number of different databases in the same way as accessing a single database. users must access other schemas in the same way they access their local database (using a familiar model and language). The existing database systems and applications must not be changed. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 5
6 Interoperating with heterogeneous databases – requirements (2) n Addition of new databases must be easily accommodated into the system. n The databases have to be accessed both for retrievals and updates. n The performance of heterogeneous systems has to be comparable to the one of homogeneous distributed systems. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 6
7 Autonomy and heterogeneity Interconnection and cooperation of autonomous and heterogeneous databases must address n Distribution n Autonomy n Heterogeneity Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 7
8 Heterogeneity n Heterogeneity is independent of location of data n When is an information system homogeneous? n l Software that creates and manipulates data is the same l All data follows same structure and data model and is part of a single universe of discourse Different levels of heterogeneity l Different languages to write applications l Different query languages l Different models l Different DBMSs l Different File systems l Semantic heterogeneity etc. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 8
9 Autonomy Databases usually under separate and independent control Aspects of autonomy n Design autonomy: Local DBs chose their own data model, query language, interpretation of data etc. n Communication autonomy: Local DBs decide when and how to respond to other DB requests n Execution autonomy: Execution of local/external operations/transactions is not controlled by any external DBMS n Association autonomy: Local DBs can decide how much of their data/functions/operations to share with other classes of users n Another kind of autonomy: User autonomy / sovereignty ! Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 9
10 Balancing autonomy and heterogeneity Different degrees of autonomy: n No/little autonomy (intra corporate, poor networking infrastructure) n More of autonomy and flexible bridging of heterogeneity (federated approach) n Autonomy over heterogeneity (multi database language approach) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 10
11 Interoperability The ability to request and receive services between the interoperating systems and use each others’ functionality. Systems considered interoperable if n They can exchange messages and requests n They can receive services and operate as a unit in solving a common problem Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 11
12 Heterogeneous Distributed Databases Information systems that provide interoperation and varying degrees of integration among multiple DBs are called n Multi database systems or n Federated (database) systems or n More generally, heterogeneous distributed database systems (HDDBSs) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 12
13 Solutions to integrating HDDBSs Global Schema Integration Federated Database systems Multi database language approach Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 13
14 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBM’s DB 2 User sovereignty & multidatabase language approach Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 14
15 Definition and advantages Global database integration: n Based on complete integration to provide a single view Advantages: n Consistent, uniform view of and access to data for users n Users unaware of existing multiple existing DBs Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 15
16 Disadvantages n Hard to automate creation of a global schema: structural, semantic or behavioral conflicts n Autonomy esp. association autonomy sacrificed: all local data and operations to be revealed n Loss of semantic information depending on how the schema integration is performed n Correctness of global schema is hard to prove: hard because of context dependent meanings n Error prone, time consuming n Unsuitable for frequent dynamic changes to schemas n Does not scale well with size of DB networks Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 16
17 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBM’s DB 2 User sovereignty & multidatabase language approach Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 17
18 Taxonomy - based on autonomy DBS either centralized or distributed n Centralized: a single DBMS managing a single DB n Distributed: a single distributed DBMS managing multiple DBs MDBS supports operations on multiple DBs Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 18
19 What is a federated system? (check that not redundant) A federated system integrates existing, possibly heterogeneous, databases while preserving their autonomy*. The main difference between federated systems and traditional distributed systems is that in federated systems each component remains autonomous. Autonomy of a component system means that the local administrator maintains some control over his/her system. * A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3): 183 -236, 1990. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 19
20 Definition n A Federated Database System (FDBS) is a collection of cooperating but autonomous component DBSs. n Aim: remove the need for static global schema integration n Allows each local DB to have more control over the shareable information n Control is decentralized n Integration need not be complete but depends on needs of users n More terminology: FDBMS = the software that controls, coordinates the component DBSs of an FDBS Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 20
21 FDBS coupling Loosely coupled FDBS n If user’s responsibility to create and maintain the federation. No control enforced by the federation admin. Tightly coupled FDBS n If federation admin have responsibility for creating and maintaining the federation and actively controlling access to the component DBSs. Association autonomy of the individual component DBs still exists Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 21
22 FDBs as a compromise Compromise between n no integration in which users must explicitly interface between multiple autonomous DBs AND n Total integration in which autonomy of each component DBS is sacrificed so that users can access data through a single global interface but not as a local user Support local and global (federated) operations Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 22
A FDBS and its components – cooperation among independent systems 23 Can continue local operations and participate in more than 1 federation. Can be (de/) centralized or another FDBMS Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 23
24 Basic system components of the data management architecture Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 24
25 FDBSs Schemas Local schema n Conceptual schema of a component DB Component schema n Local schema translated to a common data model of the FDBS. Alleviates data model heterogeneity. Export schema n Specify shareable objects to other members or classes of members of the FDBS. Federated schema n A statically integrated schema or dynamic view of multiple export schemas. Can be multiple federated schemas. External schema n For customization when the federated schema is large and complicated. Another level of abstraction for class of users for example. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 25
26 The systme of schemas needs to be extended Five level schema architecture of a FDBS Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 26
27 Loosely coupled FDBSs n User creates and maintains federation schema n Creating schema corresponds to creating a view against relevant export schemas n Therefore, each user must be aware of information and structure of the export schemas n Hard to support view updates – therefore, assume highly autonomous read-only DBs Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 27
28 Loosely coupled FDBSs - Advantages n Flexibility of different interpretations possible for same federated schema n Easier to cope with dynamic changes in schemas since it is easier to create views. Detection of changes is however expensive. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 28
29 Loosely coupled FDBSs - Disadvantages n Duplicated effort in creation of similar federated schemas. n Difficulty in understanding the semantics of schemas available to the user. n Due to possible multiple view creations, view updating cannot be supported. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 29
30 Tightly coupled FDBSs n Aim: provide location, replication and distribution transparency n Federation administrators have full control over creation and maintenance of federated schemas and access to other export schemas n Single federated schema same as global schema but view updates possible if administrators understand the mappings. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 30
31 Tightly coupled FDBSs – Disadvantages n FDBS administrator and component DBSs negotiate creation of export schemas during which adm. has complete read access to component schema and/or data. Violates autonomy n Change in export/component schemas imply redoing federated schema creation. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 31
32 Basic system components of the data management architecture Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 32
33 Processors in a FDBS n Transforming processors l n Filtering processors l n Uses mappings to transform commands from internal command language to local query language etc. Uses access control specified in export schema to limit allowable operations submitted to corresponding component schemas Constructing processors l Performs query decomposition and merges data Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 33
34 System architecture of an FDBS – schemas and processors Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 34
35 Data integration approaches n Multidatabase languages and declarative integration languages l n Conceptual-level abstraction from data sources l n Single-ontology (global) or multi-ontologies Semantic Web approaches l n Enable user to express specific views and ways to compose integrated data objects Ontology-based integration approaches l n Data integration performed on top of this conceptual layer Object-oriented virtual integration approaches l n Collective identifiers, semantic variables, virtual classes that form a global schema Ontology-based Taxonomic database systems l Support multiple, overlapping classifications in centralized, nonintegrated DB systems Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 35
36 Examples of integration approaches (1) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 36
37 Examples of integration approaches (2) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 37
38 Examples of integration approaches (3) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 38
39 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBM’s DB 2 User sovereignty & multidatabase language approach Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 39
40 Background n n n Garlic l research project l wrapper architecture ( virtual integration) l start from standard relational database, extend language and data model to support some object-oriented features l cross-source query optimization DB 2 Data. Joiner l commercial system l combine multiple heterogeneous relational sources l focus on query optimization DB 2 (see Haas et al. 2002) l incorporates ideas of both Garlic and Data. Joiner l user-defined fucntions to "federate" simple data sources Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 40
41 DB 2 architecture for database federation Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 41
42 Styles of federation n Scalar UDFs (user-defined functions) l Input: data from surrounding SQL statement l Output: a single scalar result Can federate function (combine data from one source with a function provided by another, in a single statement) n Table UDFs l Input: as in scalar UDFs l Output: a table Can federate data l n Note: UDFs can also be used to access Web services Wrappers: Federate function and data l A wrapper transforms an external data source to table form l This data source / table is then identified by a nickname (and can be queried like a „normal“ local table) l Wrappers for a variety of relational and non-relational sources are supplied (e. g. , Oracle, Excel, XML) l + a toolkit for developing wrappers for other data sources Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 42
43 Examples: Scalar UDFs n n Send a message to an MQSeries queue : db 2 mq. mqsend() l Built-in function l [MQSeries: a middleware that allows the exchange of messages between independent applications; all messages are transferred via this queue] Send a message with database content to the client application: SELECT db 2 mq. mqsend(a. headline) FROM Articles a WHERE a. article_timestamp >= CURRENT TIMESTAMP Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 43
44 Examples: Table UDFs Data source: address book in a Lotus Notes database SELECT a. first, a. last, a. phone, a. email FROM TABLE (addressbook( )) AS a, Company_Profiles c WHERE c. industry ‘FINANCIAL’ AND c. revenue > 50, 000 AND c. name = a. company_name Data source: local file system SELECT f. filename, f. author, f. last_modified_date FROM TABLE (dir(‘laurapapers’, ‘. pdf’)) AS f WHERE f. last_modified_date ‘ 07/04/2002’ Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 44
45 Using wrappers to integrate different relational databases (overview) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 45
Using wrappers to integrate different relational databases (sample queries) 1. Register nicknames for transactions from 2 company branches: sf. Transactions, ny. Transactions 2. Create federated view 46 CREATE VIEW National_Transactions (store_id, tran_date, tran_id, item_id) AS SELECT store_id, tran_date, tran_id, item_id FROM sf. Transactions UNION ALL SELECT store_id, tran_date, tran_id, item_id FROM ny. Transactions 3. Generate a national sales report SELECT MONTH(tran_date), item_id, COUNT(*) FROM National_Transactions WHERE YEAR(tran_date)=2001 GROUP BY MONTH(tran_date), item_id NB: Can also generate materialized views (cache information locally): CREATE TABLE. . . AS. . . Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 46
47 Federation of nonrelational structured data (overview) (A single XML document may be mapped to multiple nicknames) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 47
48 Federation of nonrelational structured data (sample query) Excel spreadsheets nicknames: Items, Suppliers SELECT i. mfg, s. id FROM Items i, Suppliers s WHERE i. id = s. id AND i. id = (SELECT g. id FROM (SELECT g. id, COUNT(*), ROWNUMBER( ) OVER (ORDER BY COUNT(*) DESC) AS rownum FROM National_Transactions g, Items it WHERE it. cat=‘television’ AND g. id = it. id AND YEAR(tran_date)=2001 GROUP BY g. id) AS tv_total_2001 WHERE rownum = 1) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 48
49 Agenda Goals and challenges Global schema integration (short survey) Federated database systems An example: IBM’s DB 2 User sovereignty & multidatabase language approach Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 49
50 Autonomy of data sources – autonomy and sovereignty of users n Autonomy of data sources is valued highly l n Degree to which a local data source can operate independently must not be reduced by the integration system But what about the autonomy of data receivers? l Human users and applications l Autonomous: have different information needs, vary in the ways they perceive their domain of interest l Using integrated data should be non-intrusive: users should not be forced to adapt to any standard concerning structure and semantics of data they desire Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 50
51 The ASME criteria for evaluating data integration approaches n Abstraction l n Selection l n the possibility of user-specific selection of data and data sources for individual integration Modeling l n Shield users from low-level heterogeneities and underlying data sources The availability of means to incorporate user-specific ways to perceive a domain of interest for which integrated data is desired in the process of data integration Explicit semantics l Means for explicitly representing the real-world semantics of data è Do different approaches realize these (or not)? è Can we „have it all“? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 51
52 Evaluation results (1) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 52
53 Evaluation results (2) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 53
54 Evaluation results (3) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 54
55 Conclusion: Current state of Multi database language approach – disadvantages and future work needed Lack of distribution and location transparency for users. Users responsible for n finding relevant DBs, n understanding schemas, n detecting and resolving semantic conflicts n performing view integration Some support offered by the language constructs “abstracting the user from technical-level issues and supporting userspecific data selection and modeling are conflicting goals” (Ziegler 200, p. 6) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 55
56 What about user sovereignty like this? n „Yahoo! Pipes is an interactive data aggregator and manipulator that lets you mashup your favorite online data sources. n Like Unix pipes, simple commands can be combined together to create output that meets your needs: l combine many feeds into one, then sort, filter and translate to create your ultimate custom feed. l remix your favorite data sources and use the Pipe to power a new application. l . . . “ (http: //pipes. yahoo. com/pipes/) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 56
57 Next lecture Goals and challenges Global schema integration (short survey) Federated database systems An example: IBM’s DB 2 User sovereignty & multidatabase language approach Schema integration Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 57
58 References / background reading; acknowledgements n - - Slides 2 -35 are based on l Meena Nagarajan (2006). Federated database systems. Part I. l http: //lsdis. cs. uga. edu/~meena/Spring 06/ADB/Federated%20 Database%20 Systems. ppt - – which in turn reports the classic survey paper - Amit P. Sheth, James A. Larson: Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Comput. Surv. 22(3): 183 -236(1990) - (available for example at http: //www. cs. auc. dk/~tbp/Teaching/DAT 5 E 00/sheth. pdf ) The slides on DB 2 are based on the paper - Haas, L. M. , Lin, E. T. , & Roth, M. A. (2002). Data integration through database federation. IBM Systems Journal, 41(4), 578 -596 - http: //researchweb. watson. ibm. com/journal/sj/414/haas. pdf The slides on user sovereignty and slides 35 -38 are based on the paper - Ziegler, P. (2004). User-specific semantic integration of heterogeneous data: What remains to be done? IFI, University of Zurich, Technical Report ifi-2004. 01 - ftp: //ftp. ifi. unizh. ch/pub/techreports/TR-2004/ifi-2004. 01. pdf p. 40: Garlic: M. Tork Roth, P. Schwarz, and L. Haas, “An Architecture for Transparent Access to Diverse Data Sources, ” Component Database Systems, K. R. Dittrich, A. Geppert, Editors, Morgan-Kaufmann Publishers, San Mateo, CA (2001), pp. 175– 206. Data. Joiner: IBMCorporation, Data. Joiner, http: //www. software. ibm. com/datajoiner Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. ac. be/~berendt/teaching/2007 w/adb/ 58
- Slides: 58