Resolving Structural Conflicts in the Integration of XML

  • Slides: 46
Download presentation
Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach Xia Yang

Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach Xia Yang Mong Li Lee Tok Wang Ling National University of Singapore 1

Outline Introduction n Background n Preliminaries n Motivating Example n Integration Algorithm n Related

Outline Introduction n Background n Preliminaries n Motivating Example n Integration Algorithm n Related work n Conclusion n 2

Introduction n Recent research in integrating XML data sources has mainly concentrated on schema

Introduction n Recent research in integrating XML data sources has mainly concentrated on schema matching. The XML Schema or DTD is lacking in semantics. The source schemas are heterogeneous, containing various conflicts involving naming conflict, cardinality conflict, structural conflict. 3

Background n n n Most of the work of integration of XML has focused

Background n n n Most of the work of integration of XML has focused on the matching problem to find equivalent elements among the different sources. LSD [4] employs instance information and machine learning techniques base on the instance information in their integration work. E. Jeong and C. -N. Hsu [7] use schema learning to generate a set of tree grammar rules from the DTDs in a class and optimizes the rules to transforms them into an integrated view. [4]. A. Doan, P. Domingos, A. Levy. Learning Source Descriptions for Data Integration. Web. DB, 2000. [7]. E. Jeong, C. -N. Hsu. Induction of Integrated View for XML Data with Heterogeneous DTDs. ACM CIKM, 2001. 4

Background (cont. ) n E. Jeong and C. -N. Hsu [7] ¨ DTD clustering

Background (cont. ) n E. Jeong and C. -N. Hsu [7] ¨ DTD clustering clusters DTDs in similar domains into classes. ¨ Schema learning applies a tree grammar inference technique to generate a set of tree grammar rules from the DTD in a class from the previous step. ¨ Minimization optimizes the rules generated in the previous step and transforms them into an integrated view. 5

Background (cont. ) n n n These work lack of semantic meaning, which may

Background (cont. ) n n n These work lack of semantic meaning, which may lead to the wrong integrated schema. All these work do not take into consideration the importance of the individual data sources, and how the majority of the local schemas model their data. They are binary strategies, cannot take the importance of sources into consideration. 6

Preliminaries n ORA-SS Model ¨ The ORA-SS model (Object-Relationship. Attribute model for Semi-Structured data)

Preliminaries n ORA-SS Model ¨ The ORA-SS model (Object-Relationship. Attribute model for Semi-Structured data) is a semantically rich data model that has been designed for semi-structured data [5]. ¨ The ORA-SS model distinguishes between objects, relationships and attributes. [5]. G. Dobbie, X. Wu, T. W. Ling, M. L. Lee. ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data. Technical Report TR 21/00, National University of Singapore, 2000. 7

Preliminaries (cont. ) n ORA-SS Schema Diagram example. 8

Preliminaries (cont. ) n ORA-SS Schema Diagram example. 8

Preliminaries (cont. ) n Assumptions for the algorithm ¨ ¨ ¨ The input to

Preliminaries (cont. ) n Assumptions for the algorithm ¨ ¨ ¨ The input to the proposed integration algorithm is a set of ORASS schemas with source weight. The output of the algorithm is an integrated schema, also modeled in ORA-SS. The integrated schema should contain all the information modeled in the original schemas. Further, the integrated schema should be as simple and concise as possible to facilitate users’ understanding. For meaningful integration to occur, we assume that the various sources model similar domains. Object classes and relationship sets with the same label name are considered to be semantically equivalent. Attributes of the same object class (or relationship set) with the same label name are also semantically equivalent. 9

Motivating Example (a) Schema S 1, sw 1=1 (c) Schema S 3, sw 3=7

Motivating Example (a) Schema S 1, sw 1=1 (c) Schema S 3, sw 3=7 (b) Schema S 2, sw 2=1 (d) Schema S 4, sw 4=1 The swi under each schema indicates the source weight, i. e. , the importance of a source. This is determined by users or computed based on some statistic information. 10

Motivating Example n n n A. Resolve attribute-object class conflict. B. Resolve generalizations and

Motivating Example n n n A. Resolve attribute-object class conflict. B. Resolve generalizations and specializations. C. Merge the schemas to obtain an integrated graph. D. Transform integrated graph to resolve structural conflicts and remove redundancy. E. Augment Graph with Attributes. 11

A. Resolve attribute-object class conflict. n This occurs when a concept has been modeled

A. Resolve attribute-object class conflict. n This occurs when a concept has been modeled as an attribute in one schema, and as an object class in another schema. n This conflict can be resolved by transforming the attribute to an object class. 12

A. Resolve attribute-object class conflict. (cont. ) (example) (a) Schema S 1, sw 1=1

A. Resolve attribute-object class conflict. (cont. ) (example) (a) Schema S 1, sw 1=1 (b) Schema S 2, sw 2=1 (c) Schema S 1’: Attribute “project manager” in schema S 1 has been transformed into an object class “project manager” in S 1’. 13

B. Resolve generalizations and specializations. n A generalization exists when an object class in

B. Resolve generalizations and specializations. n A generalization exists when an object class in one schema is the union of several object classes in another schema. n The integrated schema will include the generalization isa hierarchy. 14

B. Resolve generalizations and specializations. (cont. ) (example) Schema S 1, sw 1=1 Schema

B. Resolve generalizations and specializations. (cont. ) (example) Schema S 1, sw 1=1 Schema S 4, sw 4=1 Build a generalization hierarchy from part of S 1, which is used for next step to generate an integrated graph 15

C. Merge the schemas to obtain an integrated graph. n n n Each node

C. Merge the schemas to obtain an integrated graph. n n n Each node in the graph denotes an object class, and edges represent the relationship sets among the object classes. To facilitate processing, attributes are first omitted from the integrated graph. The attributes will be incorporated into the final integrated schema. Compute the edge weight. 16

C. Merge the schemas to obtain an integrated graph. (cont. ) n Compute the

C. Merge the schemas to obtain an integrated graph. (cont. ) n Compute the edge weight. ¨ For each original source, the edge weight is the source weight multiplied by the number of relationship sets involved in this edge. ¨ The edge weight in the integrated graph is the sum of all the edge weights of this edge from the original sources. 17

C. Merge the schemas to obtain an integrated graph. (cont. ) (example) (a) Schema

C. Merge the schemas to obtain an integrated graph. (cont. ) (example) (a) Schema S 1, sw 1=1 (c) Schema S 3, sw 3=7 (b) Schema S 2, sw 2=1 (d) Schema S 4, sw 4=1 18

C. Merge the schemas to obtain an integrated graph. (cont. ) n Example of

C. Merge the schemas to obtain an integrated graph. (cont. ) n Example of edge weight ¨ Since we have “project” as the parent of “project manager” in schemas S 1 and S 4, the weight of the edge from “project” to “project manager” is given by the sum of the weights of these schemas, that is, 1+1=2. ¨ Since “project” is the parent of “staff” in schema S 3 only, the weight of this edge is 7. Since the edge from “project” to “supplier” in S 3 is actually involved in two relationship sets js and jsp, its edge weight would be given by 7*2=14. 19

C. Merge the schemas to obtain an integrated graph. (cont. ) (example) Integrated graph

C. Merge the schemas to obtain an integrated graph. (cont. ) (example) Integrated graph obtained from the schemas in page 10. 20

D. Transform integrated graph to resolve structural conflicts and remove redundancy. n D-1. Differentiate

D. Transform integrated graph to resolve structural conflicts and remove redundancy. n D-1. Differentiate semantically different relationship sets among equivalent object classes. n D-2. Remove relationship sets that are projections of higher degree relationship sets. n D-3. Resolve ancestor-descendant conflicts. n D-4. Remove transitive relationship sets. n D-5. Remove other type of multiple parent nodes. 21

multiple parent nodes n If a node has more than one incoming edges in

multiple parent nodes n If a node has more than one incoming edges in an integrated graph, it is called a multiple parent node. 22

D-1. Differentiate semantically different relationship sets among equivalent object classes. n If the relationship

D-1. Differentiate semantically different relationship sets among equivalent object classes. n If the relationship sets among the same object classes are semantically different. Then duplicate the nodes and make foreign key-key references in the integrated graph. Move the object classes involved in n-nary (n>2) relationship set. 23

D-1. Differentiate semantically different relationship sets among equivalent object classes. (cont. ) (example) Ph

D-1. Differentiate semantically different relationship sets among equivalent object classes. (cont. ) (example) Ph 1 and ph 2 are semantically different. Schema S 5 Schema S 6 Integrated graph G 56 Transformed graph G 56’ because ph 1 and ph 2 are semantically different. back 24

D-2. Remove relationship sets that are projections of higher degree relationship sets. A schema

D-2. Remove relationship sets that are projections of higher degree relationship sets. A schema may model a relationship set that is a projection of another relationship set in another schema. n Keep the complete relationship set in the integrated schema. n 25

D-2. Remove relationship sets that are projections of higher degree relationship sets. (cont. )

D-2. Remove relationship sets that are projections of higher degree relationship sets. (cont. ) (example) If jp is a projection of jsp. Schema S 1 Part of Integrated graph G 13 Schema S 3 Part of Transformed graph G 13’ 26

D-3. Resolve ancestor-descendant conflicts. n n n An ancestor-descendant conflict arises when a schema

D-3. Resolve ancestor-descendant conflicts. n n n An ancestor-descendant conflict arises when a schema models an object class A as an ancestor of object class B, and another schema models B as the ancestor of A. Such conflicts appear as cycles in the integrated graph. The simplest form of this conflict is the parent-child conflict. In the ancestor-descendant conflict cycle, remove the edge with the smallest edge weight, if this relationship set can be derived from other relationship sets in the cycle. If there are two edges with the same smallest edge weight, remove either one. 27

D-3. Resolve ancestor-descendant conflicts. (cont. ) (example) Parent-child conflict 28

D-3. Resolve ancestor-descendant conflicts. (cont. ) (example) Parent-child conflict 28

D-3. Resolve ancestor-descendant conflicts. (cont. ) (example) Parent-child conflict: the edge from part to

D-3. Resolve ancestor-descendant conflicts. (cont. ) (example) Parent-child conflict: the edge from part to supplier is removed. 29

D-3. Resolve ancestor-descendant conflicts. (cont. ) (example) Schema S 7 Schema S 8 Integrated

D-3. Resolve ancestor-descendant conflicts. (cont. ) (example) Schema S 7 Schema S 8 Integrated graph G 78 Case 1: Case 2(a): if sd can be derived by dh and hs if hs derived by sd and dh and sw 7=2 and sw 8=1) and sw 7=1 and sw 8=2 Case 2(b): if dh derived by hs and sd) and sw 7=1 and sw 8=2 Transformed graph G 78’’(b) Transformed graph G 78’’(a) 30

D-4. Remove transitive relationship sets and redundant object classes. n n n If one

D-4. Remove transitive relationship sets and redundant object classes. n n n If one relationship set from object class A to object class B can be derived from relationship sets which is from A to other object class sets and back to B, it is called transitive relationship set. Transitive relationship sets are also redundant, and can be removed so that the resulting integrated graph will be concise, if the intermediate node has attribute or other sub-object classes. If the intermediate node has no attribute and no other sub-object classes, it will be considered as redundant object classes. 31

D-4. Remove transitive relationship sets and redundant object classes. (cont. ) (example) 32

D-4. Remove transitive relationship sets and redundant object classes. (cont. ) (example) 32

D-4. Remove transitive relationship sets and redundant object classes. (cont. ) (example) 33

D-4. Remove transitive relationship sets and redundant object classes. (cont. ) (example) 33

D-5. Remove other type of multiple parent nodes. n If a node has more

D-5. Remove other type of multiple parent nodes. n If a node has more than one incoming edges in an integrated graph, it is called a multiple parent node. ¨ Case 1: D-1 Different relationship sets among the same object classes. ¨ Case 2: D-2 Relationship sets that are projections of the higher degree relationship sets. ¨ Case 3: D-3 Ancestor-descendant conflicts. ¨ Case 4: D-4 Transitive relationship sets. ¨ Case 5: D-5 Others. As examples in the following page. 34

D-5. Remove multiple parent nodes. (cont. ) (example) Case 5: Schema S 9 Schema

D-5. Remove multiple parent nodes. (cont. ) (example) Case 5: Schema S 9 Schema S 10 Transformed Graph G 9 -10’ Integrated graph G 9 -10 back 35

Transformed graph (summary) Original integrated graph 36

Transformed graph (summary) Original integrated graph 36

Transformed graph (cont. ) Transformed graph 37

Transformed graph (cont. ) Transformed graph 37

E. Augment Graph with Attributes n n Augment the graph with the attributes of

E. Augment Graph with Attributes n n Augment the graph with the attributes of object classes in the integrated schema. Augment the graph with the attributes of relationship sets in the integrated schema. For the attributes of duplicated object classes in case D 1 and D-5, the attributes will become the attributes of the original ones, not the duplicated object classes. D-1 D-5 For attributes of relationship sets which have been removed, the attributes will be the attributes of relationship sets which could derive this relationship set. 38

Final integrated schema (example) back 39

Final integrated schema (example) back 39

Integration Algorithm 1. Preprocessing. a. Resolve attribute-object class conflict. b. Resolve generalizations and specializations.

Integration Algorithm 1. Preprocessing. a. Resolve attribute-object class conflict. b. Resolve generalizations and specializations. 2. 3. 4. Construct integrated graph. Transform graph. Augment graph with attributes. 40

Step 3 Transform Graph n n n 3. 1 Differentiate semantically different relationship sets

Step 3 Transform Graph n n n 3. 1 Differentiate semantically different relationship sets among equivalent object classes. 3. 2 Remove relationship sets that are projections of higher degree relationship sets. 3. 3 Resolve any ancestor-descendant conflicts which create cycles in G. 3. 4 Remove transitive relationship sets and redundant object classes. 3. 5 Remove other multiple parent nodes. 41

Related work (compare with [7]) problems by Jeong and C. -N. Hsu [7]. 42

Related work (compare with [7]) problems by Jeong and C. -N. Hsu [7]. 42

Related work (compare with [7]) (cont. ) Integrated schema obtained by our approach 43

Related work (compare with [7]) (cont. ) Integrated schema obtained by our approach 43

Related work (compare with [7]) (cont. ) n n n Our proposed method employs

Related work (compare with [7]) (cont. ) n n n Our proposed method employs the ORA-SS conceptual model which is able to capture the semantics necessary for the resolution of structural conflict during integration. We could take into consideration the importance of the individual data sources, and how the majority of the local schemas model their data. We employ n-nary strategy. While The binary strategy will not be able to utilize the source importance and how the majority of the sources model the data. N-nary strategy is also faster compare to the binary strategy. ( For example, sw 1=2, sw 2=1, sw 3=1, sw 4=1 and schema 2, 3, 4 are same. The binary strategy might treat sw 1 as the most important schema, while in fact schema 2, 3 and 4 are. ) 44

Conclusion n n In this paper, we have introduced a semantic approach to resolve

Conclusion n n In this paper, we have introduced a semantic approach to resolve structural conflicts in the integration of XML schemas. We employed the ORA-SS semantic data model to capture the implicit semantics in an XML schema. We presented a comprehensive n-nary algorithm to integrate XML schemas. our algorithm takes into account the data semantics, the importance of a source, and how the majority of the sources model their data. Structural conflicts such as attribute/object class conflict, ancestordescendant conflict are resolved in our approach. We also remove redundant object classes and relationship sets such as transitive relationship sets, and relationship sets, which are projections of higher degree relationship sets in order to obtain a concise integrated schema. 45

References n n n n n n 1. S. Castano, V. Antonellis, S. C.

References n n n n n n 1. S. Castano, V. Antonellis, S. C. Vimercati, M. Melchiori. An XML-Based Framework for Information Integration over the Web. IIWAS, 2000. 2. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER, 2002. 3. P. Buneman, S. Davidson, W. Fan, C. Hara, W. C. Tan. Keys for XML. WWW, 2001. 4. A. Doan, P. Domingos, A. Levy. Learning Source Descriptions for Data Integration. Web. DB, 2000. 5. G. Dobbie, X. Wu, T. W. Ling, M. L. Lee. ORA-SS: An Object-Relationship-Attribute Model for Semi-structured Data. Technical Report TR 21/00, National University of Singapore, 2000. 6. E. Rahm, P. Bernstein. On Matching Schemas Automatically. MSR Tech. Report MSR-TR-2001 -17, 2001. 7. E. Jeong, C. -N. Hsu. Induction of Integrated View for XML Data with Heterogeneous DTDs. ACM CIKM, 2001. 8. T. W. Ling, M. L. Lee. Relational to Entity-Relationship Schema Translation Using Semantic and Inclusion Dependencies, in Journal of Integrated Computer-Aided Engineering, John-Wiley Publishers, Vol 2, No 2, pages 125 -145, 1995. 9. M. L. Lee, T. W. Ling. Resolving Structural Conflicts in the Integration of Entity-Relationship Schemas. OOER, 1995. 10. M. L. Lee, T. W. Ling. Resolving Constraint Conflicts in the Integration of Entity-Relationship Schemas. ER, 1997. 11. M. L. Lee, T. W. Ling, W. L. Low. Designing Functional Dependencies for XML, EDBT, 2002. 12. M. L. Lee, L. H. Yang, W. Hsu, X. Yang. XClust: Clustering XML Schemas for Effective Integration, ACM CIKM, 2002. 13. D. Maier. Theory of Relational Databases. Computer Science Press, 1983. 14. J. Madhavan, P. A. Bernstein, E. Rahm. Generic Schema Matching with Cupid. VLDB, 2001. 15. R. Mello, S. Castano, C. A. Heuser. A Method for the Unification of XML. Information and Software Technology Journal, 2002. 16. P. Mitra, G. Wiederhold and J. Jannink. Semi-automatic Integration of Knowledge Sources. Fusion, 1999. 17. P. Mitra, G. Wiederhold, M. Kersten. A Graph-Oriented Model for Articulation of Ontology Interdependencies. EDBT 2000. 18. F. Naumann, U. Leser, J. C. Freytag. Quality-driven Integration of Heterogeneous Information Systems. VLDB, 1999. 19. C. Reynaud, J. -P. Sirot, D. Vodislav. Semantic Integration of XML Heterogeneous Data Sources. IDEAS, 2001. 20. http: //www. cogsci. princeton. edu/~wn 21. Xyleme. A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bulletin 24(2): 40 -47, 2001. 22. L. L. Yan, T. W. Ling. Translating Relational Schema with Constraints into OODB Schema. IFIP DS-5 Semantics of Interoperable Database Systems. 1992 46