A Hybrid Match Algorithm for XML Schemas n
A Hybrid Match Algorithm for XML Schemas n. K. Claypool, V. Hegde, N. Tansalarak n. UMass – Lowell ICDE ‘ 06 n. Ray Dos Santos n. Aug 21, 2009
XML integration a hybrid match algorithm that provides a framework for analyzing and exploiting semantic and structural information inherent in XML schemas n n Objective: find corresponding entities. 2
XML integration n an XML match taxonomy: categorizes the structural and semantic overlap between two given XML schemas (qualitative) weight-based match model: evaluates the quality of match, assigning it an absolute numeric value (quantitatively) n 3
Qo. M Classification n n Four Axes: Label atomic Properties atomic Children many values Nesting Level atomic 4
Atomic Values n Exact Match: q n* the value v 1 of the axis, where axis is either the label, properties, or level axis, in schema S 1 is identical to the value v 2 of the same axis in schema S 2. n* 5
Atomic Values n Relaxed Match: q the value v 1 of the axis, where axis is either the label, properties, or level axis, in schema S 1 has some degree of match (but not exact) to the value v 2 of the same axis in schema S 2. * * n. Need a linguistic match algorithm 6
Atomic Values n Relaxed Match: q q q Level: values are indentical Properties: decided individually. The property value of the source is a specialization or generalization of the target. Ex: n n min. Occurs, max. Occurs, and type min. Occurs=0 , min. Occurs=1 7
Set-Valued Elements (children axis) n Coverage Match: n Total: all children (sub-elements and attributes) of the source element have a match with some child of the target element * * * 8
Set-Valued Elements (children axis) n Coverage Match: n Partial: some but not all the children of the source element have a match with children of the target element 9
XML Match Taxonomy n Leaf Match: n A match between two leaf elements is said to be exact, E 1 = E 2, if both its label and set of properties match exactly n A match between two leaf elements E 1 and E 2 is said to be relaxed, if either the label or any of the properties of element E 1 have a relaxed match with the label and the properties of E 2 respectively. 10
XML Match Taxonomy n Subtree Match (intermediate node): q q (1) the number of children matches; (2) the quality of match of the children; (3) the quality of match along the atomic valued axes of the root node (of the sub-tree). Children axis: q q Total exact: all children to all children Total relaxed: all children to some children Partial exact: some children to some children Partial relaxed: some children to some childre 11
Combining the Axes n n Total exact: exact match along the label, properties and level axis, and a total exact match along the children axis Total relaxed: there is one or more relaxed match along any one of the atomic valued axes or a total relaxed match Partial exact: implies an exact match along all atomic valued axis and a partial exact match along the children axis Partial relaxed: relaxed match along one or more atomic valued axis and/or a partial relaxed match along the children axis xed n la Total re 12
Tree Match n n 2 root elements PO and Purchase Order have a relaxed match along the label and properties axis. PO root has three children, Purchase Order has five children. There is an exact match between the leaf children nodes labeled Order. No, and a relaxed match between the children nodes Purchase. Date and Date. match the sub-tree rooted at Purchase. Info with all sub-trees in the Purchase Order Purchase. Info and Purchase Order have a relaxed match along the label and properties axes 13
Tree Match n The children (leaf nodes) Billing. Addr and Shipping. Addr have a relaxed match with the leaf nodes Bill. To and Ship. To in the Purchase Order n the sub-trees rooted at nodes Lines and Items, i. e. , the two non-leaf nodes Lines and Items have a total relaxed match n Combining the matches along the different axes, the Qo. M for the match between the PO and Purchase root nodes is said to be total relaxed 14
Weight-based match model n A match is classified based on the Qo. M of four axes: label, properties, children, level n Assign weights to each individual axis: n The highest match classification, total exact will always result in Qo. M(n 1, n 2) = 1. n Leaf Match: use the label and properties axes: n Subtree Match: use all 4 labels. A match along the children axis is given by: n The subtree weight n The cardinality ratio n Qo. M n. The normalized sum of the Qom of the children n. The number of children matches to the number of children n. Qo. M along node N along children axis 15
Hybrid Match Algorithm n Recursive, depth-first search n Match the roots n Calculates children (Qo. Mc) n Calculate atomic-valued axes (Qo. Ml, Qo. Mh, Qo. Mp) n Final Qo. M match: 16
Experiment n XML schemas from XML Benchmark http: //db. uwaterloo. ca/ ddbms/projects/xbench/ n Inventory, books, and protein n Compared 3 algorithms: linguistic, structural, and hybrid 17
Experiment n n R = real matches P= matches found by the algorithm 18
Conclusion n Combined structural matching + linguistic matching hybrid algorithm n Provided a matching taxonomy, a weighted formula applied along labels, children, properties, and levels of xml elements. n Combined them into an algorithm to determine the highest Qo. M between two schemas. 19
- Slides: 19